GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

Researchers from Tsinghua University and Shanghai AI Laboratory developed GarmentPile++, a robotic pipeline that retrieves individual garments from cluttered piles using vision-language reasoning. The system integrates SAM2 for segmentation with a mask fine-tuning mechanism and employs dual-arm coordination to guarantee single-garment extraction. This represents a significant advancement toward practical home-assistant robotics that can handle real-world laundry challenges.

GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

Researchers from Tsinghua University and Shanghai AI Laboratory have introduced a novel robotic pipeline designed to tackle the complex, real-world challenge of retrieving individual garments from a pile, a significant step toward practical home-assistant robotics. This work addresses a critical limitation in the field, where most prior research simplifies the problem by assuming a single, isolated garment, thereby bridging the gap between controlled lab demonstrations and the messy reality of laundry day.

Key Takeaways

  • A new robotic pipeline integrates vision-language models (VLMs) and visual affordance perception to retrieve specific garments from a pile based on natural language instructions.
  • The system guarantees retrieval of exactly one garment per attempt, preventing tangled messes and establishing a reliable foundation for downstream tasks like folding or hanging.
  • It employs the Segment Anything Model 2 (SAM2) for garment segmentation and introduces a mask fine-tuning mechanism to correct suboptimal initial segmentations.
  • A dual-arm robotic framework is deployed to handle large, long, or sagging garments that are problematic for single-arm systems.
  • The method's effectiveness is validated across diverse tasks and scenarios in both real-world and simulation environments.

A Multi-Modal Pipeline for Garment Pile Manipulation

The proposed pipeline represents a sophisticated integration of high-level reasoning and low-level robotic control. At its core, it leverages a Vision-Language Model (VLM) to interpret natural language commands (e.g., "retrieve the blue shirt") and plan the retrieval action. To give the VLM a precise understanding of the chaotic pile, the system first uses the visual segmentation model SAM2 to identify and segment individual garments, providing the VLM with crucial visual cues about each item's state and position.

Recognizing that automated segmentation can be imperfect, the researchers integrated a mask fine-tuning mechanism. This allows the system to iteratively refine segmentation masks when initial results are suboptimal, ensuring accurate target identification before any physical interaction. For the physical retrieval, the pipeline uses visual affordance perception—learning from data which points on a garment are graspable—to determine the optimal pick point. Finally, a dual-arm cooperation framework is employed to manage complex scenarios, such as large bedsheets or garments that sag excessively when grasped at a single point, coordinating the arms to achieve clean, single-garment extraction.

Industry Context & Analysis

This research directly confronts a major bottleneck in domestic robotics. While companies like Boston Dynamics (with Stretch) and Toyota Research Institute have shown impressive dexterous manipulation, their public demonstrations often focus on rigid or single objects. Garment manipulation, particularly from piles, remains a "grand challenge" due to the infinite configurations, deformability, and occlusion of fabrics. The authors' approach of combining VLMs with affordance learning follows a powerful trend seen in general robot learning, exemplified by systems like Google's RT-2, which also fuses vision-language pre-training with robotic control. However, this work is specialized for the uniquely difficult domain of textiles.

The technical choice to use SAM2 is significant. Its predecessor, SAM, amassed over 45,000 GitHub stars for its zero-shot segmentation capability, but it is not designed for sequential robotic decision-making. This pipeline effectively repurposes SAM2 as a perception module within a closed-loop control system. The guarantee of "exactly one garment" retrieval is a critical metric of practical utility. In contrast, many prior academic works in this domain, often benchmarked on datasets like the ClothSim or Real2Sim2Real frameworks, report success rates that can drop precipitously when moving from single garments to piles. This pipeline's explicit design for pile manipulation represents a targeted advancement over more general-purpose "pick-and-place" systems that would likely fail or tangle multiple garments.

Furthermore, the adoption of a dual-arm system reflects an industry shift. While much academic and startup research (e.g., Covariant in logistics) optimizes for single-arm speed, complex deformable object handling often necessitates bimanual coordination, a approach championed by research institutions like MIT's CSAIL and UC Berkeley's AUTOLAB. The market need is clear: the global service robotics market, which includes domestic assistants, is projected to exceed $100 billion by 2030, but adoption hinges on solving such unstructured tasks.

What This Means Going Forward

This work establishes a robust foundation for the next generation of home and logistics robotics. The immediate beneficiaries are robotics researchers and companies developing solutions for laundry, warehouse apparel handling, and elderly care assistance, where managing clothing is a fundamental task. By solving the "first step" of clean retrieval from a pile, it unlocks the automation of downstream workflows like folding, sorting, and hanging, which have been the focus of companies like FoldiMate (though with limited commercial success due to handling constraints).

Looking ahead, the integration pathway demonstrated here—VLMs for planning, foundation models (SAM2) for perception, and affordance learning for control—is likely to become a blueprint for tackling other complex manipulation domains involving deformable objects, such as grocery bagging or kitchen pantry organization. A key area to watch will be the system's performance under extreme clutter or with highly similar garments, which will test the limits of the VLM's reasoning and the segmentation fine-tuning. Furthermore, the translation from controlled lab validation to cost-effective, reliable consumer hardware remains a significant hurdle. As the underlying models continue to improve (e.g., with more capable VLMs and faster segmentation), the core architecture proposed in this paper could rapidly accelerate the timeline for robots that can truly navigate the messy, unstructured world of human homes.

常见问题