GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

Researchers from Tsinghua University and Shanghai AI Laboratory developed GarmentPile2, a robotic system that retrieves specific garments from cluttered piles using vision-language reasoning. The system integrates SAM2 for garment segmentation with visual affordance perception and dual-arm manipulation to ensure reliable single-garment retrieval. This represents significant progress toward practical robotic laundry assistance beyond simplified single-garment scenarios.

GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

Researchers from Tsinghua University and Shanghai AI Laboratory have introduced a novel robotic system designed to tackle one of the most persistent and messy challenges in home robotics: retrieving a single, specific garment from a disorganized pile. This work, detailed in the paper "GarmentPile2: Language-Guided Robotic Garment Retrieval from Piles," represents a significant step toward practical robotic laundry and closet assistance by moving beyond the simplified, single-garment scenarios that dominate current research.

Key Takeaways

  • The proposed pipeline integrates vision-language models (VLMs) for high-level reasoning with visual affordance perception for precise low-level action, enabling robots to follow language commands like "retrieve the blue shirt."
  • It employs the Segment Anything Model 2 (SAM2) to segment individual garments in a pile, providing critical visual cues to the VLM, and includes a mask fine-tuning mechanism to correct poor initial segmentation.
  • A dual-arm robotic framework is deployed to handle large, long, or sagging garments that are difficult for a single arm to manage, ensuring clean and safe retrieval.
  • The system is explicitly designed to guarantee the retrieval of exactly one garment per attempt, establishing a reliable foundation for downstream tasks like folding or hanging.
  • Effectiveness was validated across diverse tasks and scenarios in both real-world and simulation environments, demonstrating robust generalization.

A Multi-Stage Pipeline for Complex Garment Retrieval

The core innovation of the GarmentPile2 system is its structured pipeline that bridges the gap between high-level instruction and low-level manipulation. When presented with a pile of garments and a command, the system first uses the visual segmentation model SAM2 to identify and isolate potential garment masks within the pile. This step is crucial, as a pile presents severe visual occlusion and entanglement. These segmented masks are then fed into a Vision-Language Model (VLM) alongside the user's instruction. The VLM reasons over these visual cues to identify the target garment and, critically, to propose a suitable grasping point.

This point is not chosen arbitrarily. The system employs a visual affordance perception model that evaluates the proposed point for its likelihood of enabling a successful, single-garment retrieval. If the initial segmentation from SAM2 is suboptimal—a common issue with highly textured or crumpled fabrics—a mask fine-tuning mechanism refines the boundaries. Finally, the action is executed by a dual-arm robot. This bimanual coordination is essential for practical deployment; a single arm often causes excessive sagging or fails to fully extract large items like dresses or long-sleeve shirts, leading to failed retrievals or tangled piles.

Industry Context & Analysis

This research directly addresses a major bottleneck in the commercialization of home-assistant robots. While companies like Matic (formerly Nimbus) are developing specialized floor-cleaning robots and FoldiMate (though now defunct) showcased single-garment folding, robust manipulation of unstructured piles remains an unsolved problem. Most academic benchmarks, such as those in the CLOTH or Deformable Object Manipulation literature, focus on isolated garments. The GarmentPile2 pipeline's explicit goal of "exactly one garment" retrieval is a critical metric for real-world utility, moving beyond simple success/failure rates to measure task cleanliness and preparation for next-step operations.

Technically, the integration of foundational models is a growing trend in robotics, exemplified by projects like Google's RT-2. However, unlike RT-2's end-to-end learning, this work uses a more modular, neuro-symbolic approach. The VLM (likely similar to GPT-4V or an open-source alternative like LLaVA) handles semantic reasoning ("the blue shirt"), while dedicated perception models handle geometric affordance. This hybrid method offers greater interpretability and safety—key concerns for physical systems in human environments—compared to a monolithic neural network. The use of SAM2, the successor to Meta's widely adopted SAM (with over 45,000 GitHub stars), indicates a push towards state-of-the-art, generalizable perception components.

The emphasis on dual-arm manipulation also highlights a key hardware-software co-design insight. Much prior research, including seminal work from UC Berkeley's BRETT or more recent efforts using Franka Emika arms, utilizes single arms. This paper argues that for long-horizon deformable object tasks, bimanual manipulation is not a luxury but a necessity, aligning with advanced industrial systems but pushing the capability into far more complex, unstructured domains.

What This Means Going Forward

For the robotics industry, this pipeline establishes a new benchmark for evaluating garment manipulation systems. Success can no longer be measured in controlled, single-item settings. Future competitive benchmarks will need to feature piles of mixed garments, measuring metrics like single-retrieval success rate, time-to-retrieve, and post-retrieval pile disorder. This raises the bar for everyone, from academic labs to startups like Matic or Misty Robotics exploring home assistance.

The immediate beneficiaries are research institutions and companies developing practical home and logistics robots. The modular architecture allows for swapping in improved components—a more capable VLM, a faster segmentation model, or more advanced grasp planners—as each sub-field advances. In the near term, we can expect to see this methodology applied to other "pile" problems in warehouses, such as retrieving specific items from bins of soft goods (the "bin picking" problem for deformables).

Watch for two key developments next. First, the integration of this pipeline with a full downstream task chain, such as retrieving a shirt and then successfully folding it, to demonstrate true end-to-end laundry automation. Second, the release of the associated simulation environment and dataset (as suggested by the project page) will be critical. Open benchmarks like these accelerate progress by allowing the community to test and compare approaches, much as ImageNet did for computer vision. If the provided simulation is high-fidelity and comprehensive, it could quickly become the standard testbed for this challenging domain, propelling the field closer to robots that can truly tackle the messy reality of a laundry basket.

常见问题