GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

Researchers from Tsinghua University and Shanghai AI Laboratory developed GarmentPile2, a robotic system that retrieves specific garments from cluttered piles using vision-language models and visual affordance perception. The system guarantees retrieval of exactly one garment per attempt and employs SAM2 segmentation with mask fine-tuning for robust performance. It utilizes a dual-arm cooperation framework for handling complex garments and has been validated in both real-world and simulation environments.

GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

Researchers from Tsinghua University and Shanghai AI Laboratory have introduced a novel robotic system designed to tackle one of the most persistent and messy challenges in home robotics: retrieving a single, specific garment from a disorganized pile. This work, "GarmentPile2," represents a significant step beyond prior research, which typically assumes a single, isolated garment, by directly addressing the complex, real-world scenario of cluttered laundry.

Key Takeaways

  • The proposed pipeline integrates vision-language models (VLMs) for high-level reasoning with visual affordance perception for precise low-level action, enabling robots to follow language instructions to retrieve garments from a pile.
  • A key innovation is the guarantee of retrieving exactly one garment per attempt, a critical requirement for reliable downstream tasks like folding or hanging.
  • The system employs the SAM2 visual segmentation model to provide detailed visual cues to the VLM and incorporates a mask fine-tuning mechanism to correct suboptimal initial segmentations.
  • For handling large, long, or sagging garments, the pipeline utilizes a dual-arm cooperation framework to overcome limitations of single-arm manipulation.
  • The method's effectiveness has been validated across diverse tasks and scenarios in both real-world and simulation environments.

A Technical Breakdown of the GarmentPile2 Pipeline

The core challenge in piled garment manipulation is the "visual clutter" problem, where garments are heavily occluded and entangled. The GarmentPile2 pipeline addresses this through a sophisticated, multi-stage architecture. First, the SAM2 (Segment Anything Model 2) model performs instance segmentation on the pile, generating individual masks for potential garments. These masks are not final action targets; instead, they provide rich, structured visual cues—such as shape, overlap, and potential grasp points—to a downstream Vision-Language Model.

The VLM, equipped with these visual prompts and a natural language instruction (e.g., "retrieve the blue t-shirt"), performs the high-level reasoning. It must identify the target garment within the segmented pile and reason about the safest, cleanest retrieval strategy that will isolate only that item. This plan is then translated into actionable coordinates via a visual affordance perception module, which generalizes grasp predictions based on the garment's perceived state. A novel mask fine-tuning mechanism acts as a corrective loop, refining SAM2's initial segmentation if the affordance model or VLM indicates poor candidate regions, ensuring robustness against imperfect visual input.

Finally, the system decides on a manipulation strategy. For simple retrievals, a single arm executes the grasp. However, for complex cases—such as a large dress where a single grasp point causes excessive sagging and risks dragging other garments, or for long sleeves that require support—the dual-arm cooperation framework is engaged. One arm performs the primary retrieval grasp while the second provides supportive manipulation to isolate the target, physically demonstrating the system's advanced physical reasoning.

Industry Context & Analysis

Garment manipulation sits at the difficult intersection of robotic vision, dexterous manipulation, and unstructured environment reasoning. While companies like Boston Dynamics showcase stunning mobility and OpenAI (formerly with Dactyl) and Google DeepMind have made strides in dexterous hand manipulation, applying these skills to highly deformable objects like fabric remains a niche but critical frontier for domestic utility. Most academic and industry demos, such as those from UC Berkeley's AUTOLAB or Toyota Research Institute, focus on folding a single, pre-isolated towel or shirt. GarmentPile2's explicit focus on the "pile" scenario directly attacks the major bottleneck preventing translation from lab demos to practical home assistants.

Technically, the integration of foundational models is the standout trend. Using SAM2 for segmentation and a VLM for planning reflects a move away from training monolithic, task-specific models from scratch—a process requiring massive, hard-to-acquire robotic datasets. Instead, it leverages the robust, pre-trained visual and reasoning priors of these large models. This approach mirrors the "embodied AI" paradigm seen in projects like Google's RT-2, which fine-tunes VLMs for direct robot action output. However, GarmentPile2's architecture is more modular, using the VLM specifically for symbolic planning and state understanding before handing off to a dedicated affordance model, potentially offering more interpretability and control.

The guarantee of "one garment per attempt" is a crucial engineering metric often glossed over in research. In real-world terms, a 95% success rate on single-garment folding is impressive, but if the robot fails to cleanly isolate a garment from a pile 30% of the time, the overall system failure rate becomes unacceptable. By designing and evaluating for this specific metric, the researchers are prioritizing system-level reliability. The use of dual-arm coordination further highlights an understanding that software intelligence must be matched with appropriate hardware capability to solve physically constrained problems, a lesson evident in the evolution of industrial robotics towards collaborative, multi-arm cells.

What This Means Going Forward

This research provides a robust template for solving a wider class of "cluttered retrieval" problems in robotics, from tidying children's toys to sorting recyclables. The modular pipeline—segmentation, language-guided reasoning, affordance prediction, and adaptive manipulation—is a framework that can be adapted to other deformable objects or entangled items. For the domestic robotics industry, which has seen cautious progress from companies like iRobot (floor cleaning) to more ambitious startups, this work chips away at a major perceptual and physical barrier. A robot that can reliably sort and retrieve laundry represents a tangible leap towards general-purpose home assistance.

Looking ahead, the next challenges are scalability and speed. The current pipeline, reliant on large models like SAM2 and a VLM, may face latency issues for real-time operation. Future work will likely focus on distilling these models into more efficient, specialized networks and improving the closed-loop feedback between perception and action. Furthermore, benchmarking on standardized, complex garment piles—akin to benchmarks like YCB-Scene for rigid object manipulation—will be essential for comparing progress across the field. As these technologies mature, we can expect to see them first in structured commercial settings like textile handling or logistics before trickling into consumer products, gradually transforming the long-held vision of a robot laundry-folder from science fiction into a measurable engineering reality.

常见问题