多模态 AI
融合文本、图像、音频、视频的多模态大模型技术与应用进展。
PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters
PlaneCycle is a training-free operator that enables pretrained 2D vision foundation models to process 3D volumetric data...
PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters
PlaneCycle is a novel training-free operator that enables 2D vision foundation models to process 3D volumetric data with...
PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters
PlaneCycle is a novel training-free operator that enables pretrained 2D foundation models like DINOv3 to process 3D volu...
PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters
PlaneCycle is a training-free operator that enables pretrained 2D vision foundation models to process 3D volumetric data...
GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning
Researchers from Tsinghua University and Shanghai AI Laboratory developed GarmentPile++, a robotic system that reliably ...
GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning
Researchers from Tsinghua University and Shanghai AI Laboratory developed GarmentPile2, a robotic system that retrieves ...
GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning
Researchers from Tsinghua University and Shanghai AI Laboratory developed GarmentPile2, a robotic system that retrieves ...
GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning
Researchers from Tsinghua University and Shanghai AI Laboratory developed GarmentPile++, a robotic pipeline that retriev...
GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning
GarmentPile2 is a robotic system developed by Tsinghua University and Shanghai AI Laboratory that enables robots to retr...
Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Crab⁺ is a novel audio-visual large language model (AV-LLM) that overcomes negative transfer in multimodal AI, where con...
Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Crab⁺ is a novel audio-visual large language model (AV-LLM) that successfully reverses negative transfer in multimodal A...
Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Crab+ is a novel Audio-Visual Large Language Model that solves the negative transfer problem in multimodal AI, where con...
Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Crab⁺ is a novel audio-visual large language model (AV-LLM) that solves the negative transfer problem in multimodal AI, ...
Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Crab+ is a novel audio-visual AI framework that successfully reverses the negative transfer problem in multimodal learni...
Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs
HPENet introduces a novel two-stage abstraction-refinement framework with high-dimensional positional encoding for effic...
Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs
HPENet introduces a novel High-dimensional Positional Encoding (HPE) module and non-local MLPs for efficient point cloud...
Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs
HPENet introduces a novel framework for efficient point cloud processing using high-dimensional positional encoding (HPE...
Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs
HPENet introduces a novel framework for point cloud processing using high-dimensional positional encoding (HPE) and non-...
DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval
DQE-CIR (Distinctive Query Embeddings via Learnable Attribute Weights and Target Relative Negative Sampling) is a novel ...
DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval
DQE-CIR (Distinctive Query Embeddings via Learnable Attribute Weights and Target Relative Negative Sampling) is a novel ...
DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval
DQE-CIR is a novel composed image retrieval method that addresses relevance suppression and semantic confusion through l...
DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval
DQE-CIR introduces a novel approach to composed image retrieval that addresses fundamental flaws in standard contrastive...
DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval
DQE-CIR (Distinctive Query Embeddings via Learnable Attribute Weights and Target Relative Negative Sampling) is a novel ...
Discriminative Perception via Anchored Description for Reasoning Segmentation
Researchers from Zhejiang University and Westlake University introduced Discriminative Perception via Anchored Descripti...
Discriminative Perception via Anchored Description for Reasoning Segmentation
Researchers introduced Discriminative Perception via Anchored Description (DPAD), a novel method that improves multimoda...
Discriminative Perception via Anchored Description for Reasoning Segmentation
Researchers from University of Technology Sydney developed Discriminative Perception via Anchored Discrimination (DPAD),...
Discriminative Perception via Anchored Description for Reasoning Segmentation
Researchers developed DPAD (Discriminative Perception via Anchored Description), a novel training framework that improve...
Discriminative Perception via Anchored Description for Reasoning Segmentation
The DPAD (Discriminative Perception via Anchored Description) method addresses unfocused reasoning in AI visual segmenta...
When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models
Cornell University researchers developed a diagnostic framework to evaluate how AI vision models interpret ambiguous vis...
When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models
A new diagnostic framework evaluates six computer vision models using face pareidolia—the perception of faces in non-fac...
When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models
The study 'Face Pareidolia as a Probe of Visual Interpretation Under Ambiguity' introduces a diagnostic framework using ...
When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models
A new diagnostic framework evaluates six vision models—CLIP-B/32, CLIP-L/14, LLaVA-1.5-7B, ViT, YOLOv8, and RetinaFace—o...
When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models
A diagnostic framework using face pareidolia reveals how six computer vision models interpret ambiguous visual patterns....
GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery
GeoSeg is a novel training-free framework that enables multimodal large language models to perform instruction-grounded ...
GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery
GeoSeg is a novel training-free framework that enables large multimodal models to perform instruction-based segmentation...
GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery
GeoSeg is a novel training-free framework that enables large multimodal models to perform complex, instruction-based obj...
GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery
GeoSeg is a novel training-free framework that enables Multimodal Large Language Models (MLLMs) to perform zero-shot, re...
GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery
GeoSeg is a novel training-free framework that enables large multimodal models to perform instruction-based segmentation...
BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft
BLOCK is an open-source bi-stage AI pipeline that generates functional Minecraft skins from arbitrary character concepts...
BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft
BLOCK is an open-source bi-stage AI pipeline that generates detailed Minecraft character skins from text descriptions. T...
BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft
BLOCK is an open-source, bi-stage AI pipeline that transforms arbitrary character concepts into functional Minecraft ski...
BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft
BLOCK is an open-source bi-stage AI pipeline that transforms arbitrary character concepts into pixel-perfect Minecraft s...
BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft
BLOCK is an open-source bi-stage AI pipeline that transforms textual character descriptions into functional Minecraft sk...
Towards Generalized Multimodal Homography Estimation
Researchers developed a novel training method for homography estimation that significantly improves generalization acros...
Towards Generalized Multimodal Homography Estimation
Researchers have developed a novel method for training homography estimation models that significantly improves generali...
Towards Generalized Multimodal Homography Estimation
Researchers have developed a novel method for training homography estimation models that significantly improves their ab...
Towards Generalized Multimodal Homography Estimation
Researchers have developed a novel training method for multimodal homography estimation that generates synthetic, unalig...
Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection
CMDR-IAD is a novel unsupervised framework for industrial anomaly detection that achieves state-of-the-art performance o...
Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection
CMDR-IAD is a novel unsupervised framework for industrial anomaly detection that integrates 2D RGB and 3D geometric data...
Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection
CMDR-IAD is a novel unsupervised framework for industrial anomaly detection that combines bidirectional cross-modal mapp...
Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection
CMDR-IAD is a novel unsupervised framework for multimodal industrial anomaly detection that fuses 2D RGB and 3D geometri...
From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning
Researchers have identified 'Lazy Attention Localization' as a fundamental flaw in multimodal AI training where models f...
From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning
Researchers introduced Visual Attention Score (VAS) to quantify multimodal model attention to visual tokens, finding str...
From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning
This research introduces the Visual Attention Score (VAS), an attention-based metric showing strong correlation (r=0.961...
From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning
The discovery of "Lazy Attention Localization" in multimodal AI training reveals a fundamental inefficiency in how model...
Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning
Research from the University of Texas at Austin demonstrates that large-scale Vision-Language-Action (VLA) models exhibi...
Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning
Research from the University of Texas at Austin reveals that large-scale Vision-Language-Action (VLA) models demonstrate...
Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning
Large-scale Vision-Language-Action (VLA) models demonstrate remarkable resistance to catastrophic forgetting in continua...
Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning
Research from the University of Texas at Austin demonstrates that large-scale pretrained Vision-Language-Action (VLA) mo...
Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning
Google DeepMind research demonstrates that large-scale pretrained Vision-Language-Action (VLA) models exhibit remarkable...