多模态 AI

融合文本、图像、音频、视频的多模态大模型技术与应用进展。

PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters

多模态 2026年3月8日

PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters

PlaneCycle is a training-free operator that enables pretrained 2D vision foundation models to process 3D volumetric data...

arXiv cs.AI 阅读全文 →

PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters

多模态 2026年3月8日

PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters

PlaneCycle is a novel training-free operator that enables 2D vision foundation models to process 3D volumetric data with...

arXiv cs.AI 阅读全文 →

PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters

多模态 2026年3月8日

PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters

PlaneCycle is a novel training-free operator that enables pretrained 2D foundation models like DINOv3 to process 3D volu...

arXiv cs.AI 阅读全文 →

PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters

多模态 2026年3月8日

PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters

PlaneCycle is a training-free operator that enables pretrained 2D vision foundation models to process 3D volumetric data...

arXiv cs.AI 阅读全文 →

GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

多模态 2026年3月8日

GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

Researchers from Tsinghua University and Shanghai AI Laboratory developed GarmentPile++, a robotic system that reliably ...

arXiv cs.AI 阅读全文 →

GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

多模态 2026年3月8日

GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

Researchers from Tsinghua University and Shanghai AI Laboratory developed GarmentPile2, a robotic system that retrieves ...

arXiv cs.AI 阅读全文 →

GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

多模态 2026年3月8日

GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

Researchers from Tsinghua University and Shanghai AI Laboratory developed GarmentPile2, a robotic system that retrieves ...

arXiv cs.AI 阅读全文 →

GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

多模态 2026年3月8日

GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

Researchers from Tsinghua University and Shanghai AI Laboratory developed GarmentPile++, a robotic pipeline that retriev...

arXiv cs.AI 阅读全文 →

GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

多模态 2026年3月8日

GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

GarmentPile2 is a robotic system developed by Tsinghua University and Shanghai AI Laboratory that enables robots to retr...

arXiv cs.AI 阅读全文 →

$Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation$

多模态 2026年3月8日

Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Crab⁺ is a novel audio-visual large language model (AV-LLM) that overcomes negative transfer in multimodal AI, where con...

arXiv cs.AI 阅读全文 →

$Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation$

多模态 2026年3月8日

Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Crab⁺ is a novel audio-visual large language model (AV-LLM) that successfully reverses negative transfer in multimodal A...

arXiv cs.AI 阅读全文 →

$Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation$

多模态 2026年3月8日

Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Crab+ is a novel Audio-Visual Large Language Model that solves the negative transfer problem in multimodal AI, where con...

arXiv cs.AI 阅读全文 →

$Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation$

多模态 2026年3月8日

Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Crab⁺ is a novel audio-visual large language model (AV-LLM) that solves the negative transfer problem in multimodal AI, ...

arXiv cs.AI 阅读全文 →

$Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation$

多模态 2026年3月8日

Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Crab+ is a novel audio-visual AI framework that successfully reverses the negative transfer problem in multimodal learni...

arXiv cs.AI 阅读全文 →

Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs

多模态 2026年3月8日

Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs

HPENet introduces a novel two-stage abstraction-refinement framework with high-dimensional positional encoding for effic...

arXiv cs.AI 阅读全文 →

Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs

多模态 2026年3月8日

Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs

HPENet introduces a novel High-dimensional Positional Encoding (HPE) module and non-local MLPs for efficient point cloud...

arXiv cs.AI 阅读全文 →

Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs

多模态 2026年3月8日

Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs

HPENet introduces a novel framework for efficient point cloud processing using high-dimensional positional encoding (HPE...

arXiv cs.AI 阅读全文 →

Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs

多模态 2026年3月8日

Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs

HPENet introduces a novel framework for point cloud processing using high-dimensional positional encoding (HPE) and non-...

arXiv cs.AI 阅读全文 →

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

多模态 2026年3月8日

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

DQE-CIR (Distinctive Query Embeddings via Learnable Attribute Weights and Target Relative Negative Sampling) is a novel ...

arXiv cs.AI 阅读全文 →

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

多模态 2026年3月8日

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

DQE-CIR (Distinctive Query Embeddings via Learnable Attribute Weights and Target Relative Negative Sampling) is a novel ...

arXiv cs.AI 阅读全文 →

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

多模态 2026年3月8日

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

DQE-CIR is a novel composed image retrieval method that addresses relevance suppression and semantic confusion through l...

arXiv cs.AI 阅读全文 →

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

多模态 2026年3月8日

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

DQE-CIR introduces a novel approach to composed image retrieval that addresses fundamental flaws in standard contrastive...

arXiv cs.AI 阅读全文 →

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

多模态 2026年3月8日

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

DQE-CIR (Distinctive Query Embeddings via Learnable Attribute Weights and Target Relative Negative Sampling) is a novel ...

arXiv cs.AI 阅读全文 →

Discriminative Perception via Anchored Description for Reasoning Segmentation

多模态 2026年3月8日

Discriminative Perception via Anchored Description for Reasoning Segmentation

Researchers from Zhejiang University and Westlake University introduced Discriminative Perception via Anchored Descripti...

arXiv cs.AI 阅读全文 →

Discriminative Perception via Anchored Description for Reasoning Segmentation

多模态 2026年3月8日

Discriminative Perception via Anchored Description for Reasoning Segmentation

Researchers introduced Discriminative Perception via Anchored Description (DPAD), a novel method that improves multimoda...

arXiv cs.AI 阅读全文 →

Discriminative Perception via Anchored Description for Reasoning Segmentation

多模态 2026年3月8日

Discriminative Perception via Anchored Description for Reasoning Segmentation

Researchers from University of Technology Sydney developed Discriminative Perception via Anchored Discrimination (DPAD),...

arXiv cs.AI 阅读全文 →

Discriminative Perception via Anchored Description for Reasoning Segmentation

多模态 2026年3月8日

Discriminative Perception via Anchored Description for Reasoning Segmentation

Researchers developed DPAD (Discriminative Perception via Anchored Description), a novel training framework that improve...

arXiv cs.AI 阅读全文 →

Discriminative Perception via Anchored Description for Reasoning Segmentation

多模态 2026年3月8日

Discriminative Perception via Anchored Description for Reasoning Segmentation

The DPAD (Discriminative Perception via Anchored Description) method addresses unfocused reasoning in AI visual segmenta...

arXiv cs.AI 阅读全文 →

When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

多模态 2026年3月7日

When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

Cornell University researchers developed a diagnostic framework to evaluate how AI vision models interpret ambiguous vis...

arXiv cs.AI 阅读全文 →

When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

多模态 2026年3月7日

When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

A new diagnostic framework evaluates six computer vision models using face pareidolia—the perception of faces in non-fac...

arXiv cs.AI 阅读全文 →

When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

多模态 2026年3月7日

When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

The study 'Face Pareidolia as a Probe of Visual Interpretation Under Ambiguity' introduces a diagnostic framework using ...

arXiv cs.AI 阅读全文 →

When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

多模态 2026年3月7日

When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

A new diagnostic framework evaluates six vision models—CLIP-B/32, CLIP-L/14, LLaVA-1.5-7B, ViT, YOLOv8, and RetinaFace—o...

arXiv cs.AI 阅读全文 →

When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

多模态 2026年3月7日

When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

A diagnostic framework using face pareidolia reveals how six computer vision models interpret ambiguous visual patterns....

arXiv cs.AI 阅读全文 →

GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery

多模态 2026年3月7日

GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery

GeoSeg is a novel training-free framework that enables multimodal large language models to perform instruction-grounded ...

arXiv cs.AI 阅读全文 →

GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery

多模态 2026年3月7日

GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery

GeoSeg is a novel training-free framework that enables large multimodal models to perform instruction-based segmentation...

arXiv cs.AI 阅读全文 →

GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery

多模态 2026年3月7日

GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery

GeoSeg is a novel training-free framework that enables large multimodal models to perform complex, instruction-based obj...

arXiv cs.AI 阅读全文 →

GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery

多模态 2026年3月7日

GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery

GeoSeg is a novel training-free framework that enables Multimodal Large Language Models (MLLMs) to perform zero-shot, re...

arXiv cs.AI 阅读全文 →

GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery

多模态 2026年3月7日

GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery

GeoSeg is a novel training-free framework that enables large multimodal models to perform instruction-based segmentation...

arXiv cs.AI 阅读全文 →

BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft

多模态 2026年3月7日

BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft

BLOCK is an open-source bi-stage AI pipeline that generates functional Minecraft skins from arbitrary character concepts...

arXiv cs.AI 阅读全文 →

BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft

多模态 2026年3月7日

BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft

BLOCK is an open-source bi-stage AI pipeline that generates detailed Minecraft character skins from text descriptions. T...

arXiv cs.AI 阅读全文 →

BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft

多模态 2026年3月7日

BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft

BLOCK is an open-source, bi-stage AI pipeline that transforms arbitrary character concepts into functional Minecraft ski...

arXiv cs.AI 阅读全文 →

BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft

多模态 2026年3月7日

BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft

BLOCK is an open-source bi-stage AI pipeline that transforms arbitrary character concepts into pixel-perfect Minecraft s...

arXiv cs.AI 阅读全文 →

BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft

多模态 2026年3月7日

BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft

BLOCK is an open-source bi-stage AI pipeline that transforms textual character descriptions into functional Minecraft sk...

arXiv cs.AI 阅读全文 →

Towards Generalized Multimodal Homography Estimation

多模态 2026年3月7日

Towards Generalized Multimodal Homography Estimation

Researchers developed a novel training method for homography estimation that significantly improves generalization acros...

arXiv cs.AI 阅读全文 →

Towards Generalized Multimodal Homography Estimation

多模态 2026年3月7日

Towards Generalized Multimodal Homography Estimation

Researchers have developed a novel method for training homography estimation models that significantly improves generali...

arXiv cs.AI 阅读全文 →

Towards Generalized Multimodal Homography Estimation

多模态 2026年3月7日

Towards Generalized Multimodal Homography Estimation

Researchers have developed a novel method for training homography estimation models that significantly improves their ab...

arXiv cs.AI 阅读全文 →

Towards Generalized Multimodal Homography Estimation

多模态 2026年3月7日

Towards Generalized Multimodal Homography Estimation

Researchers have developed a novel training method for multimodal homography estimation that generates synthetic, unalig...

arXiv cs.AI 阅读全文 →

Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection

多模态 2026年3月7日

Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection

CMDR-IAD is a novel unsupervised framework for industrial anomaly detection that achieves state-of-the-art performance o...

arXiv cs.AI 阅读全文 →

Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection

多模态 2026年3月7日

Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection

CMDR-IAD is a novel unsupervised framework for industrial anomaly detection that integrates 2D RGB and 3D geometric data...

arXiv cs.AI 阅读全文 →

Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection

多模态 2026年3月7日

Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection

CMDR-IAD is a novel unsupervised framework for industrial anomaly detection that combines bidirectional cross-modal mapp...

arXiv cs.AI 阅读全文 →

Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection

多模态 2026年3月7日

Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection

CMDR-IAD is a novel unsupervised framework for multimodal industrial anomaly detection that fuses 2D RGB and 3D geometri...

arXiv cs.AI 阅读全文 →

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

多模态 2026年3月7日

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

Researchers have identified 'Lazy Attention Localization' as a fundamental flaw in multimodal AI training where models f...

arXiv cs.AI 阅读全文 →

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

多模态 2026年3月7日

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

Researchers introduced Visual Attention Score (VAS) to quantify multimodal model attention to visual tokens, finding str...

arXiv cs.AI 阅读全文 →

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

多模态 2026年3月7日

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

This research introduces the Visual Attention Score (VAS), an attention-based metric showing strong correlation (r=0.961...

arXiv cs.AI 阅读全文 →

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

多模态 2026年3月7日

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

The discovery of "Lazy Attention Localization" in multimodal AI training reveals a fundamental inefficiency in how model...

arXiv cs.AI 阅读全文 →

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

多模态 2026年3月7日

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

Research from the University of Texas at Austin demonstrates that large-scale Vision-Language-Action (VLA) models exhibi...

arXiv cs.AI 阅读全文 →

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

多模态 2026年3月7日

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

Research from the University of Texas at Austin reveals that large-scale Vision-Language-Action (VLA) models demonstrate...

arXiv cs.AI 阅读全文 →

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

多模态 2026年3月7日

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

Large-scale Vision-Language-Action (VLA) models demonstrate remarkable resistance to catastrophic forgetting in continua...

arXiv cs.AI 阅读全文 →

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

多模态 2026年3月7日

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

Research from the University of Texas at Austin demonstrates that large-scale pretrained Vision-Language-Action (VLA) mo...

arXiv cs.AI 阅读全文 →

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

多模态 2026年3月7日

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

Google DeepMind research demonstrates that large-scale pretrained Vision-Language-Action (VLA) models exhibit remarkable...

arXiv cs.AI 阅读全文 →