多模态 AI

融合文本、图像、音频、视频的多模态大模型技术与应用进展。

Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion
多模态

Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion

arXiv:2602.21646v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have achieved notable success i...

多模态

Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics

arXiv:2602.19674v2 Announce Type: replace-cross Abstract: Remote monitoring of heart failure (HF) via speech signals pro...

多模态

Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers

arXiv:2602.18022v2 Announce Type: replace-cross Abstract: Training-free control over editing intensity is a critical req...

多模态

Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection

arXiv:2602.17484v2 Announce Type: replace-cross Abstract: Image Copy Detection (ICD) aims to identify manipulated conten...

多模态

Beyond Calibration: Confounding Pathology Limits Foundation Model Specificity in Abdominal Trauma CT

arXiv:2602.10359v2 Announce Type: replace-cross Abstract: Purpose: Translating foundation models into clinical practice ...

多模态

FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures

arXiv:2601.08026v3 Announce Type: replace-cross Abstract: Scientific compound figures combine multiple labeled panels in...

多模态

KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification

arXiv:2512.09069v2 Announce Type: replace-cross Abstract: Age-related macular degeneration (AMD) and choroidal neovascul...

多模态

Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

arXiv:2512.08639v2 Announce Type: replace-cross Abstract: Aerial Vision-and-Language Navigation (VLN) aims to enable unm...

多模态

RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation

arXiv:2511.06899v3 Announce Type: replace-cross Abstract: Large Vision-Language Models (LVLMs) excel in multimodal reaso...

多模态

Uncovering Grounding IDs: How External Cues Shape Multimodal Binding

arXiv:2509.24072v4 Announce Type: replace-cross Abstract: Large vision-language models (LVLMs) show strong performance a...

多模态

Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

arXiv:2509.23744v2 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) promise enhanced reas...

多模态

EO-1: An Open Unified Embodied Foundation Model for General Robot Control

arXiv:2508.21112v5 Announce Type: replace-cross Abstract: The human ability to seamlessly perform multimodal reasoning a...

多模态

Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection

arXiv:2506.01085v2 Announce Type: replace-cross Abstract: Instruction tuning has been central to the success of recent v...

多模态

Renaissance: Investigating the Pretraining of Vision-Language Encoders

arXiv:2411.06657v2 Announce Type: replace-cross Abstract: In the past several years there has been an explosion of avail...

多模态

Measuring the Measurers: Quality Evaluation of Hallucination Benchmarks for Large Vision-Language Models

arXiv:2406.17115v3 Announce Type: replace-cross Abstract: Despite the outstanding performance in multimodal tasks, Large...

多模态

NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

arXiv:2602.22144v1 Announce Type: cross Abstract: Object hallucination is a critical issue in Large Vision-Language Mode...

多模态

TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition

arXiv:2602.22039v1 Announce Type: cross Abstract: Low-resource automatic speech recognition (ASR) continues to pose sign...

多模态

A Framework for Cross-Domain Generalization in Coronary Artery Calcium Scoring Across Gated and Non-Gated Computed Tomography

arXiv:2602.21935v1 Announce Type: cross Abstract: Coronary artery calcium (CAC) scoring is a key predictor of cardiovasc...

多模态

DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs

arXiv:2602.21864v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) have emerged as versatile solutions for ...

多模态

Excitation: Momentum For Experts

arXiv:2602.21798v1 Announce Type: cross Abstract: We propose Excitation, a novel optimization framework designed to acce...

多模态

Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models

arXiv:2602.21779v1 Announce Type: cross Abstract: Current Vision-Language Models (VLMs) for deepfake detection excel at ...

多模态

UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation

arXiv:2602.21772v1 Announce Type: cross Abstract: A universal audio representation should capture fine-grained speech cu...

多模态

Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

arXiv:2602.21765v1 Announce Type: cross Abstract: Alignment and adaptation in large language models heavily rely on rein...

多模态

Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models

arXiv:2602.21704v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) exhibit outstanding performance o...

多模态

CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning

arXiv:2602.21655v1 Announce Type: cross Abstract: Image captioning remains a fundamental task for vision language unders...

多模态

Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration

arXiv:2602.21647v1 Announce Type: cross Abstract: This paper presents and evaluates an optimized cascaded Nepali speech-...

多模态

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

arXiv:2602.21531v1 Announce Type: cross Abstract: General-purpose robots must master long-horizon manipulation, defined ...

多模态

A Knowledge-Driven Approach to Music Segmentation, Music Source Separation and Cinematic Audio Source Separation

arXiv:2602.21476v1 Announce Type: cross Abstract: We propose a knowledge-driven, model-based approach to segmenting audi...

多模态

Adversarial Robustness of Deep Learning-Based Thyroid Nodule Segmentation in Ultrasound

arXiv:2602.21452v1 Announce Type: cross Abstract: Introduction: Deep learning-based segmentation models are increasingly...

多模态

Causal Decoding for Hallucination-Resistant Multimodal Large Language Models

arXiv:2602.21441v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) deliver detailed responses on...

多模态

AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression

arXiv:2602.21233v1 Announce Type: cross Abstract: This technical report introduces AngelSlim, a comprehensive and versat...

多模态

Architecture-Agnostic Curriculum Learning for Document Understanding: Empirical Evidence from Text-Only and Multimodal

arXiv:2602.21225v1 Announce Type: cross Abstract: We investigate whether progressive data scheduling -- a curriculum lea...

多模态

Trump claims tech companies will sign deals next week to pay for their own power supply

President Donald Trump tried to quell Americans' concerns about rising electricity costs during his State of the Union s...

ICLR2026 Oral | 当情感识别不再是分类题:EmotionThinker 让 SpeechLLM 学会“解释情绪”
多模态

ICLR2026 Oral | 当情感识别不再是分类题:EmotionThinker 让 SpeechLLM 学会“解释情绪”

语音情感识别(Speech Emotion Recognition, SER)在过去基本遵循同一种范式:输入语音,输出情绪标签。这种设定在工程上有效,但在认知层面却过于简化。在人类交流中,情绪判断从来不是一个 “标签选择&rdq...

多模态

Transforming Science Learning Materials in the Era of Artificial Intelligence

arXiv:2602.18470v2 Announce Type: replace-cross Abstract: The integration of artificial intelligence (AI) into science e...

多模态

AI-Mediated Feedback Improves Student Revisions: A Randomized Trial with FeedbackWriter in a Large Undergraduate Course

arXiv:2602.16820v2 Announce Type: replace-cross Abstract: Despite growing interest in using LLMs to generate feedback on...

多模态

Sim2Radar: Toward Bridging the Radar Sim-to-Real Gap with VLM-Guided Scene Reconstruction

arXiv:2602.13314v3 Announce Type: replace-cross Abstract: Millimeter-wave (mmWave) radar provides reliable perception in...

多模态

CryoLVM: Self-supervised Learning from Cryo-EM Density Maps with Large Vision Models

arXiv:2602.02620v2 Announce Type: replace-cross Abstract: Cryo-electron microscopy (cryo-EM) has revolutionized structur...

多模态

Generating metamers of human scene understanding

arXiv:2601.11675v3 Announce Type: replace-cross Abstract: Human vision combines low-resolution "gist" information from t...

多模态

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

arXiv:2601.10611v2 Announce Type: replace-cross Abstract: Today's strongest video-language models (VLMs) remain propriet...

多模态

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

arXiv:2601.09708v2 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) tasks require reasoning over comp...

多模态

On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations

arXiv:2510.00037v4 Announce Type: replace-cross Abstract: In Vision-Language-Actionf(VLA) models, robustness to real-wor...

多模态

Synthesis of discrete-continuous quantum circuits with multimodal diffusion models

arXiv:2506.01666v2 Announce Type: replace-cross Abstract: Efficiently compiling quantum operations remains a major bottl...

多模态

Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence

arXiv:2502.17028v3 Announce Type: replace-cross Abstract: Vision-language alignment is crucial for various downstream ta...

多模态

MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents

arXiv:2511.23055v2 Announce Type: replace Abstract: Theory of Mind (ToM) refers to the ability to infer others' mental s...

多模态

VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation

arXiv:2602.21054v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) frequently hallucinate, limiting ...

多模态

Multimodal MRI Report Findings Supervised Brain Lesion Segmentation with Substructures

arXiv:2602.20994v1 Announce Type: cross Abstract: Report-supervised (RSuper) learning seeks to alleviate the need for de...

多模态

Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

arXiv:2602.20981v1 Announce Type: cross Abstract: Scaling multimodal alignment between video and audio is challenging, p...

多模态

Training-Free Intelligibility-Guided Observation Addition for Noisy ASR

arXiv:2602.20967v1 Announce Type: cross Abstract: Automatic speech recognition (ASR) degrades severely in noisy environm...

多模态

EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations

arXiv:2602.20958v1 Announce Type: cross Abstract: Search and rescue (SAR) operations require rapid responses to save liv...

多模态

E-MMKGR: A Unified Multimodal Knowledge Graph Framework for E-commerce Applications

arXiv:2602.20877v1 Announce Type: cross Abstract: Multimodal recommender systems (MMRSs) enhance collaborative filtering...

多模态

SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing

arXiv:2602.20751v1 Announce Type: cross Abstract: Designing aligned and robust rewards for open-ended generation remains...

多模态

Communication-Inspired Tokenization for Structured Image Representations

arXiv:2602.20731v1 Announce Type: cross Abstract: Discrete image tokenizers have emerged as a key component of modern vi...

多模态

Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video

arXiv:2602.20658v1 Announce Type: cross Abstract: Manual lifting tasks are a major contributor to work-related musculosk...

多模态

Enhancing Hate Speech Detection on Social Media: A Comparative Analysis of Machine Learning Models and Text Transformation Approaches

arXiv:2602.20634v1 Announce Type: cross Abstract: The proliferation of hate speech on social media platforms has necessi...

多模态

How Do Inpainting Artifacts Propagate to Language?

arXiv:2602.20520v1 Announce Type: cross Abstract: We study how visual artifacts introduced by diffusion-based inpainting...

多模态

Learning During Detection: Continual Learning for Neural OFDM Receivers via DMRS

arXiv:2602.20361v1 Announce Type: cross Abstract: Deep neural networks (DNNs) have been increasingly explored for receiv...

多模态

Hierarchical Molecular Representation Learning via Fragment-Based Self-Supervised Embedding Prediction

arXiv:2602.20344v1 Announce Type: cross Abstract: Graph self-supervised learning (GSSL) has demonstrated strong potentia...

多模态

Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking

arXiv:2602.20330v1 Announce Type: cross Abstract: Vision-language models (VLMs) are powerful but remain opaque black box...

多模态

Learning Physical Principles from Interaction: Self-Evolving Planning via Test-Time Memory

arXiv:2602.20323v1 Announce Type: cross Abstract: Reliable object manipulation requires understanding physical propertie...