多模态 AI
融合文本、图像、音频、视频的多模态大模型技术与应用进展。
Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion
arXiv:2602.21646v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have achieved notable success i...
Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics
arXiv:2602.19674v2 Announce Type: replace-cross Abstract: Remote monitoring of heart failure (HF) via speech signals pro...
Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers
arXiv:2602.18022v2 Announce Type: replace-cross Abstract: Training-free control over editing intensity is a critical req...
Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection
arXiv:2602.17484v2 Announce Type: replace-cross Abstract: Image Copy Detection (ICD) aims to identify manipulated conten...
Beyond Calibration: Confounding Pathology Limits Foundation Model Specificity in Abdominal Trauma CT
arXiv:2602.10359v2 Announce Type: replace-cross Abstract: Purpose: Translating foundation models into clinical practice ...
FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures
arXiv:2601.08026v3 Announce Type: replace-cross Abstract: Scientific compound figures combine multiple labeled panels in...
KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification
arXiv:2512.09069v2 Announce Type: replace-cross Abstract: Age-related macular degeneration (AMD) and choroidal neovascul...
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
arXiv:2512.08639v2 Announce Type: replace-cross Abstract: Aerial Vision-and-Language Navigation (VLN) aims to enable unm...
RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation
arXiv:2511.06899v3 Announce Type: replace-cross Abstract: Large Vision-Language Models (LVLMs) excel in multimodal reaso...
Uncovering Grounding IDs: How External Cues Shape Multimodal Binding
arXiv:2509.24072v4 Announce Type: replace-cross Abstract: Large vision-language models (LVLMs) show strong performance a...
Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning
arXiv:2509.23744v2 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) promise enhanced reas...
EO-1: An Open Unified Embodied Foundation Model for General Robot Control
arXiv:2508.21112v5 Announce Type: replace-cross Abstract: The human ability to seamlessly perform multimodal reasoning a...
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
arXiv:2506.01085v2 Announce Type: replace-cross Abstract: Instruction tuning has been central to the success of recent v...
Renaissance: Investigating the Pretraining of Vision-Language Encoders
arXiv:2411.06657v2 Announce Type: replace-cross Abstract: In the past several years there has been an explosion of avail...
Measuring the Measurers: Quality Evaluation of Hallucination Benchmarks for Large Vision-Language Models
arXiv:2406.17115v3 Announce Type: replace-cross Abstract: Despite the outstanding performance in multimodal tasks, Large...
NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
arXiv:2602.22144v1 Announce Type: cross Abstract: Object hallucination is a critical issue in Large Vision-Language Mode...
TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition
arXiv:2602.22039v1 Announce Type: cross Abstract: Low-resource automatic speech recognition (ASR) continues to pose sign...
A Framework for Cross-Domain Generalization in Coronary Artery Calcium Scoring Across Gated and Non-Gated Computed Tomography
arXiv:2602.21935v1 Announce Type: cross Abstract: Coronary artery calcium (CAC) scoring is a key predictor of cardiovasc...
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
arXiv:2602.21864v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) have emerged as versatile solutions for ...
Excitation: Momentum For Experts
arXiv:2602.21798v1 Announce Type: cross Abstract: We propose Excitation, a novel optimization framework designed to acce...
Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models
arXiv:2602.21779v1 Announce Type: cross Abstract: Current Vision-Language Models (VLMs) for deepfake detection excel at ...
UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation
arXiv:2602.21772v1 Announce Type: cross Abstract: A universal audio representation should capture fine-grained speech cu...
Generalisation of RLHF under Reward Shift and Clipped KL Regularisation
arXiv:2602.21765v1 Announce Type: cross Abstract: Alignment and adaptation in large language models heavily rely on rein...
Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models
arXiv:2602.21704v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) exhibit outstanding performance o...
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
arXiv:2602.21655v1 Announce Type: cross Abstract: Image captioning remains a fundamental task for vision language unders...
Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration
arXiv:2602.21647v1 Announce Type: cross Abstract: This paper presents and evaluates an optimized cascaded Nepali speech-...
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
arXiv:2602.21531v1 Announce Type: cross Abstract: General-purpose robots must master long-horizon manipulation, defined ...
A Knowledge-Driven Approach to Music Segmentation, Music Source Separation and Cinematic Audio Source Separation
arXiv:2602.21476v1 Announce Type: cross Abstract: We propose a knowledge-driven, model-based approach to segmenting audi...
Adversarial Robustness of Deep Learning-Based Thyroid Nodule Segmentation in Ultrasound
arXiv:2602.21452v1 Announce Type: cross Abstract: Introduction: Deep learning-based segmentation models are increasingly...
Causal Decoding for Hallucination-Resistant Multimodal Large Language Models
arXiv:2602.21441v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) deliver detailed responses on...
AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression
arXiv:2602.21233v1 Announce Type: cross Abstract: This technical report introduces AngelSlim, a comprehensive and versat...
Architecture-Agnostic Curriculum Learning for Document Understanding: Empirical Evidence from Text-Only and Multimodal
arXiv:2602.21225v1 Announce Type: cross Abstract: We investigate whether progressive data scheduling -- a curriculum lea...
Trump claims tech companies will sign deals next week to pay for their own power supply
President Donald Trump tried to quell Americans' concerns about rising electricity costs during his State of the Union s...
ICLR2026 Oral | 当情感识别不再是分类题:EmotionThinker 让 SpeechLLM 学会“解释情绪”
语音情感识别(Speech Emotion Recognition, SER)在过去基本遵循同一种范式:输入语音,输出情绪标签。这种设定在工程上有效,但在认知层面却过于简化。在人类交流中,情绪判断从来不是一个 “标签选择&rdq...
Transforming Science Learning Materials in the Era of Artificial Intelligence
arXiv:2602.18470v2 Announce Type: replace-cross Abstract: The integration of artificial intelligence (AI) into science e...
AI-Mediated Feedback Improves Student Revisions: A Randomized Trial with FeedbackWriter in a Large Undergraduate Course
arXiv:2602.16820v2 Announce Type: replace-cross Abstract: Despite growing interest in using LLMs to generate feedback on...
Sim2Radar: Toward Bridging the Radar Sim-to-Real Gap with VLM-Guided Scene Reconstruction
arXiv:2602.13314v3 Announce Type: replace-cross Abstract: Millimeter-wave (mmWave) radar provides reliable perception in...
CryoLVM: Self-supervised Learning from Cryo-EM Density Maps with Large Vision Models
arXiv:2602.02620v2 Announce Type: replace-cross Abstract: Cryo-electron microscopy (cryo-EM) has revolutionized structur...
Generating metamers of human scene understanding
arXiv:2601.11675v3 Announce Type: replace-cross Abstract: Human vision combines low-resolution "gist" information from t...
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
arXiv:2601.10611v2 Announce Type: replace-cross Abstract: Today's strongest video-language models (VLMs) remain propriet...
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
arXiv:2601.09708v2 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) tasks require reasoning over comp...
On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations
arXiv:2510.00037v4 Announce Type: replace-cross Abstract: In Vision-Language-Actionf(VLA) models, robustness to real-wor...
Synthesis of discrete-continuous quantum circuits with multimodal diffusion models
arXiv:2506.01666v2 Announce Type: replace-cross Abstract: Efficiently compiling quantum operations remains a major bottl...
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
arXiv:2502.17028v3 Announce Type: replace-cross Abstract: Vision-language alignment is crucial for various downstream ta...
MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents
arXiv:2511.23055v2 Announce Type: replace Abstract: Theory of Mind (ToM) refers to the ability to infer others' mental s...
VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation
arXiv:2602.21054v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) frequently hallucinate, limiting ...
Multimodal MRI Report Findings Supervised Brain Lesion Segmentation with Substructures
arXiv:2602.20994v1 Announce Type: cross Abstract: Report-supervised (RSuper) learning seeks to alleviate the need for de...
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
arXiv:2602.20981v1 Announce Type: cross Abstract: Scaling multimodal alignment between video and audio is challenging, p...
Training-Free Intelligibility-Guided Observation Addition for Noisy ASR
arXiv:2602.20967v1 Announce Type: cross Abstract: Automatic speech recognition (ASR) degrades severely in noisy environm...
EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations
arXiv:2602.20958v1 Announce Type: cross Abstract: Search and rescue (SAR) operations require rapid responses to save liv...
E-MMKGR: A Unified Multimodal Knowledge Graph Framework for E-commerce Applications
arXiv:2602.20877v1 Announce Type: cross Abstract: Multimodal recommender systems (MMRSs) enhance collaborative filtering...
SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing
arXiv:2602.20751v1 Announce Type: cross Abstract: Designing aligned and robust rewards for open-ended generation remains...
Communication-Inspired Tokenization for Structured Image Representations
arXiv:2602.20731v1 Announce Type: cross Abstract: Discrete image tokenizers have emerged as a key component of modern vi...
Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video
arXiv:2602.20658v1 Announce Type: cross Abstract: Manual lifting tasks are a major contributor to work-related musculosk...
Enhancing Hate Speech Detection on Social Media: A Comparative Analysis of Machine Learning Models and Text Transformation Approaches
arXiv:2602.20634v1 Announce Type: cross Abstract: The proliferation of hate speech on social media platforms has necessi...
How Do Inpainting Artifacts Propagate to Language?
arXiv:2602.20520v1 Announce Type: cross Abstract: We study how visual artifacts introduced by diffusion-based inpainting...
Learning During Detection: Continual Learning for Neural OFDM Receivers via DMRS
arXiv:2602.20361v1 Announce Type: cross Abstract: Deep neural networks (DNNs) have been increasingly explored for receiv...
Hierarchical Molecular Representation Learning via Fragment-Based Self-Supervised Embedding Prediction
arXiv:2602.20344v1 Announce Type: cross Abstract: Graph self-supervised learning (GSSL) has demonstrated strong potentia...
Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking
arXiv:2602.20330v1 Announce Type: cross Abstract: Vision-language models (VLMs) are powerful but remain opaque black box...
Learning Physical Principles from Interaction: Self-Evolving Planning via Test-Time Memory
arXiv:2602.20323v1 Announce Type: cross Abstract: Reliable object manipulation requires understanding physical propertie...