Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding

Video TokenCom is a novel framework for semantic-aware video transmission that uses discrete tokens as unified units for computation and communication. The system integrates user-provided textual descriptions with adaptive source-channel coding and Unequal Error Protection (UEP) to prioritize semantic fidelity under bandwidth constraints. Experimental validation shows it outperforms traditional and semantic communication baselines across various SNR conditions.

Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding

Token Communication: A New Paradigm for Semantic-Aware Video Transmission

In a significant advancement for next-generation wireless networks, researchers have introduced a novel Video TokenCom framework. This system leverages the Token Communication (TokenCom) paradigm, using discrete tokens as unified units for computation and communication to enable efficient, intent-driven multi-rate video transmission with Unequal Error Protection (UEP). By integrating user-provided textual descriptions with adaptive source-channel coding, the framework prioritizes semantic fidelity, ensuring critical visual information is preserved even under severe bandwidth constraints, outperforming traditional and semantic communication baselines.

Bridging Language and Vision for Intent-Driven Communication

The proposed framework begins by transforming a video stream into discrete semantic tokens using a pretrained video tokenizer. The core innovation lies in its use of text-conditioned vision-language modeling combined with optical-flow propagation. This joint process analyzes the user's textual intent—such as "a person running"—to identify and track the corresponding video tokens across both space and time. This allows the system to distinguish between tokens that are semantically critical to the user's request and those that are peripheral.

Following token identification, the system implements a semantic-aware multi-rate bit-allocation strategy. Tokens deemed highly relevant to the user's described intent are encoded with full codebook precision to maintain detail. In contrast, non-intended or background tokens are represented through a reduced-precision differential encoding scheme. This intelligent prioritization creates significant rate savings without degrading the perceived semantic quality of the intended content.

Dynamic Adaptation to Network Conditions

To ensure robustness in real-world scenarios, the framework incorporates a dynamic source and channel coding adaptation scheme. This component continuously monitors available resources and link conditions, such as signal-to-noise ratio (SNR). It then adapts both the bit allocation for source tokens and the level of channel coding protection accordingly. This ensures optimal performance, providing stronger error protection for high-priority semantic tokens when channel conditions deteriorate.

Experimental validation across various video datasets confirms the framework's superiority. The Video TokenCom system demonstrated enhanced performance in both perceptual and semantic quality metrics compared to conventional and existing semantic communication baselines, maintaining reliability across a wide range of SNR conditions.

Why This Matters: The Future of Efficient Networks

This research, detailed in the preprint arXiv:2603.02470v1, represents a concrete step toward the TokenCom paradigm inspired by Large AI Models (LAMs). It moves beyond merely transmitting bits to communicating actionable, goal-oriented information.

  • Bandwidth Efficiency: By transmitting only what matters to the user, networks can support higher-quality services without proportional increases in bandwidth.
  • User-Centric Design: It pioneers a shift where communication systems directly interpret and fulfill human intent, enabling more natural human-machine interaction.
  • Robust Performance: The integration of UEP and dynamic adaptation makes semantic communication practical for fluctuating real-world wireless environments, a key hurdle for deployment.

常见问题