Artic: A Framework for Low-Latency AI Video Chat

TLDR: The research paper introduces Artic, a new framework for AI Video Chat that addresses latency challenges by shifting focus from human video perception to AI video understanding. Artic employs context-aware video streaming to reduce bitrate by prioritizing important video regions, and a loss-resilient adaptive frame rate to minimize retransmissions. It also proposes DeViBench, a benchmark for evaluating MLLM accuracy under varying video quality.

Real-time communication has taken a surprising turn with the emergence of AI Video Chat, where one participant is not a human but a Multimodal Large Language Model (MLLM). This new paradigm aims to make interactions with AI feel as intuitive as face-to-face conversations. However, this shift introduces significant challenges, particularly concerning latency, as the time-consuming MLLM inference process leaves minimal room for video streaming.

A research paper titled Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI by Jiangkai Wu, Zhiyuan Ren, Liming Liu, and Xinggong Zhang from Peking University, introduces Artic, an AI-oriented Real-time Communication framework designed to tackle these challenges. Artic redefines network requirements, moving from the traditional focus on “humans watching video” to “AI understanding video.”

The Core Problem: Latency in AI Video Chat

Traditional real-time communication (RTC) systems like WebRTC are optimized for human-to-human interaction, where the human on the other end can respond almost instantly. In such scenarios, transmission latency accounts for most of the end-to-end delay. However, with AI Video Chat, MLLMs generate responses autoregressively, which is computationally intensive. Even for audio-only inputs, the processing latency can be hundreds of milliseconds. To maintain a fluent interactive experience, the total end-to-end latency needs to be below 300 ms, leaving very little time for video transmission, making network uncertainty and instability critical bottlenecks.

Artic’s Innovative Solutions

Artic addresses these issues with three key contributions:

1. Context-Aware Video Streaming

One of Artic’s main innovations is its Context-Aware Video Streaming. In traditional video streaming, reducing bitrate (to lower latency) often means degrading overall video quality. However, for MLLMs, not all parts of a video are equally important. Artic recognizes that the MLLM’s accuracy depends on the current chat context. For example, if a user asks about a score in a game, the scoreboard region is crucial, while other areas might be less relevant. Artic leverages models like CLIP (Contrastive Language-Image Pre-Training) to compute the semantic correlation between user words and different video regions. It then allocates more bitrate to these “chat-important” regions and significantly less to irrelevant ones. This approach dramatically reduces the overall bitrate while maintaining MLLM accuracy, as the AI still receives high-quality information for the relevant parts of the video.

2. Loss-Resilient Adaptive Frame Rate

Packet loss is another major contributor to latency, as lost packets often require retransmission. Artic introduces a Loss-Resilient Adaptive Frame Rate mechanism. MLLMs typically process videos at very low frame rates (e.g., 2 frames per second), even if the client transmits at a higher rate (e.g., 30 FPS). This means many received frames are redundant for the MLLM’s immediate processing. Artic cleverly utilizes this redundancy: if an expected frame is lost or delayed, the MLLM can simply use a previous, redundant frame as a substitute, without waiting for retransmission. This effectively turns higher frame rates into a form of Forward Error Correction (FEC), enhancing resilience to packet loss while minimizing bitrate waste by adapting the frame rate based on the current loss rate.

3. DeViBench: A New Benchmark for AI Video Chat Quality

To properly evaluate the impact of video streaming quality on MLLM accuracy, Artic introduces the first-of-its-kind benchmark called Degraded Video Understanding Benchmark (DeViBench). Existing benchmarks for MLLMs focus on testing their intelligence with high-quality videos and often feature simple, high-level questions. DeViBench, however, automatically constructs quality-sensitive Question-Answer (QA) samples. It transcodes videos to low-bitrate versions and uses MLLMs themselves to generate and filter QA pairs that are specifically challenging under degraded video quality, ensuring the benchmark accurately reflects real-world scenarios where subtle details might be crucial for MLLM understanding.

Key Differences from Traditional RTC

The paper highlights fundamental differences between AI Video Chat and traditional RTC:

The Quality of Experience (QoE) shifts from human perceptual quality (e.g., visual clarity, temporal stability) to MLLM response accuracy and latency.
The receiver (MLLM) throughput is significantly lower than the sender (user) throughput, as MLLMs downsample video inputs considerably.
Uplink (user sending video to cloud-based MLLM) becomes more critical than downlink, as MLLMs typically respond with lower-bitrate audio or text.

Also Read:

Future Directions

The researchers also discuss several open questions and future work. These include developing proactive context-aware mechanisms that can identify important video regions even without explicit user words, exploring MLLM long-term memory by creating semantic layered video streaming, and implementing context-aware token pruning to further accelerate MLLM inference by removing irrelevant visual tokens.

Artic represents a significant step forward in optimizing real-time communication for the age of AI, paving the way for more natural and responsive interactions between humans and intelligent machines.