Unpacking AI Video Chat Performance: A Deep Dive into Mainstream Applications

TLDR: This research paper presents the first systematic performance measurement of AI video chat systems, which emerged in 2025. It introduces a comprehensive benchmark evaluating five mainstream AI video chatbots (ChatGPT, Gemini, Grok, Doubao, Yuanbao) across four dimensions: quality, latency, internal mechanisms, and system overhead. Key findings reveal significant variations in response delay and quality, with no single application excelling in both. AI video chat performance still lags human interaction, particularly in areas like proactive output and complex visual-audio integration. The study identifies diverse network protocols, bitrate strategies, and input modalities, alongside varying client-side resource consumption. It also highlights that AI chatbots require higher bandwidth for visual tasks compared to traditional video calls. The paper provides a crucial baseline for future optimizations in this rapidly evolving field.

In 2025, a new feature emerged from Large Language Model (LLM) services like ChatGPT, Gemini, and Grok: AI video chat. This innovation allows users to interact with AI agents through real-time video communication, much like chatting with another person. This development holds significant business opportunities and societal impact, building on the already mature video chat industry and the projected trillion-dollar LLM market.

Despite its growing importance, there hasn’t been a systematic study to characterize the performance of these new AI video chat systems. This research paper addresses that gap by proposing a comprehensive benchmark. It evaluates AI video chat performance across four key dimensions: quality, latency, internal mechanisms, and system overhead. Using custom testbeds, the researchers evaluated five mainstream AI video chatbots: ChatGPT (OpenAI), Gemini (Google), Grok (xAI), Doubao (ByteDance), and Yuanbao (Tencent).

Understanding AI Video Chat: A New Frontier

AI video chat fundamentally differs from traditional video chat, AI text/audio chat, and video analytics. In traditional video chat, the receiver is human, and the focus is on perceptual quality and end-to-end latency. With AI video chat, the receiver is a machine, shifting the emphasis to the AI’s accuracy and responsiveness. Unlike text or audio-only AI, video chat must handle continuous, high-bitrate video streams, demanding significant computational power and stateful processing to integrate visual information with ongoing dialogue.

The typical AI video chat system works by streaming audio and video from a user’s device to an LLM hosted in the cloud. The system then generates and returns a synthesized audio response. This often involves a pipeline: speech-to-text (STT) transcribes user audio, the LLM processes this text and video frames, and text-to-speech (TTS) converts the LLM’s response into audio. Some systems even process raw audio directly without STT and TTS.

Key Findings from the Benchmark

The study uncovered several major insights into the current state of AI video chat:

Response Delay: Current AI video chatbots are far from achieving seamless human-like conversations. Response delays vary significantly, with Grok averaging 2.5 seconds, while ChatGPT and Doubao can reach up to 5 seconds at the 90th percentile, and Gemini sometimes extends to 8 seconds. These delays are influenced by both the AI’s processing time and resource availability.

Specific Use Cases: Applications behave differently when handling unique AI video chat scenarios. For instance, ChatGPT can recall information from video content more than 10 minutes old, whereas Yuanbao can only respond based on the current video frame.

Network Stack Diversity: There’s no single standard for the underlying network technology. ChatGPT and Grok use RTP or a customized version, while Gemini employs the QUIC protocol. Video bitrate and framerate also vary widely across applications, by factors of 4x and 10x respectively.

Response Quality: The quality of AI video chatbot responses still has considerable room for improvement. For example, none of the tested AI video chatbots can interrupt a user or proactively generate output like a human would; they only respond passively after detecting user speech.

Detailed Evaluation Dimensions

Quality Assessment

Visual-related response quality: This includes real-time visual understanding (e.g., object counting), contextual understanding, and omni-source understanding (integrating visual and auditory inputs). The study found that AI performance in real-time video chat is significantly lower than in traditional, offline AI evaluations, largely due to the computational demands and limited visual memory in real-time scenarios. Challenges also arise in distinguishing user voice from environmental sounds.
Chatbot-related response quality: This covers scenarios specific to AI video chat, such as visual content memory (how long the AI can recall visual details), visual named entity recognition (identifying brands, dishes, attractions), and math problem solving. ChatGPT showed the longest visual memory (over 10 minutes), while Yuanbao had none. Doubao was the only agent to demonstrate competence in math problem solving, especially geometry.
Perceptual quality: Evaluated by speaking rate and response duration. ChatGPT and Doubao had the longest response durations, while Gemini was the shortest. The differences were mainly attributed to varying token limits rather than speaking speed.

Latency Analysis

Latency, crucial for user experience, was measured in two ways:

Response delay: The time from when the user stops speaking to when the AI begins its audible response. All applications showed minimum delays exceeding 1.5 seconds, far from the sub-second ideal for human conversations. Yuanbao had the lowest average delay (2.5 seconds), and ChatGPT the longest (4 seconds). Client location and server load (peak vs. off-peak hours) significantly impacted these delays.
Video chat setup time: The total delay from initiating the chat until the AI is fully connected and ready to interact. ChatGPT exhibited the longest setup time (over 4 seconds), while other applications were ready in about 2 seconds or less.

Internal Mechanisms

The researchers delved into the technical design choices:

Network protocols: Most applications (ChatGPT, Grok, Doubao, Yuanbao) use the traditional RTP/RTCP stack, while Gemini uses QUIC. Grok employs a customized RTP version.
Uplink & Downlink rate: Uplink bitrates varied significantly, with ChatGPT and Grok reaching nearly 2000 kbps, and Gemini as low as 500 kbps. Most applications use content-adaptive bitrates, reducing data for low-motion videos, but Grok uses a static high rate, and Gemini has a low maximum limit. Downlink bitrates were negligible as they only carry audio.
Framerate: All applications except ChatGPT (20-30 fps) operated at substantially lower framerates (1-10 fps). This is a design choice and a technical limitation, as AI often needs fewer, information-rich frames for analysis compared to human perception.
Video packet sending pattern: ChatGPT and Grok use a paced, consistent sending pattern, while Gemini, Doubao, and Yuanbao adopt a bursty transmission.
Input modality: ChatGPT and Gemini use end-to-end audio-native models, capable of interpreting non-verbal sounds like music or laughter. Grok, Doubao, and Yuanbao rely on a cascaded pipeline that converts speech to text, losing non-speech information.

System Overhead

Client-side resource consumption was also measured:

Yuanbao showed the highest CPU usage (over 300%), and Doubao had the largest memory footprint (nearly 9%). Grok stood out for its minimal resource consumption, with CPU usage below 100% and memory under 6%.

Impact of Bandwidth Constraints

The study also examined how performance degrades under limited bandwidth. Traditional video calls (like WhatsApp) can function reliably at very low bitrates (around 100 kbps). However, AI chatbots require significantly higher bandwidth for their visual functionality. Gemini performed best, needing 300 kbps to maintain visual capability, while Doubao required 800 kbps. Grok struggled with text recognition even at high bitrates. This indicates that AI chatbots have inherent limitations in low-bandwidth environments compared to traditional RTC applications.

Also Read:

Conclusion

This pioneering research provides a crucial baseline for understanding the real-world performance of AI video chat systems. It highlights that while applications like ChatGPT, Gemini, and Doubao offer relatively good quality, they still fall short of seamless human interaction, particularly in terms of response delay. Yuanbao offers faster responses but at lower quality. The study identifies unique system bottlenecks and diverse design choices across applications, laying a foundation for future optimization efforts in this rapidly evolving field. For more detailed information, you can refer to the full research paper here.