Multi-TW: A New Benchmark for Multimodal AI in Traditional Chinese

TLDR: Multi-TW is the first benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on Traditional Chinese question answering, focusing on both performance and inference latency. It comprises 900 multiple-choice questions, equally split between image-text and audio-text pairs, derived from authentic proficiency tests. Experiments using Multi-TW reveal that while closed-source models generally outperform open-source ones, open-source models can excel in audio tasks. Crucially, end-to-end any-to-any MLLMs demonstrate significant latency advantages over Vision-Language Models (VLMs) combined with separate audio transcription, highlighting the need for better Traditional Chinese fine-tuning and efficient multimodal architectures.

In the rapidly evolving field of artificial intelligence, Multimodal Large Language Models (MLLMs) are at the forefront, capable of processing and understanding information from various sources like images, audio, and text. These advanced models aim to overcome the limitations of traditional language models that only handle text. However, a significant challenge has been the lack of comprehensive evaluation benchmarks, especially for languages like Traditional Chinese, and for assessing a model’s performance across all three modalities simultaneously, along with its inference speed.

To address this crucial gap, a new research paper introduces Multi-TW, the first benchmark specifically designed for evaluating the performance and latency of any-to-any multimodal models in Traditional Chinese. This innovative benchmark provides a much-needed tool for researchers and developers to rigorously test and improve MLLMs for a wider range of real-world applications.

What is Multi-TW?

Multi-TW is a unique dataset comprising 900 multiple-choice questions. These questions are carefully crafted from authentic Traditional Chinese proficiency tests developed in collaboration with the Steering Committee for the Test of Proficiency-Huayu (SC-TOP) in Taiwan. The dataset is equally divided into two main types: 450 image-text pairs and 450 audio-text pairs. This balanced design allows for direct comparison of model performance on visual versus auditory inputs when paired with Traditional Chinese text.

The questions in Multi-TW cover a diverse range of tasks to thoroughly evaluate multimodal understanding. For audio-based questions, tasks include Dialogue Comprehension and Passage Comprehension. Vision-based questions encompass Dialogue Comprehension, Image Comprehension, Reading Comprehension, Sentence-to-Image Matching, and Image-to-Sentence Matching. This variety ensures that models are tested on a broad spectrum of capabilities.

Why is Multi-TW Important?

Existing benchmarks often focus on only two modalities (e.g., text and vision) or are primarily designed for English. Multi-TW fills a critical void by offering a comprehensive evaluation across textual, visual, and acoustic modalities specifically for Traditional Chinese. Furthermore, unlike many benchmarks that prioritize only accuracy, Multi-TW also evaluates model inference time, which is vital for real-world applications where both speed and precision are essential.

The data for Multi-TW was meticulously constructed from September to December 2023, using publicly available sources and a standardized workflow. A custom interface was developed to streamline data collection and labeling. Each item underwent a rigorous quality control process, including completeness checks, file consistency verification (e.g., image format, audio clarity), and label accuracy verification, ensuring high data integrity.

Key Findings from Experiments

The researchers conducted extensive experiments using Multi-TW to benchmark various publicly available multimodal language models. They evaluated both “any-to-any” models (which directly process text, image, and audio inputs) and Vision-Language Models (VLMs) that used a separate Audio Speech Recognition (ASR) system, like Whisper-large, to transcribe audio into text before processing.

Here are some of the key observations:

Closed-Source vs. Open-Source Models: Closed-source models, such as Google’s Gemini series, generally showed superior performance across both image and audio modalities. However, open-source models, particularly those like the Qwen2.5-Omni series and Baichuan-Omni-1.5 (which are primarily trained on Simplified Chinese), demonstrated competitive accuracy on Traditional Chinese inputs, especially in audio-text tasks. This highlights the potential for open-source models to excel in specific areas.
Performance Gaps: A significant performance difference was noted between open-source and closed-source models, particularly in the image-text domain. This suggests a strong need for dedicated Traditional Chinese fine-tuning and more robust vision components in open-source any-to-any models.
Latency Advantages: A crucial finding relates to inference latency. End-to-end any-to-any models showed notable speed advantages, completing the 900-item benchmark significantly faster (467–744 seconds) compared to VLMs coupled with an ASR pipeline (1,187–2,131 seconds). This indicates that integrated multimodal architectures are more efficient for processing audio inputs.
Model Specifics: Among the evaluated VLMs, Qwen2.5-VL-7B-Instruct and UI-TARS-1.5-7B performed best, suggesting that extensive pre-training on relevant Chinese-language corpora is a critical factor for strong performance. Models with less exposure to Traditional Chinese data, despite large parameter counts, showed lower performance.

Also Read:

Conclusion and Future Directions

The introduction of Multi-TW marks a significant step forward in evaluating MLLMs, particularly for Traditional Chinese. The benchmark provides invaluable insights into the capabilities and limitations of current models, emphasizing the importance of both accuracy and efficiency. The findings underscore the urgent need for more appropriate architectural designs and targeted fine-tuning data to achieve robust multimodal integration, especially for Traditional Chinese.

Future work will explore how cross-lingual transfer capabilities influence the performance of Simplified Chinese-trained models on Traditional Chinese reasoning tasks. Researchers also plan to evaluate latency under more rigorous, parallelized conditions and expand Multi-TW to include generative tasks and more complex reasoning scenarios. For more details, you can refer to the full research paper: Multi-TW: Benchmarking Multimodal Models on Traditional Chinese Question Answering in Taiwan.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Multi-TW: A New Benchmark for Multimodal AI in Traditional Chinese

What is Multi-TW?

Why is Multi-TW Important?

Key Findings from Experiments

Conclusion and Future Directions

Gen AI News and Updates

Upwork Study Reveals AI Agents Thrive with Human Collaboration, Struggle Alone

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Frontier AI Models Show Advanced Planning Skills, Rivaling Specialized Planners in 2025

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates