spot_img
HomeResearch & DevelopmentMulti-TW: A New Benchmark for Multimodal AI in Traditional...

Multi-TW: A New Benchmark for Multimodal AI in Traditional Chinese

TLDR: Multi-TW is the first benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on Traditional Chinese question answering, focusing on both performance and inference latency. It comprises 900 multiple-choice questions, equally split between image-text and audio-text pairs, derived from authentic proficiency tests. Experiments using Multi-TW reveal that while closed-source models generally outperform open-source ones, open-source models can excel in audio tasks. Crucially, end-to-end any-to-any MLLMs demonstrate significant latency advantages over Vision-Language Models (VLMs) combined with separate audio transcription, highlighting the need for better Traditional Chinese fine-tuning and efficient multimodal architectures.

In the rapidly evolving field of artificial intelligence, Multimodal Large Language Models (MLLMs) are at the forefront, capable of processing and understanding information from various sources like images, audio, and text. These advanced models aim to overcome the limitations of traditional language models that only handle text. However, a significant challenge has been the lack of comprehensive evaluation benchmarks, especially for languages like Traditional Chinese, and for assessing a model’s performance across all three modalities simultaneously, along with its inference speed.

To address this crucial gap, a new research paper introduces Multi-TW, the first benchmark specifically designed for evaluating the performance and latency of any-to-any multimodal models in Traditional Chinese. This innovative benchmark provides a much-needed tool for researchers and developers to rigorously test and improve MLLMs for a wider range of real-world applications.

What is Multi-TW?

Multi-TW is a unique dataset comprising 900 multiple-choice questions. These questions are carefully crafted from authentic Traditional Chinese proficiency tests developed in collaboration with the Steering Committee for the Test of Proficiency-Huayu (SC-TOP) in Taiwan. The dataset is equally divided into two main types: 450 image-text pairs and 450 audio-text pairs. This balanced design allows for direct comparison of model performance on visual versus auditory inputs when paired with Traditional Chinese text.

The questions in Multi-TW cover a diverse range of tasks to thoroughly evaluate multimodal understanding. For audio-based questions, tasks include Dialogue Comprehension and Passage Comprehension. Vision-based questions encompass Dialogue Comprehension, Image Comprehension, Reading Comprehension, Sentence-to-Image Matching, and Image-to-Sentence Matching. This variety ensures that models are tested on a broad spectrum of capabilities.

Why is Multi-TW Important?

Existing benchmarks often focus on only two modalities (e.g., text and vision) or are primarily designed for English. Multi-TW fills a critical void by offering a comprehensive evaluation across textual, visual, and acoustic modalities specifically for Traditional Chinese. Furthermore, unlike many benchmarks that prioritize only accuracy, Multi-TW also evaluates model inference time, which is vital for real-world applications where both speed and precision are essential.

The data for Multi-TW was meticulously constructed from September to December 2023, using publicly available sources and a standardized workflow. A custom interface was developed to streamline data collection and labeling. Each item underwent a rigorous quality control process, including completeness checks, file consistency verification (e.g., image format, audio clarity), and label accuracy verification, ensuring high data integrity.

Key Findings from Experiments

The researchers conducted extensive experiments using Multi-TW to benchmark various publicly available multimodal language models. They evaluated both “any-to-any” models (which directly process text, image, and audio inputs) and Vision-Language Models (VLMs) that used a separate Audio Speech Recognition (ASR) system, like Whisper-large, to transcribe audio into text before processing.

Here are some of the key observations:

  • Closed-Source vs. Open-Source Models: Closed-source models, such as Google’s Gemini series, generally showed superior performance across both image and audio modalities. However, open-source models, particularly those like the Qwen2.5-Omni series and Baichuan-Omni-1.5 (which are primarily trained on Simplified Chinese), demonstrated competitive accuracy on Traditional Chinese inputs, especially in audio-text tasks. This highlights the potential for open-source models to excel in specific areas.
  • Performance Gaps: A significant performance difference was noted between open-source and closed-source models, particularly in the image-text domain. This suggests a strong need for dedicated Traditional Chinese fine-tuning and more robust vision components in open-source any-to-any models.
  • Latency Advantages: A crucial finding relates to inference latency. End-to-end any-to-any models showed notable speed advantages, completing the 900-item benchmark significantly faster (467–744 seconds) compared to VLMs coupled with an ASR pipeline (1,187–2,131 seconds). This indicates that integrated multimodal architectures are more efficient for processing audio inputs.
  • Model Specifics: Among the evaluated VLMs, Qwen2.5-VL-7B-Instruct and UI-TARS-1.5-7B performed best, suggesting that extensive pre-training on relevant Chinese-language corpora is a critical factor for strong performance. Models with less exposure to Traditional Chinese data, despite large parameter counts, showed lower performance.

Also Read:

Conclusion and Future Directions

The introduction of Multi-TW marks a significant step forward in evaluating MLLMs, particularly for Traditional Chinese. The benchmark provides invaluable insights into the capabilities and limitations of current models, emphasizing the importance of both accuracy and efficiency. The findings underscore the urgent need for more appropriate architectural designs and targeted fine-tuning data to achieve robust multimodal integration, especially for Traditional Chinese.

Future work will explore how cross-lingual transfer capabilities influence the performance of Simplified Chinese-trained models on Traditional Chinese reasoning tasks. Researchers also plan to evaluate latency under more rigorous, parallelized conditions and expand Multi-TW to include generative tasks and more complex reasoning scenarios. For more details, you can refer to the full research paper: Multi-TW: Benchmarking Multimodal Models on Traditional Chinese Question Answering in Taiwan.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -