spot_img
HomeResearch & DevelopmentTowerVision: Bridging Language Barriers in AI's Vision-Language Understanding

TowerVision: Bridging Language Barriers in AI’s Vision-Language Understanding

TLDR: TowerVision is a new suite of open multilingual vision-language models (VLMs) developed to address the English-centric bias in existing AI models. Created by André G. Viveiros and colleagues, these models excel in image-text and video-text tasks across 20 languages, demonstrating enhanced cultural understanding and multimodal translation capabilities. The project introduces VisionBlocks, a high-quality, curated multilingual dataset, and provides a detailed training methodology. Key findings include the importance of multilingual text backbones and vision encoders, the effectiveness of high-quality English captions for initial alignment, and how expanding language coverage in training data improves cross-lingual generalization, even for unseen languages. TowerVision aims to advance culturally diverse multimodal AI research.

In the rapidly evolving landscape of artificial intelligence, Vision-Language Models (VLMs) have made remarkable strides, allowing computers to understand and process both images and text. However, a significant challenge has persisted: most of these advanced models are primarily designed and trained using English-centric data, limiting their effectiveness and applicability in a diverse, multilingual world.

A recent research paper, titled “TowerVision: Understanding and Improving Multilinguality in Vision-Language Models,” by André G. Viveiros and a team of researchers, addresses this critical gap. The paper introduces TowerVision, a new family of open multilingual VLMs that are specifically engineered to perform well across multiple languages and cultures, for both image-text and video-text tasks. You can find the full research paper here: TowerVision Research Paper.

The Multilingual Challenge and TowerVision’s Approach

The core problem lies in the scarcity of high-quality multilingual vision-text data. While text-only multilingual datasets are relatively abundant, finding diverse image-text pairs in many languages is difficult. TowerVision tackles this by leveraging large-scale text-only multilingual data to strengthen its language understanding, complemented by multimodal multilingual examples obtained through translation and high-quality synthetic generation.

The models are built upon Tower+, a multilingual text-only model, and are designed to support 20 languages and dialects. This comprehensive approach allows TowerVision to achieve competitive performance on various multilingual and multimodal benchmarks, showing particular strength in tasks that require cultural understanding and multimodal translation.

VisionBlocks: A Curated Multilingual Dataset

A key contribution of this work is the release of VisionBlocks, a high-quality, curated vision-language dataset. Creating such a dataset is challenging due to the limited availability of human-written vision-text data and the difficulties in filtering low-quality samples. VisionBlocks aggregates and filters data from multiple sources, enhancing it with new translated and synthetically generated content. This includes:

  • Existing English and multilingual vision-text data, focusing on quality over sheer scale.
  • Translated captions from high-quality English datasets into target languages, carefully filtered for accuracy.
  • Synthetic captions generated by advanced APIs to improve coverage of fine-grained visual details and provide instruction-like supervision.
  • Text-only data (EuroBlocks) to maintain the language model’s text performance.
  • Translated multilingual video data, extending the model’s capabilities to the video modality.

Architecture and Training

TowerVision’s architecture combines three main components: a multilingual text-only backbone (Tower+), a Vision Transformer encoder (SigLIP2) for visual inputs, and a connector module to align visual features with the text embedding space. The training process involves three stages:

  1. Projector Pretraining: Initially, the model is trained to predict captions from images using diverse, high-quality English captions, with the vision encoder and language model backbone frozen.
  2. Vision Finetuning: The full model is then unfrozen and trained on the complete VisionBlocks dataset (excluding video data), using high-dynamic resolution for images. This stage produces the TowerVision model.
  3. Video Finetuning: Finally, the video portion of VisionBlocks is used to finetune TowerVision, resulting in the TowerVideo model, which extends the analysis to video modality.

Also Read:

Key Findings and Performance

TowerVision models demonstrate strong performance, especially in culturally-aware tasks like ALM-Bench, outperforming existing approaches trained on substantially larger datasets. While they are less competitive on OCR-related tasks due to limited OCR-focused data in VisionBlocks, they still show superior performance compared to some baselines.

Interestingly, the smaller TowerVision-2B model proves competitive multilingually with larger models, highlighting the efficiency of its design choices. Scaling from 2B to 9B parameters consistently improves performance, indicating a well-scaling training recipe.

The research also delves into crucial design choices:

  • Multilingual Backbones: Using Tower+ as a backbone consistently outperforms Gemma2, emphasizing the importance of a strong multilingual foundation for cross-modal understanding.
  • Multilingual-aware Vision Encoders: SigLIP2, trained on diverse multilingual data, provides an advantage in low-data regimes, though extensive multilingual fine-tuning can compensate for less specialized encoders.
  • Alignment Data: High-quality English captions are sufficient for the initial projector pretraining phase, with little to no positive effect from adding multilingual data at this early stage.
  • Language Coverage: Expanding the number of languages in the training data consistently improves performance and cross-lingual generalization, even for languages not explicitly included in the training set.
  • Video Fine-tuning: Incorporating multilingual data during video fine-tuning significantly enhances cross-lingual reasoning without compromising English performance.

In conclusion, TowerVision represents a significant step forward in developing inclusive and culturally aware multilingual vision-language models. By publicly releasing their models, data, and training recipes, the researchers aim to foster further advancements in this critical area, helping to narrow the performance gap with English-centric systems.

Rhea Bhattacharya
Rhea Bhattacharyahttps://blogs.edgentiq.com
Rhea Bhattacharya is an AI correspondent with a keen eye for cultural, social, and ethical trends in Generative AI. With a background in sociology and digital ethics, she delivers high-context stories that explore the intersection of AI with everyday lives, governance, and global equity. Her news coverage is analytical, human-centric, and always ahead of the curve. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -