TowerVision: Bridging Language Barriers in AI's Vision-Language Understanding

TLDR: TowerVision is a new suite of open multilingual vision-language models (VLMs) developed to address the English-centric bias in existing AI models. Created by André G. Viveiros and colleagues, these models excel in image-text and video-text tasks across 20 languages, demonstrating enhanced cultural understanding and multimodal translation capabilities. The project introduces VisionBlocks, a high-quality, curated multilingual dataset, and provides a detailed training methodology. Key findings include the importance of multilingual text backbones and vision encoders, the effectiveness of high-quality English captions for initial alignment, and how expanding language coverage in training data improves cross-lingual generalization, even for unseen languages. TowerVision aims to advance culturally diverse multimodal AI research.

In the rapidly evolving landscape of artificial intelligence, Vision-Language Models (VLMs) have made remarkable strides, allowing computers to understand and process both images and text. However, a significant challenge has persisted: most of these advanced models are primarily designed and trained using English-centric data, limiting their effectiveness and applicability in a diverse, multilingual world.

A recent research paper, titled “TowerVision: Understanding and Improving Multilinguality in Vision-Language Models,” by André G. Viveiros and a team of researchers, addresses this critical gap. The paper introduces TowerVision, a new family of open multilingual VLMs that are specifically engineered to perform well across multiple languages and cultures, for both image-text and video-text tasks. You can find the full research paper here: TowerVision Research Paper.

The Multilingual Challenge and TowerVision’s Approach

The core problem lies in the scarcity of high-quality multilingual vision-text data. While text-only multilingual datasets are relatively abundant, finding diverse image-text pairs in many languages is difficult. TowerVision tackles this by leveraging large-scale text-only multilingual data to strengthen its language understanding, complemented by multimodal multilingual examples obtained through translation and high-quality synthetic generation.

The models are built upon Tower+, a multilingual text-only model, and are designed to support 20 languages and dialects. This comprehensive approach allows TowerVision to achieve competitive performance on various multilingual and multimodal benchmarks, showing particular strength in tasks that require cultural understanding and multimodal translation.

VisionBlocks: A Curated Multilingual Dataset

A key contribution of this work is the release of VisionBlocks, a high-quality, curated vision-language dataset. Creating such a dataset is challenging due to the limited availability of human-written vision-text data and the difficulties in filtering low-quality samples. VisionBlocks aggregates and filters data from multiple sources, enhancing it with new translated and synthetically generated content. This includes:

Existing English and multilingual vision-text data, focusing on quality over sheer scale.
Translated captions from high-quality English datasets into target languages, carefully filtered for accuracy.
Synthetic captions generated by advanced APIs to improve coverage of fine-grained visual details and provide instruction-like supervision.
Text-only data (EuroBlocks) to maintain the language model’s text performance.
Translated multilingual video data, extending the model’s capabilities to the video modality.

Architecture and Training

TowerVision’s architecture combines three main components: a multilingual text-only backbone (Tower+), a Vision Transformer encoder (SigLIP2) for visual inputs, and a connector module to align visual features with the text embedding space. The training process involves three stages:

Projector Pretraining: Initially, the model is trained to predict captions from images using diverse, high-quality English captions, with the vision encoder and language model backbone frozen.
Vision Finetuning: The full model is then unfrozen and trained on the complete VisionBlocks dataset (excluding video data), using high-dynamic resolution for images. This stage produces the TowerVision model.
Video Finetuning: Finally, the video portion of VisionBlocks is used to finetune TowerVision, resulting in the TowerVideo model, which extends the analysis to video modality.

Also Read:

Key Findings and Performance

TowerVision models demonstrate strong performance, especially in culturally-aware tasks like ALM-Bench, outperforming existing approaches trained on substantially larger datasets. While they are less competitive on OCR-related tasks due to limited OCR-focused data in VisionBlocks, they still show superior performance compared to some baselines.

Interestingly, the smaller TowerVision-2B model proves competitive multilingually with larger models, highlighting the efficiency of its design choices. Scaling from 2B to 9B parameters consistently improves performance, indicating a well-scaling training recipe.

The research also delves into crucial design choices:

Multilingual Backbones: Using Tower+ as a backbone consistently outperforms Gemma2, emphasizing the importance of a strong multilingual foundation for cross-modal understanding.
Multilingual-aware Vision Encoders: SigLIP2, trained on diverse multilingual data, provides an advantage in low-data regimes, though extensive multilingual fine-tuning can compensate for less specialized encoders.
Alignment Data: High-quality English captions are sufficient for the initial projector pretraining phase, with little to no positive effect from adding multilingual data at this early stage.
Language Coverage: Expanding the number of languages in the training data consistently improves performance and cross-lingual generalization, even for languages not explicitly included in the training set.
Video Fine-tuning: Incorporating multilingual data during video fine-tuning significantly enhances cross-lingual reasoning without compromising English performance.

In conclusion, TowerVision represents a significant step forward in developing inclusive and culturally aware multilingual vision-language models. By publicly releasing their models, data, and training recipes, the researchers aim to foster further advancements in this critical area, helping to narrow the performance gap with English-centric systems.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

TowerVision: Bridging Language Barriers in AI’s Vision-Language Understanding

The Multilingual Challenge and TowerVision’s Approach

VisionBlocks: A Curated Multilingual Dataset

Architecture and Training

Key Findings and Performance

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates