spot_img
HomeResearch & DevelopmentNeoBabel: Advancing Inclusive Image Generation Across Languages

NeoBabel: Advancing Inclusive Image Generation Across Languages

TLDR: NeoBabel is a novel multilingual text-to-image generation framework that directly supports six languages (English, Chinese, Dutch, French, Hindi, Persian) without relying on translation. It achieves state-of-the-art performance on multilingual benchmarks, maintains strong English capabilities, and is significantly more efficient and smaller than existing models. The project also releases an open toolkit, including code, models, and a large multilingual dataset, aiming to promote equitable and culturally aligned generative AI.

A new research paper introduces NeoBabel, a groundbreaking framework designed to overcome the English-centric bias prevalent in text-to-image generation. This bias has historically created significant barriers for non-English speakers, leading to digital inequities and cultural misalignments. Current systems often rely on translation pipelines, which can introduce problems like semantic drift, increased computational overhead, and a loss of cultural nuance.

NeoBabel aims to set a new standard for performance, efficiency, and inclusivity in visual generation. It directly supports six languages: English, Chinese, Dutch, French, Hindi, and Persian, eliminating the need for translation layers. The model achieves this by combining large-scale multilingual pretraining with high-resolution instruction tuning.

To thoroughly evaluate its capabilities, the researchers expanded two existing English-only benchmarks, GenEval and DPG-Bench, into their multilingual equivalents: m-GenEval and m-DPG. NeoBabel demonstrates state-of-the-art multilingual performance while maintaining strong capabilities in English. It scored 0.75 on m-GenEval and 0.68 on m-DPG, notably outperforming leading models on multilingual benchmarks by significant margins, even though some of these competitors are built on multilingual base language models.

The effectiveness of NeoBabel’s targeted alignment training is evident in its ability to preserve and extend cross-lingual generalization. The framework also introduces two new metrics, Cross-Lingual Consistency (CLC) and Code Switching Similarity (CSS), to rigorously assess multilingual alignment and robustness to prompts that mix multiple languages.

Remarkably, NeoBabel matches or exceeds the performance of English-only models while being two to four times smaller in size. This efficiency is a significant advantage for real-world deployment, as it processes multilingual prompts 2.8 times faster and uses 59% less memory compared to traditional translation-then-generation pipelines.

The core of NeoBabel’s architecture involves a multilingual transformer backbone. It utilizes the Gemma-2 tokenizer for text and the MAGVIT-v2 quantizer for images, creating a unified multimodal embedding space. This allows the model to process both text and image inputs natively, learning cross-modal compositionality and semantic alignment without needing separate components for different modalities or tasks.

The training process for NeoBabel is progressive, starting with three stages of pretraining to build foundational visual understanding and scale alignment with large multilingual datasets. This is followed by two stages of instruction tuning, which refine the model’s ability to interpret and execute complex, multilingual instructions at high resolution.

A key contribution of this work is the release of an open toolkit, including all code, model checkpoints, a curated dataset of 124 million multilingual text-image pairs, and standardized multilingual evaluation protocols. This open-source approach is intended to foster further inclusive AI research and advance the field.

Also Read:

The research underscores that multilingual capability is not a compromise but rather a catalyst for improved robustness, efficiency, and cultural fidelity in generative AI. For more details, you can refer to the research paper.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -