TLDR: The Qwen Team has officially open-sourced Qwen-Image, a state-of-the-art image foundation model designed for advanced text-to-image generation and precise text-image-to-image editing. This new model excels in rendering complex text, including both English and Chinese, and demonstrates superior performance across various benchmarks, marking a significant advancement in multimodal AI.
The Qwen Team has announced the open-sourcing of Qwen-Image, an innovative image foundation model that promises to redefine the landscape of AI-driven visual content creation and manipulation. Released on August 26, 2025, Qwen-Image is engineered to handle sophisticated text-to-image (T2I) generation and text-image-to-image (TI2I) editing tasks, showcasing remarkable capabilities in rendering intricate text and achieving high scores on industry benchmarks.
At its core, Qwen-Image integrates a sophisticated architecture comprising a Qwen2.5-VL for processing text inputs, a Variational AutoEncoder (VAE) for efficient image input handling, and a Multimodal Diffusion Transformer (MMDiT) responsible for image generation. This combined system is particularly adept at text rendering, demonstrating excellence in both English and Chinese characters, preserving typographic details, layout coherence, and contextual harmony with stunning accuracy.
The model’s performance has been rigorously evaluated across a suite of T2I and TI2I benchmarks, including DPG, GenEval, GEdit, and ImgEdit, where it consistently achieved the highest overall scores. Furthermore, Qwen-Image has made a strong showing in the AI Arena, an open benchmarking platform built on the Elo rating system for human evaluation of generated images, currently ranking third among five high-quality closed models, including GPT Image 1.
Beyond generation, Qwen-Image offers extensive image editing functionalities, enabling advanced operations such as style transfer, object insertion or removal, detail enhancement, and even human pose manipulation with intuitive input and coherent output. It also supports a range of image understanding tasks, including object detection, semantic segmentation, depth and edge estimation, novel view synthesis, and super-resolution, which are viewed as specialized forms of intelligent image editing.
Developing Qwen-Image involved a comprehensive data pipeline. The Qwen Team ‘collected and annotated billions of image-text pairs’ from diverse categories: nature (approximately 55%), design (around 27%, including images with rich textual elements like paintings, posters, and GUIs), people, and synthetic data. This initial dataset underwent rigorous filtering to ensure high quality. A progressive training strategy was also employed, upscaling images from 256×256 to 640×640 and then to 1328×1328 pixels, and teaching the model from simple to complex textual inputs.
According to the Qwen Team, ‘Qwen-Image is more than a state-of-the-art image generation model—it represents a paradigm shift in how we conceptualize and build multimodal foundation models. Its contributions extend beyond technical benchmarks, challenging the community to rethink the roles of generative models in perception, interface design, and cognitive modeling…As we continue to scale and refine such systems, the boundary between visual understanding and generation will blur further, paving the way for truly interactive, intuitive, and intelligent multimodal agents.’
Also Read:
- Microsoft Introduces VibeVoice: An Open-Source AI for Long-Form, Multi-Speaker Audio Generation
- Adobe Integrates Google’s Gemini 2.5 Flash Image Model to Enhance Generative AI Capabilities in Firefly and Express
This open-source release, under the Apache 2.0 license, makes Qwen-Image a versatile tool for artists, designers, storytellers, and developers, fostering innovation in the multimodal AI space.


