TLDR: QZhou-Embedding is a new general-purpose text embedding model from Kingsoft AI, built on Qwen2.5-7B-Instruct. It uses a unified multi-task framework, LLM-powered data synthesis (paraphrasing, augmentation, hard negative generation), and a two-stage training strategy. The model achieved state-of-the-art results on MTEB and CMTEB benchmarks, demonstrating the importance of high-quality, diverse data for advanced text representation.
Kingsoft AI has unveiled QZhou-Embedding, a new text embedding model designed to significantly enhance how computers understand and represent human language. This model is built on the powerful Qwen2.5-7B-Instruct foundation and has achieved top rankings on major benchmarks for text embedding models.
What are Text Embeddings?
Text embedding models are crucial for many AI applications, including search engines, question-answering systems, and recommendation engines. They convert natural language text into numerical vector representations, allowing computers to process and compare text based on its meaning. The better the embedding, the more accurately AI systems can understand and respond to text.
A Unified Approach to Learning
QZhou-Embedding introduces a unified multi-task framework that handles diverse types of text data and optimizes training. This framework includes specialized data transformation techniques to adapt various data formats for retrieval, natural language inference (NLI), and classification tasks. For instance, it can convert news articles, academic papers, and Q&A datasets into a format suitable for training, treating titles as queries and bodies as positive samples for retrieval, or scoring sentence pairs for semantic similarity in NLI tasks.
Smart Data Generation with AI
One of the standout features of QZhou-Embedding is its innovative data synthesis pipeline, which uses large language model (LLM) APIs to create higher-quality and more varied training data. This pipeline employs three key techniques:
- Paraphrasing: Generates structurally diverse sentences that retain the original meaning, making the model robust to different phrasing.
- Augmentation: Expands the semantic diversity of the data by exploring different topics, aspects, and viewpoints related to the original text.
- Hard Negative Example Generation: Creates challenging negative examples that are difficult for the model to distinguish from positive ones, pushing the model to learn finer semantic differences.
Two Stages of Training for Robust Performance
The model undergoes a two-stage training process. The first stage focuses exclusively on building strong retrieval capabilities using retrieval-oriented data. The second stage then integrates both retrieval and non-retrieval tasks, fine-tuning the model for a broader range of applications while maintaining its robust retrieval performance. This strategy ensures that QZhou-Embedding develops a solid foundation before extending its capabilities across various tasks.
Also Read:
- Scanning LLM Training Data for Harmful Content: A New Approach with ElasticSearch
- Bridging Medical Ontologies for Enhanced Healthcare AI
Achieving State-of-the-Art Results
QZhou-Embedding has demonstrated exceptional performance, achieving state-of-the-art results on both the MTEB (Massive Text Embedding Benchmark) and CMTEB (Chinese Massive Text Embedding Benchmark) leaderboards as of August 27, 2025. It also excels in specific tasks like Reranking and Clustering. These achievements underscore the effectiveness of Kingsoft AI’s approach, highlighting the critical role of high-quality and diverse data, especially when enhanced by LLM-driven generative capabilities.
The researchers emphasize that leveraging LLMs to optimize data quality is key to breakthroughs in embedding models. The model weights are publicly available on HuggingFace, and evaluation code and instructions can be found on GitHub for reproducibility. For more technical details, you can read the full research paper: QZhou-Embedding Technical Report.


