Tencent's Conan-embedding-v2: Advancing Text Representation Across Languages

TLDR: Conan-embedding-v2 is a 1.4-billion-parameter large language model developed by Tencent, trained from scratch to excel in text embedding tasks. It addresses common limitations of fine-tuning existing LLMs by bridging data and training gaps. Key innovations include incorporating diverse news and multilingual data during pre-training, a soft-masking mechanism for better representation learning, a novel cross-lingual retrieval dataset for 26 languages, and dynamic hard negative mining. The model achieves state-of-the-art performance on both English and Chinese text embedding benchmarks (MTEB) while maintaining an efficient size and fast inference speed, and supports Matryoshka Representation Learning (MRL).

In the rapidly evolving landscape of artificial intelligence, text embeddings have emerged as a cornerstone technology, transforming how machines understand and process human language. These embeddings convert words, sentences, or entire documents into numerical vectors, allowing similar texts to be represented closely in a high-dimensional space. This capability significantly boosts the performance of various downstream tasks, from information retrieval to sentiment analysis.

While large language models (LLMs) have shown remarkable prowess in generating these text embeddings, previous approaches often faced inherent limitations. Many models relied on fine-tuning existing LLMs, such as Mistral-7B, using methods like LoRA. However, this approach can be constrained by fundamental differences between how LLMs are initially trained and what embedding models require. There’s often a “data gap” where the base LLM’s training data doesn’t perfectly align with the needs of embedding tasks, and a “training gap” because LLMs predict the next token (causal masking) while embedding models need a holistic understanding of a sentence (bidirectional masking).

Addressing these challenges head-on, researchers from Tencent’s Basic Algorithm Center have introduced Conan-embedding-v2, a groundbreaking 1.4-billion-parameter LLM trained entirely from scratch and specifically fine-tuned as a text embedder. This novel approach aims to bridge the aforementioned data and training disparities, leading to more effective and comprehensive text representations.

Bridging the Data Gap

To tackle the data mismatch, Conan-embedding-v2 incorporates extensive news data and multilingual pairs during its initial LLM pre-training phase. This strategic inclusion helps align the model’s foundational understanding with the diverse data requirements of embedding tasks. Building on this, the team developed a unique cross-lingual retrieval dataset. This dataset enables the LLM to better integrate embeddings across 26 different languages, facilitating bidirectional search and narrowing the representation gap between them. This is a significant step towards truly cohesive multilingual understanding, as demonstrated by how the model successfully unified embeddings for six diverse languages into a single distribution, unlike vanilla methods where languages clustered separately.

Overcoming the Training Gap with Soft Masking

The difference in masking mechanisms between LLMs (causal) and embedding models (bidirectional) is a critical hurdle. Directly switching between these can hinder effective learning. Conan-embedding-v2 introduces an innovative soft-masking mechanism that allows for a gradual transition between causal and bidirectional masks. This mechanism progressively updates attention weights and dynamically reduces the mask’s rank, enabling the model to learn richer, more comprehensive feature representations during early training stages. This soft transition prevents the model from getting stuck in local minima, a common issue with abrupt changes.

Dynamic Hard Negative Mining for Sharper Learning

Beyond soft masking, the paper also proposes a dynamic hard negative mining (DHNM) method. Unlike traditional approaches that rely on fixed hard negatives identified during preprocessing, DHNM continuously assesses the difficulty of negative samples throughout the training process. If a negative example becomes too easy, it’s dynamically replaced with a more challenging one from a candidate pool. This ensures that the model is constantly exposed to difficult examples, refining its ability to distinguish between similar but distinct texts without introducing additional computational overhead.

State-of-the-Art Performance and Practical Efficiency

Conan-embedding-v2 has demonstrated impressive results, achieving state-of-the-art performance on both the Massive Text Embedding Benchmark (MTEB) for English and the Chinese MTEB. It excels across various tasks, including classification, clustering, pair classification, reranking, retrieval, and semantic textual similarity. Notably, it achieves these results with a relatively compact architecture of approximately 1.4 billion parameters, offering a higher number of embedding dimensions (3584) with fewer parameters compared to many larger models.

From a practical standpoint, Conan-embedding-v2 is highly efficient. It boasts an impressive inference time of just 5.14 minutes (measured on a single 910B GPU for English queries), making it one of the fastest models evaluated. Furthermore, it supports Matryoshka Representation Learning (MRL), allowing for embeddings of different dimensions, a capability shared by only a few other models. This balance of high performance, efficient size, and versatile features makes Conan-embedding-v2 a significant advancement in text embedding technology.

Also Read:

Conclusion and Future Outlook

Conan-embedding-v2 represents a significant leap forward in text embedding models by meticulously addressing the foundational data and training gaps. By training an LLM from scratch with specialized techniques like soft masking, a cross-lingual retrieval dataset, and dynamic hard negative mining, it sets a new benchmark for performance and efficiency. The authors hope this work will inspire further research in embedding training methods and plan to continue updating the model to enhance its capabilities, including exploring cross-modal retrieval in the future.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Tencent’s Conan-embedding-v2: Advancing Text Representation Across Languages

Bridging the Data Gap

Overcoming the Training Gap with Soft Masking

Dynamic Hard Negative Mining for Sharper Learning

State-of-the-Art Performance and Practical Efficiency

Conclusion and Future Outlook

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates