New F2LLM Models Deliver Top Embedding Performance Using Only Open-Source Data

TLDR: F2LLM is a new family of open-source embedding models (0.6B, 1.7B, 4B) that achieve state-of-the-art performance on the MTEB leaderboard. Unlike many top models, F2LLM is trained solely on 6 million open-source, non-synthetic query-document-negative tuples, making it a cost-effective and reproducible baseline. The 4B model ranks 2nd in its size category and 7th overall, while the 1.7B model is 1st in its size range.

In the rapidly evolving landscape of artificial intelligence, text embedding models have become crucial for a wide array of applications, from information retrieval to classification. These models transform text into numerical representations, allowing computers to understand and process language more effectively. Recently, a new suite of models named F2LLM (Foundation to Feature Large Language Models) has emerged, promising state-of-the-art performance while addressing some significant challenges in the field.

F2LLM introduces three models of varying sizes: 0.6 billion, 1.7 billion, and 4 billion parameters. What sets F2LLM apart from many existing top-ranking embedding models is its unique approach to training. While many leading models rely on extensive contrastive pretraining, complex training pipelines, and costly synthetic data generated by other large language models, F2LLM takes a different path.

Instead, F2LLM is directly fine-tuned from foundation models using a carefully curated dataset of 6 million query-document-negative tuples. Crucially, this entire dataset is sourced from open-source, non-synthetic datasets. This strategy not only makes F2LLM a more budget-friendly option but also significantly enhances its reproducibility, allowing other researchers to replicate and build upon its success without prohibitive costs or proprietary data.

The performance of F2LLM has been rigorously evaluated on the MTEB (Massive Text Embedding Benchmark) English leaderboard, a widely recognized benchmark for text embedding models. The results are impressive: F2LLM-4B, the largest model in the suite, secured the 2nd position among models with approximately 4 billion parameters and ranked 7th overall. Even more remarkably, F2LLM-1.7B achieved the top spot among models in the 1 billion to 2 billion parameter range, making it an excellent choice for applications with limited computational resources.

One area where F2LLM particularly shines is in clustering tasks, where the 4B model achieved a score of 68.54, setting a new record among all models evaluated. This highlights the model’s strong ability to group similar texts together effectively.

The training process for F2LLM involved using Qwen3 models as the backbone and conducting contrastive fine-tuning for two epochs. The data collection process was meticulous, compiling a large-scale composite covering 4.9 million retrieval samples, 0.2 million classification samples, and 0.8 million clustering samples, all formatted uniformly. This diverse dataset, combined with a single-stage training approach, demonstrates that high performance can be achieved without the need for complex multi-stage pipelines or synthetic data.

By fully open-sourcing its model checkpoints, training dataset, and training code, F2LLM aims to provide a robust, reproducible, and cost-effective baseline for future research in text embedding models. This initiative is expected to foster further innovation and accessibility in the field of natural language processing.

Also Read:

For more in-depth technical details, you can refer to the full research paper available here: F2LLM Technical Report.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

New F2LLM Models Deliver Top Embedding Performance Using Only Open-Source Data

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates