TLDR: F2LLM is a new family of open-source embedding models (0.6B, 1.7B, 4B) that achieve state-of-the-art performance on the MTEB leaderboard. Unlike many top models, F2LLM is trained solely on 6 million open-source, non-synthetic query-document-negative tuples, making it a cost-effective and reproducible baseline. The 4B model ranks 2nd in its size category and 7th overall, while the 1.7B model is 1st in its size range.
In the rapidly evolving landscape of artificial intelligence, text embedding models have become crucial for a wide array of applications, from information retrieval to classification. These models transform text into numerical representations, allowing computers to understand and process language more effectively. Recently, a new suite of models named F2LLM (Foundation to Feature Large Language Models) has emerged, promising state-of-the-art performance while addressing some significant challenges in the field.
F2LLM introduces three models of varying sizes: 0.6 billion, 1.7 billion, and 4 billion parameters. What sets F2LLM apart from many existing top-ranking embedding models is its unique approach to training. While many leading models rely on extensive contrastive pretraining, complex training pipelines, and costly synthetic data generated by other large language models, F2LLM takes a different path.
Instead, F2LLM is directly fine-tuned from foundation models using a carefully curated dataset of 6 million query-document-negative tuples. Crucially, this entire dataset is sourced from open-source, non-synthetic datasets. This strategy not only makes F2LLM a more budget-friendly option but also significantly enhances its reproducibility, allowing other researchers to replicate and build upon its success without prohibitive costs or proprietary data.
The performance of F2LLM has been rigorously evaluated on the MTEB (Massive Text Embedding Benchmark) English leaderboard, a widely recognized benchmark for text embedding models. The results are impressive: F2LLM-4B, the largest model in the suite, secured the 2nd position among models with approximately 4 billion parameters and ranked 7th overall. Even more remarkably, F2LLM-1.7B achieved the top spot among models in the 1 billion to 2 billion parameter range, making it an excellent choice for applications with limited computational resources.
One area where F2LLM particularly shines is in clustering tasks, where the 4B model achieved a score of 68.54, setting a new record among all models evaluated. This highlights the model’s strong ability to group similar texts together effectively.
The training process for F2LLM involved using Qwen3 models as the backbone and conducting contrastive fine-tuning for two epochs. The data collection process was meticulous, compiling a large-scale composite covering 4.9 million retrieval samples, 0.2 million classification samples, and 0.8 million clustering samples, all formatted uniformly. This diverse dataset, combined with a single-stage training approach, demonstrates that high performance can be achieved without the need for complex multi-stage pipelines or synthetic data.
By fully open-sourcing its model checkpoints, training dataset, and training code, F2LLM aims to provide a robust, reproducible, and cost-effective baseline for future research in text embedding models. This initiative is expected to foster further innovation and accessibility in the field of natural language processing.
Also Read:
- Retro*: A New Approach for Smarter Document Retrieval in LLMs
- Fact Grounded Attention: A New Approach to Reliable LLMs
For more in-depth technical details, you can refer to the full research paper available here: F2LLM Technical Report.


