TLDR: The COMP-COMP framework offers a new approach to building benchmarks for domain-specific Large Language Models (LLMs). Moving beyond traditional ‘scaling law’ methods, it emphasizes balancing ‘comprehensiveness’ (broad semantic coverage) and ‘compactness’ (efficient, non-redundant data). Demonstrated with XUBench, an academic benchmark, the framework shows improved evaluation efficiency and effectiveness by optimizing both corpus and question-answer set construction, leading to better performance for specialized LLMs while significantly reducing resource requirements.
Large Language Models (LLMs) have shown incredible versatility, but for specialized fields like law, medicine, or academia, general-purpose models often fall short. This has led to the rise of domain-specific LLMs, which are designed to provide highly accurate and precise answers tailored to particular areas. However, effectively evaluating these specialized AI models presents a significant challenge: how do you build a benchmark that truly tests their capabilities without being overly cumbersome or missing crucial aspects of the domain?
Traditionally, many domain-specific benchmarks have relied on a ‘scaling law’ approach, meaning they use massive amounts of data for training or generate vast numbers of questions to ensure broad coverage. While this seems logical, a recent research paper titled “Rethinking Domain-Specific LLM Benchmark Construction: A Comprehensiveness-Compactness Approach” by Rubing Chen, Jiaxin Wu, Jian Wang, Xulu Zhang, Wenqi Fan, Chenghua Lin, Xiao-Yong Wei, and Qing Li, argues that this isn’t always the most effective strategy. You can read the full paper here: Research Paper.
The authors introduce a novel framework called COMP-COMP, which stands for Comprehensiveness-Compactness. This framework proposes an iterative method for building benchmarks that balances two key principles: comprehensiveness and compactness. Comprehensiveness ensures that the benchmark covers the full semantic range of a domain, preventing models from ‘forgetting’ broader knowledge. Compactness, on the other hand, focuses on enhancing precision by making sure the data and questions are efficient and non-redundant.
How COMP-COMP Works
The COMP-COMP framework operates through a dynamic, iterative process. It continuously assesses and refines both the training data (corpus) and the question-answer (QA) sets. It does this by encoding the semantics of the data into a unified space and using a technique called Gaussian Kernel Density Estimation (KDE) to measure how well the domain is covered (comprehensiveness) and how focused the data is (compactness).
For building the corpus, the framework iteratively adds new data only if it fills a semantic gap and doesn’t introduce too much redundancy. This ensures the training data is both broad and efficient. For generating questions, COMP-COMP aims for diversity and representativeness. It identifies areas in the corpus that are underrepresented by existing questions and then generates new questions specifically for those areas. Importantly, it also incorporates user-interest-oriented questions, such as those from public forums, to ensure the benchmark reflects real-world queries.
XUBench: A Case Study in Academia
To demonstrate the effectiveness of their framework, the researchers applied COMP-COMP to create XUBench, a large-scale benchmark for the academic domain. XUBench includes nearly 25,000 questions of various types, including binary (yes/no), multiple-choice (MCQ), multi-answer (MAQ), and open-ended questions. It also integrates discussions from user forums, making it highly relevant to real academic queries.
The construction of XUBench showed impressive results. The comprehensiveness aspect expanded semantic coverage significantly, capturing 98% of academic concepts related to staff and courses. Simultaneously, the compactness principle helped eliminate 68% of redundant entries compared to unfiltered web crawls, proving that a smaller, more focused dataset can be more effective.
Also Read:
- OmniBench-RAG: A New Standard for Evaluating Retrieval-Augmented Generation
- Benchmarking AI Agents: A New Standard for Evaluating Tool Use Capabilities
Key Experimental Findings
The paper also explores how different approaches to training LLMs perform on XUBench, specifically In-Context Learning (ICL) and Supervised Fine-tuning (SFT).
-
RAG’s Impact: Retrieval Augmented Generation (RAG) consistently improved performance across most question types for both ICL and SFT models. This highlights RAG’s ability to enhance precision and recall by providing relevant external knowledge.
-
Few-Shot Learning Challenges: While few-shot learning can help with simple question formats, it often struggles with answer diversity and can even degrade performance for more complex questions, suggesting that models might overfit to limited examples.
-
Fine-tuning Benefits: Supervised Fine-tuning (SFT) significantly improved recall and precision, especially for open-ended questions. However, the study also noted instances where SFT models experienced ‘catastrophic forgetting,’ losing some general capabilities while gaining domain-specific knowledge. This points to an ongoing challenge in balancing specialized and general abilities in LLMs.
-
Efficiency Gains: An ablation study on the COMP-COMP framework’s parameters (td for questions and tc for corpus) revealed remarkable efficiency. The optimized benchmark achieved similar performance to conventional benchmarks while using only 1.7% of the questions and 46.4% of the corpus components. This demonstrates the framework’s ability to maintain evaluation quality with drastically reduced resources.
In conclusion, the COMP-COMP framework offers a principled and efficient way to construct domain-specific LLM benchmarks. By prioritizing both comprehensive coverage and compact, non-redundant data, it provides valuable insights for developing more effective and sustainable evaluation systems for specialized AI models across various fields.


