TLDR: A new research paper introduces KMMLU-REDUX and KMMLU-PRO, two Korean expert-level benchmarks for evaluating LLMs. KMMLU-REDUX refines an existing benchmark with technical qualification exam questions, removing errors and contamination. KMMLU-PRO is a new benchmark based on Korean National Professional Licensure exams, assessing LLMs’ practical knowledge in high-stakes professions like law and medicine. The study reveals LLMs struggle with region-specific legal and accounting knowledge, highlighting the need for locally adapted benchmarks and balanced competence across subjects.
Evaluating the capabilities of Large Language Models (LLMs) in real-world scenarios, especially in specialized professional fields, requires robust and reliable benchmarks. A new research paper introduces two significant Korean expert-level benchmarks: KMMLU-REDUX and KMMLU-PRO. These benchmarks aim to address critical issues found in existing evaluation methods, such as data contamination and a lack of focus on practical, industry-specific knowledge.
Addressing Benchmark Challenges
Previous benchmarks, like the widely used MMLU for general knowledge, have faced concerns regarding reliability due to publicly available problems leading to potential data contamination. Similar issues were identified in KMMLU, a benchmark for Korean expert-level knowledge. The original KMMLU dataset, compiled by crawling various exam websites, contained noisy samples, including questions that explicitly revealed answers, non-existent references, and even contamination between training and test sets, as well as with common web corpora like FineWeb2.
To overcome these limitations, the researchers behind this new paper focused on creating high-quality, contamination-free benchmarks. While human-authored benchmarks offer high quality, they are costly and difficult to update regularly. This new approach leverages official, real-world professional exams to ensure both quality and relevance.
Introducing KMMLU-REDUX: A Refined Technical Benchmark
KMMLU-REDUX is a meticulously refined version of the original KMMLU. It consists of 2,587 problems derived from Korean National Technical Qualification (KNTQ) exams. These exams are designed to assess practical technical competencies required in various industrial fields and typically demand a bachelor’s degree or at least nine years of professional experience, making them highly challenging.
The construction of KMMLU-REDUX involved a rigorous process of filtering out easier problems and extensive denoising. Researchers manually reviewed the dataset to eliminate errors, performed decontamination to prevent data leakage from pre-training corpora, and removed all duplicate questions. This ensures that KMMLU-REDUX provides a reliable and challenging evaluation of LLMs’ industrial knowledge across 14 diverse domains, including Safety Management, Mechanical Engineering, and Information & Communication.
Unveiling KMMLU-PRO: Assessing Professional Licensure
KMMLU-PRO is a newly developed, highly challenging benchmark comprising 2,822 problems from Korean National Professional Licensure (KNPL) exams. These are high-stakes, annually administered exams for highly specialized professions in Korea, such as lawyers, accountants, and physicians. The benchmark includes 14 professions from diverse domains, reflecting the advanced knowledge, critical reasoning, and ethical judgment required in these fields.
Unlike previous methods, data for KMMLU-PRO was collected directly from official government sources, ensuring accuracy and avoiding potential errors from online text. Human annotators then meticulously reviewed the parsed questions, converting images and tables into text where possible. A key feature of KMMLU-PRO is its commitment to annual updates with the latest exam questions, maintaining long-term reliability and preventing contamination. The evaluation process for KMMLU-PRO simulates real-world assessment standards by incorporating official exam pass criteria, aligning model performance with human professional standards.
Key Findings from LLM Evaluations
Extensive evaluations were conducted on various LLMs using both KMMLU-REDUX and KMMLU-PRO. The results showed that models with reasoning capabilities generally performed better. On KMMLU-REDUX, LLMs demonstrated robust performance in engineering domains but struggled significantly in specialized fields like Mining & Resources and Architecture.
For KMMLU-PRO, the assessment focused on whether models could pass professional licensure exams. While many state-of-the-art models performed strongly in the medicine domain, meeting passing criteria for most medical licenses, they nearly failed in law-related and tax & accounting licenses. This highlights a crucial point: even models with high overall accuracy might not possess the balanced competence across subjects required for real-world certification exams, especially in fields governed by specific national laws.
Also Read:
- VerifyBench: A New Benchmark for Evaluating AI Reasoning Verifiers
- New AI Models Tackle Financial Jargon in Multiple Languages
The Importance of Local Context
The research underscores the critical importance of locally adapted benchmarks. A comparison with translated datasets like MMMLU (Korean) revealed significant performance gaps in categories such as law, where in-depth knowledge of specific Korean laws is essential. In contrast, the gap was smaller in medicine, where domain knowledge is more globally consistent. This finding emphasizes that benchmarks reflecting authentic professional knowledge specific to a region are vital for accurately assessing LLM capabilities for practical applications.
The paper also explored the impact of reasoning budget and prompt language. While increasing reasoning efforts generally improved performance, this wasn’t uniform across all licenses, suggesting that more reasoning doesn’t always guarantee a boost in highly specialized areas. Additionally, some LLMs, particularly the Llama-4 series, showed a notable drop in performance when prompted in Korean compared to English, raising concerns about consistency in multilingual settings.
This work provides a foundational step towards more rigorous evaluation and continued advancement of LLMs’ real-world competence in Korean industries. For more details, you can refer to the full research paper available here.


