New Benchmarks Evaluate LLM Professional Knowledge in Korea

TLDR: A new research paper introduces KMMLU-REDUX and KMMLU-PRO, two Korean expert-level benchmarks for evaluating LLMs. KMMLU-REDUX refines an existing benchmark with technical qualification exam questions, removing errors and contamination. KMMLU-PRO is a new benchmark based on Korean National Professional Licensure exams, assessing LLMs’ practical knowledge in high-stakes professions like law and medicine. The study reveals LLMs struggle with region-specific legal and accounting knowledge, highlighting the need for locally adapted benchmarks and balanced competence across subjects.

Evaluating the capabilities of Large Language Models (LLMs) in real-world scenarios, especially in specialized professional fields, requires robust and reliable benchmarks. A new research paper introduces two significant Korean expert-level benchmarks: KMMLU-REDUX and KMMLU-PRO. These benchmarks aim to address critical issues found in existing evaluation methods, such as data contamination and a lack of focus on practical, industry-specific knowledge.

Addressing Benchmark Challenges

Previous benchmarks, like the widely used MMLU for general knowledge, have faced concerns regarding reliability due to publicly available problems leading to potential data contamination. Similar issues were identified in KMMLU, a benchmark for Korean expert-level knowledge. The original KMMLU dataset, compiled by crawling various exam websites, contained noisy samples, including questions that explicitly revealed answers, non-existent references, and even contamination between training and test sets, as well as with common web corpora like FineWeb2.

To overcome these limitations, the researchers behind this new paper focused on creating high-quality, contamination-free benchmarks. While human-authored benchmarks offer high quality, they are costly and difficult to update regularly. This new approach leverages official, real-world professional exams to ensure both quality and relevance.

Introducing KMMLU-REDUX: A Refined Technical Benchmark

KMMLU-REDUX is a meticulously refined version of the original KMMLU. It consists of 2,587 problems derived from Korean National Technical Qualification (KNTQ) exams. These exams are designed to assess practical technical competencies required in various industrial fields and typically demand a bachelor’s degree or at least nine years of professional experience, making them highly challenging.

The construction of KMMLU-REDUX involved a rigorous process of filtering out easier problems and extensive denoising. Researchers manually reviewed the dataset to eliminate errors, performed decontamination to prevent data leakage from pre-training corpora, and removed all duplicate questions. This ensures that KMMLU-REDUX provides a reliable and challenging evaluation of LLMs’ industrial knowledge across 14 diverse domains, including Safety Management, Mechanical Engineering, and Information & Communication.

Unveiling KMMLU-PRO: Assessing Professional Licensure

KMMLU-PRO is a newly developed, highly challenging benchmark comprising 2,822 problems from Korean National Professional Licensure (KNPL) exams. These are high-stakes, annually administered exams for highly specialized professions in Korea, such as lawyers, accountants, and physicians. The benchmark includes 14 professions from diverse domains, reflecting the advanced knowledge, critical reasoning, and ethical judgment required in these fields.

Unlike previous methods, data for KMMLU-PRO was collected directly from official government sources, ensuring accuracy and avoiding potential errors from online text. Human annotators then meticulously reviewed the parsed questions, converting images and tables into text where possible. A key feature of KMMLU-PRO is its commitment to annual updates with the latest exam questions, maintaining long-term reliability and preventing contamination. The evaluation process for KMMLU-PRO simulates real-world assessment standards by incorporating official exam pass criteria, aligning model performance with human professional standards.

Key Findings from LLM Evaluations

Extensive evaluations were conducted on various LLMs using both KMMLU-REDUX and KMMLU-PRO. The results showed that models with reasoning capabilities generally performed better. On KMMLU-REDUX, LLMs demonstrated robust performance in engineering domains but struggled significantly in specialized fields like Mining & Resources and Architecture.

For KMMLU-PRO, the assessment focused on whether models could pass professional licensure exams. While many state-of-the-art models performed strongly in the medicine domain, meeting passing criteria for most medical licenses, they nearly failed in law-related and tax & accounting licenses. This highlights a crucial point: even models with high overall accuracy might not possess the balanced competence across subjects required for real-world certification exams, especially in fields governed by specific national laws.

Also Read:

The Importance of Local Context

The research underscores the critical importance of locally adapted benchmarks. A comparison with translated datasets like MMMLU (Korean) revealed significant performance gaps in categories such as law, where in-depth knowledge of specific Korean laws is essential. In contrast, the gap was smaller in medicine, where domain knowledge is more globally consistent. This finding emphasizes that benchmarks reflecting authentic professional knowledge specific to a region are vital for accurately assessing LLM capabilities for practical applications.

The paper also explored the impact of reasoning budget and prompt language. While increasing reasoning efforts generally improved performance, this wasn’t uniform across all licenses, suggesting that more reasoning doesn’t always guarantee a boost in highly specialized areas. Additionally, some LLMs, particularly the Llama-4 series, showed a notable drop in performance when prompted in Korean compared to English, raising concerns about consistency in multilingual settings.

This work provides a foundational step towards more rigorous evaluation and continued advancement of LLMs’ real-world competence in Korean industries. For more details, you can refer to the full research paper available here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

New Benchmarks Evaluate LLM Professional Knowledge in Korea

Addressing Benchmark Challenges

Introducing KMMLU-REDUX: A Refined Technical Benchmark

Unveiling KMMLU-PRO: Assessing Professional Licensure

Key Findings from LLM Evaluations

The Importance of Local Context

Gen AI News and Updates

MLCommons Unveils MLPerf Training v5.1 Benchmarks, Showcasing Significant AI Performance Gains

Automating the Detection of Modality Bias in Multimodal Misinformation

New Remote Labor Index Reveals AI Agents Automate Only 2.5% of Freelance Tasks, Signaling Augmentation Over Mass Replacement

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates