TLDR: KITE is a new benchmark designed to assess how well Large Language Models (LLMs) follow instructions in Korean. It addresses the current English-focused evaluations by providing both general and Korean-specific tasks, considering the language’s unique syntax, honorifics, and numbering systems. Experiments show that while advanced models like GPT-4o perform well, Korean-specific LLMs still need significant improvement in instruction-following, highlighting the benchmark’s role in driving culturally and linguistically inclusive AI development.
Large Language Models (LLMs) are becoming increasingly vital for a wide range of applications, from powering conversational AI to assisting with complex reasoning tasks. Their ability to understand and follow instructions is a cornerstone of their effectiveness. However, a significant challenge in the field has been the predominant focus on English-language models in evaluations, often overlooking the rich linguistic and cultural diversity of other languages.
This oversight is particularly pronounced for languages like Korean, which possesses a unique syntax, intricate morphological features, a sophisticated honorific system, and dual numbering systems. These characteristics present distinct challenges that English-centric benchmarks simply cannot capture, leading to an incomplete assessment of LLMs’ capabilities in a multilingual world.
Introducing KITE: A Dedicated Korean Benchmark
To bridge this critical gap, researchers have introduced the Korean Instruction-following Task Evaluation (KITE). This comprehensive benchmark is specifically designed to evaluate the open-ended instruction-following abilities of LLMs in Korean. Unlike previous Korean benchmarks that primarily focused on factual knowledge or multiple-choice questions, KITE directly targets diverse, open-ended instruction tasks, providing a more nuanced understanding of a model’s true proficiency.
The KITE benchmark is divided into two main versions: KITE General and KITE Korean. KITE General comprises 427 instructions derived from existing English instruction-following datasets, translated and meticulously filtered to ensure relevance within the Korean context. This allows for consistent evaluation metrics and direct performance comparisons across languages.
KITE Korean, on the other hand, consists of 100 instructions created from scratch. These instructions are tailored to address the complex linguistic subtleties and cultural specificities unique to Korean. The development of KITE Korean involved a detailed analysis of Korean grammatical structures, linguistic features, and cultural practices. It includes specialized instruction categories such as:
- Acrostic Poem: Evaluating the model’s ability to generate structured poetry where each line starts with a specific letter from a given word.
- Post-position Drop: Testing the model’s understanding of Korean grammar by requiring sentences to be formed without postpositions, while preserving meaning.
- Honorifics: Assessing the model’s proficiency in switching between different levels of politeness (honorific and informal speech).
- Native/Sino Korean Number System: Evaluating the model’s capability to understand and interchangeably use the two distinct number systems in Korean.
The KITE evaluation pipeline combines automated metrics with human assessments, providing deep insights into the strengths and weaknesses of various models. The strong correlation between human and automated evaluations confirms KITE’s reliability in assessing instruction-following capabilities.
Also Read:
- Evaluating Language Models on Real-World Uncertainty with OPENESTIMATE
- Unlocking Individual Thought: A New Benchmark for Language Models
Key Findings from the Evaluation
Experiments were conducted on a selection of generic and Korean-specific LLMs across various ‘shot’ settings (zero-shot, one-shot, three-shot, five-shot). Key observations include:
- Consistent Performance: Models like GPT-4o demonstrated high and consistent performance across both KITE General and KITE Korean benchmarks, indicating strong generalization capabilities.
- Room for Improvement: Despite being trained specifically for the Korean language, Korean-specific models such as SOLAR 1 Mini Chat, HyperCLOVA X 003, and EEVE v1.0 10.8b Instruct generally lagged behind advanced generic models like GPT-4o in Korean language proficiency. This highlights a significant need for further research and development in language-specific instruction following.
- Impact of Shot Settings: Surprisingly, the study found that performance in instruction-following tasks did not consistently improve with more ‘shots’ (examples provided). This suggests that the varied nature of instructions among examples might play a role, and robust models like GPT-4o showed stability despite these variations.
The findings underscore that achieving proficiency in instruction following requires specialized tuning and targeted refinement, distinct from capabilities in reasoning or commonsense knowledge. The researchers emphasize that dedicated benchmarks like KITE are crucial for capturing the full spectrum of LLM capabilities in multilingual and cross-cultural contexts.
By publicly releasing the KITE dataset and code, the researchers aim to foster further research on culturally and linguistically inclusive LLM development, inspiring similar efforts for other underrepresented languages. This work represents a vital step towards ensuring that LLMs can be effectively used in real-world applications requiring instruction following across diverse linguistic environments. You can find the full research paper here: KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models.


