TLDR: HanjaBridge is a novel pre-training framework that significantly improves Korean LLM performance by resolving semantic ambiguity caused by homophonous Sino-Korean words. It achieves this by presenting all possible Hanja (Chinese character) candidates during training, forcing the model to learn contextual disambiguation. Coupled with knowledge distillation to prevent catastrophic forgetting, HanjaBridge boosts Korean understanding (21% relative improvement on KoBALT) while preserving English proficiency, and importantly, its benefits persist without needing Hanja augmentation during inference, ensuring practical efficiency.
Large language models (LLMs) have made incredible strides in understanding and generating human language, but they often struggle with “low-resource” languages like Korean. A significant reason for this challenge in Korean is its unique linguistic structure, particularly the prevalence of Sino-Korean words – words borrowed from Chinese – that sound identical but have different meanings when written in the phonetic Hangul script. This creates a high degree of semantic ambiguity, making it difficult for LLMs to accurately interpret context.
Imagine a word like “의사” in Korean. In Hangul, it’s always written the same way, but it can mean “doctor” (醫師), “intention” (意思), “patriot” (義士), or “deliberation” (議事) depending on the context. For an LLM, distinguishing between these meanings without additional cues is a major hurdle. This is where a new approach called HanjaBridge comes in.
Introducing HanjaBridge: Bridging the Semantic Gap
Proposed by Seungho Choi from Wisenut, HanjaBridge is a novel technique designed to resolve this semantic ambiguity by injecting explicit meaning cues into Korean LLMs during their training. Instead of simply trying to guess the meaning from context, HanjaBridge leverages Hanja, the traditional Chinese characters that many Korean words originate from. While modern Korean primarily uses Hangul, Hanja still carries rich semantic and etymological information.
The core idea is clever: when an ambiguous Hangul word appears, HanjaBridge doesn’t just pick one Hanja. Instead, it presents the model with all possible Hanja candidates for that word. For example, if the word “가격” (price/hit) appears, the model sees both 價格 (‘price’) and 加擊 (‘hit’) alongside it. This forces the model to learn contextual disambiguation – it has to figure out which Hanja candidate makes the most sense in the given sentence. This process is like giving the model a hint, guiding its attention to the correct meaning.
To ensure this new information is effectively integrated, HanjaBridge also adjusts the model’s internal attention mechanism. It allows the Korean word to “look at” all its Hanja candidates, but it prevents the Hanja candidates from looking at each other. This ensures each Hanja retains its distinct meaning while still informing the Korean word’s representation.
Preventing Forgetting and Boosting Performance
A common problem when continually training LLMs on new data or languages is “catastrophic forgetting,” where the model loses its proficiency in previously learned languages. HanjaBridge addresses this by incorporating “token-level knowledge distillation.” In this process, an original, pre-trained LLM acts as a “teacher,” and the HanjaBridge-trained model acts as a “student.” The student learns to mimic the teacher’s internal representations, especially for other languages like English, thereby preserving its existing multilingual capabilities while specializing in Korean.
The researchers also expanded the tokenizer’s vocabulary to include these new Hanja characters. This helps prevent “semantic fragmentation,” a problem where Korean words are broken down into many smaller, less meaningful sub-words by standard tokenizers, making it harder for the model to grasp their full meaning.
Impressive Results and Practical Benefits
HanjaBridge was tested on the Qwen 2.5-3B model using a high-quality Korean text dataset. The results were significant:
- On the challenging KoBALT-Hard benchmark, which assesses deep linguistic understanding in Korean, HanjaBridge achieved a remarkable 21% relative improvement.
- It also outperformed other models on KoBEST-General, a suite for general Korean natural language understanding.
- Crucially, HanjaBridge successfully prevented catastrophic forgetting. While a standard continual pre-training model showed a significant decline in English performance, the HanjaBridge model maintained its English proficiency, even slightly exceeding some baselines. This demonstrates a strong positive cross-lingual transfer, meaning the semantic alignment between Korean and Chinese via shared Hanja benefits other languages too.
- The study also found an optimal number of Hanja candidates (k=8) for best performance, suggesting that too many candidates can introduce noise.
One of the most practical contributions of HanjaBridge is its efficiency during inference. The performance gains achieved through Hanja augmentation persist even when the Hanja characters are omitted at inference time. This means there’s no additional run-time cost or increase in token length when using the model in real-world applications. The model learns the disambiguation during training and can apply that knowledge without needing the Hanja hints later.
Also Read:
- Enhancing LLM Safety Across Languages with SEALGUARD
- KGA: Dynamic Knowledge Integration for Large Language Models at Inference Time
Looking Ahead
HanjaBridge represents a significant step forward in enhancing LLMs for low-resource languages like Korean. By intelligently leveraging the historical and semantic connection between Korean and Chinese through Hanja, it provides a robust solution to the problem of semantic ambiguity. This approach not only boosts Korean language understanding but also maintains multilingual capabilities and offers practical efficiency for deployment. The full research paper can be found here.


