TLDR: This research introduces multi-language models for on-the-fly syntax highlighting, addressing the limitations of current single-language models. By leveraging a novel Token Normalization technique and few-shot learning, the new approach allows a single model to accurately highlight up to six mainstream programming languages. This significantly reduces deployment complexity and training costs, enabling efficient and scalable real-time code highlighting even with limited training data for new languages.
Syntax highlighting is a fundamental feature in today’s software development tools, making code easier to read and understand by applying distinct colors to different parts of the code. Imagine looking at a block of code where keywords, variables, and comments all blend into one color – it would be incredibly difficult to follow. This is why “on-the-fly” syntax highlighting, where code is instantly colored as it’s typed or viewed online, is so crucial for platforms like code review tools, online editors, and code snippet displays.
However, achieving accurate and real-time highlighting, especially in web-based development environments, has been a significant challenge. Traditional methods often rely on complex sets of rules or grammar parsers for each language, which are slow, resource-intensive, and struggle with incomplete or incorrect code – a common scenario in real-time editing. State-of-the-art solutions have turned to Convolutional Neural Networks (CNNs) to learn these highlighting patterns, effectively “deep abstracting” the slow, brute-force methods into fast, statistical models. While successful, these models have a major drawback: each model is designed for a single programming language. This means that for every language a development environment supports, a separate model needs to be trained, maintained, and deployed, leading to increased complexity and operational costs.
Introducing a Unified Solution
A new research paper titled “Multi Language Models for On-the-Fly Syntax Highlighting” by Marco Edoardo Palma, Pooja Rani, and Harald C. Gall introduces a groundbreaking approach to overcome these limitations. The authors propose a unified model capable of effectively highlighting up to six mainstream programming languages (Java, Kotlin, Python, C++, C#, and JavaScript) within a single instance. This innovation significantly reduces deployment complexity by a factor of six and improves performance on languages the model hasn’t explicitly seen before. You can read the full paper for more details.
Read the full research paper here.
Key Innovations: Token Normalization and Few-Shot Learning
The paper highlights two core innovations that make this multi-language capability possible. First is a novel technique called Token Normalization (TN). Current models rely on language-specific “token IDs” – unique numerical values assigned to different parts of the code by each language’s parser. This means that even if two languages have a similar concept (like a plus sign or a variable name), their token IDs would be different, preventing a single model from recognizing the pattern across languages. Token Normalization addresses this by mapping equivalent lexical elements (like a ‘+’ operator or a general identifier) to a fixed, universal token type. This ensures that the model receives consistent input for similar constructs across different languages, greatly enhancing its ability to generalize.
The second innovation is the exploration of few-shot learning. Traditionally, training these syntax highlighting models requires massive datasets – around 13,000 samples per language, generated by slow brute-force highlighters. This process is time-consuming and resource-intensive. Few-shot learning aims to drastically reduce this cost by allowing the model to learn new languages from a very small number of manually generated “oracle” samples, sometimes as few as 10. The proposed Token Normalization step further boosts the model’s accuracy even with these limited samples, making the training process far more efficient and scalable.
Also Read:
- Improving Vulnerability Detection in Polyglot Software Systems
- Unlocking Multilingual Potential: Steering AI’s Language Experts
Impact and Future Outlook
The research demonstrates that these multi-language models perform on par with their single-language counterparts, achieving near-perfect accuracy for both valid and incomplete code snippets. This means developers can expect the same high-quality highlighting with the added benefit of reduced system overhead. The ability to train models with significantly less data also opens doors for quicker adaptation to new or evolving programming languages.
This work paves the way for more efficient, scalable, and cost-effective syntax highlighting across a wide range of programming languages in online development tools. It transforms the landscape from maintaining a multitude of specialized models to deploying a single, adaptable solution, ultimately enhancing developer productivity and the user experience in multi-language coding environments.


