TLDR: F2STRANS is a novel two-stage framework designed to improve code translation by Large Language Models (LLMs). It first focuses on ‘functional learning’ to ensure the correctness of translated code by using high-quality, functionally consistent code pairs. Subsequently, ‘style learning’ enhances code readability by incorporating both positive and negative stylistic examples. This approach allows smaller LLMs like Qwen1.5B to achieve superior performance compared to larger models such as GPT-4 in various code translation scenarios, addressing critical challenges in real-world software development.
Large Language Models (LLMs) have made significant progress in translating code from one programming language to another. This task is crucial for updating applications or migrating software. However, a major hurdle remains: ensuring the translated code is not only functionally correct but also easy to read and maintain. Poorly structured or inconsistent code can be a significant burden for developers, often taking more time to understand than to write from scratch.
Addressing these challenges, researchers have introduced a new approach called F2STRANS. This method is designed to progressively enhance LLMs’ performance in code translation by focusing on two key aspects: functional correctness and code readability. The framework operates in two distinct stages.
Functional Learning: Ensuring Correctness
The first stage, functional learning, aims to optimize the accuracy of the translated code. It does this by using high-quality pairs of source and target code. These pairs are carefully selected from online programming platforms, ensuring that both the original and translated code snippets produce identical outputs for the same inputs. This process involves a ‘relevance-driven code pair selection’ to find similar solutions and ‘differential testing’ to verify that the code pairs behave exactly the same way. By fine-tuning LLMs on this meticulously curated data, F2STRANS ensures that the translated code retains its original functionality.
Style Learning: Improving Readability
Even if code is functionally correct, it might lack readability due to inconsistencies in variable naming, function signatures, or overall structure. The second stage, style learning, tackles this by improving the aesthetic and structural quality of the translated code. This stage incorporates both ‘positive’ and ‘negative’ style examples. Positive examples are translations that maintain stylistic consistency with the source code, often generated by a powerful LLM like Qwen32B and selected through a ‘style consensus selection’ mechanism. Negative examples are translations that deviate stylistically. By learning from both good and bad examples, the model learns to recognize and prioritize maintaining stylistic consistency, making the translated code much more readable.
A New Benchmark for Evaluation
To rigorously test F2STRANS and overcome limitations of existing benchmarks (like outdated code or insufficient test cases), the researchers developed a new, comprehensive code translation benchmark. This new benchmark includes up-to-date source code, extensive test cases, and manually annotated ground-truth translations, allowing for thorough evaluations of both functional accuracy and stylistic quality. The benchmark covers 20 diverse code translation scenarios across five programming languages: C, C++, Go, Java, and Python.
Also Read:
- Bridging Language and Logic: A New Framework for AI Reasoning
- CodeJudgeBench: A New Benchmark for Evaluating AI Code Judges
Impressive Results
Experiments conducted on both the new benchmark and traditional datasets demonstrate that F2STRANS significantly improves code translation performance. Remarkably, this approach enables smaller LLMs, such as Qwen1.5B, to outperform larger, more established models like prompt-enhanced Qwen32B and even GPT-4 on average across the 20 code translation scenarios. This highlights the effectiveness of the function-to-style guiding paradigm in making LLMs more efficient and capable for code translation tasks.
A detailed look into the components of F2STRANS through ablation studies revealed that each part, from the relevance-driven data selection to the specific loss functions used in style learning, contributes significantly to the overall performance gains. The research also showed that the style guidance is particularly impactful, improving the correction rate for compilation errors by teaching LLMs to avoid superficial code errors by adhering to source code style.
This work represents a significant step forward in making LLM-generated code translations more reliable and developer-friendly, paving the way for their more effective adoption in real-world software development and maintenance. For more details, you can refer to the original research paper.


