A Two-Stage Approach to Enhance LLM Code Translation Accuracy and Readability

TLDR: F2STRANS is a novel two-stage framework designed to improve code translation by Large Language Models (LLMs). It first focuses on ‘functional learning’ to ensure the correctness of translated code by using high-quality, functionally consistent code pairs. Subsequently, ‘style learning’ enhances code readability by incorporating both positive and negative stylistic examples. This approach allows smaller LLMs like Qwen1.5B to achieve superior performance compared to larger models such as GPT-4 in various code translation scenarios, addressing critical challenges in real-world software development.

Large Language Models (LLMs) have made significant progress in translating code from one programming language to another. This task is crucial for updating applications or migrating software. However, a major hurdle remains: ensuring the translated code is not only functionally correct but also easy to read and maintain. Poorly structured or inconsistent code can be a significant burden for developers, often taking more time to understand than to write from scratch.

Addressing these challenges, researchers have introduced a new approach called F2STRANS. This method is designed to progressively enhance LLMs’ performance in code translation by focusing on two key aspects: functional correctness and code readability. The framework operates in two distinct stages.

Functional Learning: Ensuring Correctness

The first stage, functional learning, aims to optimize the accuracy of the translated code. It does this by using high-quality pairs of source and target code. These pairs are carefully selected from online programming platforms, ensuring that both the original and translated code snippets produce identical outputs for the same inputs. This process involves a ‘relevance-driven code pair selection’ to find similar solutions and ‘differential testing’ to verify that the code pairs behave exactly the same way. By fine-tuning LLMs on this meticulously curated data, F2STRANS ensures that the translated code retains its original functionality.

Style Learning: Improving Readability

Even if code is functionally correct, it might lack readability due to inconsistencies in variable naming, function signatures, or overall structure. The second stage, style learning, tackles this by improving the aesthetic and structural quality of the translated code. This stage incorporates both ‘positive’ and ‘negative’ style examples. Positive examples are translations that maintain stylistic consistency with the source code, often generated by a powerful LLM like Qwen32B and selected through a ‘style consensus selection’ mechanism. Negative examples are translations that deviate stylistically. By learning from both good and bad examples, the model learns to recognize and prioritize maintaining stylistic consistency, making the translated code much more readable.

A New Benchmark for Evaluation

To rigorously test F2STRANS and overcome limitations of existing benchmarks (like outdated code or insufficient test cases), the researchers developed a new, comprehensive code translation benchmark. This new benchmark includes up-to-date source code, extensive test cases, and manually annotated ground-truth translations, allowing for thorough evaluations of both functional accuracy and stylistic quality. The benchmark covers 20 diverse code translation scenarios across five programming languages: C, C++, Go, Java, and Python.

Also Read:

Impressive Results

Experiments conducted on both the new benchmark and traditional datasets demonstrate that F2STRANS significantly improves code translation performance. Remarkably, this approach enables smaller LLMs, such as Qwen1.5B, to outperform larger, more established models like prompt-enhanced Qwen32B and even GPT-4 on average across the 20 code translation scenarios. This highlights the effectiveness of the function-to-style guiding paradigm in making LLMs more efficient and capable for code translation tasks.

A detailed look into the components of F2STRANS through ablation studies revealed that each part, from the relevance-driven data selection to the specific loss functions used in style learning, contributes significantly to the overall performance gains. The research also showed that the style guidance is particularly impactful, improving the correction rate for compilation errors by teaching LLMs to avoid superficial code errors by adhering to source code style.

This work represents a significant step forward in making LLM-generated code translations more reliable and developer-friendly, paving the way for their more effective adoption in real-world software development and maintenance. For more details, you can refer to the original research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

A Two-Stage Approach to Enhance LLM Code Translation Accuracy and Readability

Functional Learning: Ensuring Correctness

Style Learning: Improving Readability

A New Benchmark for Evaluation

Impressive Results

Gen AI News and Updates

OMPILOT: Automating C++ to OpenMP Translation for Faster Code

Multilingual Code Learning with AdvFusion: New Insights for Large Language Models

Boosting Code Translation with Automated Snippet Data and Two-Stage Training

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates