TLDR: A new research paper investigates whether execution trace-based semantic information helps Code Large Language Models (LLMs) in understanding and generating code. The study introduces a framework to integrate various trace representations (NExT, SemCoder, Code Executor, Concise) into LLM training and inference. Surprisingly, the findings suggest that current trace-based semantic information has limited usefulness for fine-tuning Code LLMs and offers mixed benefits during test-time scaling, challenging previous assumptions. The paper highlights the need for new semantic representations and integration strategies to truly enhance Code LLM performance.
Code Large Language Models, or Code LLMs, have rapidly transformed the landscape of programming, offering impressive capabilities in tasks like code generation, repair, and summarization. However, despite their advancements, these models still face significant hurdles, particularly in understanding the dynamic, runtime behavior of programs and reasoning about their actual functionality.
A recent study titled “Do Code Semantics Help? A Comprehensive Study on Execution Trace-Based Information for Code Large Language Models” by Jian Wang, Xiaofei Xie, Qiang Hu, Shangqing Liu, and Yi Li, delves into these limitations. The researchers highlight two primary issues: Code LLMs struggle to interpret what programs truly do during execution, and the semantic information, such as execution traces, is often represented inconsistently across different methods, hindering the models’ ability to generalize and reason effectively.
To tackle these challenges, the researchers introduced a generic framework designed to integrate semantic information, specifically execution traces, into prompts relevant to coding tasks. This framework allowed for a comprehensive investigation into how semantic information could enhance the reasoning abilities of Code LLMs. The study focused on evaluating the usefulness of trace-based semantic information in both supervised fine-tuning (SFT) and the post-training inference phase of these models.
Understanding Execution Traces
A key component of their framework is the trace adapter, which supports various ways of representing execution traces. These include:
- NExT: Integrates execution traces directly into the code as inline comments, showing variable changes on each line.
- SemCoder: Uses natural language to describe execution traces, offering line-by-line explanations of execution status, variable changes, and input-output relationships.
- Code Executor: Records variable state changes line-by-line, similar to NExT, but presents these traces separately from the code.
- Concise: A simplified version of Code Executor, it records only the value changes of variables line-by-line, omitting variables whose values remain unchanged.
Surprising Findings on Fine-Tuning
The experimental results from the study presented a surprising finding that challenges previous assumptions: integrating trace-based semantic information into fine-tuning datasets did not significantly improve the code generation capabilities of Code LLMs. For program repair tasks, only SemCoder showed limited improvements, and for code synthesis and reasoning tasks, models trained without trace information often performed better in more than half the cases.
This suggests that while semantic information might boost the prediction confidence of Code LLMs, it doesn’t necessarily translate into increased prediction correctness during fine-tuning. The researchers concluded that there isn’t a single trace representation that consistently outperforms others in this context, though SemCoder and SemCoder (GPT4o) were noted as relatively better for program repair and code reasoning tasks, respectively.
Insights into Test-Time Scaling
The study also explored the impact of semantic information during test-time scaling, where models iteratively refine solutions or generate multiple candidates. Here, test-scaling strategies generally enhanced the code generation ability of Code LLMs. However, similar to fine-tuning, the overall usefulness of semantic information remained ambiguous. In over half of the cases, adding semantic information to the input prompt did not help Code LLMs produce more correct code compared to not adding it.
An exception was the Concise representation, which performed no worse than the ‘without trace’ baseline in most cases, indicating its potential for guiding LLMs in generating more accurate code during inference. The study also found that closed-source LLMs generally performed better at test time compared to open-source counterparts.
Also Read:
- Evaluating Semantic Similarity: Uncovering Flaws and Finding Better Measures for Code and Text
- Assessing the Energy Footprint of AI-Generated Code: Human Expertise Still Leads the Way
Future Directions
The paper concludes that existing trace-based code semantic information does not significantly enhance the fine-tuning and test-time scaling of Code LLMs. This opens up crucial avenues for future research, emphasizing the need to design new forms of semantic representations that better align with how Code LLMs process and understand code. Future work should also explore more effective strategies for integrating semantic information into model training and inference pipelines, including architectural modifications, specialized pretraining objectives, or more adaptive prompting techniques.
For a deeper dive into the methodology and detailed results, you can access the full research paper here.


