Ericsson's Journey into AI-Powered Code Review Automation

TLDR: Ericsson has developed and evaluated a lightweight automated code review tool leveraging Large Language Models (LLMs) and static program analysis. The system extracts the ‘enclosing method’ of modified code lines to provide context to the LLM, which then generates concise and relevant reviews. Preliminary evaluations with expert developers show promising results in reducing cognitive burden and improving efficiency, despite some areas for improvement. The project aims to integrate seamlessly into existing workflows and has a future roadmap including advanced prompting, RAG, and multi-agent systems.

Code review is a cornerstone of software quality assurance, alongside testing and static analysis. However, it often demands significant time and expertise from senior developers, creating a bottleneck in the development lifecycle and diverting them from primary tasks like writing new features and fixing bugs. Recognizing this challenge, Ericsson has explored the use of Large Language Models (LLMs) to automate the code review process.

In their recent work, Ericsson describes their experience in developing a lightweight tool that combines LLMs with static program analysis. The goal is to alleviate the cognitive burden on experienced developers by providing timely and consistent feedback as code is committed to version control systems like Gerrit and Git.

A Lightweight Approach to Automated Review

Unlike some approaches that require extensive and costly pre-training or fine-tuning of LLMs, Ericsson opted for a more agile, lightweight method. Their solution focuses on intelligently preparing the input for the LLM. When a developer modifies Java code, the tool uses static program analysis (specifically, the Tree-Sitter parser) to identify the ‘enclosing method’ – the specific function or method that contains the changed lines. This contextual information is crucial for the LLM to generate relevant and accurate reviews.

The team experimented with various prompting strategies, moving beyond simple requests like “Please generate a code review for the following code.” They found that effective prompts needed to ensure reviews were concise, human-like, focused on the enclosing method (to prevent ‘hallucinations’ of irrelevant code elements), and avoided generating new code in the output. Post-processing steps were also implemented to refine the LLM’s output, including summarizing and ranking reviews, and validating them with human experts to recalibrate prompts.

Practical considerations were central to their design. The tool aims to generate reviews that are relevant, concise, and accurate, while being fast, cost-efficient, and easy to integrate into existing development workflows. Security, logging feedback for continuous improvement, and incorporating human validation were also key aspects.

The Automated Code Review Pipeline

The process begins by extracting the latest code changes from the Gerrit API. For each change, the system identifies the modified files and their diffs. The critical step then involves extracting the enclosing Java function for each diff, providing the necessary context. This contextualized code snippet is then fed to an LLM, such as Code Llama, with a suitable prompt. Finally, the LLM’s generated review undergoes post-processing, which includes presenting, saving, and summarizing the feedback. The tool is integrated into developers’ workflows via a web-based user interface and a plugin for Visual Studio Code.

Evaluating the Solution

Ericsson conducted surveys with experienced developers to evaluate their automated code review system. The evaluation addressed three key questions:

How good is the code review generation by the LLMs? Experts reviewed LLM-generated feedback for 10 Java code snippets. While there were positive comments (e.g., appreciating suggestions for meaningful variable names), there were also neutral and negative remarks. Negative feedback highlighted issues like incorrect variable types, irrelevant or incorrect reviews, missing abstractions, or excessive verbosity.
Which LLM produced the best review? In a pairwise comparison, experts compared reviews from different smaller LLM models (Llama 2 13B, Code Llama 13B, Llama 2 7B, Code Llama 7B). Preliminary results suggested that the Code Llama 13B model performed better than the others.
How good is the code review tool in practice? Nine expert developers used the tool for fifteen days. Four out of nine agreed the tool saved them time and improved their overall coding efficiency. Common criticisms included reviews merely explaining the code, being factually incorrect, or focusing on irrelevant areas. Usage frequency varied, with two developers using it regularly and five sometimes.

Additional experiments showed the LLM could find relatively easy logical bugs in adversarial prompting scenarios and generally stuck to commenting only on changed lines. The time taken for LLMs to generate reviews was consistently around 5-6 seconds, regardless of snippet length.

Also Read:

Looking Ahead

Ericsson’s study demonstrates a practical, lightweight approach to automated code review using LLMs that can be integrated into existing development systems without expensive fine-tuning or reliance on external black-box tools. The initial results are promising, validating that the method can reduce redundant feedback and enhance usability.

The research is ongoing, with a roadmap for future improvements. This includes expanding user surveys, experimenting with new LLMs and advanced prompting strategies like zero-shot, few-shot, and chain-of-thought. Future phases will explore Retrieval-Augmented Generation (RAG) and Graph-RAG to provide more context from documentation and past reviews, and even develop a multi-agent framework where specialized AI agents handle different review tasks, continuously learning from feedback. The ultimate goal is seamless integration with various internal software engineering tools at Ericsson. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Ericsson’s Journey into AI-Powered Code Review Automation

A Lightweight Approach to Automated Review

The Automated Code Review Pipeline

Evaluating the Solution

Looking Ahead

Gen AI News and Updates

ContextCRBench: A New Benchmark for Detailed LLM Evaluation in Code Review

Ericsson’s Dataplex Edge: Pioneering Data Integrity and Governance in Telecom with Google Cloud

New Research Reveals AI Models Generate Code with Significant Security Vulnerabilities

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates