TLDR: IBM Research developed an automated system to evaluate the quality of COBOL-to-Java code transformations performed by IBM watsonx Code Assistant for Z (WCA4Z). This system addresses challenges in evaluating LLM-based translators by combining precise analytic checkers (for syntax, semantics, and hallucinations) with holistic LLM-as-a-Judge (LaaJ) techniques. It provides scalable, multi-faceted evaluations, supports continuous integration, enables large-scale benchmarking, and generates actionable insights for developers and project managers, reducing reliance on manual review and facilitating the modernization of legacy mainframe applications.
In the rapidly evolving landscape of artificial intelligence, large language models (LLMs) are transforming software development. IBM’s watsonx Code Assistant for Z (WCA4Z) stands out by applying these advanced AI capabilities to mainframe environments, specifically addressing the challenge of modernizing legacy COBOL applications by transforming them into Java code.
Mainframes, despite the rise of cloud computing, remain critical for mission-critical applications in sectors like banking and government. COBOL, a language developed in the late 1950s, is still widely used on these systems, with an estimated 200 billion lines of live code. However, maintaining these legacy systems is becoming increasingly difficult due to a shrinking pool of skilled COBOL developers. WCA4Z aims to bridge this gap by leveraging generative AI to automate the modernization process, including the complex task of COBOL-to-Java code transformation.
The process within WCA4Z involves a fine-tuned LLM that converts COBOL into optimized, object-oriented Java. Unlike older tools that might produce “JOBOL” (COBOL code written in Java syntax), WCA4Z uses a two-phase, semantically driven approach. First, a Class Designer analyzes the COBOL program to propose a Java class design. Once approved, the system generates Java class files with method headers. The second phase focuses on method-level transformation, allowing developers to review, edit, and approve the generated Java code, ensuring accuracy and alignment with the intended design.
Evaluating the Transformation Quality
A significant challenge lies in evaluating the quality of these LLM-based code translations. LLMs are often seen as “black boxes,” offering little insight into their internal reasoning, and they can sometimes produce errors or “hallucinations.” Furthermore, proving the exact functional equivalence between a COBOL program and its Java translation is a complex, often undecidable problem due to the vast semantic differences between the languages.
To address these challenges, IBM Research – Israel developed an automated evaluation system. This system is a data-driven pipeline that collects translation results, processes them, and assesses quality using a multi-faceted approach. It supports continuous integration workflows and large-scale benchmarking, significantly reducing reliance on manual review by human subject-matter experts.
The Hybrid Evaluation Approach
The system employs a combination of techniques to provide a comprehensive assessment:
- Static Analytic Checkers: These are highly accurate, rule-based checks that analyze the translated Java code without executing it. They include:
- Syntactic Checks: Basic validations like ensuring the output is not empty, doesn’t contain endlessly repeated text, and is parsable Java code.
- Semantic Checks: More in-depth analysis to verify correct translation of specific elements, such as variable usage, procedure calls (COBOL PERFORM statements), and middleware calls (e.g., CICS, IMS, SQL). These checks compare elements in the COBOL control flow graph with corresponding elements in the Java parse tree. The system also identifies “hallucinations” – Java code elements generated without a corresponding COBOL source.
- Dynamic Testing (Compilation and Execution): While conceptually ideal, compiling and executing the translated Java code on Z platforms is complex. It requires full program contexts and high-quality test data. Currently, this aspect primarily helps in uncovering issues within the class designer component rather than the LLM translation itself, as error-free translation of full programs is still an ongoing development.
- LLM as a Judge (LaaJ): This innovative approach uses another LLM to holistically assess translation quality. LaaJs can understand both syntactic structures and semantic nuances, rating translations on a seven-point scale developed with domain experts. The LaaJ’s effectiveness is refined through human evaluations and “partial order benchmarks,” where different quality variants of translations are used to train the judge.
The paper highlights that no single evaluation method is a “silver bullet.” Instead, combining precise but partial analytic checkers with comprehensive but less precise LLM-as-a-Judge techniques yields a balanced and informative evaluation.
Also Read:
- RePaCA: Boosting Automated Bug Fix Assessment with Reasoning AI
- CodeEvo: Enhancing Code Generation LLMs Through Agent Interaction and Smart Feedback
Actionable Insights and Future Directions
The evaluation system synthesizes results into actionable insights for various stakeholders. Project managers can view high-level quality assessments and compare different LLM versions, while technical leads can delve into specific issues, identifying problematic COBOL statements or translation patterns. The system uses platforms like Grafana to visualize data, including heatmaps showing average LaaJ scores for different COBOL statements, helping pinpoint areas needing improvement.
This evaluation framework has been instrumental in improving the quality of WCA4Z’s code transformation component since its early stages. It provides clear visibility into performance, aiding in the identification and resolution of significant issues. The foundational components of the platform are also being adapted for evaluating other WCA4Z features, such as code explanation and generation.
Looking ahead, the team is continuously enhancing the system. Future work includes bridging more complex semantic gaps between COBOL and Java, incorporating deeper domain-specific knowledge into LaaJs through prompt engineering or fine-tuning, and automatically generating new benchmarks based on identified challenging translation areas. This ongoing development ensures that IBM watsonx Code Assistant for Z continues to deliver high-quality, modernized codebases for critical mainframe applications. You can learn more about this work in the research paper available here: Quality Evaluation of COBOL to Java Code Transformation.


