Assessing the Quality of COBOL to Java Code Transformation in IBM watsonx Code Assistant for Z

TLDR: IBM Research developed an automated system to evaluate the quality of COBOL-to-Java code transformations performed by IBM watsonx Code Assistant for Z (WCA4Z). This system addresses challenges in evaluating LLM-based translators by combining precise analytic checkers (for syntax, semantics, and hallucinations) with holistic LLM-as-a-Judge (LaaJ) techniques. It provides scalable, multi-faceted evaluations, supports continuous integration, enables large-scale benchmarking, and generates actionable insights for developers and project managers, reducing reliance on manual review and facilitating the modernization of legacy mainframe applications.

In the rapidly evolving landscape of artificial intelligence, large language models (LLMs) are transforming software development. IBM’s watsonx Code Assistant for Z (WCA4Z) stands out by applying these advanced AI capabilities to mainframe environments, specifically addressing the challenge of modernizing legacy COBOL applications by transforming them into Java code.

Mainframes, despite the rise of cloud computing, remain critical for mission-critical applications in sectors like banking and government. COBOL, a language developed in the late 1950s, is still widely used on these systems, with an estimated 200 billion lines of live code. However, maintaining these legacy systems is becoming increasingly difficult due to a shrinking pool of skilled COBOL developers. WCA4Z aims to bridge this gap by leveraging generative AI to automate the modernization process, including the complex task of COBOL-to-Java code transformation.

The process within WCA4Z involves a fine-tuned LLM that converts COBOL into optimized, object-oriented Java. Unlike older tools that might produce “JOBOL” (COBOL code written in Java syntax), WCA4Z uses a two-phase, semantically driven approach. First, a Class Designer analyzes the COBOL program to propose a Java class design. Once approved, the system generates Java class files with method headers. The second phase focuses on method-level transformation, allowing developers to review, edit, and approve the generated Java code, ensuring accuracy and alignment with the intended design.

Evaluating the Transformation Quality

A significant challenge lies in evaluating the quality of these LLM-based code translations. LLMs are often seen as “black boxes,” offering little insight into their internal reasoning, and they can sometimes produce errors or “hallucinations.” Furthermore, proving the exact functional equivalence between a COBOL program and its Java translation is a complex, often undecidable problem due to the vast semantic differences between the languages.

To address these challenges, IBM Research – Israel developed an automated evaluation system. This system is a data-driven pipeline that collects translation results, processes them, and assesses quality using a multi-faceted approach. It supports continuous integration workflows and large-scale benchmarking, significantly reducing reliance on manual review by human subject-matter experts.

The Hybrid Evaluation Approach

The system employs a combination of techniques to provide a comprehensive assessment:

Static Analytic Checkers: These are highly accurate, rule-based checks that analyze the translated Java code without executing it. They include:
- Syntactic Checks: Basic validations like ensuring the output is not empty, doesn’t contain endlessly repeated text, and is parsable Java code.
- Semantic Checks: More in-depth analysis to verify correct translation of specific elements, such as variable usage, procedure calls (COBOL PERFORM statements), and middleware calls (e.g., CICS, IMS, SQL). These checks compare elements in the COBOL control flow graph with corresponding elements in the Java parse tree. The system also identifies “hallucinations” – Java code elements generated without a corresponding COBOL source.
Dynamic Testing (Compilation and Execution): While conceptually ideal, compiling and executing the translated Java code on Z platforms is complex. It requires full program contexts and high-quality test data. Currently, this aspect primarily helps in uncovering issues within the class designer component rather than the LLM translation itself, as error-free translation of full programs is still an ongoing development.
LLM as a Judge (LaaJ): This innovative approach uses another LLM to holistically assess translation quality. LaaJs can understand both syntactic structures and semantic nuances, rating translations on a seven-point scale developed with domain experts. The LaaJ’s effectiveness is refined through human evaluations and “partial order benchmarks,” where different quality variants of translations are used to train the judge.

The paper highlights that no single evaluation method is a “silver bullet.” Instead, combining precise but partial analytic checkers with comprehensive but less precise LLM-as-a-Judge techniques yields a balanced and informative evaluation.

Also Read:

Actionable Insights and Future Directions

The evaluation system synthesizes results into actionable insights for various stakeholders. Project managers can view high-level quality assessments and compare different LLM versions, while technical leads can delve into specific issues, identifying problematic COBOL statements or translation patterns. The system uses platforms like Grafana to visualize data, including heatmaps showing average LaaJ scores for different COBOL statements, helping pinpoint areas needing improvement.

This evaluation framework has been instrumental in improving the quality of WCA4Z’s code transformation component since its early stages. It provides clear visibility into performance, aiding in the identification and resolution of significant issues. The foundational components of the platform are also being adapted for evaluating other WCA4Z features, such as code explanation and generation.

Looking ahead, the team is continuously enhancing the system. Future work includes bridging more complex semantic gaps between COBOL and Java, incorporating deeper domain-specific knowledge into LaaJs through prompt engineering or fine-tuning, and automatically generating new benchmarks based on identified challenging translation areas. This ongoing development ensures that IBM watsonx Code Assistant for Z continues to deliver high-quality, modernized codebases for critical mainframe applications. You can learn more about this work in the research paper available here: Quality Evaluation of COBOL to Java Code Transformation.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Assessing the Quality of COBOL to Java Code Transformation in IBM watsonx Code Assistant for Z

Evaluating the Transformation Quality

The Hybrid Evaluation Approach

Actionable Insights and Future Directions

Gen AI News and Updates

Unlocking Test Code Clarity: How Assertions Guide AI in Summarization

European GENIUS Project, with Over 30 Partners, Drives Generative AI Integration Across Software Development Lifecycle

VRScout: A New Era for Quality Assurance in Virtual Reality Games

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates