TLDR: CrossPL is the first benchmark designed to evaluate Large Language Models (LLMs) on their ability to generate code that enables different programming languages to work together, specifically focusing on Inter-Process Communication (IPC). The study, which analyzed 1,982 tasks across six languages and seven IPC techniques, found that current LLMs significantly underperform in these cross-language scenarios. Performance varied by programming language and IPC technique, with C++ and gRPC yielding better results than Go and Pipe. Surprisingly, larger model sizes and the ‘thinking mode’ in some LLMs did not consistently improve performance, highlighting a critical need for more specialized research and development in this area.
In today’s complex software world, it’s increasingly common for different parts of a single system to be written in multiple programming languages. This approach allows developers to use each language’s unique strengths, leading to better performance, modularity, and scalability. However, making these diverse language components work together smoothly, or ‘interoperate,’ introduces significant challenges.
Large Language Models (LLMs) have made impressive strides in generating code, becoming valuable tools in software development. Yet, a crucial question has remained largely unanswered: Can LLMs accurately generate code that enables cross-programming language (CPL) interoperability? Existing benchmarks for LLM code generation primarily focus on single-language tasks or translating code between languages, rather than evaluating their ability to create code that facilitates direct interaction between different languages.
To address this critical gap, researchers have introduced CrossPL, the first benchmark specifically designed to systematically evaluate LLMs’ capability in generating CPL-interoperating code. CrossPL focuses on Inter-Process Communication (IPC), a common mechanism for different software processes (potentially written in different languages) to communicate with each other. IPC methods include technologies like Sockets, gRPC, HTTP, and message queues.
The creation of CrossPL was a meticulous process. It involved analyzing over 19,000 multi-language projects from GitHub using 156 specially designed ‘finite state machines’ (FSMs) to identify and characterize CPL interaction patterns. An LLM-based pipeline was then developed to automatically extract relevant code snippets, generate clear task instructions, and validate the functional correctness of the code. The benchmark ultimately comprises 1,982 tasks, covering six widely used programming languages (Java, Python, Go, JavaScript, PHP, C++) and seven representative IPC techniques.
Key Findings from the Evaluation
The researchers evaluated 14 state-of-the-art general-purpose LLMs and 6 code-oriented LLMs released in the past three years on CrossPL. The results revealed several important insights:
First, even the best-performing models struggled significantly with CPL scenarios. This indicates that while LLMs might excel at single-language coding tasks, generating code for cross-language interactions remains a major hurdle.
Second, LLMs’ performance varied considerably across different programming languages. They performed best on C++ CPL tasks, likely because C++ is often used for low-level system programming and the protocols covered (like TCP, UDP, HTTP, WebSocket) are well-documented and frequently appear in training data. Conversely, performance was weaker in Go, possibly due to Go’s structural characteristics (lack of native classes) which might not align well with LLMs primarily trained on class-based languages.
Third, the models showed varying effectiveness across different IPC techniques. Higher-level, more structured protocols like gRPC yielded better results, likely due to their standardized syntax and schema-driven design. In contrast, lower-level, more flexible mechanisms like HTTP and platform-dependent ones like Pipe presented greater challenges, possibly due to their diverse implementations and less strict interfaces.
Finally, the study explored the impact of model characteristics, specifically focusing on the Qwen3 model family. Surprisingly, performance on IPC code generation did not consistently improve with larger model sizes. Furthermore, incorporating a ‘thinking mode’ (designed for general-purpose reasoning) sometimes led to a decline in performance. This suggests that such reasoning might not be well-suited for the highly structured and protocol-driven nature of IPC code.
Also Read:
- Code Models Struggle with Imperfect Instructions: A New Study Reveals Robustness Gaps
- Assessing LLM Capabilities in Answer Set Programming: A New Benchmark Reveals Core Challenges
Looking Ahead
The findings from CrossPL underscore the urgent need for more targeted research in enhancing LLMs’ ability to generate CPL-interoperating code. This capability is crucial for the future development of complex, multi-language software systems. The CrossPL benchmark and its associated code are publicly available, providing a valuable resource for researchers to further explore and improve LLMs in this challenging yet vital area of software engineering. You can find more details about the research paper here: CrossPL Research Paper.


