TLDR: CABench is the first public benchmark for Composable AI, featuring 70 complex tasks and a pool of 700 ready-to-use models. It introduces an evaluation framework and provides initial baselines, demonstrating that while Large Language Models (LLMs) can attempt these tasks, human-designed solutions are currently far superior. The research highlights the significant potential of Composable AI for solving complex real-world problems and the urgent need for more advanced methods to automatically generate effective AI pipelines.
Artificial Intelligence (AI) has made incredible strides, integrating into various fields from healthcare to autonomous driving. However, many real-world problems are inherently complex, requiring multiple steps, different types of data, and various processing components. Imagine a medical system that needs to analyze images, understand clinical notes, and assess risk – building a single, all-encompassing AI model for such a task is often impractical.
This challenge highlights the need for a more flexible and modular approach. Instead of creating massive, monolithic AI models from scratch for every complex task, a promising solution lies in Composable AI (CA). This paradigm involves breaking down complex AI problems into smaller, more manageable sub-tasks and then solving each sub-task by combining existing, well-trained AI models.
Introducing CABench: A New Standard for Composable AI
Despite the potential of Composable AI, systematically evaluating methods in this area has been largely unexplored. This is where CABench comes in. It is introduced as the first public benchmark specifically designed for Composable AI. CABench comprises 70 realistic, complex AI tasks and a curated pool of approximately 700 ready-to-use models spanning multiple data types and domains. The benchmark also includes a comprehensive evaluation framework to assess Composable AI solutions from start to finish.
The creation of CABench was guided by several key principles: realism (tasks derived from popular real-world datasets), decomposability (tasks that can be naturally broken down), solvability (tasks solvable with the provided model pool, often requiring ‘glue code’ for integration), diversity (covering a wide range of domains), and evaluability (clear input/output specifications and metrics).
How Composable AI Works
At its core, Composable AI aims to automatically decompose a complex task into sub-tasks, select appropriate models from a pool to solve each sub-task, and then compose these selected models into a coherent, executable pipeline. This pipeline is often represented as a directed acyclic graph (DAG), where each node is either an AI model or a ‘glue code’ module. Glue code is crucial for handling data pre-processing, format transformations, and integrating diverse outputs, ensuring different models can work together seamlessly.
For instance, a task might involve determining if an audio claim is supported by textual evidence in an image. A Composable AI solution could involve a Speech-to-Text model to transcribe the audio, a Text Extraction model (OCR) for the image, and a Similarity Measurement model to compare the two. Crucially, glue code would be used for cleaning the text and interpreting the similarity score into a final verdict.
Also Read:
- New Benchmark Reveals Visual Language Models Struggle with Complex Graphic Reasoning, But New Methods Show Promise
- How Well Do AI Models Write Code for New Scientific Tools?
Human vs. Machine: Initial Baselines
To establish initial performance baselines, the researchers compared human-designed solutions against two Large Language Model (LLM)-based approaches: Prompt-to-Solve and Prompt-to-Pipeline. Prompt-to-Solve directly asks the LLM to solve the task, leveraging its vast pre-trained knowledge. Prompt-to-Pipeline, on the other hand, instructs the LLM to act as a Composable AI system, decomposing the task, selecting models, and generating an executable pipeline.
The results showed a significant performance gap. While Prompt-to-Solve generally outperformed Prompt-to-Pipeline, especially in typical natural language processing tasks like summarization and translation, it struggled with specialized tasks such as speech recognition and sentence similarity. The human-designed reference solutions consistently outperformed both LLM-based strategies across all task types and complexities. On average, human solutions were 90% better than Prompt-to-Solve and 6.7 times better than Prompt-to-Pipeline.
This highlights that while LLMs are powerful, they are currently limited in their ability to orchestrate multi-model pipelines effectively, especially for tasks requiring precise compositional reasoning and structured execution. The findings underscore the immense potential of Composable AI for tackling complex real-world problems and emphasize the critical need for developing more capable and robust automated methods for generating effective AI pipelines.
CABench serves as a foundational step, providing a rigorous and reproducible environment for future research in building scalable and efficient AI systems through the principled reuse of existing resources.


