TLDR: COCORELI is a novel hybrid AI framework that uses medium-sized LLM agents, abstraction mechanisms, and a discourse module to accurately follow complex language instructions, minimize hallucinations, and perform spatial reasoning. It significantly outperforms larger LLM-based systems in collaborative construction tasks and demonstrates strong generalization abilities in API completion, effectively identifying missing information and learning abstract functions from context.
Large Language Models (LLMs) have made incredible strides, but they often hit roadblocks when faced with real-world tasks that demand precise instruction following, spatial reasoning, and a complete absence of made-up information, known as hallucination. These challenges are particularly evident in complex scenarios where LLMs need to plan, use multiple tools, or learn from limited examples.
Introducing COCORELI: A Smarter Approach to Language Instructions
A new framework called COCORELI, which stands for Cooperative, Compositional Reconstitution & Execution of Language Instructions, offers a promising solution. Developed by researchers including Swarnadeep Bhar, Omar Naim, Eleni Metheniti, Bastien Navarri, Loïc Cabannes, Morteza Ezzabady, and Nicholas Asher, COCORELI is a hybrid agent system designed to overcome these limitations. What’s particularly impressive is that it achieves this using medium-sized LLMs, outperforming systems that rely on much larger models.
How COCORELI Works
COCORELI’s strength lies in its modular design, integrating several specialized LLM agents with innovative abstraction mechanisms and a ‘discourse module’. This allows it to dynamically learn high-level representations of an environment directly from user instructions. The system’s architecture includes:
- Discourse Module: This is a crucial component that generates clarification questions when an agent needs more information to execute a task. This proactive questioning significantly reduces the chance of the system hallucinating missing details.
- Instruction Parser: It interacts with the user to extract key information about objects and their desired locations from natural language instructions.
- Locator: This agent takes information from the parser to determine precise coordinates in the 3D environment. If details are incomplete, it triggers the discourse module for clarification.
- Builder: Checks an external memory for known structures or uses instructions to construct new ones. It also uses the discourse module if instructions are unclear.
- External Memory: Stores predefined functions and previously created shapes as relational graphs, enabling COCORELI to recall and adapt complex structures.
- Executor: Combines information from the Builder and Locator to produce a JSON object, which can then be run as a deterministic program to build or modify structures.
One of COCORELI’s standout features is its ability to learn new complex object functions through abstraction. This means it can take a specific instruction, like building a ‘tower made of three red nuts’, abstract its parameters (color, parts, location), and then recreate a similar structure with different specifications later on. This function-based approach is highly efficient, using one function for a complex structure rather than many individual placement instructions.
Testing COCORELI in a Challenging 3D World
To evaluate its capabilities, COCORELI was tested on an ‘ENVIRONMENT’ task, a collaborative construction challenge in a 3D grid. This environment is more complex than typical benchmarks like Minecraft, featuring a larger grid, more diverse object types (like nuts, washers, bridges that occupy multiple spaces), and strict physics rules such as gravity. The tasks ranged from placing single parts and sequences of parts to constructing complex shapes, handling underspecified instructions, and learning abstract functions from context.
COCORELI was compared against two baseline systems: a single LLM using a Chain-of-Thought (CoT) approach and an agentic LLM system, both utilizing larger LLMs (Claude 3.5 Sonnet, GPT-4.1, and LLaMA 3-70b, respectively) compared to COCORELI’s LLaMA-3.1 8b.
Also Read:
- Orchestrating Smarter LLM Teams: A New Framework for Dynamic Collaboration
- Structuring Intelligence: Language Models Crafting Hierarchical Learning Environments for AI Agents
Impressive Results and Versatility
The results were compelling. COCORELI consistently outperformed the baselines across various tasks:
- It excelled at identifying part types, colors, and coordinates, especially in sequences of instructions where CoT LLMs struggled with the second object.
- For constructing complex shapes, COCORELI demonstrated a higher overall accuracy in following instructions and was the only system capable of partially parsing instructions for a very complex ‘Moroccan bridge’ structure that stumped other models.
- Its clarification loop proved highly effective in handling underspecified instructions, correctly detecting missing information, asking for it, and then accurately parsing the complete instruction without hallucinating. This was a significant weakness for the CoT and even the agentic LLM baselines in more complex underspecified scenarios.
- COCORELI was the only system capable of learning and reproducing all novel shapes from abstract instructions, showcasing its superior abstraction and generalization abilities.
Beyond the ENVIRONMENT tasks, COCORELI also demonstrated its versatility by successfully applying its in-context function learning to the ToolBench API completion task, where it achieved 100% precision and recall in function reuse, unlike the CoT baseline. This highlights its robustness and transferability to different domains.
In conclusion, COCORELI represents a significant step forward in developing more reliable and capable AI agents. By combining medium-sized LLMs with a sophisticated modular architecture, including a discourse module for clarifications and powerful abstraction capabilities, it effectively addresses key limitations of current LLMs in complex, real-world tasks. For more details, you can read the full research paper here.


