spot_img
HomeResearch & DevelopmentParaCook: A New Benchmark for Time-Efficient Multi-Agent Planning with...

ParaCook: A New Benchmark for Time-Efficient Multi-Agent Planning with LLMs

TLDR: ParaCook is a new benchmark for evaluating how well Large Language Models (LLMs) can plan tasks for multiple agents to complete efficiently, focusing on minimizing time rather than just correctness. Inspired by the ‘Overcooked’ game, it uses cooking scenarios to test LLMs’ ability to handle parallel and asynchronous operations. Experiments show that while top LLMs like GPT-5 perform best, they still lag significantly behind human performance in success rate, time efficiency, and coordination, especially in complex tasks. The research also reveals that LLMs have strong high-level planning potential in abstract settings, but struggle with translating this into efficient real-world actions, highlighting a need for better structured planning approaches.

In the rapidly evolving world of artificial intelligence, Large Language Models (LLMs) have demonstrated impressive capabilities in planning complex, real-world tasks. However, a significant challenge remains: ensuring these plans are not just correct, but also time-efficient, especially in scenarios involving multiple agents working in parallel. This is the core problem that a new research paper, ParaCook: On Time-Efficient Planning for Multi-Agent Systems, aims to address.

Authored by a collaborative team including Shiqi Zhang, Xinbei Ma, Yunqing Xu, Zouying Cao, Pengrui Lu, and Zhuosheng Zhang from Shanghai Jiao Tong University, and Haobo Yuan, Tiancheng Shen, and Ming-Hsuan Yang from the University of California, Merced, the paper introduces ParaCook, a novel benchmark designed to evaluate time-efficient collaborative planning in multi-agent systems.

The Need for Time-Efficient Planning

Current benchmarks for LLM agents primarily focus on task completion and adherence to rules. While important, this overlooks a crucial aspect of real-world applicability: time efficiency. Imagine a team of robots in a factory or a group of virtual assistants; their success isn’t just about finishing tasks, but doing so quickly and efficiently. Complex tasks often involve both parallel and sequential steps. A single agent might multitask (e.g., chopping vegetables while water boils), and multiple agents can distribute workloads (e.g., cooking different dishes simultaneously). These forms of parallelism – within an agent and between agents – are vital for overall time efficiency.

Existing benchmarks like AsyncHow, Robotouille, and CookBench have limitations. Some oversimplify tasks, assume unlimited resources, or focus only on single-agent asynchronous planning. Others are too broad to isolate and evaluate time efficiency effectively. This gap motivated the creation of ParaCook.

Introducing ParaCook: A Kitchen for LLM Agents

ParaCook is inspired by the popular ‘Overcooked’ game, setting up a 2D grid-world kitchen environment where multiple agents (cooks) must collaborate to prepare dishes. The environment provides a simplified action space (MoveTo, Interact, Process, Wait, Finish) to ensure the primary challenge is strategic parallel planning, rather than complex low-level actions.

The benchmark instantiates planning concepts through:

  • Task Decomposition: Recipes are broken down into subtasks with dependencies, forming a Directed Acyclic Graph (DAG).
  • Time Delay: Accounts for inherent execution delays, like cooking times, allowing for parallel execution.
  • Parallel Execution: Supports both intra-agent (e.g., one agent managing multiple tasks) and inter-agent (multiple agents working together) parallelism.
  • Actual Execution Time: Incorporates real-world factors like travel time between workstations.

Tasks are structured around various recipes (simple, intermediate, complex) and orders (combinations of dishes), allowing for controllable concurrency challenges. The kitchen map itself is configurable, enabling dynamic control over size, workstation arrangement, and agent count, which helps in fine-tuning the difficulty and evaluating different levels of parallelism.

Measuring Success: Metrics in ParaCook

ParaCook evaluates agent performance using several key metrics:

  • Success Rate (SR): Measures the percentage of tasks where all dishes are completed correctly.
  • Order Completion Time (OCT): The total time elapsed until all orders are finished. This includes penalized (pOCT) and normalized (nOCT) variants to account for failures and compare efficiency among successful runs.
  • Movement Distance (MD): The average travel distance of all agents, indicating spatial efficiency.
  • Agent Utilization (AU): The proportion of time agents spend actively working, reflecting coordination quality.

Key Findings: LLMs vs. Humans

The research conducted comprehensive experiments with state-of-the-art LLMs, including GPT-5, Gemini-2.5-Pro, DeepSeek-V3.2-Exp, Claude-Opus-4.1, and Qwen3-Max-Preview. The findings reveal that while LLMs show promise, there’s a significant gap compared to human performance:

  • LLMs Struggle with Complexity: GPT-5 achieved the highest average success rate (65.0%) but saw significant drops on complex tasks. Other models performed even worse, with some failing almost completely on medium and hard tasks.
  • Time Efficiency is a Challenge: GPT-5 demonstrated the best time efficiency among LLMs, but all models were substantially slower than human baselines. Inefficient scheduling and failure to parallelize actions directly led to longer completion times and higher movement costs.
  • Humans Excel: Human participants achieved a perfect 100% success rate across all difficulty levels, demonstrating superior robustness, time efficiency, and spatial optimization. Humans completed tasks faster and with far less movement than even the best LLMs.
  • Chain-of-Thought (CoT) Prompting: The effectiveness of CoT prompting varied. It amplified performance for strong models like GPT-5 but could destabilize moderately capable ones and offered only limited help to weaker models.
  • Abstract Potential: Interestingly, when given abstract planning tasks (without environmental interaction), top LLMs achieved near-optimal performance (within 1-7% of the theoretical optimum). This suggests LLMs have strong inherent reasoning capabilities for high-level scheduling, but struggle when translating these plans into concrete actions within an embodied environment.

Also Read:

The Path Forward

ParaCook highlights that while LLMs can generate correct plans for less complex tasks, their execution strategies are not yet as temporally and spatially optimized as those of humans, especially when high coordination demands are present. The contrast between strong performance on abstract tasks and inefficiency in the real ParaCook environment underscores the need for structured approaches, such as hierarchical planning frameworks that separate high-level scheduling from detailed action execution.

ParaCook provides a scalable and adjustable framework for developing and assessing time efficiency-aware multi-agent planning, laying a crucial foundation for advancing LLM agents that can truly exploit concurrency and collaboration in complex, dynamic environments.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -