spot_img
HomeResearch & DevelopmentL-MARS: A Multi-Agent System for Precise Legal Question Answering

L-MARS: A Multi-Agent System for Precise Legal Question Answering

TLDR: L-MARS (Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search) is a novel AI system designed to significantly reduce hallucinations and uncertainty in legal question answering. It employs a coordinated multi-agent workflow that decomposes queries, conducts targeted searches across diverse legal sources (web, local RAG, case law), and uses a Judge Agent to verify evidence for sufficiency, jurisdiction, and temporal validity. Evaluated on a new 2025 legal benchmark, L-MARS substantially improves factual accuracy and reduces uncertainty compared to traditional large language models, demonstrating a scalable blueprint for deploying LLMs in high-stakes legal domains.

Large language models (LLMs) have shown great promise in legal tasks, from interpreting statutes to assisting with case law retrieval. However, their direct application often leads to significant challenges like hallucinations—confidently stated but factually incorrect answers—and uncertainty, which can carry substantial real-world risks in the legal domain. Traditional methods like fine-tuning are costly and struggle to keep up with rapidly changing laws, while standard Retrieval-Augmented Generation (RAG) can miss crucial legal evidence, leading to incomplete or inaccurate reasoning.

To address these critical issues, researchers have introduced L-MARS (Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search), a sophisticated system designed to enhance the accuracy and reliability of legal question answering. L-MARS stands out by employing a coordinated multi-agent approach that combines iterative reasoning with intelligent search and rigorous verification.

How L-MARS Works

Unlike single-pass RAG systems, L-MARS breaks down complex legal queries into smaller, manageable subproblems. It then conducts targeted searches across a variety of sources, including up-to-date web information via the Serper API, a curated local legal database, and authoritative case law through the CourtListener API. A crucial component of L-MARS is its Judge Agent, which meticulously verifies the sufficiency, jurisdiction, and temporal validity of the retrieved evidence before any answer is synthesized. This iterative loop of reasoning, searching, and verification helps maintain coherence, filter out noisy information, and ground answers in credible legal authority.

L-MARS operates in two distinct modes:

  • Simple Mode: This mode offers a faster, single-pass pipeline. A Query Agent generates structured intents, which the Search Agent uses to retrieve evidence. The Summary Agent then composes an answer with citations.
  • Multi-Turn Mode: This is the more robust, iterative mode. A Query Agent refines the user’s question, potentially asking clarifying follow-ups. The Search Agent performs deep searches, and a Judge Agent evaluates the sufficiency and relevance of the evidence. If the evidence is insufficient, the system refines the query and repeats the search-judge-refine loop until an evidence sufficiency threshold is met or a maximum number of iterations is reached. Finally, a Summary Agent synthesizes the answer.

The system’s agentic search capabilities are particularly noteworthy. It can trigger retrieval when a knowledge gap is identified or when external evidence is predicted to reduce uncertainty. It uses both a ‘Basic Search’ for quick exploration and an ‘Enhanced Search’ for detailed content extraction from top URLs, cleaning HTML and parsing PDFs to get precise information. For local documents, it maintains a RAG index, and for case law, it integrates directly with CourtListener, allowing it to query by party name, citation, or keyword.

Evaluating L-MARS

To assess its effectiveness, L-MARS was evaluated on LegalSearchQA, a new benchmark of 200 up-to-date multiple-choice legal questions from 2025. This benchmark specifically tests a system’s end-to-end ability to retrieve and reason over external legal sources, focusing on recent federal regulatory actions, tax provisions, immigration, and technology law.

The evaluation used several metrics:

  • Accuracy: Measures the fraction of correctly answered multiple-choice questions.
  • U-Score: A rule-based metric (ranging from 0 to 1, lower is better) that quantifies uncertainty by assessing hedging cues, temporal vagueness, citation sufficiency, jurisdictional specificity, and decisiveness.
  • LLM-as-Judge: GPT-o3 was used to provide qualitative ratings on factual accuracy, evidence grounding, clarity of reasoning, and uncertainty calibration.

Key Findings

The results demonstrated that both L-MARS variants significantly outperformed pure LLM inference. L-MARS achieved up to 98% accuracy compared to 86–89% for baseline LLMs. The U-Score sharply decreased, indicating a marked reduction in hedging, vagueness, and unsupported conclusions. LLM-as-Judge ratings consistently ranked L-MARS outputs higher, especially the multi-turn variant, which produced more thorough and contextually grounded answers.

While L-MARS offers substantial improvements in accuracy and reliability, it does incur higher latency. Baseline LLMs answered within 1-4 seconds, whereas L-MARS took 13.6 seconds for Simple Mode and 55.7 seconds for Multi-Turn Mode, reflecting the added retrieval and reasoning steps. A case study highlighted how L-MARS successfully identified a specific 30-day timeline from a 2025 Executive Order, a detail that a pure LLM incorrectly approximated based on general patterns.

The L-MARS framework provides a flexible architecture that can switch between low-latency simple mode and high-accuracy multi-turn mode, offering a reproducible blueprint for high-stakes domains such as law. The code for L-MARS is available for further exploration. You can read the full research paper here.

Also Read:

Limitations and Future Directions

Despite its advancements, L-MARS’s performance is still dependent on the quality of retrieval from search engines and legal databases. Its higher latency might also be a barrier for real-time applications. Future work aims to improve the Judge Agent’s ability to avoid over-rejecting partially relevant sources and to expand evaluations to broader cross-jurisdictional and multilingual contexts.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -