L-MARS: A Multi-Agent System for Precise Legal Question Answering

TLDR: L-MARS (Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search) is a novel AI system designed to significantly reduce hallucinations and uncertainty in legal question answering. It employs a coordinated multi-agent workflow that decomposes queries, conducts targeted searches across diverse legal sources (web, local RAG, case law), and uses a Judge Agent to verify evidence for sufficiency, jurisdiction, and temporal validity. Evaluated on a new 2025 legal benchmark, L-MARS substantially improves factual accuracy and reduces uncertainty compared to traditional large language models, demonstrating a scalable blueprint for deploying LLMs in high-stakes legal domains.

Large language models (LLMs) have shown great promise in legal tasks, from interpreting statutes to assisting with case law retrieval. However, their direct application often leads to significant challenges like hallucinations—confidently stated but factually incorrect answers—and uncertainty, which can carry substantial real-world risks in the legal domain. Traditional methods like fine-tuning are costly and struggle to keep up with rapidly changing laws, while standard Retrieval-Augmented Generation (RAG) can miss crucial legal evidence, leading to incomplete or inaccurate reasoning.

To address these critical issues, researchers have introduced L-MARS (Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search), a sophisticated system designed to enhance the accuracy and reliability of legal question answering. L-MARS stands out by employing a coordinated multi-agent approach that combines iterative reasoning with intelligent search and rigorous verification.

How L-MARS Works

Unlike single-pass RAG systems, L-MARS breaks down complex legal queries into smaller, manageable subproblems. It then conducts targeted searches across a variety of sources, including up-to-date web information via the Serper API, a curated local legal database, and authoritative case law through the CourtListener API. A crucial component of L-MARS is its Judge Agent, which meticulously verifies the sufficiency, jurisdiction, and temporal validity of the retrieved evidence before any answer is synthesized. This iterative loop of reasoning, searching, and verification helps maintain coherence, filter out noisy information, and ground answers in credible legal authority.

L-MARS operates in two distinct modes:

Simple Mode: This mode offers a faster, single-pass pipeline. A Query Agent generates structured intents, which the Search Agent uses to retrieve evidence. The Summary Agent then composes an answer with citations.
Multi-Turn Mode: This is the more robust, iterative mode. A Query Agent refines the user’s question, potentially asking clarifying follow-ups. The Search Agent performs deep searches, and a Judge Agent evaluates the sufficiency and relevance of the evidence. If the evidence is insufficient, the system refines the query and repeats the search-judge-refine loop until an evidence sufficiency threshold is met or a maximum number of iterations is reached. Finally, a Summary Agent synthesizes the answer.

The system’s agentic search capabilities are particularly noteworthy. It can trigger retrieval when a knowledge gap is identified or when external evidence is predicted to reduce uncertainty. It uses both a ‘Basic Search’ for quick exploration and an ‘Enhanced Search’ for detailed content extraction from top URLs, cleaning HTML and parsing PDFs to get precise information. For local documents, it maintains a RAG index, and for case law, it integrates directly with CourtListener, allowing it to query by party name, citation, or keyword.

Evaluating L-MARS

To assess its effectiveness, L-MARS was evaluated on LegalSearchQA, a new benchmark of 200 up-to-date multiple-choice legal questions from 2025. This benchmark specifically tests a system’s end-to-end ability to retrieve and reason over external legal sources, focusing on recent federal regulatory actions, tax provisions, immigration, and technology law.

The evaluation used several metrics:

Accuracy: Measures the fraction of correctly answered multiple-choice questions.
U-Score: A rule-based metric (ranging from 0 to 1, lower is better) that quantifies uncertainty by assessing hedging cues, temporal vagueness, citation sufficiency, jurisdictional specificity, and decisiveness.
LLM-as-Judge: GPT-o3 was used to provide qualitative ratings on factual accuracy, evidence grounding, clarity of reasoning, and uncertainty calibration.

Key Findings

The results demonstrated that both L-MARS variants significantly outperformed pure LLM inference. L-MARS achieved up to 98% accuracy compared to 86–89% for baseline LLMs. The U-Score sharply decreased, indicating a marked reduction in hedging, vagueness, and unsupported conclusions. LLM-as-Judge ratings consistently ranked L-MARS outputs higher, especially the multi-turn variant, which produced more thorough and contextually grounded answers.

While L-MARS offers substantial improvements in accuracy and reliability, it does incur higher latency. Baseline LLMs answered within 1-4 seconds, whereas L-MARS took 13.6 seconds for Simple Mode and 55.7 seconds for Multi-Turn Mode, reflecting the added retrieval and reasoning steps. A case study highlighted how L-MARS successfully identified a specific 30-day timeline from a 2025 Executive Order, a detail that a pure LLM incorrectly approximated based on general patterns.

The L-MARS framework provides a flexible architecture that can switch between low-latency simple mode and high-accuracy multi-turn mode, offering a reproducible blueprint for high-stakes domains such as law. The code for L-MARS is available for further exploration. You can read the full research paper here.

Also Read:

Limitations and Future Directions

Despite its advancements, L-MARS’s performance is still dependent on the quality of retrieval from search engines and legal databases. Its higher latency might also be a barrier for real-time applications. Future work aims to improve the Judge Agent’s ability to avoid over-rejecting partially relevant sources and to expand evaluations to broader cross-jurisdictional and multilingual contexts.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

L-MARS: A Multi-Agent System for Precise Legal Question Answering

How L-MARS Works

Evaluating L-MARS

Key Findings

Limitations and Future Directions

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Ironclad Unveils Advanced AI Agents to Transform Contracts into Dynamic Assets

NeosAI Honored as Top LegalTech Generative AI Solution at 2025 Breakthrough Awards

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates