Benchmarking AI for Legal Contract Review: Insights from ContractEval

TLDR: ContractEval is a new benchmark evaluating 19 large language models (LLMs), both proprietary and open-source, on their ability to identify legal risks in commercial contracts at a clause level. It uses the CUAD dataset and assesses correctness, output effectiveness, and “laziness.” Key findings show proprietary models generally outperform open-source ones, larger open-source models have diminishing returns, “thinking” modes can reduce correctness, and open-source models often miss relevant clauses. The study highlights the need for targeted fine-tuning of open-source LLMs for high-stakes legal applications.

In the complex world of commercial transactions, managing legal risk is paramount. Before a major deal, like a technology company acquiring a video game company, legal teams meticulously review contracts to uncover any existing or potential liabilities. This process, which involves identifying specific clauses related to intellectual property, shareholder agreements, or service contracts, is crucial for understanding legal obligations and potential issues that could impact the transaction. However, this essential contract review is notoriously time-consuming, expensive, and often involves junior legal assistants manually extracting relevant clauses, a task that can cost hundreds of thousands of dollars.

Despite the clear need for efficiency, the potential of large language models (LLMs) in specialized legal domains, particularly for contract review and legal risk assessment, has remained largely unexplored. Most prior research has focused on areas like legal case retrieval or judgment prediction, which differ significantly from the precise, span-level extraction required for contract review. Furthermore, law firms face strict obligations to protect client confidentiality, leading to a growing interest in deploying open-source LLMs locally to minimize data exposure and comply with privacy rules. This raises a critical question: Can open-source LLMs automate contract review while maintaining data privacy and matching the performance of proprietary models?

Introducing ContractEval: A New Benchmark for Legal LLMs

To address this gap, a new research paper introduces ContractEval, the first benchmark designed to systematically evaluate both open-source and proprietary LLMs on clause-level legal risk identification in commercial contracts. ContractEval assesses 19 leading models, including 4 proprietary and 15 open-source LLMs, across 41 common legal risk categories. The benchmark is built upon the Contract Understanding Atticus Dataset (CUAD), an expert-annotated dataset featuring high-quality labels from real-world contracts collected from the Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system.

The models were selected based on their ability to handle large contexts (at least 128k tokens to cover long legal documents), their recency (released around mid-2025), and the practical benefits of open-source models, such as data confidentiality and cost efficiency. The primary task for the LLMs was to act as junior legal assistants, extracting exact sentences from a contract that directly address a given legal question. If no relevant clause was found, they were instructed to respond with “no related clause.”

How LLMs Were Evaluated

ContractEval evaluates model performance from three key perspectives:

Correctness of Risk Identification: Measured using F1 and F2 scores, this assesses how accurately LLMs identify and extract relevant clauses, simulating how senior lawyers would evaluate a junior assistant’s work.
Output Effectiveness and Conciseness: Using Jaccard similarity coefficients, this metric reflects how precisely the model’s output matches the ground truth without including unnecessary content, crucial for senior lawyers who value concise and accurate summaries.
Laziness Detection: This measures the false “no related clause” rate, indicating how often models incorrectly state that no relevant clause exists when one is present. A high rate suggests a model’s inability or low confidence in retrieving information, which can have serious consequences in legal services.

Key Findings from ContractEval

The benchmark yielded several important insights:

Proprietary Models Lead: Proprietary models like GPT 4.1 and GPT 4.1 mini consistently outperformed open-source models in both correctness (F1 and F2 scores) and output effectiveness (Jaccard similarity). They demonstrated a more balanced performance across these metrics.

Open-Source Models Show Potential but Lag: While some open-source models, particularly certain Qwen3 8B variants, showed competitive performance in correctness, they generally fell behind proprietary models. For instance, Qwen3 8B in “thinking” mode achieved an F1 score of 0.540, still about 16% lower than GPT 4.1. Many smaller or less-tuned open-source models performed poorly, indicating they are not yet suitable for accurate clause-level legal risk identification.

Impact of Model Size: For open-source models, scaling up generally improved correctness, but with diminishing returns. The Qwen3 8B model often achieved the highest F1 scores within its family, outperforming both smaller and larger variants, suggesting an optimal balance. However, larger models generally showed better output effectiveness (Jaccard similarity).

Reasoning Strategies: Some open-source models offer a “thinking” mode for step-by-step reasoning. While this can be beneficial for complex tasks, ContractEval found that thinking mode often improved output effectiveness but reduced correctness. This trade-off suggests that for precise span identification tasks like clause extraction, over-explaining or including irrelevant clauses can be a drawback.

Missing Relevant Risks: Open-source models exhibited a higher false “no related clause” rate, meaning they more frequently missed relevant clauses entirely. For example, Qwen3 8B AWQ in non-thinking mode missed nearly one-third of relevant clauses. Minimizing these false negatives is critical in legal review to avoid overlooking important issues.

Effects of Quantization: Quantization, a technique to improve inference efficiency and reduce GPU costs, resulted in a slight performance drop, especially when combined with the “thinking” mode. This highlights a trade-off between efficiency and accuracy for legal tasks.

Performance Variation by Category: Both proprietary and open-source models struggled with less common or more nuanced legal categories, such as “Uncapped Liability” or “Joint IP Ownership,” often showing near-zero F1 scores. This indicates a need for domain-specific fine-tuning to address imbalances among categories.

Also Read:

Takeaways for Legal Practice and Future Research

The findings suggest that while top proprietary LLMs are approaching the performance of junior legal assistants in identifying relevant clauses, they still require oversight from senior professionals. Open-source models, despite lagging, show strong potential, especially given their appeal for local deployment due to confidentiality and cost. The research points to three key areas for improvement in open-source models: fine-tuning for more accurate clause identification, addressing performance gaps in rare or complex clauses, and mitigating the tendency to overuse “no related clause” responses.

ContractEval provides a robust benchmark to guide the future development of LLMs for legal applications, bridging the gap between LLM research and the legal industry. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Benchmarking AI for Legal Contract Review: Insights from ContractEval

Introducing ContractEval: A New Benchmark for Legal LLMs

How LLMs Were Evaluated

Key Findings from ContractEval

Takeaways for Legal Practice and Future Research

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates