Advancing Legal AI: NOWJ Team's Hybrid Framework Excels at COLIEE 2025

TLDR: The NOWJ team participated in all five tasks at COLIEE 2025, presenting a multi-stage framework that integrates embedding models and large language models (LLMs) for legal retrieval and entailment. Their approach combined pre-ranking models, semantic representations, and advanced LLMs for summarization, relevance scoring, and contextual re-ranking. Notably, they achieved first place in the Legal Case Entailment task (Task 2) with an F1 score of 0.3195, utilizing a two-stage retrieval system and contextualized LLM analysis. They also secured third place in Legal Case Retrieval (Task 3) and Legal Judgment Prediction (Task 5), demonstrating the effectiveness of hybrid models in complex legal information processing.

The field of legal text processing, a specialized area combining law and information science, is witnessing rapid advancements thanks to artificial intelligence (AI) and large language models (LLMs). The Competition on Legal Information, Extraction/Entailment (COLIEE) is an annual event that serves as a crucial platform for researchers to explore and evaluate cutting-edge techniques in complex real-world judicial problems.

At COLIEE 2025, the NOWJ team made a significant impact, participating across all five tasks and achieving remarkable results, including a first-place finish in the Legal Case Entailment task. Their success stems from a comprehensive multi-stage framework that systematically integrates various AI methodologies, from traditional information retrieval techniques to advanced generative models.

A Holistic Approach to Legal AI

The NOWJ team’s core strategy revolved around a hybrid model approach, combining pre-ranking models like BM25, BERT, and monoT5 with embedding-based semantic representations (BGE-m3, LLM2Vec), and sophisticated Large Language Models such as Qwen-2, QwQ-32B, and DeepSeek-V3. These models were deployed for tasks ranging from summarization and relevance scoring to contextual re-ranking and complex reasoning.

COLIEE 2025 featured five distinct tasks, spanning both case law (Federal Court of Canada and Japanese court decisions) and statute law (Japanese Civil Code). These included Legal Case Retrieval (Task 1), Legal Case Entailment (Task 2), Statute Law Retrieval (Task 3), Legal Textual Entailment (Task 4), and Legal Judgment Prediction (Task 5).

Task 1: Legal Case Retrieval

This task focused on identifying relevant precedents that support a given case decision. Facing challenges like the excessive length and complex logical structure of legal documents, the NOWJ team proposed a four-stage framework. This involved meticulous data pre-processing to remove noise and irrelevant information, abstractive summarization using Qwen-2.5 to create concise case summaries, a pre-ranking step with BGE-m3 for initial candidate selection, and a re-ranking phase utilizing fine-tuned BGE-m3 or LLM2Vec. Finally, a majority voting post-processing step combined the outputs for improved performance. The team secured fourth place in this task, demonstrating robust performance.

Task 2: Legal Case Entailment – A Winning Strategy

The NOWJ team truly excelled in Task 2, which aimed to identify specific paragraphs within retrieved cases that entail a given decision. This fine-grained challenge required deep legal text understanding. Their three-stage pipeline combined lexical pre-ranking (BM25) for efficient filtering, semantic re-ranking (fine-tuned BERT and monoT5) to capture deep semantic relationships, and a crucial LLM-based analysis stage. In this final stage, advanced LLMs like DeepSeek-V3 and QwQ-32B were prompted to consider multiple candidate paragraphs holistically, making more informed entailment judgments. This innovative approach, particularly the voting strategy between two distinct LLMs, led to a first-place finish with an F1 score of 0.3195, highlighting the power of multi-perspective LLM-based verification.

Task 3: Statute Law Retrieval

For this task, participants had to retrieve relevant articles from the Japanese Civil Code for a given legal question. The NOWJ team leveraged a combination of Bi-Encoder models (like bge, e5, stella, NV-Embed) for initial vector representations and Cross-Encoder models (bge-reranker, gte-reranker) for precise re-ranking. They employed various ensemble strategies, including a LightGBM model, grid search-optimized linear weights, and a similarity-informed voting ensemble. Their best approach, using optimized linear weights from three top-performing base models, secured third place on the leaderboard with an F2 score of 0.7702.

Task 4: Legal Textual Entailment

This task involved a Yes/No question-answering system based on a legal question and relevant articles. The NOWJ team utilized an LLM-based framework, deploying open-source LLMs such as Qwen-2, Llama-3, and Mixtral. They experimented with both zero-shot and few-shot prompting, followed by answer processing and a majority voting mechanism to combine LLM outputs. Despite strong in-context learning abilities, LLMs still presented challenges in this real-world legal entailment task, with the team achieving sixth place. This outcome points to future research directions in data augmentation and fine-tuning for complex legal reasoning.

Also Read:

Task 5: Legal Judgment Prediction (Pilot Task)

The pilot task focused on predicting legal judgments in Japanese tort cases, including Tort Prediction (TP) and Rationale Extraction (RE). The NOWJ team explored two main approaches: a hierarchical language model (Inter-Span Transformer architecture with ModernBERT-Ja-310M and a Conditional Random Field layer) combined with heuristic post-processing, and a clustering-based method using DeepSeek-V3. Their enhanced hierarchical model with post-processing achieved the best results for the team, securing third place in both TP (67.1% accuracy) and RE (69.2% F1 score), demonstrating the value of refining predictions based on consistency between claim patterns and final decisions.

The NOWJ team’s performance at COLIEE 2025 underscores the significant potential of integrating traditional information retrieval techniques with contemporary generative AI models for legal information processing. Their innovative methodologies, particularly the multi-stage ensemble framework and the successful application of LLMs for contextual re-ranking and analysis, provide a valuable reference for future advancements in legal AI. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Advancing Legal AI: NOWJ Team’s Hybrid Framework Excels at COLIEE 2025

A Holistic Approach to Legal AI

Task 1: Legal Case Retrieval

Task 2: Legal Case Entailment – A Winning Strategy

Task 3: Statute Law Retrieval

Task 4: Legal Textual Entailment

Task 5: Legal Judgment Prediction (Pilot Task)

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates