TLDR: The NOWJ team participated in all five tasks at COLIEE 2025, presenting a multi-stage framework that integrates embedding models and large language models (LLMs) for legal retrieval and entailment. Their approach combined pre-ranking models, semantic representations, and advanced LLMs for summarization, relevance scoring, and contextual re-ranking. Notably, they achieved first place in the Legal Case Entailment task (Task 2) with an F1 score of 0.3195, utilizing a two-stage retrieval system and contextualized LLM analysis. They also secured third place in Legal Case Retrieval (Task 3) and Legal Judgment Prediction (Task 5), demonstrating the effectiveness of hybrid models in complex legal information processing.
The field of legal text processing, a specialized area combining law and information science, is witnessing rapid advancements thanks to artificial intelligence (AI) and large language models (LLMs). The Competition on Legal Information, Extraction/Entailment (COLIEE) is an annual event that serves as a crucial platform for researchers to explore and evaluate cutting-edge techniques in complex real-world judicial problems.
At COLIEE 2025, the NOWJ team made a significant impact, participating across all five tasks and achieving remarkable results, including a first-place finish in the Legal Case Entailment task. Their success stems from a comprehensive multi-stage framework that systematically integrates various AI methodologies, from traditional information retrieval techniques to advanced generative models.
A Holistic Approach to Legal AI
The NOWJ team’s core strategy revolved around a hybrid model approach, combining pre-ranking models like BM25, BERT, and monoT5 with embedding-based semantic representations (BGE-m3, LLM2Vec), and sophisticated Large Language Models such as Qwen-2, QwQ-32B, and DeepSeek-V3. These models were deployed for tasks ranging from summarization and relevance scoring to contextual re-ranking and complex reasoning.
COLIEE 2025 featured five distinct tasks, spanning both case law (Federal Court of Canada and Japanese court decisions) and statute law (Japanese Civil Code). These included Legal Case Retrieval (Task 1), Legal Case Entailment (Task 2), Statute Law Retrieval (Task 3), Legal Textual Entailment (Task 4), and Legal Judgment Prediction (Task 5).
Task 1: Legal Case Retrieval
This task focused on identifying relevant precedents that support a given case decision. Facing challenges like the excessive length and complex logical structure of legal documents, the NOWJ team proposed a four-stage framework. This involved meticulous data pre-processing to remove noise and irrelevant information, abstractive summarization using Qwen-2.5 to create concise case summaries, a pre-ranking step with BGE-m3 for initial candidate selection, and a re-ranking phase utilizing fine-tuned BGE-m3 or LLM2Vec. Finally, a majority voting post-processing step combined the outputs for improved performance. The team secured fourth place in this task, demonstrating robust performance.
Task 2: Legal Case Entailment – A Winning Strategy
The NOWJ team truly excelled in Task 2, which aimed to identify specific paragraphs within retrieved cases that entail a given decision. This fine-grained challenge required deep legal text understanding. Their three-stage pipeline combined lexical pre-ranking (BM25) for efficient filtering, semantic re-ranking (fine-tuned BERT and monoT5) to capture deep semantic relationships, and a crucial LLM-based analysis stage. In this final stage, advanced LLMs like DeepSeek-V3 and QwQ-32B were prompted to consider multiple candidate paragraphs holistically, making more informed entailment judgments. This innovative approach, particularly the voting strategy between two distinct LLMs, led to a first-place finish with an F1 score of 0.3195, highlighting the power of multi-perspective LLM-based verification.
Task 3: Statute Law Retrieval
For this task, participants had to retrieve relevant articles from the Japanese Civil Code for a given legal question. The NOWJ team leveraged a combination of Bi-Encoder models (like bge, e5, stella, NV-Embed) for initial vector representations and Cross-Encoder models (bge-reranker, gte-reranker) for precise re-ranking. They employed various ensemble strategies, including a LightGBM model, grid search-optimized linear weights, and a similarity-informed voting ensemble. Their best approach, using optimized linear weights from three top-performing base models, secured third place on the leaderboard with an F2 score of 0.7702.
Task 4: Legal Textual Entailment
This task involved a Yes/No question-answering system based on a legal question and relevant articles. The NOWJ team utilized an LLM-based framework, deploying open-source LLMs such as Qwen-2, Llama-3, and Mixtral. They experimented with both zero-shot and few-shot prompting, followed by answer processing and a majority voting mechanism to combine LLM outputs. Despite strong in-context learning abilities, LLMs still presented challenges in this real-world legal entailment task, with the team achieving sixth place. This outcome points to future research directions in data augmentation and fine-tuning for complex legal reasoning.
Also Read:
- Enhancing Legal AI with Parametric Knowledge: Introducing the PL-CA Framework
- FETCH Classifier: Boosting Legal Aid Accuracy with Hybrid AI
Task 5: Legal Judgment Prediction (Pilot Task)
The pilot task focused on predicting legal judgments in Japanese tort cases, including Tort Prediction (TP) and Rationale Extraction (RE). The NOWJ team explored two main approaches: a hierarchical language model (Inter-Span Transformer architecture with ModernBERT-Ja-310M and a Conditional Random Field layer) combined with heuristic post-processing, and a clustering-based method using DeepSeek-V3. Their enhanced hierarchical model with post-processing achieved the best results for the team, securing third place in both TP (67.1% accuracy) and RE (69.2% F1 score), demonstrating the value of refining predictions based on consistency between claim patterns and final decisions.
The NOWJ team’s performance at COLIEE 2025 underscores the significant potential of integrating traditional information retrieval techniques with contemporary generative AI models for legal information processing. Their innovative methodologies, particularly the multi-stage ensemble framework and the successful application of LLMs for contextual re-ranking and analysis, provide a valuable reference for future advancements in legal AI. For more details, you can read the full research paper here.


