TLDR: J1-ENVS is a novel interactive and dynamic legal environment designed to benchmark LLM-based legal agents across six real-world Chinese legal scenarios, categorized into three complexity levels. Coupled with J1-EVAL, a fine-grained evaluation framework, it assesses both task performance and procedural compliance. Experiments with 17 LLMs, including GPT-4o, reveal that while models possess legal knowledge, they struggle significantly with procedural execution in dynamic settings, highlighting challenges in achieving true legal intelligence.
A new research paper introduces J1-ENVS, a groundbreaking interactive and dynamic legal environment designed to benchmark the capabilities of Large Language Model (LLM)-based agents in real-world legal scenarios. This innovative framework aims to bridge the gap between traditional static benchmarks and the complex, evolving nature of actual legal practice.
The J1-ENVS environment, developed with guidance from legal experts, comprises six distinct scenarios drawn from Chinese legal practices. These scenarios are categorized into three levels of increasing environmental complexity:
Level I: Foundational Legal Interactions
This level includes Knowledge Questioning (KQ) and Legal Consultation (LC). Here, the LLM agent simulates a legal trainee, engaging in progressive dialogues with the general public to answer legal questions or provide case-specific advice. The interactions are designed to test the agent’s ability to respond accurately and proactively in dynamic settings.
Level II: Document Drafting
Level II focuses on Complaint Drafting (CD) and Defence Drafting (DD). In these scenarios, the legal agent acts as a practicing lawyer, guiding litigants step-by-step to collect necessary information and produce legally compliant documents. This level assesses the agent’s ability to manage multi-turn interactions and adhere to procedural requirements for document generation.
Also Read:
- Unlocking LLM Potential: How NL2FLOW Revolutionizes AI Planning and Evaluation
- Revolutionizing Legal Research: AI-Powered Summarization and Case Retrieval for Indian Courts
Level III: Courtroom Simulations
The most complex level involves Civil Court (CI) and Criminal Court (CR simulations. These environments feature multiple participants and are governed by strict procedural norms. The LLM agent takes on the role of a judge, facilitating interactions between parties, ensuring procedural compliance, and ultimately rendering legally valid judgments. This level tests advanced reasoning, procedural adherence, and multi-party interaction capabilities.
To evaluate agent performance within J1-ENVS, the researchers also introduced J1-EVAL, a fine-grained evaluation framework. J1-EVAL assesses both task completion and procedural compliance across varying levels of legal proficiency, from trainee to lawyer to judge. It employs a mix of rule-based and LLM-based automatic evaluation methods, with explicit ground truth references for each task. Metrics include binary accuracy, non-binary scores for open-ended questions, format-following and document quality scores for drafting tasks, and procedural-following, judgment, reason, law accuracy, crime accuracy, and verdict deviation scores for court scenarios.
Extensive experiments were conducted on 17 different LLM agents, including proprietary models like GPT-4o and Claude-3.7, as well as various open-source and legal-specific models. The findings reveal that while many models demonstrate a solid understanding of legal knowledge, they significantly struggle with procedural execution in dynamic environments. Even GPT-4o, considered a state-of-the-art model, achieved an overall performance score below 60%. This highlights persistent challenges in achieving true dynamic legal intelligence in AI agents.
The research paper emphasizes that current legal-specific LLMs, despite performing well on existing static benchmarks, show significantly weaker performance in interactive and dynamic settings, underscoring a key limitation in their interactive capabilities. The study also found that effective reasoning and adherence to procedural protocols within long legal contexts are crucial for accurate legal judgments.
The introduction of J1-ENVS and J1-EVAL marks a significant shift in the paradigm for legal intelligence, moving from static evaluations to dynamic, interactive assessments. Beyond evaluation, this framework can be extended for data generation and reinforcement learning training for future legal AI systems. For more details, you can refer to the full research paper here.
While the benchmark provides a comprehensive simulation of real-world legal practice, the authors note limitations, such as the primary focus on procedural flow rather than complex capabilities like retrieving statutory provisions or consulting precedent databases. Future research is expected to build upon this foundation by incorporating such advanced functionalities to further enhance realism and applicability.


