TLDR: JustEva is an open-source toolkit designed to evaluate the fairness of Large Language Models (LLMs) in legal contexts. It features a 65-factor label system, three core fairness metrics (inconsistency, bias, imbalanced inaccuracy), robust statistical inference, and visualizations. Using the JudiFair dataset, JustEva can identify systematic fairness issues in LLMs, revealing significant inconsistencies, biases, and imbalanced inaccuracies in leading models like Gemini Flash 1.5, GLM 4, and Qwen2.5 72B Instruct. The toolkit aims to provide a practical solution for auditing and improving algorithmic fairness in the legal domain.
The increasing integration of Large Language Models (LLMs) into legal practice brings significant advancements but also raises critical questions about judicial fairness. Given the complex, often opaque nature of these AI systems, ensuring their impartiality is paramount to upholding justice. A new study introduces JustEva, a comprehensive, open-source toolkit designed specifically to evaluate the fairness of LLMs in legal tasks.
JustEva addresses the urgent need for a robust method to assess whether AI tools in legal settings might inadvertently perpetuate or introduce biases. The toolkit is built on several key advantages that make it a powerful resource for developers, researchers, and legal professionals alike.
Key Features of JustEva
At its core, JustEva utilizes a meticulously structured label system that covers 65 extra-legal factors. These factors, ranging from demographic characteristics to procedural elements, are identified by legal experts as potential influences on judicial decision-making. This extensive set of labels allows for a granular and comprehensive examination of fairness.
The toolkit employs three core fairness metrics to provide a multi-faceted evaluation:
- Inconsistency: This metric measures how stable an LLM’s predictions are when minor, controlled changes are made to case features. High inconsistency suggests that an LLM might deliver different judgments based on irrelevant or unstable details, which is highly undesirable in legal contexts.
- Bias: JustEva identifies systematic biases by estimating the marginal impact of each extra-legal label on sentencing outcomes. It uses advanced statistical methods to determine if certain factors, like gender or crime type, systematically influence an LLM’s decisions.
- Imbalanced Inaccuracy: Beyond just disparate treatment, this metric examines whether prediction errors are disproportionately larger for certain groups or label values. It highlights if an LLM is less accurate for specific categories, leading to unfair outcomes.
To ensure the reliability of its findings, JustEva incorporates robust statistical inference methods, including high-dimensional fixed-effects linear regression and a Bernoulli test to account for multiple comparisons. This rigorous approach helps distinguish meaningful fairness disparities from random variations. Furthermore, the toolkit provides informative visualizations, presenting detailed statistics and tables in an easy-to-understand format, which enhances clarity and interpretation of the results.
How JustEva Works
JustEva supports a complete evaluation workflow through two main types of experiments. Users can either generate structured outputs from LLMs using a provided dataset or conduct statistical analysis and inference on existing LLM outputs. The toolkit is designed to be user-friendly, allowing configuration and evaluation of LLMs through custom APIs and settings without requiring coding skills.
The toolkit’s backend is implemented in Python and integrates with platforms like OpenRouter, enabling users to query various LLMs. For data analysis, it leverages PyStata, which seamlessly integrates Python and Stata to perform advanced statistical and regression analyses. The results are then processed into readable formats, including JSON for visualization and Excel tables for detailed review.
The foundation of JustEva’s evaluation is the JudiFair dataset, a comprehensive collection of 177,100 unique case facts derived from real Chinese judicial documents. This dataset is annotated with the 65 legal labels and includes counterfactual variants, allowing for precise analysis of how specific label changes affect model outputs.
Also Read:
- Understanding Large Language Models in Legal AI: A Deep Dive into Current Trends and Future Paths
- Unmasking License Conflicts in Open-Source AI: A Deep Dive into Compliance Challenges
Empirical Findings and Impact
The researchers applied JustEva to evaluate three widely-used LLMs: Gemini Flash 1.5, GLM 4, and Qwen2.5 72B Instruct. The empirical application revealed significant fairness deficiencies across all models. For instance, each model showed substantial inconsistency, with over 10% of documents producing varied predictions for each label on average. Moreover, systematic biases and imbalanced inaccuracies were found to be statistically significant at the 1% level, as determined by the Bernoulli test.
These findings underscore a general pattern of significant fairness problems in current LLMs when applied to legal tasks, highlighting the urgent need for more fair and trustworthy AI legal tools. JustEva offers a transparent, scalable, and reproducible workflow for identifying and addressing these critical judicial fairness concerns.
In conclusion, JustEva provides a practical and convenient solution for auditing LLMs in legal tasks, contributing to broader efforts to build transparent, fair, and trustworthy AI systems in the legal domain. For more details, you can read the full research paper: JustEva: A Toolkit to Evaluate LLM Fairness in Legal Knowledge Inference.


