TLDR: This research introduces CLoVE and GloVE, two algorithms designed to enhance the transparency of LLM-as-a-Judge systems. CLoVE generates verifiable, concept-based, contrastive local explanations (BECAUSE-DESPITE format) for individual LLM decisions. GloVE then aggregates these local explanations into a high-level, verifiable, rule-based global policy. Evaluated on harm detection tasks, the approach demonstrates high fidelity to LLM decisions and robustness against text perturbations and adversarial attacks. A user study also indicated improved perceived usefulness of GloVE explanations, contributing to more trustworthy AI evaluation.
Large Language Models (LLMs) are increasingly being used to evaluate text, a practice known as LLM-as-a-Judge. This approach is gaining traction for augmenting or even replacing human annotations, especially at scale. However, the opaque nature of LLM decision-making raises significant concerns about potential biases, reliability, and fairness, particularly when these models are applied to critical tasks.
Understanding how an LLM-as-a-Judge arrives at its conclusions is crucial for ensuring its safe, correct, and unbiased operation. While human feedback is ideal, it’s often costly and difficult to scale. LLM-as-a-Judge offers a promising alternative, but its inherent encoding of specific worldviews and normative assumptions, derived from training data, can go unnoticed without rigorous evaluation and transparent explanations.
Traditional local explanations, such as Chain of Thought (CoT) prompting, can explain individual decisions but are often unreliable and may not reflect the true causal reasons behind a model’s output. Furthermore, local explanations don’t provide a comprehensive understanding of the LLM-as-a-Judge’s overall decision-making logic or ‘policy’. This is where the work presented in the research paper, INTERPRETINGLLM-AS-A-JUDGEPOLICIES VIA VERIFIABLEGLOBALEXPLANATIONS, comes into play.
Introducing CLoVE: Contrastive Local Verifiable Explanations
The paper proposes CLoVE, a novel algorithm designed to generate high-level, verifiable, concept-based local explanations for an LLM-as-a-Judge’s decisions. CLoVE provides contrastive explanations in a ‘BECAUSE-DESPITE’ format, offering nuanced insights from multiple viewpoints. For instance, an explanation might state that a prompt was classified as harmful BECAUSE of a ‘weapon making request’ DESPITE a ‘video game context’. This format helps users understand both the supporting and conflicting concepts influencing a decision.
CLoVE operates in three key steps:
- Generator: An LLM is prompted to provide initial reasoning for a decision, which is then summarized into candidate concepts.
- Local Word-Based Explainer: An algorithm like LIME identifies the most important words in the input that contributed to the LLM-as-a-Judge’s decision.
- Verifier: Another LLM assesses whether the generated concepts are genuinely supported by the important words identified by the explainer, mitigating the risk of hallucination.
This ensures that the concepts used in the explanation are causally grounded in the input text and truly influenced the model’s decision.
Introducing GloVE: Global Verifiable Explanations
Building on CLoVE, the researchers introduce GloVE, an algorithm for summarizing these individual local explanations into a high-level, verifiable, rule-based global policy for the LLM-as-a-Judge. GloVE aims to condense the numerous local rules into an interpretable format while preserving the relationships between supporting and conflicting concepts.
GloVE represents the collection of local explanations as a K-partite graph, where nodes are concepts and arcs connect concepts from the same local rule. To summarize this graph, GloVE employs an iterative clustering process. In each iteration, similar concepts are grouped, and an LLM proposes potential common labels for these clusters. A factuality assessor, FactReasoner, then verifies these labels, ensuring they are entailed by the original concepts with high probability. This verified label replaces the clustered concepts, effectively merging them into a higher-level concept. This process is repeated, gradually distilling the local explanations into a compact, faithful global policy.
Experimental Validation and User Study
The researchers evaluated GloVE on seven standard benchmarking datasets for content harm detection, using two prominent guardrail models: Granite Guardian and LlamaGuard. The results demonstrated that GloVE explanations achieve consistently high fidelity to the LLM-as-a-Judge’s decision-making process, often outperforming a baseline approach (GELPE) in accuracy and F1 score, while sacrificing minimal performance.
Furthermore, GloVE showed high robustness to various text perturbations, including paraphrasing strategies (like hiding content, elaborating, or substituting words) and simple adversarial attacks (such as removing spaces, changing case, inserting punctuation, or swapping words). This indicates that the extracted global policies remain stable even when the input text is slightly altered.
A user study was also conducted to assess user understanding and satisfaction. While participants who saw GloVE explanations showed only a marginal increase in accuracy when predicting LLM-as-a-Judge decisions compared to the baseline, they reported significantly higher satisfaction in terms of perceived usefulness of the explanations. This highlights the ongoing challenge of balancing faithfulness with interpretability in global explanations.
Also Read:
- Enhancing AI Reasoning for Complex Rules with Structured Templates
- Enhancing LLM Judge Reliability Through Adaptive Prompting and Collective Confidence
Conclusion
The CLoVE and GloVE pipeline represents a significant step towards making LLM-as-a-Judge systems more transparent and trustworthy. By generating verifiable, concept-based local explanations and then summarizing them into faithful, interpretable global policies, this research, conducted by Jasmina Gajcin, Erik Miehling, Rahul Nair, Elizabeth Daly, Radu Marinescu, and Seshu Tirupathi from IBM Research Ireland, provides crucial tools for understanding and overseeing AI evaluators in high-stakes domains.


