TLDR: MArgE is a novel framework that enhances claim verification by structuring evidence from multiple large language models (LLMs) into formal argument trees. It addresses the limitations of unstructured multi-LLM interactions and unfaithful Chain-of-Thought outputs by having LLMs generate pro/con arguments, meshing them, scoring their quality with an external LLM, and then applying formal argumentation semantics to derive a justifiable true/false prediction. This approach significantly improves accuracy and transparency compared to existing methods.
In the rapidly evolving landscape of artificial intelligence, Large Language Models (LLMs) have become indispensable tools, capable of generating coherent text, answering complex questions, and assisting in decision-making across various fields. However, their immense power comes with a challenge: the potential for errors, such as hallucinations, and a lack of transparency in their reasoning processes, especially when used for high-stakes applications like claim verification.
Traditional methods of combining insights from multiple LLMs often involve unstructured interactions, like free-form debates or simple voting. While these approaches can improve performance, they frequently leave the final decisions ambiguous and difficult to justify. Imagine a group of experts debating a complex issue; if their final conclusion isn’t clearly tied to their individual arguments, it’s hard to trust or understand why that decision was made. Similarly, existing Chain-of-Thought (CoT) outputs from single LLMs, intended to show reasoning, can sometimes be unfaithful to the model’s true internal process or even contain conflicting information.
Introducing MArgE: A Structured Approach to Claim Verification
A new framework called MArgE (Meshing Argumentative Evidence from Multiple Large Language Models for Justifiable Claim Verification) has emerged to address these critical limitations. MArgE introduces a novel way to bring formal structure to the evidence generated by multiple LLMs, transforming their insights into an inspectable tree of extracted arguments. This framework is inspired by Argumentative LLMs (ArgLLMs), which leverage principles from computational argumentation to build structured reasoning pathways.
The core idea behind MArgE is to create a transparent and justifiable pathway from initial arguments to the final claim verification decision. Instead of simply asking LLMs for a true/false judgment, MArgE prompts them to generate specific arguments that either support or attack a given claim. This process ensures that every piece of evidence contributes to a clear, traceable rationale.
How MArgE Works: A Four-Step Process
MArgE operates through a systematic four-step pipeline:
First, for a given claim, multiple LLMs (such as Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, and Phi-3-Mini-4K-Instruct) independently generate trees of supporting (pro) and attacking (con) arguments. Each argument is a concise, self-contained sentence, and the tree structure reflects the model’s reasoning process.
Second, these individual argument trees are then ‘meshed’ or combined into a single, unified structure. This can be done by simply concatenating all arguments or by using semantic merging, which identifies and combines similar arguments generated by different models to avoid redundancy.
Third, an external, often more powerful, LLM (like GPT-4o mini) is used to assign a ‘base score’ to each argument and, optionally, to the claim itself. This score represents the intrinsic quality of the argument, independent of its attackers or supporters. This external scoring step is crucial as it acts like a ‘reward model,’ helping to filter out or down-weight obviously hallucinated or irrelevant arguments, ensuring that only high-quality evidence significantly impacts the final decision.
Finally, formal gradual semantics from computational argumentation theory (such as DF-QuAD) are applied. These semantics propagate the influence of each argument’s base score through the entire meshed tree, iteratively updating the strength of each node, including the root claim. The final strength of the claim is then used to make a true/false prediction, typically by thresholding at 0.5.
Why Justifiability Matters
MArgE’s strength lies in its commitment to justifiability. Unlike unstructured LLM outputs or debates, MArgE’s merged and scored argument tree provides an intuitive and inspectable rationale. The discrete nature of the arguments and their quality scores offers human-interpretable signals, making it easier to understand how a decision was reached. This framework also helps mitigate issues like ‘echo chambers’ in multi-LLM debates or the misleading influence of adversarial agents, as unreliable arguments are explicitly marked down.
Furthermore, by grounding the reasoning in formal argumentative semantics, MArgE ensures that the final verdict is not only transparent but also logically consistent with established theories in computational argumentation.
Also Read:
- CHECK: Enhancing Language Models for Complex Questions Through Semantic Reasoning
- KCR: Empowering LLMs to Master Knowledge Conflicts in Extensive Texts
Performance and Impact
Experimental evaluations on various claim verification datasets (TruthfulClaim, StrategyClaim, and MedClaim) demonstrate that MArgE significantly outperforms single LLMs, ensembles of LLMs, existing ArgLLMs, and even prior methods for unstructured multi-LLM debates. Notably, when a stronger model like GPT-4o mini is used for argument scoring, MArgE shows substantial accuracy improvements, often surpassing the performance of GPT-4o mini itself when prompted directly.
The research highlights that the most significant performance gain comes from allowing the scoring model to evaluate the quality of the claim itself, in addition to the generated arguments. MArgE also exhibits remarkable consistency across different runs, suggesting its suitability for real-world applications where reproducibility is key.
MArgE represents a significant step forward in making AI’s reasoning more transparent, reliable, and justifiable for critical tasks like claim verification. By structuring the collective intelligence of multiple LLMs through formal argumentation, it paves the way for more trustworthy AI systems. You can read the full research paper here.


