TLDR: The Guess2Graph (G2G) framework introduces a novel method to integrate unreliable expert knowledge into causal discovery algorithms. Instead of replacing statistical tests, G2G uses expert guesses to guide the *sequence* of these tests, ensuring statistical consistency while significantly improving performance in finite-sample settings. Two implementations, PC-Guess and gPC-Guess, demonstrate that algorithmic redesign (gPC-Guess) yields superior gains, even with large language model experts, offering robust and monotonic improvements with expert accuracy.
Causal discovery, the process of uncovering cause-and-effect relationships from data, is a cornerstone of scientific understanding and decision-making. However, a significant challenge arises when dealing with limited data samples: traditional causal discovery algorithms often struggle to perform accurately. This limitation can lead to unstable or inaccurate causal graphs, sometimes even contradicting established domain knowledge.
A promising avenue to overcome these finite-sample issues is the integration of expert knowledge. Historically, this has involved human experts providing constraints to guide the discovery process. More recently, large language models (LLMs) have emerged as potential scalable proxies for human experts, capable of suggesting causal constraints based on their vast training data. Yet, both human experts and LLMs are fallible; their input can be biased, inconsistent, or outright incorrect. Existing methods that incorporate such unreliable expert knowledge, either as hard constraints or soft priors, often lack theoretical guarantees and can even lead to unbounded errors if the expert advice is misleading.
Introducing Guess2Graph (G2G)
A new framework, called Guess2Graph (G2G), addresses this critical problem by proposing a principled approach to leverage fallible expert knowledge without sacrificing statistical rigor. The core idea behind G2G is to use expert guesses to guide the *sequence* of statistical tests performed by causal discovery algorithms, rather than replacing these tests or imposing rigid constraints. This ensures that all decisions remain grounded in statistical evidence, preserving the fundamental soundness of the algorithms.
The G2G framework is built upon three key criteria:
- Statistical Consistency (C1): Regardless of the expert’s quality, the algorithm is guaranteed to recover the true causal graph as the sample size grows.
- Monotonic Improvement (C2): The algorithm’s performance in finite-sample settings improves consistently as the expert’s accuracy increases.
- Finite-Sample Robustness (C3): There’s an expert accuracy threshold (e.g., better than random) above which the algorithm’s performance with expert guidance is guaranteed to be no worse than without it.
How G2G Works: Guiding the Test Sequence
Many causal discovery algorithms involve subroutines that perform sequences of statistical tests, often in a random order. G2G identifies these subroutines and replaces the random sampling with an expert-guided ordering. For instance, in constraint-based methods, G2G uses an expert’s predicted causal structure to prioritize which edges to test first. If an expert believes an edge is false (i.e., does not exist), G2G will test that edge earlier. Correctly removing false edges early on can simplify subsequent tests by reducing the size of adjacency sets, which are crucial for determining conditional independencies.
The framework also considers guiding the ‘Edge Prune’ subroutine, which tests individual edges with various conditioning sets. While the order of these tests doesn’t affect accuracy, it can significantly impact runtime. G2G can prioritize conditioning sets that the expert predicts are ‘d-separating’ (meaning they render two variables conditionally independent), leading to faster discovery.
Two Implementations: PC-Guess and gPC-Guess
The researchers developed two specific implementations of G2G:
- PC-Guess: This augments the well-known PC algorithm. While it maintains statistical consistency and shows some performance gains with expert accuracy, its improvements are modest. This is because the PC algorithm’s rigid, level-by-level structure (prioritizing smaller conditioning sets first) limits how much it can benefit from expert guidance, even perfect guidance.
- gPC-Guess: This is a redesigned variant of the PC algorithm, specifically engineered to be more receptive to expert input. By removing the level-by-level constraint, gPC-Guess can act immediately on expert predictions, allowing false edges with larger minimal d-separating sets to be removed earlier. This design fully achieves all three criteria (C1-C3) and offers provable end-to-end finite-sample performance improvements that increase monotonically with expert quality.
Also Read:
- New Method Unlocks Efficient and Accurate Causal Effect Estimation
- Probing LLM Understanding: A Causal Approach to Quantifying Model Uncertainty
Empirical Validation and Real-World Impact
Experiments on both synthetic and real-world datasets (like the Sachs protein signaling data) validate the theoretical distinctions. PC-Guess showed modest gains (up to 5%), confirming the limitations of simply augmenting existing rigid algorithms. In contrast, gPC-Guess achieved significantly stronger gains, with up to 30% performance improvements when experts were accurate. These results held true even when using a large language model expert (Claude Opus 4.1), where gPC-Guess achieved a 15% performance boost over baselines.
Further experiments confirmed that all methods converge to perfect accuracy with increasing sample size (C1). The value of expert guidance also increased in high-dimensional, low-sample settings, where data-driven methods typically struggle. Importantly, even when expert predictions were worse than random, the performance drop was bounded (around 8%), demonstrating the robustness of the G2G framework compared to traditional methods that risk unbounded error.
The Guess2Graph framework, particularly its gPC-Guess instantiation, offers a robust and effective way to integrate fallible expert knowledge into causal discovery. By guiding the sequence of statistical tests rather than replacing them, it ensures statistical consistency while unlocking significant performance improvements in practical, finite-sample scenarios. For more details, you can refer to the full research paper: From Guess2Graph: When and How Can Unreliable Experts Safely Boost Causal Discovery in Finite Samples?


