TLDR: SIEVE is a community-driven framework that introduces “Confidence Cards” to provide machine-readable, verifiable, and statistically-backed certifications for the quality of code datasets. It addresses the current lack of auditable quality guarantees in public code datasets by enabling continuous, transparent auditing through sponsors, validators, and arbiters, leveraging smart contracts for trust and reproducibility, ultimately reducing duplicated cleaning efforts and increasing trust in data.
In the world of artificial intelligence and software engineering, code datasets are fundamental. They power everything from recommendation systems to advanced code-generation tools. However, a significant challenge persists: ensuring the verifiable quality of these datasets. Currently, many public code datasets lack reliable quality guarantees, making it difficult to trust their completeness, cleanliness, or legal compliance. Existing “dataset cards” offer some information, but they are often not auditable and don’t provide statistical assurances, leading to fragmented and costly ad-hoc cleaning efforts by individual teams.
Addressing this critical gap, researchers Fatou Ndiaye MBODJI, El-hacen Diallo, Jordan SAMHI, Kui Liu, Jacques KLEIN, and Tegawendé F. BISSYANDE have introduced SIEVE: a novel, community-driven framework designed to bring verifiable certification to code datasets. SIEVE aims to transform traditional, narrative dataset cards into “Confidence Cards”—machine-readable, verifiable certificates that offer anytime-valid statistical bounds on dataset properties.
Understanding the Problem
The research highlights that while other critical domains like chip design certify quality before use, datasets, which are the bedrock of empirical science, often lack transparent, machine-verifiable certification. Early efforts like “Datasheets for Datasets” and “Data Cards” aimed to standardize human-readable documentation. More recent work, such as Open Datasheets and Croissant-RAI, has focused on machine-readable metadata. However, these initiatives rely heavily on adoption by dataset providers, and in reality, comprehensive dataset documentation remains scarce. An audit of Hugging Face dataset cards, for instance, revealed that only a small percentage of repositories had non-empty cards, and critical sections detailing biases and limitations were often minimal.
Code datasets present unique challenges because they are executable artifacts. Auditing them requires reconstructing complex toolchains, managing dependencies, and running builds and tests, whose outcomes can change over time due to evolving ecosystems, deprecated APIs, and new vulnerabilities. A survey conducted by the researchers identified key properties required by code datasets, such as buildability, test smoke, link validity, dependency health, and license resolution. The survey revealed that popular code dataset cards often fail to document these crucial properties adequately.
Introducing SIEVE: A Framework for Verifiable Certification
SIEVE is proposed as a pioneering solution to provide transparent, machine-verifiable, per-property certificates for code datasets, reporting quality with anytime-valid statistical bounds. The framework is designed to empower a consortium of stakeholders to collaboratively refine dataset quality and properties continuously, replacing ad-hoc cleaning with a proactive certification layer.
The SIEVE framework involves several key actors and components:
- Sponsors: These entities submit datasets for audit, cover processing costs, and provide incentives for validators. They also define acceptable error bounds and coverage parameters for properties.
- Validators: Dataset users (e.g., researchers, engineers) who derive public samples, run lightweight property checks (oracles) on these samples, and may define properties aligned with their needs.
- Arbiters: These reproduce validator evidence, aggregate results, and attest to the current confidence score. Their role can vary, from academic reviewers to AI models, and their outputs are auditable.
- Smart Contract: SIEVE leverages a smart contract as a trust anchor to ensure transparency and verifiability. It stores dataset and audit rules, fixes public randomness for unbiased sampling, manages funds, and maintains an append-only log of attestations and challenges. All actual checks run off-chain, with only commitments stored on-chain.
Confidence Cards: The Core of SIEVE
A “Confidence Card” is a machine-readable record for a specific dataset version and a binary property (e.g., buildability: violation/no-violation). It provides current evidence, including the sample count, observed violations, a live interval for the true violation rate, and a decision state (Clean, Dirty, or Pending). These cards are updated as more items are checked and can be replayed by any third party. SIEVE uses anytime-valid confidence sequences, which provide a continuously valid interval for the true violation rate, regardless of when monitoring stops.
The workflow for SIEVE involves sponsors submitting datasets with defined properties and oracles. A smart contract locks a public randomness seed for unbiased sampling. Validators then submit sampled items, compute property outcomes, and publish evidence off-chain. Arbiters reproduce this evidence, update the Confidence Card, and co-sign attestations. A stopping rule determines if a dataset is “Clean” (if the upper bound of the violation rate is below the accepted error), “Dirty” (if the lower bound is above the accepted error), or “Pending.” Once a terminal decision is reached, the per-property card is stored and referenced on-chain.
This approach promises less duplicated cleaning effort, lower onboarding costs for downstream users (as cards are portable), and increased trust for all stakeholders, as decisions are auditable and resistant to manipulation, all without requiring full rescans of entire datasets.
Also Read:
- Ensuring Data Privacy: A Framework for Verifiable Federated Unlearning
- The Sandbox Configurator: A Framework for Standardized AI Assessment in Regulatory Environments
Future Directions
The researchers outline future plans to mature SIEVE, focusing on integrating it into developer tools (like VS Code/JetBrains) to capture evidence with minimal friction, improving efficiency and cost tracking, and deploying multi-dataset pilots to demonstrate its value in real-world settings. The ultimate goal is a reproducible pipeline where evidence capture is nearly effortless, cards certify properties with anytime-valid bounds, and measurable reductions in duplicated cleaning effort and increased trust are achieved.
SIEVE represents a significant step towards a more trustworthy and efficient ecosystem for code datasets, transforming subjective claims into objective, verifiable statistical certificates. For more details, you can read the full research paper here.


