A New AI-Powered Pipeline for Transparent and Efficient Full-Text Screening in Systematic Reviews

TLDR: The paper introduces an auditable pipeline for full-text screening in systematic reviews, combining contrastive semantic highlighting, Mamdani fuzzy logic, and LLM judgment. This approach reframes inclusion/exclusion as a fuzzy decision, achieving significantly higher recall (82.81% macro recall) compared to statistical (62.50%) and crisp (57.81%) baselines, while reducing screening time from 20 minutes to under 1 minute per article and maintaining high human-machine agreement and auditability.

Systematic reviews are a cornerstone of evidence-based medicine and research, providing reliable summaries of scientific knowledge. However, the sheer volume of new publications daily has created a significant bottleneck in the review process, particularly during full-text screening. This stage requires careful reading of lengthy, diverse documents where crucial information can be scattered and often ambiguous, leading to reviewer fatigue and inconsistencies.

The challenge lies in the inherently ‘fuzzy’ nature of full-text decision-making. Unlike simple yes/no questions, eligibility often depends on graded, partial, and distributed evidence. Traditional methods, such as crisp rule-based systems, struggle with this vagueness, offering limited transparency and an inability to represent partial relevance. Probabilistic machine learning, while an improvement, often provides opaque scores without clear explanations, and large language models (LLMs) alone may not guarantee consistent recall or calibrated decisions for borderline cases.

A new pipeline has been developed that addresses these limitations by reframing inclusion/exclusion as a fuzzy decision problem. This innovative system integrates contrastive semantic highlighting, Mamdani fuzzy inference, and LLM judgment to create a scalable and auditable solution.

How the Pipeline Works

The process begins by parsing research papers into overlapping chunks of text. These chunks are then embedded using a domain-adapted model. For each inclusion criterion (such as Population, Intervention, Outcome, Study Approach), the system computes a ‘contrastive similarity’ score. This score rewards alignment with inclusion criteria while penalizing overlap with exclusion criteria, effectively highlighting candidate text spans where evidence might reside.

Next, a Mamdani fuzzy inference system takes these similarity scores and an ‘ambiguity margin’ to produce a graded inclusion degree. Instead of a rigid binary decision, fuzzy logic allows for degrees of membership, reflecting the nuanced nature of evidence. This system uses a set of human-aligned rules to weigh signal strength and contextual nuance, mirroring how expert panels reason.

Finally, a domain-adapted Large Language Model acts as a judge. It reviews the fuzzy-selected and highlighted text spans, providing a tertiary decision (YES/NO/MAYBE), a confidence score, and a concise, criterion-referenced explanation. Crucially, if the LLM deems evidence insufficient, the fuzzy membership is attenuated rather than forcing a hard exclusion, preserving recall for borderline cases. This entire process is logged, ensuring end-to-end traceability and auditability.

Also Read:

Key Advantages and Performance

The fuzzy system demonstrated significant improvements in recall compared to statistical and crisp baselines. In a pilot study on an all-positive gold set of 16 full-text articles, the fuzzy system achieved a macro recall of 82.81% across four criteria, substantially outperforming statistical (62.50%) and crisp (57.81%) baselines. For strict ‘all-criteria’ inclusion, the fuzzy approach included 50% of articles, compared to 25% for the statistical system and 12.5% for the crisp system.

Beyond performance, the pipeline offers remarkable efficiency and transparency. A pilot user study showed that screening time per article was reduced from approximately 20 minutes to under 1 minute. The system is also cost-effective, processing around 200 documents for under $5 USD in LLM API fees. Furthermore, cross-model agreement (GPT-5 vs. GPT-4.1-mini) on chunk-level justifications was 98.27%, and human-machine agreement was 96.07%, indicating stable and reliable rationales.

This auditable chain from document text to decision, coupled with clear LLM explanations, addresses a major barrier to AI adoption in systematic reviews. The modular and model-agnostic design also ensures scalability and adaptability to various review topics.

In conclusion, by embracing the inherent fuzziness of full-text evidence, this pipeline offers a robust, transparent, and efficient solution for systematic review screening. It transforms a labor-intensive bottleneck into a streamlined process, suitable for ‘living’ reviews and institutional deployment. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

A New AI-Powered Pipeline for Transparent and Efficient Full-Text Screening in Systematic Reviews

How the Pipeline Works

Key Advantages and Performance

Gen AI News and Updates

Generative AI Revolutionizes Engineering: Startups and Enterprises Drive Measurable ROI in 2025

Appy Pie Agents Unveils AI Travel Assistants for Simplified Trip Planning and Enhanced Customer Support

TrueBalance Transforms Indian Credit Landscape with Advanced AI for Financial Inclusion

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates