SurGE: A New Benchmark for Evaluating Automated Scientific Survey Generation

TLDR: SurGE (Survey Generation Evaluation) is a new benchmark introduced by researchers from Tsinghua University to evaluate automated scientific survey generation in computer science. It comprises test instances with expert-written surveys and cited references, alongside a large academic corpus of over one million papers. SurGE proposes an automated evaluation framework across four dimensions: information coverage, referencing accuracy, structural organization, and content quality. Initial evaluations show that even advanced LLM-based systems struggle with the complexity of the task, highlighting the need for further research in effectively synthesizing and structuring information from a vast literature pool.

The rapid expansion of academic literature has made the traditional method of manually creating scientific survey articles increasingly difficult. These surveys are crucial for summarizing research progress, but the sheer volume of new papers, especially in fields like computer science, overwhelms individual researchers. While large language models (LLMs) show great promise in automating this process, a significant hurdle has been the absence of standardized benchmarks and evaluation methods to properly assess their performance.

To address this critical gap, researchers Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Jiaxin Mao, Ziyi Ye, and Yiqun Liu from Tsinghua University introduced SurGE (Survey Generation Evaluation). This innovative benchmark is specifically designed for evaluating scientific survey generation within the computer science domain. SurGE provides a robust framework for testing and comparing different LLM-based approaches, pushing the boundaries of automated academic writing.

What is SurGE?

SurGE is composed of two main elements:

Test Instances: Each instance includes a topic description, an expert-written survey (serving as the ‘ground truth’), and a complete list of its cited references. These ground-truth surveys are carefully selected from highly-cited, peer-reviewed papers published between 2020 and 2024, ensuring academic significance and reliability.
Academic Corpus: A vast collection of over one million computer science papers, primarily sourced from arXiv metadata, acts as the retrieval pool. This corpus allows systems to gather relevant documents for survey generation.

Automated Evaluation Framework

Beyond just providing data, SurGE also introduces an automated evaluation framework that measures generated surveys across four key dimensions:

Information Coverage: This assesses how comprehensively a generated survey includes essential literature identified by experts in the ground-truth survey. It’s measured by the recall of references.
Referencing Accuracy: This is evaluated hierarchically at the document, section, and sentence levels, ensuring that cited papers are relevant to the overall topic, placed in appropriate sections, and directly support specific claims. An AI model is used to judge relevance, and any references not found in the academic corpus are considered ‘hallucinations’.
Structural Organization: This dimension uses two metrics. The Structure Quality Score (SQS) employs a powerful LLM (GPT-4o) to holistically score the overall outline quality. Soft-Heading Recall (SHR) measures fine-grained alignment of headings and subheadings, using semantic similarity to account for variations in wording.
Content Quality: Traditional metrics like ROUGE and BLEU are used at the section level, supplemented by a ‘Logic Score’ from GPT-4o to assess readability and coherence.

How Surveys are Generated and Evaluated

The task is formalized as a two-stage process: first, a retrieval module collects a set of topic-relevant papers from the academic corpus (Document Collection), and then a generative model composes a structured survey based on these papers, including proper citations (Survey Generation).

The researchers evaluated several LLM-based baseline systems, including standard Retrieval-Augmented Generation (RAG), AutoSurvey, and StepSurvey. These baselines all use a shared ‘Paper Retriever’ to find relevant documents before the generation stage.

Also Read:

Key Findings and Challenges

The experiments revealed that even state-of-the-art systems face significant challenges in survey generation. A major bottleneck was identified in the generation stage itself; while the Paper Retriever could make a substantial portion of ground-truth references available, the LLMs struggled to effectively incorporate all relevant information into the final survey. For instance, the best-performing baseline, StepSurvey, only achieved a final Coverage of 6.30%, despite the retriever making 36.65% of references available.

However, the study also highlighted the strengths of more advanced, structured approaches:

AutoSurvey excelled in Section-Level and Sentence-Level Relevance, indicating its iterative planning and section-by-section refinement lead to more accurate citation placement. It also achieved the highest Structure Quality Score.
StepSurvey performed best in overall Coverage and Document-Level Relevance, suggesting its multi-phase workflow is effective at broadening the scope of included works and aligning cited literature with the main survey theme. It also showed strong content quality and logical coherence.

These findings underscore the importance of iterative refinement and structured planning in automated survey generation. The benchmark demonstrates that while LLMs are powerful, there’s substantial room for improvement, particularly in boosting coverage and refining the interplay between local and global survey organization. SurGE is an open-source project, with all code, data, and models available on GitHub, fostering reproducible research and future advancements in this challenging area. You can find more details about this research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

SurGE: A New Benchmark for Evaluating Automated Scientific Survey Generation

What is SurGE?

Automated Evaluation Framework

How Surveys are Generated and Evaluated

Key Findings and Challenges

Gen AI News and Updates

AI’s Hyper-Growth Unlocked: OpenAI’s $500B Valuation Forces a Capital Re-evaluation for Investors

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

Ghana Navigates Complexities in AI Regulatory Development Amidst Coordination Challenges

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates