spot_img
HomeResearch & DevelopmentSurGE: A New Benchmark for Evaluating Automated Scientific Survey...

SurGE: A New Benchmark for Evaluating Automated Scientific Survey Generation

TLDR: SurGE (Survey Generation Evaluation) is a new benchmark introduced by researchers from Tsinghua University to evaluate automated scientific survey generation in computer science. It comprises test instances with expert-written surveys and cited references, alongside a large academic corpus of over one million papers. SurGE proposes an automated evaluation framework across four dimensions: information coverage, referencing accuracy, structural organization, and content quality. Initial evaluations show that even advanced LLM-based systems struggle with the complexity of the task, highlighting the need for further research in effectively synthesizing and structuring information from a vast literature pool.

The rapid expansion of academic literature has made the traditional method of manually creating scientific survey articles increasingly difficult. These surveys are crucial for summarizing research progress, but the sheer volume of new papers, especially in fields like computer science, overwhelms individual researchers. While large language models (LLMs) show great promise in automating this process, a significant hurdle has been the absence of standardized benchmarks and evaluation methods to properly assess their performance.

To address this critical gap, researchers Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Jiaxin Mao, Ziyi Ye, and Yiqun Liu from Tsinghua University introduced SurGE (Survey Generation Evaluation). This innovative benchmark is specifically designed for evaluating scientific survey generation within the computer science domain. SurGE provides a robust framework for testing and comparing different LLM-based approaches, pushing the boundaries of automated academic writing.

What is SurGE?

SurGE is composed of two main elements:

  • Test Instances: Each instance includes a topic description, an expert-written survey (serving as the ‘ground truth’), and a complete list of its cited references. These ground-truth surveys are carefully selected from highly-cited, peer-reviewed papers published between 2020 and 2024, ensuring academic significance and reliability.
  • Academic Corpus: A vast collection of over one million computer science papers, primarily sourced from arXiv metadata, acts as the retrieval pool. This corpus allows systems to gather relevant documents for survey generation.

Automated Evaluation Framework

Beyond just providing data, SurGE also introduces an automated evaluation framework that measures generated surveys across four key dimensions:

  • Information Coverage: This assesses how comprehensively a generated survey includes essential literature identified by experts in the ground-truth survey. It’s measured by the recall of references.
  • Referencing Accuracy: This is evaluated hierarchically at the document, section, and sentence levels, ensuring that cited papers are relevant to the overall topic, placed in appropriate sections, and directly support specific claims. An AI model is used to judge relevance, and any references not found in the academic corpus are considered ‘hallucinations’.
  • Structural Organization: This dimension uses two metrics. The Structure Quality Score (SQS) employs a powerful LLM (GPT-4o) to holistically score the overall outline quality. Soft-Heading Recall (SHR) measures fine-grained alignment of headings and subheadings, using semantic similarity to account for variations in wording.
  • Content Quality: Traditional metrics like ROUGE and BLEU are used at the section level, supplemented by a ‘Logic Score’ from GPT-4o to assess readability and coherence.

How Surveys are Generated and Evaluated

The task is formalized as a two-stage process: first, a retrieval module collects a set of topic-relevant papers from the academic corpus (Document Collection), and then a generative model composes a structured survey based on these papers, including proper citations (Survey Generation).

The researchers evaluated several LLM-based baseline systems, including standard Retrieval-Augmented Generation (RAG), AutoSurvey, and StepSurvey. These baselines all use a shared ‘Paper Retriever’ to find relevant documents before the generation stage.

Also Read:

Key Findings and Challenges

The experiments revealed that even state-of-the-art systems face significant challenges in survey generation. A major bottleneck was identified in the generation stage itself; while the Paper Retriever could make a substantial portion of ground-truth references available, the LLMs struggled to effectively incorporate all relevant information into the final survey. For instance, the best-performing baseline, StepSurvey, only achieved a final Coverage of 6.30%, despite the retriever making 36.65% of references available.

However, the study also highlighted the strengths of more advanced, structured approaches:

  • AutoSurvey excelled in Section-Level and Sentence-Level Relevance, indicating its iterative planning and section-by-section refinement lead to more accurate citation placement. It also achieved the highest Structure Quality Score.
  • StepSurvey performed best in overall Coverage and Document-Level Relevance, suggesting its multi-phase workflow is effective at broadening the scope of included works and aligning cited literature with the main survey theme. It also showed strong content quality and logical coherence.

These findings underscore the importance of iterative refinement and structured planning in automated survey generation. The benchmark demonstrates that while LLMs are powerful, there’s substantial room for improvement, particularly in boosting coverage and refining the interplay between local and global survey organization. SurGE is an open-source project, with all code, data, and models available on GitHub, fostering reproducible research and future advancements in this challenging area. You can find more details about this research paper here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -