spot_img
HomeResearch & DevelopmentIntroducing Secu-Table: A New Dataset for AI to Understand...

Introducing Secu-Table: A New Dataset for AI to Understand Cybersecurity Data

TLDR: The Secu-Table dataset is a new, comprehensive collection of over 1500 security-related tables, extracted from CVE and CWE sources and annotated with Wikidata and SEPSES CSKG. It aims to address the lack of domain-specific tabular datasets for evaluating semantic table interpretation (STI) systems, especially those based on Large Language Models (LLMs), in cybersecurity. The dataset includes controlled errors to test AI robustness and is publicly available for the SemTab 2025 challenge, with plans for quarterly updates and expansion.

In the rapidly evolving landscape of cybersecurity, the ability of artificial intelligence systems, particularly large language models (LLMs), to understand and interpret complex security data presented in tables is crucial. However, a significant challenge has been the lack of publicly available, domain-specific tabular datasets for evaluating these systems. This gap makes it difficult to accurately assess how well AI can process and make sense of security-related information.

Addressing this critical need, a new research paper introduces the Secu-Table dataset, a comprehensive collection designed specifically for evaluating semantic table interpretation (STI) systems in the security domain. This dataset is a major step forward, providing a standardized resource for researchers and developers working on AI applications in cybersecurity.

What is Secu-Table?

Secu-Table is a robust dataset comprising over 1500 tables, containing more than 15,000 entities. These tables are meticulously constructed using security data extracted from well-known sources like Common Vulnerabilities and Exposures (CVE) and Common Weakness Enumeration (CWE). These sources are fundamental to understanding and categorizing cybersecurity threats and weaknesses.

The dataset is not just a collection of tables; it’s also richly annotated. The annotations link table elements to existing knowledge graphs (KGs), specifically Wikidata, a general-purpose knowledge base, and the SEmantic Processing of Security Event Streams CyberSecurity Knowledge Graph (SEPSES CSKG), a specialized security KG. This annotation process adds semantic meaning to the raw tabular data, allowing AI systems to understand the context and relationships within the data.

Why is Secu-Table Important?

Current security datasets often suffer from several limitations. They are scattered across the internet in various formats (CSV, JSON, XML), making it hard to get a unified view. Many focus on narrow attack vectors, limiting their general applicability, and often lack the detailed annotations necessary for training advanced supervised learning models. Secu-Table tackles these issues by providing a consolidated, richly annotated, and structured resource.

The creation of Secu-Table is particularly relevant in the context of the SemTab challenge, an international initiative aimed at benchmarking systems that match tabular data to knowledge graphs. The 2025 SemTab challenge will specifically use Secu-Table to evaluate the performance of various STI systems, especially those built on open-source LLMs.

How Was Secu-Table Built?

The construction of Secu-Table involved a multi-step process. Initially, data curators with strong backgrounds in semantic web and knowledge graphs were recruited. They identified CVE and CWE as primary data sources. The data was then manually parsed from various formats into CSV files, forming the raw tables. These tables were then manually annotated, a meticulous and time-consuming task, to ensure high quality. The annotation process involved three key tasks:

  • Cell Entity Annotation (CEA): Mapping individual table elements to entities in a knowledge graph.
  • Column Type Annotation (CTA): Identifying the types of data contained within table columns.
  • Column Property Annotation (CPA): Defining the relationships between different columns in a table.

To make the dataset robust for evaluating AI systems, various types of errors and ambiguities were intentionally introduced. This includes data without errors (20%), missing context (26%), misspelling errors (26.6%), and annotation errors (26.26%). This simulates real-world data challenges, pushing LLMs to perform better in imperfect scenarios.

Also Read:

Availability and Future Plans

The Secu-Table dataset is publicly available for research purposes on Hugging Face. All associated code is also released on GitLab under the MIT license, ensuring transparency and reproducibility. The current version, Secu-Table v2, includes 1,554 tables, with a subset of 76 tables serving as ground truth for initial evaluations and the rest for testing.

The researchers plan quarterly updates, expanding the dataset to include more security data sources such as Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK), Common Configuration Enumeration (CCE), Common Platform Enumeration (CPE), Common Vulnerability Scoring System (CVSS), Open Worldwide Application Security Project (OWASP), and Security Content Automation Protocol (SCAP). They are also exploring semi-automatic annotation approaches to scale the dataset further.

Preliminary evaluations using open-source LLMs like Falcon3-7b-instruct and Mistral-7B-Instruct, as well as the closed-source GPT-4o mini, have already been conducted, establishing a baseline for future research. This dataset promises to significantly advance the field of semantic table interpretation in cybersecurity, enabling more intelligent and robust AI systems. You can find more details about this work in the full research paper: Secu-Table: a Comprehensive security table dataset for evaluating semantic table interpretation systems.

Dev Sundaram
Dev Sundaramhttps://blogs.edgentiq.com
Dev Sundaram is an investigative tech journalist with a nose for exclusives and leaks. With stints in cybersecurity and enterprise AI reporting, Dev thrives on breaking big stories—product launches, funding rounds, regulatory shifts—and giving them context. He believes journalism should push the AI industry toward transparency and accountability, especially as Generative AI becomes mainstream. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -