Introducing Secu-Table: A New Dataset for AI to Understand Cybersecurity Data

TLDR: The Secu-Table dataset is a new, comprehensive collection of over 1500 security-related tables, extracted from CVE and CWE sources and annotated with Wikidata and SEPSES CSKG. It aims to address the lack of domain-specific tabular datasets for evaluating semantic table interpretation (STI) systems, especially those based on Large Language Models (LLMs), in cybersecurity. The dataset includes controlled errors to test AI robustness and is publicly available for the SemTab 2025 challenge, with plans for quarterly updates and expansion.

In the rapidly evolving landscape of cybersecurity, the ability of artificial intelligence systems, particularly large language models (LLMs), to understand and interpret complex security data presented in tables is crucial. However, a significant challenge has been the lack of publicly available, domain-specific tabular datasets for evaluating these systems. This gap makes it difficult to accurately assess how well AI can process and make sense of security-related information.

Addressing this critical need, a new research paper introduces the Secu-Table dataset, a comprehensive collection designed specifically for evaluating semantic table interpretation (STI) systems in the security domain. This dataset is a major step forward, providing a standardized resource for researchers and developers working on AI applications in cybersecurity.

What is Secu-Table?

Secu-Table is a robust dataset comprising over 1500 tables, containing more than 15,000 entities. These tables are meticulously constructed using security data extracted from well-known sources like Common Vulnerabilities and Exposures (CVE) and Common Weakness Enumeration (CWE). These sources are fundamental to understanding and categorizing cybersecurity threats and weaknesses.

The dataset is not just a collection of tables; it’s also richly annotated. The annotations link table elements to existing knowledge graphs (KGs), specifically Wikidata, a general-purpose knowledge base, and the SEmantic Processing of Security Event Streams CyberSecurity Knowledge Graph (SEPSES CSKG), a specialized security KG. This annotation process adds semantic meaning to the raw tabular data, allowing AI systems to understand the context and relationships within the data.

Why is Secu-Table Important?

Current security datasets often suffer from several limitations. They are scattered across the internet in various formats (CSV, JSON, XML), making it hard to get a unified view. Many focus on narrow attack vectors, limiting their general applicability, and often lack the detailed annotations necessary for training advanced supervised learning models. Secu-Table tackles these issues by providing a consolidated, richly annotated, and structured resource.

The creation of Secu-Table is particularly relevant in the context of the SemTab challenge, an international initiative aimed at benchmarking systems that match tabular data to knowledge graphs. The 2025 SemTab challenge will specifically use Secu-Table to evaluate the performance of various STI systems, especially those built on open-source LLMs.

How Was Secu-Table Built?

The construction of Secu-Table involved a multi-step process. Initially, data curators with strong backgrounds in semantic web and knowledge graphs were recruited. They identified CVE and CWE as primary data sources. The data was then manually parsed from various formats into CSV files, forming the raw tables. These tables were then manually annotated, a meticulous and time-consuming task, to ensure high quality. The annotation process involved three key tasks:

Cell Entity Annotation (CEA): Mapping individual table elements to entities in a knowledge graph.
Column Type Annotation (CTA): Identifying the types of data contained within table columns.
Column Property Annotation (CPA): Defining the relationships between different columns in a table.

To make the dataset robust for evaluating AI systems, various types of errors and ambiguities were intentionally introduced. This includes data without errors (20%), missing context (26%), misspelling errors (26.6%), and annotation errors (26.26%). This simulates real-world data challenges, pushing LLMs to perform better in imperfect scenarios.

Also Read:

Availability and Future Plans

The Secu-Table dataset is publicly available for research purposes on Hugging Face. All associated code is also released on GitLab under the MIT license, ensuring transparency and reproducibility. The current version, Secu-Table v2, includes 1,554 tables, with a subset of 76 tables serving as ground truth for initial evaluations and the rest for testing.

The researchers plan quarterly updates, expanding the dataset to include more security data sources such as Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK), Common Configuration Enumeration (CCE), Common Platform Enumeration (CPE), Common Vulnerability Scoring System (CVSS), Open Worldwide Application Security Project (OWASP), and Security Content Automation Protocol (SCAP). They are also exploring semi-automatic annotation approaches to scale the dataset further.

Preliminary evaluations using open-source LLMs like Falcon3-7b-instruct and Mistral-7B-Instruct, as well as the closed-source GPT-4o mini, have already been conducted, establishing a baseline for future research. This dataset promises to significantly advance the field of semantic table interpretation in cybersecurity, enabling more intelligent and robust AI systems. You can find more details about this work in the full research paper: Secu-Table: a Comprehensive security table dataset for evaluating semantic table interpretation systems.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Introducing Secu-Table: A New Dataset for AI to Understand Cybersecurity Data

What is Secu-Table?

Why is Secu-Table Important?

How Was Secu-Table Built?

Availability and Future Plans

Gen AI News and Updates

Rubrik Report Reveals Alarming Decline in Cyber Resilience Amidst AI Agent Proliferation

Anthropic Reveals First AI-Orchestrated Cyber Espionage Campaign by Chinese State-Sponsored Group

TrojAI Unveils Defend for MCP to Bolster Security for AI Agent Workflows

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates