Enhancing Cyber Threat Intelligence Mapping with AI-Generated Data

TLDR: SynthCTI is a novel framework that leverages Large Language Models (LLMs) to generate high-quality synthetic Cyber Threat Intelligence (CTI) sentences. This data augmentation strategy addresses the scarcity and imbalance of labeled CTI data, significantly improving the accuracy of mapping threat descriptions to MITRE ATT&CK techniques. The approach leads to substantial performance gains, particularly for underrepresented techniques, and enables smaller AI models to outperform larger ones trained without augmentation, offering a more efficient solution for cybersecurity analysis.

In the ever-evolving landscape of cybersecurity, understanding and responding to adversarial behavior is paramount. This is where Cyber Threat Intelligence (CTI) comes into play, providing crucial insights from vast amounts of unstructured threat data. A key challenge in CTI is mapping these threat descriptions to the MITRE ATT&CK framework, a comprehensive knowledge base of adversary tactics and techniques. Traditionally, this mapping has been a manual, labor-intensive process, demanding significant expert knowledge.

Automating this critical task faces two major hurdles: a scarcity of high-quality, labeled CTI data and a severe imbalance in existing datasets, where some techniques have many examples while others have very few. While advanced AI models, particularly Large Language Models (LLMs), have shown promise, most research has focused on improving the models themselves rather than addressing the fundamental data limitations.

A new framework called SynthCTI steps in to tackle these data challenges head-on. SynthCTI is designed to generate high-quality synthetic CTI sentences, specifically for those MITRE ATT&CK techniques that are underrepresented in existing datasets. This innovative approach aims to enrich training data, making AI models more robust and effective.

How SynthCTI Works

SynthCTI operates in two main phases: System Training and System Deployment. During the training phase, it focuses on augmenting the data. Imagine you have a collection of cybersecurity threat descriptions, each linked to a specific MITRE ATT&CK technique. SynthCTI first groups similar sentences within each technique using advanced clustering techniques. This helps identify semantically coherent subgroups.

From these clusters, the framework extracts key features. This includes selecting a few representative examples (called ‘few-shots’), identifying central topics, extracting important keywords, and even finding synonyms for those keywords to introduce lexical variety. It also analyzes the ‘tone’ of the text (e.g., formal, neutral) and the typical ‘text type’ (e.g., short, multi-sentence descriptions). All this extracted information is then used to construct detailed prompts.

These structured prompts are fed into a powerful LLM, such as Gemma-3, guiding it to generate new, synthetic CTI sentences. The goal is to produce sentences that are not only lexically diverse but also semantically faithful to the original data, maintaining the correct cybersecurity context and nuances. These newly generated synthetic examples are then combined with the original training data.

Finally, this enriched dataset is used to fine-tune various pre-trained LLMs, adapting them specifically for multi-class classification of MITRE ATT&CK techniques. In the deployment phase, these fine-tuned models can then assist CTI analysts by automatically mapping sentences from new threat reports to their corresponding MITRE ATT&CK techniques, accelerating analysis and informing decision-making.

Significant Improvements and Key Findings

The effectiveness of SynthCTI was rigorously evaluated on two publicly available CTI datasets: CTI-to-MITRE and TRAM. The results were impressive, showing consistent improvements in classification performance, particularly in the F1-macro score, which is a crucial metric for imbalanced datasets as it gives equal weight to all classes, including the rare ones.

For instance, the ALBERT model, a relatively lightweight LLM, saw its F1-macro score improve significantly from 0.35 to 0.52 on the CTI-to-MITRE dataset – a relative gain of nearly 49%. SecureBERT, a domain-specific LLM, also saw substantial gains, reaching 0.6558 from 0.4412. A notable finding was that smaller models augmented with SynthCTI often outperformed larger models trained without any data augmentation. This highlights the immense value of high-quality data generation in compensating for model size, making efficient and deployable CTI classification systems more feasible.

Beyond performance, SynthCTI also demonstrated faster training convergence. Models trained with augmented data learned more quickly and reached higher performance levels in fewer training steps, which is a significant advantage for security teams needing to regularly update their models to keep pace with new threats.

Also Read:

Understanding the Generated Data

An in-depth analysis of the synthetic data revealed that the quality of generation is strongly linked to the amount of original data available for a given technique. Techniques with very few original examples (fewer than 10) sometimes led to synthetic data that overfitted specific entities or showed less semantic consistency. However, techniques with a moderate number of original examples (around 20 or more) consistently produced high-quality, diverse, and semantically coherent synthetic sentences.

The research also highlighted inherent ambiguities between some MITRE ATT&CK technique definitions, which occasionally surfaced during the augmentation process. This suggests potential areas for refining the framework’s boundaries in future work.

In conclusion, SynthCTI presents a powerful, LLM-driven data augmentation pipeline that significantly enhances the automated classification of CTI sentences into MITRE ATT&CK techniques. By strategically generating high-quality synthetic data, it not only boosts the performance of existing models but also enables smaller, more resource-efficient models to achieve competitive results, paving the way for more practical and effective cybersecurity solutions. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing Cyber Threat Intelligence Mapping with AI-Generated Data

How SynthCTI Works

Significant Improvements and Key Findings

Understanding the Generated Data

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates