spot_img
HomeResearch & DevelopmentEnhancing Cyber Threat Intelligence Mapping with AI-Generated Data

Enhancing Cyber Threat Intelligence Mapping with AI-Generated Data

TLDR: SynthCTI is a novel framework that leverages Large Language Models (LLMs) to generate high-quality synthetic Cyber Threat Intelligence (CTI) sentences. This data augmentation strategy addresses the scarcity and imbalance of labeled CTI data, significantly improving the accuracy of mapping threat descriptions to MITRE ATT&CK techniques. The approach leads to substantial performance gains, particularly for underrepresented techniques, and enables smaller AI models to outperform larger ones trained without augmentation, offering a more efficient solution for cybersecurity analysis.

In the ever-evolving landscape of cybersecurity, understanding and responding to adversarial behavior is paramount. This is where Cyber Threat Intelligence (CTI) comes into play, providing crucial insights from vast amounts of unstructured threat data. A key challenge in CTI is mapping these threat descriptions to the MITRE ATT&CK framework, a comprehensive knowledge base of adversary tactics and techniques. Traditionally, this mapping has been a manual, labor-intensive process, demanding significant expert knowledge.

Automating this critical task faces two major hurdles: a scarcity of high-quality, labeled CTI data and a severe imbalance in existing datasets, where some techniques have many examples while others have very few. While advanced AI models, particularly Large Language Models (LLMs), have shown promise, most research has focused on improving the models themselves rather than addressing the fundamental data limitations.

A new framework called SynthCTI steps in to tackle these data challenges head-on. SynthCTI is designed to generate high-quality synthetic CTI sentences, specifically for those MITRE ATT&CK techniques that are underrepresented in existing datasets. This innovative approach aims to enrich training data, making AI models more robust and effective.

How SynthCTI Works

SynthCTI operates in two main phases: System Training and System Deployment. During the training phase, it focuses on augmenting the data. Imagine you have a collection of cybersecurity threat descriptions, each linked to a specific MITRE ATT&CK technique. SynthCTI first groups similar sentences within each technique using advanced clustering techniques. This helps identify semantically coherent subgroups.

From these clusters, the framework extracts key features. This includes selecting a few representative examples (called ‘few-shots’), identifying central topics, extracting important keywords, and even finding synonyms for those keywords to introduce lexical variety. It also analyzes the ‘tone’ of the text (e.g., formal, neutral) and the typical ‘text type’ (e.g., short, multi-sentence descriptions). All this extracted information is then used to construct detailed prompts.

These structured prompts are fed into a powerful LLM, such as Gemma-3, guiding it to generate new, synthetic CTI sentences. The goal is to produce sentences that are not only lexically diverse but also semantically faithful to the original data, maintaining the correct cybersecurity context and nuances. These newly generated synthetic examples are then combined with the original training data.

Finally, this enriched dataset is used to fine-tune various pre-trained LLMs, adapting them specifically for multi-class classification of MITRE ATT&CK techniques. In the deployment phase, these fine-tuned models can then assist CTI analysts by automatically mapping sentences from new threat reports to their corresponding MITRE ATT&CK techniques, accelerating analysis and informing decision-making.

Significant Improvements and Key Findings

The effectiveness of SynthCTI was rigorously evaluated on two publicly available CTI datasets: CTI-to-MITRE and TRAM. The results were impressive, showing consistent improvements in classification performance, particularly in the F1-macro score, which is a crucial metric for imbalanced datasets as it gives equal weight to all classes, including the rare ones.

For instance, the ALBERT model, a relatively lightweight LLM, saw its F1-macro score improve significantly from 0.35 to 0.52 on the CTI-to-MITRE dataset – a relative gain of nearly 49%. SecureBERT, a domain-specific LLM, also saw substantial gains, reaching 0.6558 from 0.4412. A notable finding was that smaller models augmented with SynthCTI often outperformed larger models trained without any data augmentation. This highlights the immense value of high-quality data generation in compensating for model size, making efficient and deployable CTI classification systems more feasible.

Beyond performance, SynthCTI also demonstrated faster training convergence. Models trained with augmented data learned more quickly and reached higher performance levels in fewer training steps, which is a significant advantage for security teams needing to regularly update their models to keep pace with new threats.

Also Read:

Understanding the Generated Data

An in-depth analysis of the synthetic data revealed that the quality of generation is strongly linked to the amount of original data available for a given technique. Techniques with very few original examples (fewer than 10) sometimes led to synthetic data that overfitted specific entities or showed less semantic consistency. However, techniques with a moderate number of original examples (around 20 or more) consistently produced high-quality, diverse, and semantically coherent synthetic sentences.

The research also highlighted inherent ambiguities between some MITRE ATT&CK technique definitions, which occasionally surfaced during the augmentation process. This suggests potential areas for refining the framework’s boundaries in future work.

In conclusion, SynthCTI presents a powerful, LLM-driven data augmentation pipeline that significantly enhances the automated classification of CTI sentences into MITRE ATT&CK techniques. By strategically generating high-quality synthetic data, it not only boosts the performance of existing models but also enables smaller, more resource-efficient models to achieve competitive results, paving the way for more practical and effective cybersecurity solutions. You can read the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -