Unmasking Stealthy Data Leaks: How Multi-Stage Prompt Attacks Target Enterprise AI

TLDR: This research paper explores multi-stage prompt inference attacks on enterprise Large Language Model (LLM) systems, where attackers chain seemingly benign prompts to gradually extract confidential data like internal documents and emails. It details attack strategies including reconnaissance, exploitation (indirect prompt injection, adaptive questioning), and exfiltration. The paper also proposes a comprehensive defense-in-depth strategy, encompassing anomaly detection, fine-grained access control, prompt sanitization, and architectural modifications like differential privacy training and ‘watchdog’ models, emphasizing the need for layered security to protect sensitive enterprise data from sophisticated AI exploits.

Large Language Models (LLMs) are rapidly becoming integral to enterprise operations, assisting with tasks by leveraging private organizational data. Think of tools like Microsoft 365 Copilot, which integrate LLMs with a company’s emails, documents, and knowledge base to boost productivity. However, this powerful capability introduces new security vulnerabilities, particularly a sophisticated threat known as multi-stage prompt inference attacks.

Understanding the Threat: Multi-Stage Prompt Inference Attacks

Traditionally, prompt injection attacks involve a single, maliciously crafted input designed to manipulate an LLM’s behavior. While these are a concern, a more insidious threat is the multi-stage prompt inference attack. This involves adversaries chaining together seemingly benign prompts over multiple interactions to gradually extract confidential data. Each individual prompt might appear harmless and evade immediate detection, but the cumulative dialogue coerces the model into revealing secrets piece by piece. For instance, an attacker might first ask for meta-information about a document and then, through cleverly disguised follow-ups, extract the actual content. This method can reliably exfiltrate sensitive information, such as internal SharePoint documents or emails, even when standard safety measures are in place.

The paper highlights how these attacks unfold in stages: reconnaissance, exploitation, and exfiltration. Reconnaissance involves asking general questions to understand the LLM’s behavior and boundaries. Exploitation is where the attacker actively extracts information, either through indirect prompt injection (embedding hidden instructions in external content that the LLM processes) or adaptive questioning (systematically asking a series of innocuous questions that incrementally reveal details about a secret). Finally, exfiltration is about getting the sensitive data out, sometimes by tricking the LLM into outputting it in a coded form or via an external channel, as seen in the ‘EchoLeak’ attack chain.

Building a Robust Defense Strategy

To counter these evolving threats, the researchers propose a multi-layered defense-in-depth strategy. No single solution is sufficient; instead, a combination of measures is needed:

Anomaly Detection

One key defense is to detect attacks as they happen. Anomaly detection systems can monitor sequences of user prompts and LLM responses for suspicious patterns. This includes looking for unusually pointed questions, semantic similarities between successive queries that suggest systematic probing, or the presence of keywords often associated with prompt injections. By analyzing these conversational features, the system can flag suspicious behavior and potentially intervene, for example, by switching the LLM to a more restrictive mode or alerting an administrator.

Access Control and Context Separation

A fundamental principle is to limit what data the LLM can access and reveal based on the user’s permissions and the query’s context. This involves strictly enforcing permissions at the retrieval layer, ensuring the LLM only accesses data the querying user is authorized to see. Techniques like ‘spotlighting’ are also proposed, which involve clearly delineating untrusted user input from trusted internal context within the prompt using special tokens. This helps the model distinguish between user instructions and sensitive internal data, making it less likely to confuse an injected instruction as part of its system role.

Prompt Sanitization and Content Filtering

This layer focuses on cleaning inputs and outputs. Input sanitization involves removing or escaping special tokens, neutralizing HTML/Markdown that could hide instructions, and filtering out suspicious keywords from user prompts. On the output side, content filters scan the LLM’s answers for sensitive data patterns, similar to Data Loss Prevention (DLP) systems. If sensitive content is detected, it can be redacted or trigger a review.

Architectural and Training-Time Defenses

More profound changes involve modifying the LLM’s architecture or training process. Differential privacy training can provably limit the influence of any single training example on the model’s outputs, reducing its propensity to memorize and regurgitate sensitive training data. Another approach is a ‘two-model’ or ‘tiered’ architecture, where a smaller ‘watchdog’ model oversees the primary LLM’s outputs, specifically trained to detect and prevent the disclosure of sensitive information. Continuous learning from attacks, where new exploit attempts are fed back into training to make the model more robust, is also crucial. Furthermore, running LLMs in secure enclaves and sandboxing their execution can contain the impact if an attack succeeds, for example, by requiring user confirmation for sensitive actions like sending emails.

Also Read:

The Ongoing Arms Race

The research underscores that securing LLMs in enterprise settings requires moving beyond single-turn prompt filtering toward a holistic, multi-stage perspective on both attacks and defenses. It’s an ongoing ‘cat-and-mouse’ game between attackers and defenders. Organizations must adopt a proactive security posture, regularly red-teaming their LLM systems, investing in monitoring tools, and continuously updating safety mechanisms as new vulnerabilities emerge. For more technical details, you can refer to the full research paper here.

By combining multiple defense layers and staying vigilant, enterprises can significantly bolster their LLM security posture, enjoying the productivity benefits of AI assistants without opening the floodgates to their most precious secrets.