TLDR: A new study reveals that Large Language Models (LLMs) can have their original input prompts reconstructed from their internal states (ISs), even from deep layers and for very long texts. Researchers developed four inversion attacks (two white-box, two black-box) that significantly improve the accuracy of recovering sensitive information. They also found that common defenses like quantization and differential privacy are largely ineffective without severely impacting the LLM’s performance, highlighting a critical privacy vulnerability in how LLMs process and expose internal data.
Large Language Models (LLMs) have become an integral part of our daily lives, powering everything from chatbots to coding assistants. However, their widespread adoption also brings significant privacy and safety concerns. A recent research paper, titled “Depth Gives a False Sense of Privacy: LLM Internal States Inversion,” delves into a critical vulnerability: the ability to reconstruct a user’s original input prompt from the LLM’s internal states (ISs).
Traditionally, these internal states, which are intermediate representations generated as the LLM processes information, were considered irreversible back to the original input. This assumption was based on the idea that deep layers contain highly abstract representations and that optimization challenges would prevent such a reversal. However, this new research challenges that very notion.
The Challenge of Inverting Internal States
The paper highlights two main reasons why inverting LLM internal states is particularly challenging. First, ISs are designed for subsequent inference and contain abstract logical representations, making them inherently different from simpler text embeddings or model outputs that have a more direct semantic relevance to the input. Second, modern LLMs are significantly larger, with more layers, greater width, and much larger vocabularies than older language models, which further complicates the inversion process, especially for deeper layers where information loss can occur.
Novel Inversion Attacks
To demonstrate this privacy risk, the researchers developed four novel inversion attacks, adapting to different levels of access to the LLM’s internal workings:
- White-Box Optimization-Based Attacks: These attacks are designed for scenarios where the attacker has full knowledge of the model’s weights (like a curious but honest inference server).
The first, called Embedding Recovery (ER), is tailored for shallow internal states. It works by approximating a dummy input embedding that matches the observed internal states, then recovering the closest input tokens. For deeper layers, ER can struggle due to unstable gradients.
To address this, the second white-box attack, Token Basis Selection (TBS), was developed. TBS focuses on finding the correct projection values of an orthogonal basis of the input embedding space to compose the inverted embeddings. This significantly reduces the search space and stabilizes the optimization process, making it effective even for deep layers.
- Black-Box Attacks: These attacks are for more practical scenarios where the attacker can only observe the internal states without knowing the model’s weights (like a third-party auditor).
The first black-box approach involves Model Type Identification. Since many LLMs are derived from existing open-source models through fine-tuning or merging, the attacker first tries to identify the base model type. If successful, they can then replicate the target model’s internal states using a surrogate LLM and apply the optimization-based attacks.
If the model is completely closed-source or drastically different, a Generation-Based Attack is employed. This method treats the inversion as a translation task, using an encoder-decoder model (similar to those used in machine translation) to translate the observed internal states back into input tokens. A projection module is used to align the internal states with the encoder’s embedding space, enhancing inversion accuracy.
Effectiveness and Implications
The extensive evaluation of these attacks on 6 real-world LLMs, using both short and long prompts from sensitive datasets like medical consulting and coding assistance, validated their effectiveness. Notably, the TBS attack could nearly perfectly invert a 4,112-token long medical consulting prompt from the middle layer of a Llama-3 model, achieving an 86.88 F1 token matching score. The generation-based attack also achieved high F1 scores for medium-length inputs.
The study found that domain-specialized models, like Qwen2.5-Coder, can be even more susceptible to inversion. Furthermore, larger models (up to 70B parameters) were also vulnerable, sometimes yielding even better inversion results due to more information being retained in their wider internal states.
Also Read:
- Unmasking Stealthy Data Leaks: How Multi-Stage Prompt Attacks Target Enterprise AI
- Inference-Time Compute and LLM Robustness: A Deeper Look
Challenges for Defenses
The researchers also evaluated four practical defenses: quantization, dropout, noisy input embedding, and Differential Privacy (DP). They found that none of these methods could perfectly prevent internal state inversion without significantly deteriorating the model’s utility. For instance, even with high dropout probabilities or noise levels that made the model’s performance close to random, a substantial amount of input information could still be recovered.
This research highlights a critical privacy vulnerability in LLMs, especially in scenarios like collaborative inference or safety auditing where internal states are exposed. The findings suggest that directly protecting internal states is insufficient and that more comprehensive safeguards, possibly involving cryptographic tools or confidential computing, or even architectural changes to LLMs, may be necessary to truly mitigate this risk. For more technical details, you can refer to the full research paper here.


