Unmasking LLM Inputs: The Inversion Threat to Internal States

TLDR: A new study reveals that Large Language Models (LLMs) can have their original input prompts reconstructed from their internal states (ISs), even from deep layers and for very long texts. Researchers developed four inversion attacks (two white-box, two black-box) that significantly improve the accuracy of recovering sensitive information. They also found that common defenses like quantization and differential privacy are largely ineffective without severely impacting the LLM’s performance, highlighting a critical privacy vulnerability in how LLMs process and expose internal data.

Large Language Models (LLMs) have become an integral part of our daily lives, powering everything from chatbots to coding assistants. However, their widespread adoption also brings significant privacy and safety concerns. A recent research paper, titled “Depth Gives a False Sense of Privacy: LLM Internal States Inversion,” delves into a critical vulnerability: the ability to reconstruct a user’s original input prompt from the LLM’s internal states (ISs).

Traditionally, these internal states, which are intermediate representations generated as the LLM processes information, were considered irreversible back to the original input. This assumption was based on the idea that deep layers contain highly abstract representations and that optimization challenges would prevent such a reversal. However, this new research challenges that very notion.

The Challenge of Inverting Internal States

The paper highlights two main reasons why inverting LLM internal states is particularly challenging. First, ISs are designed for subsequent inference and contain abstract logical representations, making them inherently different from simpler text embeddings or model outputs that have a more direct semantic relevance to the input. Second, modern LLMs are significantly larger, with more layers, greater width, and much larger vocabularies than older language models, which further complicates the inversion process, especially for deeper layers where information loss can occur.

Novel Inversion Attacks

To demonstrate this privacy risk, the researchers developed four novel inversion attacks, adapting to different levels of access to the LLM’s internal workings:

White-Box Optimization-Based Attacks: These attacks are designed for scenarios where the attacker has full knowledge of the model’s weights (like a curious but honest inference server).
The first, called Embedding Recovery (ER), is tailored for shallow internal states. It works by approximating a dummy input embedding that matches the observed internal states, then recovering the closest input tokens. For deeper layers, ER can struggle due to unstable gradients.

To address this, the second white-box attack, Token Basis Selection (TBS), was developed. TBS focuses on finding the correct projection values of an orthogonal basis of the input embedding space to compose the inverted embeddings. This significantly reduces the search space and stabilizes the optimization process, making it effective even for deep layers.
Black-Box Attacks: These attacks are for more practical scenarios where the attacker can only observe the internal states without knowing the model’s weights (like a third-party auditor).
The first black-box approach involves Model Type Identification. Since many LLMs are derived from existing open-source models through fine-tuning or merging, the attacker first tries to identify the base model type. If successful, they can then replicate the target model’s internal states using a surrogate LLM and apply the optimization-based attacks.

If the model is completely closed-source or drastically different, a Generation-Based Attack is employed. This method treats the inversion as a translation task, using an encoder-decoder model (similar to those used in machine translation) to translate the observed internal states back into input tokens. A projection module is used to align the internal states with the encoder’s embedding space, enhancing inversion accuracy.

Effectiveness and Implications

The extensive evaluation of these attacks on 6 real-world LLMs, using both short and long prompts from sensitive datasets like medical consulting and coding assistance, validated their effectiveness. Notably, the TBS attack could nearly perfectly invert a 4,112-token long medical consulting prompt from the middle layer of a Llama-3 model, achieving an 86.88 F1 token matching score. The generation-based attack also achieved high F1 scores for medium-length inputs.

The study found that domain-specialized models, like Qwen2.5-Coder, can be even more susceptible to inversion. Furthermore, larger models (up to 70B parameters) were also vulnerable, sometimes yielding even better inversion results due to more information being retained in their wider internal states.

Also Read:

Challenges for Defenses

The researchers also evaluated four practical defenses: quantization, dropout, noisy input embedding, and Differential Privacy (DP). They found that none of these methods could perfectly prevent internal state inversion without significantly deteriorating the model’s utility. For instance, even with high dropout probabilities or noise levels that made the model’s performance close to random, a substantial amount of input information could still be recovered.

This research highlights a critical privacy vulnerability in LLMs, especially in scenarios like collaborative inference or safety auditing where internal states are exposed. The findings suggest that directly protecting internal states is insufficient and that more comprehensive safeguards, possibly involving cryptographic tools or confidential computing, or even architectural changes to LLMs, may be necessary to truly mitigate this risk. For more technical details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unmasking LLM Inputs: The Inversion Threat to Internal States

The Challenge of Inverting Internal States

Novel Inversion Attacks

Effectiveness and Implications

Challenges for Defenses

Gen AI News and Updates

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Vatican Summit Addresses Ethical Imperatives of AI in Healthcare

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates