Black-Box LLMs Vulnerable to Replication Through Logit Leakage

TLDR: A new research paper demonstrates a two-stage method to create high-fidelity clones of black-box Large Language Models (LLMs) by exploiting ‘logit leakage’ from their APIs. The attack first reconstructs the LLM’s output projection matrix using Singular Value Decomposition (SVD) from under 10,000 queries. Then, it uses knowledge distillation to train a ‘student’ model to mimic the original LLM’s internal reasoning. This process, completed in under 24 GPU hours, can replicate a 6-layer teacher model with 97.6% hidden-state geometry fidelity and minimal performance degradation, highlighting critical security vulnerabilities in current LLM deployments.

Large Language Models (LLMs) are becoming increasingly vital in critical sectors, from satellite operations to military decision support. These powerful AI systems are often accessed through Application Programming Interfaces (APIs). However, a recent study reveals a significant, often overlooked, vulnerability: when these APIs expose internal model predictions, known as ‘logits’, they create an opening for adversaries to replicate the LLM.

A new research paper titled “Clone What You Can’t Steal: Black-Box LLM Replication via Logit Leakage and Distillation” by Kanchon Gharami, Hansaka Aluvihare, Shafika Showkat Moni, and Berker Pek ¨oz from Embry-Riddle Aeronautical University, addresses this critical security gap. The authors introduce a novel, two-stage method that transforms this partial ‘logit leakage’ into a fully functional, deployable clone of the original black-box LLM. You can read the full paper here.

The Vulnerability: Logit Leakage

Many LLM systems are accessed remotely, and their APIs might inadvertently expose ‘top-k logits’ – essentially, the raw scores for the most probable next words the model is considering. While previous research focused on reconstructing only the final output layer or mimicking surface-level behaviors, this new work demonstrates how to create a comprehensive clone that replicates the target model’s underlying reasoning and generalization capabilities, even under strict query limits.

A Two-Stage Replication Process

The researchers developed a two-stage pipeline to achieve this high-fidelity replication:

Stage 1: Stealing the Output Projection Matrix. The first step involves reconstructing the LLM’s ‘output projection matrix’. This is the final component of the model that translates its internal, abstract understanding into concrete word predictions. By sending less than 10,000 carefully crafted queries to the black-box LLM and collecting the top-k logits, the researchers use a mathematical technique called Singular Value Decomposition (SVD) to effectively reverse-engineer this crucial layer. This process is akin to figuring out how the model converts its ‘thoughts’ into actual words, without ever seeing its internal structure.

Stage 2: Cloning the Remaining Architecture via Knowledge Distillation. Once the output projection matrix is ‘stolen’ and fixed, the remaining complex internal layers of the LLM (the transformer blocks) cannot be directly extracted. Instead, the researchers employ a technique called ‘knowledge distillation’. They train a smaller ‘student’ model on publicly available datasets. This student model learns to mimic the behavior and internal reasoning patterns of the original ‘teacher’ LLM by observing its outputs for the same inputs. The goal is not just to copy the output, but to replicate the teacher’s deeper understanding and generalization abilities.

Impressive Results and Efficiency

The findings are striking. A 6-layer student model, for example, successfully recreated 97.6% of the 6-layer teacher model’s internal ‘hidden-state geometry’ – essentially, its internal thought processes. This was achieved with only a modest 7.31% increase in ‘perplexity’ (a measure of how well a language model predicts text, where lower is better). Even more efficiently, a smaller 4-layer variant achieved 17.1% faster inference and an 18.1% reduction in parameters, while still maintaining comparable performance.

Crucially, the entire replication attack was completed in under 24 GPU hours and used fewer than 10,000 queries, effectively avoiding common API rate-limit defenses. The cloned models also demonstrated strong generalization capabilities, performing well on unseen data, confirming they captured the teacher model’s latent reasoning rather than just memorizing training prompts.

Also Read:

Urgent Implications for LLM Security

This research highlights an urgent need for enhanced security measures for LLM inference APIs and secure on-premise deployments. The ability for a cost-limited adversary to quickly clone an LLM raises significant concerns:

Circumvention of Alignment Safeguards: Cloned models could potentially bypass safety and ethical guidelines built into the original LLM.
Unauthorized Redistribution: Proprietary models could be replicated and distributed without permission.
Sensitive Data Leakage: Clones might inadvertently expose sensitive data that the original model had memorized during its training.

The authors emphasize that this work should serve as a reference point for interdisciplinary efforts to secure LLM deployments in high-stakes environments. Future research will explore defenses like adaptive noise injection, behavioral fingerprinting, and secure on-premise inference protocols to counter these emerging threats.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Black-Box LLMs Vulnerable to Replication Through Logit Leakage

The Vulnerability: Logit Leakage

A Two-Stage Replication Process

Impressive Results and Efficiency

Urgent Implications for LLM Security

Gen AI News and Updates

TabDistill: Bridging Transformer Power and Neural Network Efficiency for Tabular Data

Unmasking Prompt Injection Risks in Web Chatbot Plugins

Unmasking LLM Vulnerabilities: A New Framework for Factual Memory Attacks

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates