TLDR: A new research paper demonstrates a two-stage method to create high-fidelity clones of black-box Large Language Models (LLMs) by exploiting ‘logit leakage’ from their APIs. The attack first reconstructs the LLM’s output projection matrix using Singular Value Decomposition (SVD) from under 10,000 queries. Then, it uses knowledge distillation to train a ‘student’ model to mimic the original LLM’s internal reasoning. This process, completed in under 24 GPU hours, can replicate a 6-layer teacher model with 97.6% hidden-state geometry fidelity and minimal performance degradation, highlighting critical security vulnerabilities in current LLM deployments.
Large Language Models (LLMs) are becoming increasingly vital in critical sectors, from satellite operations to military decision support. These powerful AI systems are often accessed through Application Programming Interfaces (APIs). However, a recent study reveals a significant, often overlooked, vulnerability: when these APIs expose internal model predictions, known as ‘logits’, they create an opening for adversaries to replicate the LLM.
A new research paper titled “Clone What You Can’t Steal: Black-Box LLM Replication via Logit Leakage and Distillation” by Kanchon Gharami, Hansaka Aluvihare, Shafika Showkat Moni, and Berker Pek ¨oz from Embry-Riddle Aeronautical University, addresses this critical security gap. The authors introduce a novel, two-stage method that transforms this partial ‘logit leakage’ into a fully functional, deployable clone of the original black-box LLM. You can read the full paper here.
The Vulnerability: Logit Leakage
Many LLM systems are accessed remotely, and their APIs might inadvertently expose ‘top-k logits’ – essentially, the raw scores for the most probable next words the model is considering. While previous research focused on reconstructing only the final output layer or mimicking surface-level behaviors, this new work demonstrates how to create a comprehensive clone that replicates the target model’s underlying reasoning and generalization capabilities, even under strict query limits.
A Two-Stage Replication Process
The researchers developed a two-stage pipeline to achieve this high-fidelity replication:
Stage 1: Stealing the Output Projection Matrix. The first step involves reconstructing the LLM’s ‘output projection matrix’. This is the final component of the model that translates its internal, abstract understanding into concrete word predictions. By sending less than 10,000 carefully crafted queries to the black-box LLM and collecting the top-k logits, the researchers use a mathematical technique called Singular Value Decomposition (SVD) to effectively reverse-engineer this crucial layer. This process is akin to figuring out how the model converts its ‘thoughts’ into actual words, without ever seeing its internal structure.
Stage 2: Cloning the Remaining Architecture via Knowledge Distillation. Once the output projection matrix is ‘stolen’ and fixed, the remaining complex internal layers of the LLM (the transformer blocks) cannot be directly extracted. Instead, the researchers employ a technique called ‘knowledge distillation’. They train a smaller ‘student’ model on publicly available datasets. This student model learns to mimic the behavior and internal reasoning patterns of the original ‘teacher’ LLM by observing its outputs for the same inputs. The goal is not just to copy the output, but to replicate the teacher’s deeper understanding and generalization abilities.
Impressive Results and Efficiency
The findings are striking. A 6-layer student model, for example, successfully recreated 97.6% of the 6-layer teacher model’s internal ‘hidden-state geometry’ – essentially, its internal thought processes. This was achieved with only a modest 7.31% increase in ‘perplexity’ (a measure of how well a language model predicts text, where lower is better). Even more efficiently, a smaller 4-layer variant achieved 17.1% faster inference and an 18.1% reduction in parameters, while still maintaining comparable performance.
Crucially, the entire replication attack was completed in under 24 GPU hours and used fewer than 10,000 queries, effectively avoiding common API rate-limit defenses. The cloned models also demonstrated strong generalization capabilities, performing well on unseen data, confirming they captured the teacher model’s latent reasoning rather than just memorizing training prompts.
Also Read:
- Securing AI on the Go: A Look at Privacy and Security in Mobile Large Language Models
- Pinpointing Safety: A New Look at LLM Jailbreak Defenses Through Knowledge Neurons
Urgent Implications for LLM Security
This research highlights an urgent need for enhanced security measures for LLM inference APIs and secure on-premise deployments. The ability for a cost-limited adversary to quickly clone an LLM raises significant concerns:
- Circumvention of Alignment Safeguards: Cloned models could potentially bypass safety and ethical guidelines built into the original LLM.
- Unauthorized Redistribution: Proprietary models could be replicated and distributed without permission.
- Sensitive Data Leakage: Clones might inadvertently expose sensitive data that the original model had memorized during its training.
The authors emphasize that this work should serve as a reference point for interdisciplinary efforts to secure LLM deployments in high-stakes environments. Future research will explore defenses like adaptive noise injection, behavioral fingerprinting, and secure on-premise inference protocols to counter these emerging threats.


