spot_img
HomeResearch & DevelopmentUnseen Messages in Plain Sight: How LLMs Conceal Information

Unseen Messages in Plain Sight: How LLMs Conceal Information

TLDR: A new protocol allows Large Language Models (LLMs) to hide a complete, meaningful text inside another, entirely different but plausible text of the same length. This method works efficiently even with modest LLMs and demonstrates a radical separation of text from its author’s true intent, raising significant questions for AI safety, such as the covert deployment of unfiltered LLMs, and challenging our understanding of LLM knowledge and hallucinations.

Large Language Models (LLMs) have introduced a fascinating and somewhat unsettling capability: the ability to conceal an entire meaningful text within another, completely different yet coherent and plausible text of the same length. This groundbreaking development, explored in a recent paper by Antonio Norelli and Michael Bronstein from Project CETI and the University of Oxford, highlights a radical decoupling of text from authorial intent, further eroding trust in written communication.

Imagine a tweet containing a harsh political critique hidden inside a tweet celebrating the same political leader, or a secret manuscript embedded within an ordinary product review. This is now achievable with a simple and efficient protocol using LLMs. Even modest 8-billion-parameter open-source LLMs can produce high-quality results, encoding and decoding messages as long as an abstract on a laptop in mere seconds.

How Does This Text-Hiding Protocol Work?

The method, described as a straightforward ‘recipe’, involves a few key steps. First, the secret message (the text to hide) is tokenized using the LLM’s tokenizer. For each token, its rank within the LLM’s probability distribution, given the preceding context, is recorded. Then, to construct the ‘fake’ or ‘stegotext’, the LLM is prompted with a secret key. Instead of sampling tokens based on their probabilities, the LLM is forced to choose tokens according to the recorded ranks of the secret message. The receiver, possessing the same LLM and the secret key, can reverse this process to perfectly reconstruct the original hidden text.

The quality of the resulting stegotext—how plausible and coherent it appears—depends on several factors: the soundness of the original message, the effectiveness of the secret prompt (key), and the capability of the LLM used. If the LLM is good at predicting the original message, the ranks will be low, ensuring the stegotext remains coherent and aligns with the context set by the secret prompt. However, if the original message is difficult for the LLM to predict, the stegotext might become garbled.

Also Read:

Implications for AI Safety and Trust

This protocol opens up significant questions and potential applications, particularly in the realm of AI safety. One concrete scenario presented in the paper is how a company could covertly deploy an unfiltered LLM. By encoding the answers from a powerful, unaligned LLM within the compliant responses of a safe, aligned model, users could unknowingly extract uncensored information. The paper illustrates this with an example of a user asking how to tamper with a gas meter; the aligned chatbot provides a censored response, but a hidden message within it guides the user’s local open-source LLM to generate the uncensored instructions.

The existence of such a protocol challenges our fundamental understanding of what it means for an LLM to ‘know’ something. LLMs can generate virtually any text, but does assigning a high probability to a text equate to knowledge? The paper suggests a novel perspective on ‘hallucinations’—not merely as factual falsehoods, but as a ‘void of intention.’ When a text is generated under the constraint of encoding another message, it becomes difficult to ascribe genuine intent to the LLM, much like appreciating Oulipo literature written under arbitrary constraints.

Ultimately, this research highlights the extreme constraint satisfaction problem underlying standard LLM text generation. It reveals a clash between the LLM’s ability to produce coherent text and the human expectation of authorial purpose. As machine-generated text becomes ubiquitous, this decoupling of text from human intent could profoundly impact our trust in written communication. The paper concludes by warning that any original text could now be a ‘beautiful and treacherous, and spacious, Trojan horse.’ You can read the full research paper here: LLMs can hide text in other text of the same length.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -