spot_img
HomeResearch & DevelopmentAI Models Comprehend Ciphers, But Struggle to Reason in...

AI Models Comprehend Ciphers, But Struggle to Reason in Them

TLDR: A new research paper by Shiyuan Guo, Henry Sleight, and Fabien Roger reveals that while current large language models (LLMs) can accurately translate ciphered text, their reasoning ability significantly declines when they are required to perform computations or complex thought processes in that same ciphered language. The study, which tested 28 different ciphers across multiple models, found that reasoning capability correlates with cipher prevalence in pre-training data and improves very slowly with additional fine-tuning. This suggests that using ciphered reasoning to evade AI monitoring is currently an ineffective strategy for attackers and misaligned models, offering a measure of safety for Chain-of-Thought monitoring systems.

As artificial intelligence agents become more integrated into our daily lives, ensuring their safety and preventing harmful actions is paramount. One crucial method for monitoring AI behavior is Chain-of-Thought (CoT) monitoring, which involves examining the step-by-step reasoning processes of AI models. This allows developers and oversight systems to detect adversarial attacks or signs of misalignment.

However, a recent study explores a potential vulnerability: what if malicious AI agents or attackers try to hide their reasoning in ‘ciphered language’? This refers to text that is encrypted, translated into another language, or compressed, making it illegible to standard CoT monitors. The research, titled CURRENT LANGUAGE MODELS STRUGGLE TO REASON IN CIPHERED LANGUAGE, investigates whether current language models (LLMs) are capable of performing such ‘ciphered reasoning’.

The study, conducted by Shiyuan Guo, Henry Sleight, and Fabien Roger, tested up to 10 different models across 28 various ciphers. They fine-tuned and prompted these models to reason using these ciphers, using their accuracy on math problems as a benchmark for reasoning ability. The findings reveal a significant asymmetry: while models can accurately translate ciphered text back to English, demonstrating comprehension, their ability to reason in that same ciphered text drops considerably. Even advanced models struggle with less common ciphers, though they perform better with well-known ones like rot13.

This suggests that the prevalence of a cipher in a model’s pre-training data plays a crucial role. Ciphers that appear more frequently in the data correlate with better reasoning performance. The researchers also identified scaling laws, indicating that improving ciphered reasoning capabilities requires a vast amount of additional fine-tuning data – billions of tokens for even simple ciphers to approach the accuracy of plain-text reasoning. This makes it a prohibitively expensive tactic for black-box adversarial fine-tuning attacks.

The research also delved into qualitative characteristics of ciphers that impact reasoning. They found that ciphers involving significant syntactic and lexical transformations (like changing word structure or replacing content) severely degrade a model’s reasoning ability. In contrast, stylistic changes or the addition of distractor text had a relatively minor impact.

These findings offer a reassuring insight: for now, evading CoT monitoring through ciphered reasoning appears to be an ineffective strategy for current AI models. This is because models struggle to maintain their reasoning prowess when forced to think in encrypted or transformed text. The study also provides valuable guidance for future AI development, suggesting that filtering ciphered text from pre-training data could help prevent the emergence of this capability in more advanced models.

Also Read:

While the study has limitations, such as not attempting to design ciphers specifically to maximize reasoning while minimizing legibility, it marks a significant step in understanding the boundaries of AI reasoning in non-standard linguistic forms. It highlights the ongoing challenge of ensuring AI safety and transparency as models become more sophisticated.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -