TLDR: This paper investigates how increasing inference-time computation affects the robustness of open-source LLMs. While it generally improves robustness against prompt injection and extraction when reasoning steps are hidden, the study reveals an “inverse scaling law”: robustness significantly decreases if intermediate reasoning steps are exposed to adversaries. It also highlights persistent vulnerabilities even when reasoning chains are concealed, particularly with tool-integrated models and advanced extraction attacks, urging careful consideration of deployment contexts.
Large Language Models, or LLMs, have become incredibly powerful, and a key area of research focuses on how to make them even better. One promising method is ‘inference-time scaling,’ which means giving the model more computational power during the moment it’s generating a response, rather than just during its initial training. This approach has shown great potential for boosting LLM capabilities, from complex agent interactions to mathematical problem-solving.
Recent studies, particularly by Zaremba et al. (2025), highlighted that increasing this inference-time computation could significantly enhance the robustness of large, proprietary LLMs against various adversarial attacks. This suggested a powerful new way to make LLM deployments more secure.
However, a new research paper titled “Does More Inference-Time Compute Really Help Robustness?” by Tong Wu, Chong Xiang, Jiachen T. Wang, Weichen Yu, Chawin Sitawarin, Vikash Sehwag, and Prateek Mittal, delves deeper into this topic, addressing critical unanswered questions. Their work systematically investigates how inference-time scaling impacts smaller, open-source reasoning models and, crucially, examines a hidden assumption in prior research: that intermediate reasoning steps are always concealed from potential attackers.
Boosting Robustness with Hidden Reasoning
The researchers first explored whether open-source models like DeepSeek R1, Qwen3, and Phi-reasoning could also benefit from inference-time scaling. They used a straightforward method called ‘budget forcing,’ which essentially controls the length of the reasoning chain an LLM generates before providing a final answer. By increasing this ‘thinking budget,’ they found that these smaller models indeed showed improved robustness, especially against ‘prompt injection’ and ‘prompt extraction’ attacks.
Prompt injection involves embedding malicious instructions within normal input to override the model’s intended behavior (e.g., making it send sensitive data). Prompt extraction, on the other hand, tricks the model into revealing confidential internal instructions or data. The study observed that with more reasoning tokens, models became better at ignoring low-priority malicious instructions and resisting attempts to leak secrets. This aligns with previous findings on proprietary models, extending the applicability of inference-time scaling to a broader range of LLMs.
However, for ‘harmful requests’ (e.g., asking for instructions on illegal activities), the benefits were limited. Models maintained stable robustness but didn’t show significant improvement, suggesting that for inherently ambiguous or unsafe queries, extended reasoning might not be as effective.
The Inverse Scaling Law: When Reasoning is Exposed
The most critical finding of the paper emerges when the implicit assumption of hidden reasoning chains is relaxed. What if adversaries can see the intermediate steps an LLM takes to arrive at its answer? This scenario is relevant for open-source systems or even some commercial APIs that expose these steps.
The researchers hypothesized that exposing reasoning chains would fundamentally change the relationship between inference-time computation and robustness. Their experiments confirmed this: when intermediate reasoning steps were accessible, increasing inference-time computation consistently *reduced* model robustness across all three adversarial settings (prompt injection, prompt extraction, and harmful requests). This is what they term an “inverse scaling law.”
The reason is intuitive: longer, exposed reasoning chains simply provide more opportunities for malicious tokens or sensitive information to appear, making the model more vulnerable. For instance, in prompt extraction, if a secret key appears in the intermediate reasoning, an attacker can directly observe and extract it. Similarly, for harmful requests, an attacker might extract detailed unsafe instructions from the reasoning chain, even if the final answer is a refusal.
Vulnerabilities Even When Reasoning is Hidden
The paper further argues that simply hiding reasoning chains doesn’t solve all robustness issues. Two key scenarios highlight persistent vulnerabilities:
First, modern LLMs increasingly integrate ‘tool-use capabilities,’ allowing them to call external APIs or tools during their reasoning process. Even if the reasoning chain is hidden, an attacker might craft a prompt that triggers an unintended or malicious API call during an intermediate step. The study simulated this and found that robustness against such prompt injection attacks degraded as inference-time computation increased, as longer reasoning chains provided more opportunities to trigger unsafe tool interactions.
Second, even intentionally hidden reasoning chains can be extracted by determined adversaries. The paper references a red-teaming competition where participants successfully revealed internal reasoning steps from proprietary models. This suggests that longer reasoning chains, even when hidden, expand the attack surface and offer more chances for adversaries to reconstruct or infer sensitive internal logic.
Also Read:
- Unmasking Stealthy Data Leaks: How Multi-Stage Prompt Attacks Target Enterprise AI
- Large Language Models: A New Frontier in Cybersecurity
Conclusion: A Complex Trade-Off
The findings collectively demonstrate that the robustness benefits of inference-time scaling are highly dependent on the adversarial setting and how the model is deployed. While increasing inference-time computation can enhance robustness when reasoning chains are hidden, it can be counterproductive if these chains are exposed. Furthermore, new attack vectors emerge with tool-integrated reasoning and advanced extraction techniques, even when reasoning is concealed.
This research urges practitioners to carefully weigh these subtle trade-offs before applying inference-time scaling in security-sensitive, real-world LLM applications, paving the way for more secure and robust AI systems.


