spot_img
HomeResearch & DevelopmentDelethink: Enabling LLMs to Think Longer with Linear Compute

Delethink: Enabling LLMs to Think Longer with Linear Compute

TLDR: A new research paper introduces ‘Markovian Thinking’ and ‘Delethink,’ an RL environment that allows Large Language Models (LLMs) to perform long-chain-of-thought reasoning with linear computational cost and constant memory. By structuring reasoning into fixed-size chunks and carrying over only a small textual state, Delethink overcomes the quadratic scaling issues of traditional methods. This approach matches or surpasses existing LongCoT-RL performance, significantly reduces training costs, and enables superior test-time scaling, demonstrating a path towards highly efficient and scalable reasoning LLMs.

Large Language Models (LLMs) have shown remarkable capabilities in complex reasoning tasks, often by generating a ‘long chain of thought’ (LongCoT) before arriving at an answer. This approach, while powerful, comes with a significant drawback: the computational cost grows quadratically as the length of these thought processes increases. This means that as LLMs try to think longer, the resources required for training and inference skyrocket, making very long reasoning prohibitively expensive and slow.

A new research paper introduces a novel paradigm called ‘Markovian Thinking’ and its practical implementation, ‘Delethink,’ to address this fundamental challenge. The core idea is to decouple the length of an LLM’s thought process from the size of the context it needs to process at any given moment. This innovative approach promises to enable LLMs to think for much longer durations with significantly reduced computational overhead.

The Problem with Traditional LLM Reasoning

In standard Reinforcement Learning (RL) setups for LLMs, the ‘state’ of the model is typically the initial prompt combined with all the reasoning tokens generated so far. As the LLM thinks, this state continuously grows. For attention-based models, which are common in LLMs, processing this ever-growing context leads to a quadratic increase in computational requirements and memory usage. This ‘quadratic growth’ is the bottleneck preventing LLMs from engaging in truly extensive reasoning.

Introducing Markovian Thinking and Delethink

The researchers propose ‘Markovian Thinking,’ a paradigm where the LLM’s policy advances its reasoning by conditioning on a constant-size state. This means that regardless of how long the model has been thinking, the amount of information it needs to actively process at any single step remains fixed. Delethink is an RL environment designed to train LLMs to become native Markovian Thinkers.

Here’s how Delethink works: Instead of generating one continuous, ever-growing chain of thought, reasoning is structured into a sequence of fixed-size ‘chunks.’ Within each chunk, the model thinks as usual. However, at the boundary of each chunk, the environment resets the context. The next chunk’s prompt is then reinitialized using the original query and a small ‘carryover’ of textual information from the end of the previous chunk. This carryover acts as the ‘textual Markovian state.’ Through RL, the LLM learns to write a concise, sufficient textual state at the end of each chunk, allowing it to seamlessly continue its reasoning after the context reset.

Significant Computational Benefits

The immediate consequence of Delethink’s design is profound: longer thinking requires linear compute and constant memory with respect to the total thinking length. This is a massive improvement over the quadratic scaling of traditional LongCoT methods. For instance, the paper estimates that for an average thinking length of 96,000 tokens, LongCoT-RL would cost approximately 27 H100-months of training, whereas Delethink would only cost 7 H100-months. This represents a substantial reduction in training time and resources.

Empirical results demonstrate that Delethink is highly effective. An R1-Distill 1.5B model trained with Delethink, reasoning in 8K-token chunks, can think up to 24K tokens, matching or even surpassing LongCoT-RL models trained with the same 24K budget on math benchmarks. Furthermore, Delethink shows superior ‘test-time scaling,’ meaning it continues to improve performance when allowed to think beyond its training-time limits, while LongCoT-RL methods tend to plateau.

Why Delethink Works So Well

A key insight from the research is that many off-the-shelf reasoning LLMs, even without explicit training for Markovian Thinking, already exhibit a latent ability to generate ‘Markovian traces’ zero-shot. This means they can naturally produce reasoning sequences that can be effectively chunked and continued with a limited state. This strong initial capability provides a favorable starting point for RL training, making Delethink highly effective at scale.

The researchers also tested Delethink’s compatibility with larger, state-of-the-art models like GPT-OSS 120B and Qwen3 30B-A3B. These models also demonstrated robust Markovian Thinking capabilities zero-shot across diverse tasks, including PhD-level questions, coding, and math competitions, signaling that Delethink can scale with the most advanced LLMs.

Also Read:

Implications for Future LLMs

The success of Markovian Thinking, as demonstrated by Delethink, highlights the RL environment itself as a powerful lever for progress in LLM development. By decoupling thinking length from context size, it opens a path toward efficient, scalable reasoning LLMs that could potentially think for millions of tokens. This paradigm shift could also make non-quadratic sequence architectures (like state-space models or sparse attention mechanisms) particularly beneficial for reasoning models, as they align well with the constant-memory, linear-compute nature of Markovian Thinking.

This research suggests that the way LLMs process and retain information during reasoning can be fundamentally redesigned to overcome current computational barriers, paving the way for more capable and efficient AI systems. You can read the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -