TLDR: A new research paper introduces the 1dCA-Reasoning benchmark to study multi-step reasoning in AI models without memorization. It evaluates various architectures (Transformers, LSTMs, Mamba, ARMT) and depth-extension methods (ACT, GRPO, Chain-of-Thought). Findings show that fixed-depth models struggle beyond one-step reasoning, but techniques like Adaptive Computation Time (ACT) and Reinforcement Learning (GRPO) can extend reasoning to two and three steps, respectively. Chain-of-Thought (CoT) with explicit supervision achieves near-perfect four-step prediction, highlighting the importance of adaptive depth and intermediate representations for genuine generalization in AI.
Large Language Models (LLMs) have shown incredible abilities in solving problems and reasoning, even achieving top ranks in international competitions. However, a significant challenge remains: their capacity for multi-step reasoning. This often leads to questions about whether LLMs truly generalize or simply memorize patterns. A recent study delves into this very issue, exploring how different AI architectures and training methods influence a model’s ability to perform complex, multi-step reasoning.
Unpacking AI Reasoning: The Cellular Automata Benchmark
To rigorously test reasoning without relying on memorization, researchers developed a unique benchmark based on one-dimensional Cellular Automata (1dCA). Imagine a simple, digital universe where each cell’s state changes based on a local rule. Given a sequence of states (an ‘orbit’), the AI model’s task is to first figure out this hidden rule and then apply it repeatedly to predict future states. Crucially, the rules used during training are never repeated in testing, forcing the model to genuinely infer and apply rules rather than just recall them.
The benchmark includes several task variations:
- Orbit-State (O-S): Predict a single future state, ‘k’ steps ahead.
- Orbit-Orbit (O-O): Predict a sequence of future states, simulating step-by-step reasoning.
- Orbit-State and Rule (O-RS): Predict both the future state and the underlying rule.
- Rule and Orbit-State (RO-S): The rule is provided, so the model only needs to learn to apply it.
Architectural Strengths and Limitations
The study evaluated several popular AI architectures, including Transformers (like GPTNeox), Long Short-Term Memory (LSTM) networks, State-Space Models (Mamba), and the Associative Recurrent Memory Transformer (ARMT). Initial findings showed that most models could predict the very next state (k=1) with high accuracy. However, their performance dropped sharply when asked to predict two or more steps ahead (k≥2), even with four layers, which should theoretically allow for some sequential computation.
Interestingly, increasing the number of layers in a Transformer did improve performance for up to three-step predictions, but gains plateaued, and four-step predictions remained challenging. Simply making the model ‘wider’ (increasing embedding dimensions) offered only marginal improvements, highlighting that depth, not width, is key for multi-step reasoning.
Extending Reasoning Depth: New Approaches
Since simply adding more layers has its limits, the researchers investigated methods to extend a model’s ‘effective depth’ during inference:
- Recurrence (ARMT): The Associative Recurrent Memory Transformer (ARMT) showed an advantage, extending generalization to two look-ahead steps. This suggests that processing sequences in segments and maintaining a recurrent memory helps.
- Adaptive Computation Time (ACT): This mechanism allows models to dynamically allocate a variable number of computation steps for each input. When applied to GPTNeox, ACT provided roughly one additional effective reasoning step, improving two-step predictions.
- Reinforcement Learning (GRPO): Training models with Reinforcement Learning using Group Relative Policy Optimization (GRPO) enabled them to achieve three-step predictions without needing explicit supervision for intermediate reasoning steps. The model learned to ‘think’ before giving a final answer.
- Chain-of-Thought (CoT): When models were trained with explicit, step-by-step supervision, similar to how Chain-of-Thought prompting works, they achieved near-perfect accuracy for up to four-step predictions. This approach essentially turns multi-step reasoning into an autoregressive generation task.
Also Read:
- Enhancing Large Language Model Reasoning Through Contrastive Learning and Reinforced Fine-Tuning
- New Benchmarks Reveal LLM Struggles in Formal Mathematical Proofs
Broader Implications for AI Development
These findings have significant implications for how we design and train future AI systems. The research suggests that for complex, multi-step reasoning tasks:
- Relying solely on prompt engineering is unlikely to be sufficient.
- Mechanisms that allow for adaptive computation, like ACT, are promising for efficiently handling varying computational demands.
- Explicit intermediate representations, as seen in Chain-of-Thought, remain the most reliable way to achieve deep generalization.
The study emphasizes that how we train AI models can be as crucial as the models themselves. Objectives that encourage multi-step prediction and mechanisms that adaptively allocate computational depth are decisive for building more capable and genuinely reasoning AI systems. For more details, you can read the full research paper: Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling.


