TLDR: A new study investigates how Large Language Models (LLMs) retrieve information based on its temporal position, rather than just its meaning. Through experiments with repeated tokens and overlapping ‘episodes’, researchers found that both transformer and state-space models exhibit strong temporal biases, favoring information at the beginning or end of a prompt (primacy and recency effects). An ablation study in transformers linked these biases to ‘induction heads’, crucial components for sequential recall. The findings suggest that temporal biases are fundamental to LLM processing, impacting how they learn and retrieve context, and offer insights into the ‘lost in the middle’ phenomenon.
Large Language Models (LLMs) have shown an incredible ability to learn from the information provided directly within their input, a process known as in-context learning. While much attention has been paid to how these models understand meaning, a new study delves into a less explored but equally crucial aspect: how the timing and position of information within a prompt influence what an LLM remembers and retrieves.
This research, titled Beyond Semantics: How Temporal Biases Shape Retrieval in Transformer and State-Space Models, draws a parallel between LLMs and human episodic memory. Just as humans recall events based on when they happened, the study investigates whether LLMs can differentiate and retrieve information based on its temporal separation. The authors, Anooshka Bajaj, Deven Mahesh Mistry, Sahaj Singh Maini, Yash Aggarwal, and Zoran Tiganj from Indiana University Bloomington, designed experiments to isolate these temporal effects, removing semantic distractions to get a clearer picture.
Unpacking Temporal Positional Biases
The first experiment aimed to understand the inherent temporal biases in LLM retrieval, independent of any meaning. Researchers created prompts where a specific token (like ‘A’) was repeated multiple times, separated by sequences of random, unique tokens. A final instance of the fixed token acted as a probe, and the models were tasked with predicting the next token. By shuffling the random tokens, the team ensured that any observed patterns were due to temporal position, not semantic content.
The findings were striking: all seven tested models, including both transformer-based (like Llama, Mistral, Qwen, Gemma) and state-space models (like Mamba, Falcon-Mamba, Recurrent-Gemma), consistently showed a preference for predicting the token that immediately followed a repeated token. This indicates a tendency for ‘serial recall’ – remembering sequences in the order they were presented. More importantly, the strength of this recall varied significantly with the token’s position in the prompt. Models often showed a bias for information presented at the very beginning (primacy effect) or the very end (recency effect) of the input, a phenomenon often referred to as being ‘lost in the middle’. Different models exhibited distinct biases; for instance, Mistral leaned towards recency, while Falcon-Mamba showed a primacy bias.
Testing Episodic Retrieval with Interference
The second experiment pushed the models further, evaluating their ability to retrieve specific temporal sequences, or ‘episodes’, when presented alongside other similar, partially overlapping sequences. Prompts contained five distinct episodes, each with a unique context token, followed by the same fixed token, and then a unique target token (e.g., ‘BAH’, ‘CAF’, ‘XAM’). The models were then probed with a context and fixed token pair (e.g., ‘XA’) and had to predict the correct target token (‘M’).
Most models successfully retrieved the correct target token, demonstrating a capacity for temporal separation. However, this retrieval wasn’t perfect. Smaller peaks corresponding to non-probed episodes were often visible, indicating interference from competing memories. Retrieval was generally strongest for episodes located nearer the end of the prompt, reinforcing the recency bias observed in the first experiment. Mamba and Falcon-Mamba models, in particular, showed less robust retrieval, especially for episodes closer to the end.
The Role of Induction Heads in Transformers
To understand the underlying mechanisms in transformer models, an ablation study was conducted. Researchers focused on ‘induction heads’, specific components within transformer architectures known to be crucial for in-context learning and temporal processing. These heads essentially find previous occurrences of a token and attend to the token that followed it, learning and reproducing sequences based on temporal association.
By progressively disabling these top induction heads, the study found a significant degradation in the models’ ability to perform serial recall and selectively retrieve the correct episode amidst interference. Ablating randomly selected heads had a much weaker impact, confirming the critical role of induction heads in these temporal processing behaviors. This suggests that these heads are key to how transformers manage and separate temporal context.
Also Read:
- The U-Shaped Curve of Learning: Insights from AI on Human Memory
- Unpacking In-Context Learning: A Deep Dive into Non-Transformer AI Models
Broader Implications
This research deepens our understanding of how LLMs process and retrieve information based on its temporal structure. The consistent temporal biases, including primacy and recency effects, suggest that these are fundamental properties of sequential processing in LLMs, not just artifacts of semantic content. Interestingly, state-space models, despite their different architecture, exhibited comparable temporal biases, hinting that these limitations might arise from more fundamental aspects of how context history is maintained and accessed over time.
For the development of future LLMs, these findings highlight that addressing the ‘lost in the middle’ problem requires tackling these fundamental temporal processing limitations. Simple architectural changes alone might not be sufficient. From a cognitive science perspective, this methodology offers a controlled way to compare how different computational architectures handle temporal context and interference, providing insights into memory-like phenomena in artificial intelligence.


