TLDR: A new research paper formalizes how large language models predict the next token as a smooth trajectory on a probability simplex, converging to a softmax equilibrium. It demonstrates that the temperature parameter acts as an exact time-rescaling of this trajectory, while sampling methods like top-k and nucleus sampling restrict the flow to a subset of tokens. The paper also outlines how path-dependent score adjustments can lead to ‘hallucination’-like behavior, offering a rigorous framework for understanding LLM output dynamics.
Large language models (LLMs) have become incredibly powerful, capable of generating human-like text, translating languages, and answering complex questions. At their core, these models predict the next word or ‘token’ in a sequence by scoring a vast vocabulary and then normalizing these scores using a mathematical function called softmax. While this process is operationally correct, a common intuition among practitioners is that models ‘traverse a manifold’ during decoding. A new research paper, authored by Christopher R. Lee-Jenkins, delves into this very idea, transforming it from a metaphor into a precisely stated and proven theorem.
The paper, titled Manifold Trajectories in Next-Token Prediction: From Replicator Dynamics to Softmax Equilibrium, offers a minimal and self-contained account of the decoding step as a constrained variational principle on the probability simplex. Imagine the probability simplex as a geometric shape where each point represents a possible distribution of probabilities over all possible next tokens. The paper demonstrates that the next-token distribution follows a smooth, continuous path within this simplex, eventually settling into what’s called the softmax equilibrium.
The Dynamics of Prediction
The core of this research lies in understanding the decoding process as a dynamic system. The authors show that the discrete, normalization-respecting way LLMs update their token probabilities is akin to a classical method known as the multiplicative-weights update. When this discrete update is viewed in its continuous-time limit, it transforms into a well-known concept from evolutionary biology and game theory: the replicator flow. This replicator flow dictates how the probabilities of different tokens evolve over time, always staying within the bounds of a valid probability distribution.
From these foundational elements, the paper rigorously proves its ‘manifold-traversal theorem.’ This theorem states that for a given context (the text already generated) and a specific temperature setting, the distribution of probabilities for the next token follows a smooth, predictable trajectory inside the probability simplex. This trajectory consistently converges towards the softmax equilibrium, which represents the most stable and optimal distribution of next-token probabilities.
Also Read:
- Boosting Efficiency in LLM Hallucination Detection with Decoding Memory Pipeline
- Adaptive Heavy-Tailed Stochastic Gradient Descent: A New Approach to Neural Network Optimization
Practical Implications for LLM Behavior
The formalization of this dynamic process yields several precise and practical insights for how LLMs behave:
-
Temperature as a Time Rescaler: The ‘temperature’ parameter, often used in LLMs to control the randomness of token generation, is shown to act as an exact rescaling of time along the same trajectory. A lower temperature means the distribution moves faster towards its equilibrium, making the model’s choices more deterministic and focused. Conversely, a higher temperature slows down this movement, leading to more diverse and less predictable outputs.
-
Top-k and Nucleus Sampling: Popular decoding strategies like top-k and nucleus sampling, which restrict the model to choose from a subset of the most probable tokens, are explained as simply confining this dynamic flow to a ‘face’ of the probability simplex. The underlying dynamics and convergence guarantees remain identical, just within a smaller, constrained space.
-
Path-Dependent Score Adjustments and Hallucination: The paper also touches upon how mild, path-dependent adjustments to token scores (e.g., through heuristics or implicit feedback) can introduce non-conservative dynamics. This can lead to phenomena like ‘loops’ or ‘brittle attractors’ in the probability trajectory. This offers a controlled language for understanding ‘hallucination’-like behavior in LLMs, where the model might get stuck in self-reinforcing, yet globally incoherent, cycles of generation.
It’s important to note that this research focuses exclusively on the output distribution of next tokens for a fixed context. It does not make claims about the internal representations of LLMs or their training dynamics, which are complex areas reserved for future work. However, by providing a rigorous dynamical framework for next-token prediction, this paper offers a deeper conceptual understanding of how large language models make their choices, moving beyond mere operational descriptions to a more profound theoretical foundation.


