TLDR: New research challenges the notion that Large Language Models (LLMs) are not “abstract reasoners.” While LLMs perform poorly in zero-shot settings on complex reasoning tasks, fine-tuning only their input embedding layers or visual encoders dramatically improves performance. This suggests that LLMs possess transferable reasoning capabilities, and their apparent lack of abstract reasoning in zero-shot tests is often due to input formatting rather than a fundamental limitation. The paper prompts a re-evaluation of what it means to be an “abstract reasoner” and why this distinction matters for AI development.
The capabilities of large language models (LLMs) continue to astound, yet a persistent question lingers: are they truly “abstract reasoners”? This debate is crucial because abstract reasoning is often considered a hallmark of general intelligence, and how we answer this question influences the future direction of AI development.
Recent studies have suggested that LLMs fall short in this area, pointing to their poor performance when tested “out-of-the-box” on complex reasoning tasks. These tasks often require models to infer and generalize patterns from a limited number of observations, similar to how humans might solve novel puzzles. The initial findings indicated that LLMs struggled significantly, often performing no better than random chance on these challenging benchmarks.
However, new research from Tian Yun, Chen Sun, and Ellie Pavlick at Brown University revisits these claims, adding a crucial layer of nuance. Their paper, titled “What is an “Abstract Reasoner”? Revisiting Experiments and Arguments about Large Language Models,” acknowledges and replicates the earlier findings: indeed, frozen, pre-trained LLMs perform poorly in a zero-shot setting. But their additional experiments reveal a surprising twist.
The Power of Input Adaptation
The researchers found that even a small amount of adaptation can dramatically change an LLM’s performance. Specifically, by fine-tuning only the input embedding layer – the part of the model that processes and encodes incoming information – LLMs achieved near-perfect performance on many of these abstract reasoning tasks. This is akin to teaching a highly intelligent person a new language or a specific way to interpret instructions; their core intelligence remains, but adapting the input format unlocks their ability to solve the problem.
This finding extends beyond text-based tasks. When applied to abstract visual reasoning problems, freezing the LLM’s core “transformer blocks” (its main processing units) and only training a visual encoder (which translates images into a format the LLM can understand) also led to significant performance improvements. This suggests that the LLM’s internal reasoning mechanisms are robust and transferable, provided the input data is presented in a compatible format.
Also Read:
- Assessing LLM Capabilities in Answer Set Programming: A New Benchmark Reveals Core Challenges
- Decoding Chain-of-Thought: Information Flow in Language Models
Redefining “Abstract Reasoner”
These empirical results invite a deeper, more philosophical discussion: what does it truly mean to be an “abstract reasoner,” and why does it matter if LLMs fit this description? If abstract reasoning is defined by the ability to perform tasks without any prior adaptation (zero-shot), then current LLMs might not qualify. However, if it includes the capacity to reason effectively once inputs are appropriately formatted, then the picture changes considerably.
The paper draws an analogy to older “Good Old-Fashioned AI” (GOFAI) systems, which were considered abstract reasoners but required data in specific formats. Just as a database system needs data in SQL, an LLM might need its inputs “tuned” to its internal representations. The authors also reference philosopher Daniel Dennett, who argued that intelligent systems, especially human cognition, often require adaptation to new environments to perform well, rather than operating perfectly out-of-the-box.
Ultimately, the researchers argue that the community needs to clarify its motivations. Do we seek to understand how human-like LLMs are, where adaptability is key? Or do we care more about practical technological progress, where efficient transfer to new tasks is paramount? The answer to “why we care” will shape how we define and evaluate abstract reasoning in AI. You can read the full research paper for more details at this link.


