TLDR: This research evaluates in-context learning (ICL) across transformer, state-space (Mamba, Mamba2), and hybrid (Hymba, Zamba2) language models. It finds that while these architectures can perform similarly, their internal mechanisms for ICL differ. Function Vectors (FVs), key to ICL, are mainly in self-attention and Mamba layers, being more crucial for parametric knowledge retrieval tasks than for contextual understanding. Mamba2 appears to use a different ICL mechanism than FVs, and hybrid models primarily rely on their self-attention components for ICL.
In-context learning (ICL) is a remarkable ability of large language models (LLMs) to learn new tasks from a few examples provided directly in the prompt, without needing any updates to their core parameters. This capability has primarily been studied in Transformer-based architectures, which have dominated the LLM landscape. However, a recent research paper, titled “Understanding In-Context Learning Beyond Transformers: An Investigation of State Space and Hybrid Architectures,” delves into how ICL functions in newer model types: State Space Models (SSMs) and hybrid architectures that combine elements of both Transformers and SSMs.
Exploring Diverse Architectures
The researchers, Shenran Wang, Timothy Tin-Long Tse, and Jian Zhu from The University of British Columbia, conducted an in-depth evaluation of ICL across a range of state-of-the-art models. This included established Transformer LLMs like GEMMA-3-1B-PT, LLAMA-3.2-1B, and QWEN2.5-1.5B. For State Space Models, they examined MAMBA-1.4B and MAMBA2-1.3B. Additionally, they investigated hybrid models such as HYMBA-1.5B-BASE and ZAMBA2-1.2B, which represent different ways of integrating self-attention and Mamba components. All models were chosen to have approximately 1 billion parameters for a fair comparison.
Two Types of Knowledge-Based Tasks
To understand ICL comprehensively, the study categorized tasks into two main types, a distinction not always made in prior research:
- Parametric Knowledge Retrieval: These tasks involve retrieving factual information stored within the model’s parameters, such as identifying country capitals or antonyms. These tasks were a primary focus in earlier research on Function Vectors (FVs).
- Contextual Knowledge Understanding: These tasks require the model to interpret information provided within a given paragraph to answer questions, like classifying hate speech or performing sentiment analysis. The relationships in these tasks tend to be less direct and more nuanced.
Behavioral Insights: How Models Perform
The initial experiments focused on how these different models behave under various ICL conditions. In parametric knowledge retrieval tasks, all models demonstrated effective in-context learning. However, in contextual knowledge understanding tasks, a notable difference emerged. Transformer models, HYMBA-1.5B-BASE, and MAMBA-1.4B-HF showed significant performance improvements when given correct demonstrations. In contrast, MAMBA2-1.3B-HF and ZAMBA2-1.2B exhibited only marginal gains, suggesting weaker ICL performance for these types of tasks under regular settings.
Interestingly, when presented with “label-flipped” demonstrations (where correct answers were consistently mapped to incorrect options), all models, including MAMBA2-1.3B-HF and ZAMBA2-1.2B, were able to learn these new, counterfactual associations. This indicates that even models with weaker ICL for contextual understanding can still pick up unseen relationships from context, though Transformer-based models generally achieved greater performance increments.
Mechanistic Analysis: Uncovering Internal Mechanisms
Beyond observing behavior, the researchers employed mechanistic interpretability techniques to understand *how* ICL happens internally. They focused on identifying “Function Vectors” (FVs), which are specific attention heads responsible for ICL in Transformers. The study extended this analysis to SSMs and hybrid models, treating SSM heads as analogous to attention heads.
Key Discoveries from Internal Probing:
- Location of FVs: Function vectors responsible for ICL were primarily found in the self-attention and Mamba layers. For hybrid models like HYMBA-1.5B-BASE, these FVs were much more concentrated in the self-attention layers, especially for parametric knowledge retrieval tasks.
- Task-Specific FVs: The study found that the specific FV heads activated for parametric knowledge retrieval tasks were highly consistent and concentrated, but this was not the case for contextual knowledge understanding tasks. This suggests that different sets of FVs are involved depending on the task type.
- Mamba2’s Unique Mechanism: While FVs contribute significantly to ICL in Transformers, Mamba, and hybrid models, they appear to be less crucial for Mamba2. Intervention experiments (steering and ablating FVs) showed that Mamba2’s performance was not as significantly influenced by FV manipulation, leading the researchers to speculate that Mamba2 might employ a different internal mechanism for ICL.
- Hybrid Model Dominance: In hybrid models, the ICL capabilities are predominantly driven by FVs located in their self-attention layers. Even when Mamba streams were steered, the self-attention stream remained the primary contributor to performance.
- Layer-wise Importance: Steering FVs in the middle or later layers of models generally led to improved ICL performance, particularly in self-attention layers of hybrid models.
Also Read:
- Decoding How Pre-Training and Context Shape In-Context Learning
- Unlocking Advanced Reasoning in Language Models with Code Execution
A Nuanced Understanding of ICL
This research significantly extends our understanding of in-context learning beyond the traditional focus on Transformers. It highlights that while different LLM architectures might achieve similar task performance, their internal mechanisms for ICL can vary considerably. The findings underscore the importance of Function Vectors, particularly in self-attention layers, for parametric knowledge retrieval. However, for contextual understanding, FVs play a less dominant role, and Mamba2 seems to leverage alternative mechanisms. This work emphasizes the value of combining both behavioral observations and mechanistic analyses to truly unravel the complexities of LLM capabilities. For more details, you can read the full paper here.


