Unpacking In-Context Learning: A Deep Dive into Non-Transformer AI Models

TLDR: This research evaluates in-context learning (ICL) across transformer, state-space (Mamba, Mamba2), and hybrid (Hymba, Zamba2) language models. It finds that while these architectures can perform similarly, their internal mechanisms for ICL differ. Function Vectors (FVs), key to ICL, are mainly in self-attention and Mamba layers, being more crucial for parametric knowledge retrieval tasks than for contextual understanding. Mamba2 appears to use a different ICL mechanism than FVs, and hybrid models primarily rely on their self-attention components for ICL.

In-context learning (ICL) is a remarkable ability of large language models (LLMs) to learn new tasks from a few examples provided directly in the prompt, without needing any updates to their core parameters. This capability has primarily been studied in Transformer-based architectures, which have dominated the LLM landscape. However, a recent research paper, titled “Understanding In-Context Learning Beyond Transformers: An Investigation of State Space and Hybrid Architectures,” delves into how ICL functions in newer model types: State Space Models (SSMs) and hybrid architectures that combine elements of both Transformers and SSMs.

Exploring Diverse Architectures

The researchers, Shenran Wang, Timothy Tin-Long Tse, and Jian Zhu from The University of British Columbia, conducted an in-depth evaluation of ICL across a range of state-of-the-art models. This included established Transformer LLMs like GEMMA-3-1B-PT, LLAMA-3.2-1B, and QWEN2.5-1.5B. For State Space Models, they examined MAMBA-1.4B and MAMBA2-1.3B. Additionally, they investigated hybrid models such as HYMBA-1.5B-BASE and ZAMBA2-1.2B, which represent different ways of integrating self-attention and Mamba components. All models were chosen to have approximately 1 billion parameters for a fair comparison.

Two Types of Knowledge-Based Tasks

To understand ICL comprehensively, the study categorized tasks into two main types, a distinction not always made in prior research:

Parametric Knowledge Retrieval: These tasks involve retrieving factual information stored within the model’s parameters, such as identifying country capitals or antonyms. These tasks were a primary focus in earlier research on Function Vectors (FVs).
Contextual Knowledge Understanding: These tasks require the model to interpret information provided within a given paragraph to answer questions, like classifying hate speech or performing sentiment analysis. The relationships in these tasks tend to be less direct and more nuanced.

Behavioral Insights: How Models Perform

The initial experiments focused on how these different models behave under various ICL conditions. In parametric knowledge retrieval tasks, all models demonstrated effective in-context learning. However, in contextual knowledge understanding tasks, a notable difference emerged. Transformer models, HYMBA-1.5B-BASE, and MAMBA-1.4B-HF showed significant performance improvements when given correct demonstrations. In contrast, MAMBA2-1.3B-HF and ZAMBA2-1.2B exhibited only marginal gains, suggesting weaker ICL performance for these types of tasks under regular settings.

Interestingly, when presented with “label-flipped” demonstrations (where correct answers were consistently mapped to incorrect options), all models, including MAMBA2-1.3B-HF and ZAMBA2-1.2B, were able to learn these new, counterfactual associations. This indicates that even models with weaker ICL for contextual understanding can still pick up unseen relationships from context, though Transformer-based models generally achieved greater performance increments.

Mechanistic Analysis: Uncovering Internal Mechanisms

Beyond observing behavior, the researchers employed mechanistic interpretability techniques to understand *how* ICL happens internally. They focused on identifying “Function Vectors” (FVs), which are specific attention heads responsible for ICL in Transformers. The study extended this analysis to SSMs and hybrid models, treating SSM heads as analogous to attention heads.

Key Discoveries from Internal Probing:

Location of FVs: Function vectors responsible for ICL were primarily found in the self-attention and Mamba layers. For hybrid models like HYMBA-1.5B-BASE, these FVs were much more concentrated in the self-attention layers, especially for parametric knowledge retrieval tasks.
Task-Specific FVs: The study found that the specific FV heads activated for parametric knowledge retrieval tasks were highly consistent and concentrated, but this was not the case for contextual knowledge understanding tasks. This suggests that different sets of FVs are involved depending on the task type.
Mamba2’s Unique Mechanism: While FVs contribute significantly to ICL in Transformers, Mamba, and hybrid models, they appear to be less crucial for Mamba2. Intervention experiments (steering and ablating FVs) showed that Mamba2’s performance was not as significantly influenced by FV manipulation, leading the researchers to speculate that Mamba2 might employ a different internal mechanism for ICL.
Hybrid Model Dominance: In hybrid models, the ICL capabilities are predominantly driven by FVs located in their self-attention layers. Even when Mamba streams were steered, the self-attention stream remained the primary contributor to performance.
Layer-wise Importance: Steering FVs in the middle or later layers of models generally led to improved ICL performance, particularly in self-attention layers of hybrid models.

Also Read:

A Nuanced Understanding of ICL

This research significantly extends our understanding of in-context learning beyond the traditional focus on Transformers. It highlights that while different LLM architectures might achieve similar task performance, their internal mechanisms for ICL can vary considerably. The findings underscore the importance of Function Vectors, particularly in self-attention layers, for parametric knowledge retrieval. However, for contextual understanding, FVs play a less dominant role, and Mamba2 seems to leverage alternative mechanisms. This work emphasizes the value of combining both behavioral observations and mechanistic analyses to truly unravel the complexities of LLM capabilities. For more details, you can read the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking In-Context Learning: A Deep Dive into Non-Transformer AI Models

Exploring Diverse Architectures

Two Types of Knowledge-Based Tasks

Behavioral Insights: How Models Perform

Mechanistic Analysis: Uncovering Internal Mechanisms

Key Discoveries from Internal Probing:

A Nuanced Understanding of ICL

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates