GRAINS: Enhancing AI Model Behavior Through Targeted Gradient-Based Steering

TLDR: GRAINS is a novel inference-time steering method for Large Language Models (LLMs) and Vision-Language Models (VLMs). It uses contrastive, gradient-based attribution to identify the most influential input tokens (both positive and negative) and constructs precise steering vectors. These vectors are then used to adjust the model’s internal activations during inference, guiding it towards desired behaviors like truthfulness and safety, and away from undesirable ones like hallucinations and toxicity. GRAINS consistently outperforms fine-tuning and existing steering baselines, achieving significant accuracy gains and hallucination reductions, all while preserving the model’s general capabilities and fluency.

Large Language Models (LLMs) and Vision-Language Models (VLMs) have shown remarkable capabilities across various tasks. However, they often produce outputs that are undesirable, such as lacking factual grounding, generating hallucinations, or exhibiting toxicity. Traditionally, fine-tuning has been used to address these issues by adapting models with specific datasets. While effective, fine-tuning is computationally expensive, requires substantial data, and carries the risk of ‘catastrophic forgetting,’ where the model loses previously learned knowledge.

A promising alternative to fine-tuning is ‘inference-time steering.’ This approach modifies the model’s internal activations during the testing phase, without altering the model’s core parameters. Existing steering methods, however, often rely on fixed, global intervention vectors, failing to consider the specific influence of individual input tokens. They also tend to overlook valuable gradient information from the model’s outputs, which is particularly crucial in multimodal settings where visual and textual inputs contribute unevenly.

Introducing GRAINS: A Gradient-Based Approach to AI Steering

To overcome these limitations, researchers have introduced GRAINS (Gradient-based Attribution for Inference-Time Steering). GRAINS is a novel inference-time steering method applicable to both language-only and vision-language models. It employs a sophisticated technique called contrastive, gradient-based attribution, specifically using Integrated Gradients, to pinpoint the most influential tokens in an input. These tokens are identified based on their contribution to either preferred or dispreferred outputs.

Once these influential tokens are identified, GRAINS constructs directional steering vectors. These vectors capture the semantic shifts needed to guide the model from undesirable to desirable behavior. During the inference process, GRAINS precisely adjusts the hidden activations within the transformer layers, guided by these token-level attribution signals. A crucial step involves normalizing these activations to maintain the model’s representational scale, ensuring that the intervention is targeted without disrupting the model’s overall capabilities.

How GRAINS Works in Simple Terms

Imagine you want an AI model to be more truthful. GRAINS first identifies which parts of the input (tokens, whether text or image parts) strongly contribute to a truthful answer versus a false one. It does this by looking at the ‘gradients’ – essentially, how much each input token influences the model’s preference for a correct answer over an incorrect one. Tokens that push the model towards a wrong answer are identified as ‘negative,’ and those towards a right answer as ‘positive.’

Then, GRAINS creates ‘steering vectors’ based on these positive and negative influences. These vectors act like a compass, guiding the model’s internal thought process. During generation, these vectors are subtly injected into the model’s hidden layers, nudging it away from paths that lead to undesirable outputs and towards those that lead to desired ones. This process is highly precise because it focuses only on the most impactful tokens, avoiding broad, potentially disruptive changes.

Also Read:

Impressive Results Across Models and Tasks

Empirically, GRAINS has demonstrated consistent superiority over both traditional fine-tuning and existing steering methods. For instance, on the TruthfulQA dataset, GRAINS achieved a significant 13.22% accuracy gain when applied to Llama-3.1-8B. In multimodal settings, it reduced hallucination rates on MMHal-Bench from 0.624 to 0.514 with LLaVA-1.6-7B and improved alignment win rates on SPA-VL by 8.11%. Crucially, these improvements were achieved while preserving the model’s fluency and general capabilities, unlike some prior methods that could degrade performance on unrelated tasks.

The research highlights that GRAINS’s ability to integrate gradient-based token attribution with activation steering allows for fine-grained, interpretable, and modular control over model behavior. This means AI developers can precisely steer models towards desired attributes like truthfulness or safety without the need for extensive retraining or auxiliary supervision. For more technical details, you can refer to the full research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

GRAINS: Enhancing AI Model Behavior Through Targeted Gradient-Based Steering

Introducing GRAINS: A Gradient-Based Approach to AI Steering

How GRAINS Works in Simple Terms

Impressive Results Across Models and Tasks

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates