Power Attention: A New Approach to Efficient Long-Context AI Models

TLDR: A new research paper introduces “power attention,” an architectural layer for AI models that efficiently handles extremely long sequences of data. Unlike traditional Transformers (which are too costly) or existing linear attention models (which are too simple), power attention offers a linear computational cost and an adjustable state size, leading to better in-context learning and superior performance on long-context training, with optimized GPU kernels for practical use.

As artificial intelligence models, particularly large language models (LLMs), continue to grow in complexity and capability, a critical challenge emerges: how to efficiently process and learn from increasingly long sequences of information, known as “context.” Traditional attention mechanisms, like those found in the widely used Transformer architecture, struggle with this, as their computational cost skyrockets with longer contexts. This has led researchers to explore alternative architectures, but a new paper from Manifest AI argues that neither Transformers nor existing sub-quadratic models are truly optimized for the future of long-context AI.

The Problem with Current Approaches

The core issue lies in how current models handle context. Transformers, while powerful, use a self-attention mechanism whose computational cost grows quadratically with context length. This means doubling the context length quadruples the computational effort, making it prohibitively expensive for very long sequences (millions or billions of tokens). On the other hand, sub-quadratic architectures, often employing linear attention, aim to reduce this cost. However, the paper suggests that these models tend to be “too inexpensive” in their processing of context, leading to a different kind of imbalance.

Another approach, like “sliding window attention,” attempts to reduce Transformer costs by limiting the context window. While this helps with efficiency, it unfortunately impairs the model’s ability to learn from information spread across the entire context, a crucial aspect known as “in-context learning.”

Introducing Power Attention: A Balanced Solution

To address these limitations, Manifest AI introduces a novel architectural layer called “power attention.” This new approach is designed for linear-cost sequence modeling, meaning its computational cost scales linearly with context length, making it much more efficient for very long sequences. A key innovation of power attention is its ability to adjust its “state size” independently of the model’s parameters. This allows for a crucial balance between the computational effort spent on processing the model’s internal state and its parameters, a balance the authors argue is essential for compute-optimal long-context training.

Power attention achieves this by substituting the exponential function in classic attention with a p-th power. This seemingly simple change allows it to inherit the computational advantages of linear attention while offering a flexible state expansion. For instance, with a head size of 64, setting ‘p’ to 2 can increase the state size by approximately 32 times without adding parameters, significantly boosting performance.

Hardware Efficiency and Empirical Success

The researchers didn’t just propose a theoretical concept; they also developed and open-sourced a set of GPU kernels for efficient power attention. They identified a new pattern of operation fusion to overcome memory and bandwidth bottlenecks, similar to how “Flash Attention” optimized Transformers. Their “TSPOW” approach, a tiled symmetric power expansion, further enhances hardware compatibility and efficiency, interpolating between different computational structures to maximize performance on modern GPUs.

Empirical evaluations show promising results. In experiments on long-context datasets, power attention models demonstrated superior in-context learning compared to other balanced architectures like windowed attention. When training on very long contexts (65,536 tokens), power attention significantly outperformed both exponential attention and traditional linear attention in terms of “loss-per-FLOP,” indicating better efficiency for a given computational budget. This suggests that power attention can achieve similar in-context learning capabilities to Transformers at a much lower cost for long sequences.

Also Read:

Looking Ahead

While the initial results are compelling, the authors acknowledge limitations and areas for future work. Their experiments primarily focused on generic natural language text and negative log likelihood, suggesting a need to validate findings across diverse domains, modalities (like audio or video), and downstream tasks. They also aim to further optimize their GPU kernels by transitioning from Triton to CUDA for even greater wall-clock performance. The paper represents a significant step towards building more efficient and capable AI models for the era of extremely long contexts. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Power Attention: A New Approach to Efficient Long-Context AI Models

The Problem with Current Approaches

Introducing Power Attention: A Balanced Solution

Hardware Efficiency and Empirical Success

Looking Ahead

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates