OrthoRank: A New Approach to Efficient LLM Inference Through Token Selection

TLDR: OrthoRank is a novel method to make Large Language Models (LLMs) run more efficiently without extra training. It identifies a ‘sink token’ that remains stable across layers and then dynamically selects other tokens for full computation based on their ‘orthogonality’ (dissimilarity) to this sink token. This approach significantly reduces computational cost while maintaining or improving LLM performance in terms of perplexity, zero-shot accuracy, and long-context understanding, offering a practical solution for faster and more economical LLM inference.

Large Language Models (LLMs) have become incredibly powerful tools, excelling at a wide range of tasks from writing to complex problem-solving. However, their impressive capabilities come with a significant challenge: the high computational cost of running them, especially for real-time applications. This has led researchers to explore various methods to make LLMs more efficient.

Traditional approaches to reducing LLM computational costs often involve methods like ‘layer pruning,’ where entire layers of the model are removed if they are deemed less critical. While effective in some scenarios, layer pruning has limitations. It applies a fixed decision across all input tokens, meaning it can’t adapt to the specific needs of individual tokens. Some tokens might no longer require extensive processing, while others still benefit from it, but layer pruning can’t account for this variation.

Another set of methods, such as ‘early exit’ or ‘mixture of depth,’ aim for more dynamic computation paths based on token-level characteristics. However, these often require training additional components or even retraining the entire model, which limits their practical use with a wide range of existing LLMs.

A recent research paper introduces a novel approach called OrthoRank, which tackles these efficiency challenges without requiring additional training. The core of OrthoRank stems from a deeper understanding of how LLMs process information internally, specifically focusing on a phenomenon known as the ‘attention sink.’

The attention sink refers to the observation that an initial token in an input sequence often receives a disproportionately high amount of attention from other tokens, despite its limited semantic role. Researchers behind OrthoRank delved further, analyzing the similarity of hidden states between this ‘sink token’ and other tokens. They discovered a fascinating pattern: as the LLM processes deeper layers, the cosine similarity between the normalized hidden states of the sink token and other tokens increases. This implies that other tokens are consistently moving towards the sink token throughout the layers, while the sink token itself remains remarkably stable.

OrthoRank leverages this insight to define ‘token importance.’ Instead of processing every token in every layer, OrthoRank identifies tokens that are ‘more orthogonal’ (meaning less similar) to the sink token as more important. The reasoning is that tokens that are still far from the static sink token are the ones that are actively changing and thus require further computation. Conversely, tokens that have already aligned closely with the sink token may not need as much processing.

The method works by dynamically selecting a subset of tokens for full computation in a given layer. The unselected tokens are not entirely discarded; they still participate in ‘key’ and ‘value’ calculations, which are essential for the selected tokens to interact correctly. However, their own states are not updated, significantly reducing computational overhead. This selective processing is applied to specific layers, often in conjunction with existing layer pruning strategies, to achieve optimal efficiency.

Extensive experiments demonstrate the effectiveness of OrthoRank. When compared to existing layer pruning methods at the same sparsity (computational reduction) ratio, OrthoRank consistently achieves lower perplexity (a measure of how well a language model predicts text) and higher zero-shot accuracy across various LLM models, including Llama-2, Llama-3, Mistral, and Mixtral. It also shows superior performance on long-context understanding tasks (LongBench) and improves factual quality as measured by TruthfulQA. Crucially, these performance gains are achieved with comparable or even improved throughput (processing speed).

Also Read:

The paper’s ablation studies further validate OrthoRank’s design choices, confirming that selecting tokens based on their orthogonality to the sink token in normalized hidden states is indeed the most effective strategy. The research highlights that this approach offers a simple, interpretable, and highly practical mechanism for optimizing LLM inference without the need for complex retraining or additional modules, making it readily applicable to existing pretrained models. For more technical details, you can refer to the full research paper: OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM inference.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

OrthoRank: A New Approach to Efficient LLM Inference Through Token Selection

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates