Compressing Text for LLMs: How Images Can Halve Token Usage

TLDR: This paper introduces a method to compress long text inputs for multimodal large language models (LLMs) by rendering them as images. This “text-as-image” approach significantly reduces the number of tokens processed by the LLM’s decoder (often by nearly half) without compromising performance on tasks like long-context retrieval and document summarization, leading to improved efficiency and potentially lower inference costs.

Large language models (LLMs) have become incredibly powerful, but processing long text inputs can be computationally expensive. This is due to how their underlying architecture, particularly the self-attention mechanism, scales with input length. This cost and throughput challenge is a significant hurdle for deploying LLMs in real-world applications like chat assistants or document analysis.

A new research paper titled “Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs” explores a surprisingly effective solution: feeding text to LLMs as images. The core idea is to render long textual inputs as a single image and then provide this image directly to a multimodal LLM (MLLM). This innovative approach dramatically reduces the number of “decoder tokens” required, offering a novel form of input compression.

How Does Text-as-Image Compression Work?

Traditionally, an LLM processes text as a sequence of tokens. For a long document, this means a large number of tokens, leading to higher computational costs. Multimodal LLMs, however, can also interpret visual inputs. The researchers leveraged this capability by converting the text into an image. A vision encoder within the MLLM then processes this image, generating a much smaller, fixed-length sequence of “visual tokens” for the language decoder to work with. Essentially, the image acts as an implicit compression layer, allowing the model to “read” the context from the image rather than processing every single text token.

The paper highlights that this method can cut the number of decoder tokens by nearly half while maintaining performance. This is a significant finding because it means LLMs can process the same amount of information with substantially less computational effort, leading to faster inference and reduced costs.

Experimental Validation: Retrieval and Summarization

The researchers tested their “text-as-image” compression strategy on two distinct benchmarks:

RULER (Long-Context Retrieval): This task involves hiding a specific piece of information (a “needle”) within a long, irrelevant passage (“haystack”). The model must accurately retrieve this hidden information. Experiments with models like GPT-4.1-mini and Qwen2.5-VL-72B-Instruct showed that visual text inputs could sustain high accuracy (97-99%) while reducing decoder tokens by up to 58%. For larger models like Qwen2.5-VL-72B, this even translated to a 25-45% end-to-end speedup in inference time, despite the added overhead of vision processing.
CNN/DailyMail (Document Summarization): This benchmark evaluates the model’s ability to summarize long documents. The text-as-image method outperformed two specialized token-pruning baselines (Select-Context and LLMLingua-2) at similar or even higher compression rates. This suggests that the visual representation effectively preserves the necessary information for complex tasks like summarization, even when not specifically trained for it.

A key observation was the “text-token tolerance,” which refers to the maximum amount of text that can be rendered into an image and still be accurately processed by the MLLM. The studies showed a strong positive correlation between the number of visual tokens (from the image) and the text-token tolerance, with a consistent compression ratio of approximately 2:1 (meaning roughly half the original text tokens are needed). This linear relationship suggests a predictable trade-off between the visual budget and text compression capacity.

Also Read:

Implications and Future Directions

The findings suggest that converting long textual contexts into images is a practical and effective way to reduce inference costs on large-context tasks without sacrificing performance. This approach is model- and task-agnostic, requiring no fine-tuning or parameter updates to the LLM itself.

The authors propose several exciting future directions, including combining this visual rendering with existing token-level pruning techniques for even greater compression gains, and expanding the approach to other domains like mathematics where prompt tokens are often critical and difficult to prune traditionally. While the current work focuses on short to medium context scenarios, future research will explore its impact on extremely large contexts spanning tens of thousands of tokens.

This research opens up a new avenue for improving the efficiency and usability of large language models by leveraging their multimodal capabilities. You can read the full paper for more details at arXiv:2510.18279.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Compressing Text for LLMs: How Images Can Halve Token Usage

How Does Text-as-Image Compression Work?

Experimental Validation: Retrieval and Summarization

Implications and Future Directions

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates