spot_img
HomeResearch & DevelopmentCompressing Text for LLMs: How Images Can Halve Token...

Compressing Text for LLMs: How Images Can Halve Token Usage

TLDR: This paper introduces a method to compress long text inputs for multimodal large language models (LLMs) by rendering them as images. This “text-as-image” approach significantly reduces the number of tokens processed by the LLM’s decoder (often by nearly half) without compromising performance on tasks like long-context retrieval and document summarization, leading to improved efficiency and potentially lower inference costs.

Large language models (LLMs) have become incredibly powerful, but processing long text inputs can be computationally expensive. This is due to how their underlying architecture, particularly the self-attention mechanism, scales with input length. This cost and throughput challenge is a significant hurdle for deploying LLMs in real-world applications like chat assistants or document analysis.

A new research paper titled “Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs” explores a surprisingly effective solution: feeding text to LLMs as images. The core idea is to render long textual inputs as a single image and then provide this image directly to a multimodal LLM (MLLM). This innovative approach dramatically reduces the number of “decoder tokens” required, offering a novel form of input compression.

How Does Text-as-Image Compression Work?

Traditionally, an LLM processes text as a sequence of tokens. For a long document, this means a large number of tokens, leading to higher computational costs. Multimodal LLMs, however, can also interpret visual inputs. The researchers leveraged this capability by converting the text into an image. A vision encoder within the MLLM then processes this image, generating a much smaller, fixed-length sequence of “visual tokens” for the language decoder to work with. Essentially, the image acts as an implicit compression layer, allowing the model to “read” the context from the image rather than processing every single text token.

The paper highlights that this method can cut the number of decoder tokens by nearly half while maintaining performance. This is a significant finding because it means LLMs can process the same amount of information with substantially less computational effort, leading to faster inference and reduced costs.

Experimental Validation: Retrieval and Summarization

The researchers tested their “text-as-image” compression strategy on two distinct benchmarks:

  • RULER (Long-Context Retrieval): This task involves hiding a specific piece of information (a “needle”) within a long, irrelevant passage (“haystack”). The model must accurately retrieve this hidden information. Experiments with models like GPT-4.1-mini and Qwen2.5-VL-72B-Instruct showed that visual text inputs could sustain high accuracy (97-99%) while reducing decoder tokens by up to 58%. For larger models like Qwen2.5-VL-72B, this even translated to a 25-45% end-to-end speedup in inference time, despite the added overhead of vision processing.
  • CNN/DailyMail (Document Summarization): This benchmark evaluates the model’s ability to summarize long documents. The text-as-image method outperformed two specialized token-pruning baselines (Select-Context and LLMLingua-2) at similar or even higher compression rates. This suggests that the visual representation effectively preserves the necessary information for complex tasks like summarization, even when not specifically trained for it.

A key observation was the “text-token tolerance,” which refers to the maximum amount of text that can be rendered into an image and still be accurately processed by the MLLM. The studies showed a strong positive correlation between the number of visual tokens (from the image) and the text-token tolerance, with a consistent compression ratio of approximately 2:1 (meaning roughly half the original text tokens are needed). This linear relationship suggests a predictable trade-off between the visual budget and text compression capacity.

Also Read:

Implications and Future Directions

The findings suggest that converting long textual contexts into images is a practical and effective way to reduce inference costs on large-context tasks without sacrificing performance. This approach is model- and task-agnostic, requiring no fine-tuning or parameter updates to the LLM itself.

The authors propose several exciting future directions, including combining this visual rendering with existing token-level pruning techniques for even greater compression gains, and expanding the approach to other domains like mathematics where prompt tokens are often critical and difficult to prune traditionally. While the current work focuses on short to medium context scenarios, future research will explore its impact on extremely large contexts spanning tens of thousands of tokens.

This research opens up a new avenue for improving the efficiency and usability of large language models by leveraging their multimodal capabilities. You can read the full paper for more details at arXiv:2510.18279.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -