Bridging the Gap: How Discrete Tokens Power Multimodal Large Language Models

TLDR: This survey explores discrete tokenization, particularly Vector Quantization (VQ), as a crucial method for converting continuous data (like images, audio, video) into a format Large Language Models (LLMs) can understand. It details various VQ techniques, their applications in both traditional and LLM-based systems, and identifies key challenges like codebook collapse and information loss, while outlining future research directions for more efficient and generalizable multimodal AI.

The world of Artificial Intelligence is rapidly evolving, with Large Language Models (LLMs) at the forefront of understanding and generating human language. However, the real world is not just text; it’s filled with continuous data like images, sounds, and videos. The challenge lies in teaching these text-focused LLMs to understand and interact with this rich, diverse information. This is where ‘discrete tokenization’ comes into play, acting as a crucial bridge.

At its core, discrete tokenization transforms continuous, high-dimensional data into compact, distinct units, much like how words are discrete tokens in human language. This process makes non-textual data compatible with the token-based architecture of LLMs, offering both computational efficiency and seamless integration. A central technique for this transformation is Vector Quantization (VQ).

Understanding Vector Quantization (VQ)

Imagine you have a vast palette of continuous colors, but you need to represent them using only a limited set of predefined colors from a color chart. Vector Quantization works similarly. It takes a continuous input (like a segment of an image or audio) and maps it to the closest ‘codeword’ in a finite ‘codebook.’ This codebook is essentially a dictionary of discrete representations. The process typically involves three steps: an encoder that turns raw data into a continuous representation, a quantizer that discretizes this representation using the codebook, and a decoder that reconstructs the data from these discrete tokens.

Training these systems involves various methods, including ‘reconstruction-based’ approaches like VQ-VAE, which aim to reconstruct the original input as accurately as possible, and ‘adversarial-based’ methods like VQGAN, which use a competitive training process to create more realistic outputs.

A key hurdle in this process is that the step of picking the ‘closest’ codeword is not smoothly differentiable, making it hard for standard machine learning training methods to work. Researchers have devised clever tricks like the ‘Straight-Through Estimator’ (STE) and ‘Gumbel-Softmax’ to approximate gradients, allowing the system to learn effectively.

Diverse Quantization Techniques

Beyond the basic VQ, several sophisticated methods have emerged:

Residual Vector Quantization (RVQ): Instead of quantizing the entire input at once, RVQ does it in stages. It quantizes the input, then quantizes the ‘leftover’ error (the residual), and so on, progressively refining the representation.
Product Quantization (PQ): This technique breaks down a high-dimensional input into smaller sub-vectors and quantizes each sub-vector independently. This significantly reduces the complexity and memory requirements.
Additive Vector Quantization (AQ): Here, the input is represented as a sum of codewords chosen from multiple full-dimensional codebooks, offering another way to build complex representations from simpler parts.
Finite Scalar Quantization (FSQ): This is a simpler approach where each dimension of the input is independently rounded to a fixed set of scalar values, creating an implicit codebook.
Look-up Free Quantization (LFQ) and Binary Spherical Quantization (BSQ): These methods directly map inputs to binary integer sets or spherical projections, avoiding the need for explicit codebook lookups and enabling very efficient tokenization.
Graph Anchor-Relation Tokenization (GART): Specifically designed for graph data, this method tokenizes nodes by relating them to a set of pre-selected ‘anchor’ nodes and their connections, creating compact representations for complex networks.

From Traditional Applications to LLM Integration

Before the rise of powerful LLMs, discrete tokenization was already vital for tasks like image compression, audio encoding, and graph representation. These early applications laid the groundwork, demonstrating the power of quantized representations across various data types.

With LLMs, discrete tokenization has found new purpose. For ‘single-modality’ LLMs, it allows models to process non-textual inputs by converting them into tokens that fit the LLM’s vocabulary. For instance, images can be turned into visual tokens for image generation or understanding, and audio signals can become speech tokens for recognition or synthesis. This enables LLMs to perform tasks like recommending items based on user behavior or classifying actions from video sequences.

The most exciting frontier is ‘multimodal LLMs,’ where discrete tokenization unifies diverse data streams like text, images, audio, and video into a shared token space. This enables LLMs to handle complex tasks such as generating images from text descriptions, creating speech from text, or even generating video from combined text and audio inputs. The goal is to achieve seamless understanding and generation across all modalities, making LLMs truly general-purpose AI agents.

Also Read:

Challenges and Future Directions

Despite its promise, discrete tokenization faces several challenges. One major issue is ‘codebook collapse,’ where only a small portion of the codebook is actively used, limiting the diversity and expressiveness of the tokens. Another is ‘information loss,’ as discretizing continuous data inevitably means some detail is lost, which can impact downstream tasks.

The non-differentiable nature of quantization also makes ‘gradient propagation’ tricky during training, leading to potential instability. Furthermore, balancing ‘granularity and semantic alignment’ is crucial – tokens need to be fine enough to capture detail but coarse enough to be efficient, and they should align with meaningful units in the data.

Future research aims to address these issues by developing more robust codebook utilization strategies, minimizing information loss through adaptive coding, and creating more stable gradient approximation methods. A significant direction is the ‘unification of discrete and continuous tokens,’ combining the compactness of discrete tokens with the fine-grained information of continuous embeddings. Researchers are also working on ‘modality and task transferability’ to create general-purpose tokenizers and improving ‘interpretability and controllability’ to make these learned tokens more transparent and manipulable for human users.

Discrete tokenization is a foundational technology that is enabling LLMs to move beyond text and interact with the full richness of the multimodal world. As research continues, it promises to unlock even more powerful and versatile AI systems. For a deeper dive into the technical aspects, you can refer to the full survey paper: Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Bridging the Gap: How Discrete Tokens Power Multimodal Large Language Models

Understanding Vector Quantization (VQ)

Diverse Quantization Techniques

From Traditional Applications to LLM Integration

Challenges and Future Directions

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates