TLDR: WGrammar is a new decoding engine that significantly speeds up structured output generation from LLMs (e.g., HTML, JSON). It achieves this by leveraging prior knowledge to decompose output constraints into static (precompiled) and dynamic components, using efficient finite state machine (FSM) based operators instead of complex pushdown automata, and employing global mask caching. This results in up to 250x faster time to first token and improved time per output token compared to existing methods.
Large Language Models (LLMs) are becoming increasingly versatile, handling complex tasks like code generation and function calling. However, for these applications, LLMs often need to produce outputs in very specific, structured formats, such as HTML or JSON. This process, known as structured decoding, ensures that the LLM’s output can be reliably consumed by other systems or components in a larger workflow.
While crucial, existing methods for structured decoding face significant efficiency challenges. These bottlenecks arise from three main stages: grammar compilation (interpreting the rules for the output format), state tracking (monitoring the generation process token by token), and mask creation (generating a list of allowed tokens at each step). Current systems, often relying on complex pushdown automata (PDAs), can introduce substantial delays, sometimes adding thousands of milliseconds to the time it takes to generate the first token.
Introducing WGrammar: A New Approach to Efficient Structured Decoding
A new research paper, “WGRAMMAR: Leverage Prior Knowledge to Accelerate Structured Decoding,” introduces WGrammar, a lightweight decoding engine designed to overcome these efficiency issues. The core insight behind WGrammar is that many real-world structured generation tasks embed strong “prior knowledge” about the output format. This knowledge, often overlooked by previous methods, can be strategically leveraged to accelerate the decoding process.
WGrammar proposes a novel decomposition of constraints into two types: static and dynamic. Static structures, which are predictable and fixed (like the overall HTML tag structure in an outline), can be precompiled offline. Dynamic arguments, such as specific section IDs or content, are then instantiated at runtime using smaller, pre-defined grammar snippets. This approach avoids the need to build the entire complex structure from scratch for every request.
Instead of relying on computationally intensive pushdown automata, WGrammar employs a compositional set of simpler operators to model regular formats. This shift significantly reduces transition latency during the decoding process. The system integrates several key innovations: domain-aware simplification, constraint decomposition, and mask caching.
How WGrammar Works: Collaborative Parsing and Efficient Operators
WGrammar operates through three main components: a backend parser, a frontend parser, and a state tracking and mask generation module. Users first define a “structure template” which is processed by the backend parser offline. This creates a “structure factory” – essentially a collection of reusable, parameterized structural building blocks.
When an online request arrives, the frontend parser combines the request’s specific arguments with these precompiled building blocks from the structure factory to quickly construct a tailored state machine. This collaborative parsing approach means that the most complex parts of the grammar are handled once, offline, drastically reducing the runtime overhead. For instance, while traditional methods might re-parse a full context-free grammar (CFG) for each request, WGrammar uses an optimized linear-time parser for online processing, leading to significant speedups.
The system translates structural constraints and regular expressions into a sequence of efficient, context-free operators like “Wait” (for conditional checkpoints) and “Write” (for outputting specific token sequences). More complex patterns are handled by compositional operators such as “IfElse” and “DoWhile.” Because these operators are context-free, WGrammar can also implement a global caching mechanism for mask creation, further amortizing costs across multiple requests and decoding steps.
Also Read:
- LaCache: A Smart Memory Solution for Long-Context LLMs
- STACKTRANS: Enhancing Language Models with Stack-Based Memory
Impressive Performance Gains
Experiments show WGrammar’s remarkable efficiency. Compared to state-of-the-art structured decoding systems like XGrammar and Outlines, WGrammar achieves up to a 250 times speedup in “Time to First Token” (TTFT) and up to 2.33 times speedup in “Time per Output Token” (TPOT). TTFT measures the initial setup cost, while TPOT reflects the ongoing decoding efficiency.
These improvements are consistent across various datasets, including Outline-Generation (complex HTML-like output), Reference-Lookup (JSON output), and JSON-mode-eval (conforming to a JSON schema). The breakdown of execution overhead reveals that WGrammar excels in all stages: grammar compilation, state tracking, and mask creation, largely due to its collaborative parsing, efficient operators, and global caching.
While WGrammar offers significant advancements, the authors note some limitations. Optimal performance often requires users to manually design offline templates, which might be challenging for those unfamiliar with the backend domain-specific languages. However, WGrammar does support using regular expressions as an alternative for defining structural constraints. Additionally, WGrammar uses non-greedy matching for regular expressions, which differs from the more common greedy matching, requiring users to be aware of this semantic difference.
WGrammar represents a significant step forward in making structured generation with LLMs more practical and efficient for real-world applications. Its source code is publicly available for further exploration and use. You can find the full research paper here.


