TLDR: QUASAR is a new agentic reinforcement learning framework that significantly improves how large language models (LLMs) generate and optimize quantum circuits, specifically in OpenQASM 3.0. It uses external quantum simulators for verification and a hierarchical reward system to teach LLMs quantum-specific knowledge. QUASAR outperforms industrial LLMs like GPT-4o and GPT-5 in both syntactic correctness and semantic performance, making it a powerful tool for automated quantum algorithm design.
The world of quantum computing is rapidly advancing, but designing and optimizing the complex instructions for these machines, known as quantum circuits, remains a significant challenge. While large language models (LLMs) have shown promise in automatically generating these circuits, they often struggle with the precise numerical values required for optimal performance and lack deep quantum domain knowledge, leading to errors or low-quality outputs.
Addressing these critical issues, researchers have introduced a groundbreaking framework called QUASAR (Quantum Assembly Code Generation Using Tool-Augmented LLMs via Agentic RL). This innovative system leverages agentic reinforcement learning (RL) and tool-augmented LLMs to significantly improve the generation and optimization of quantum circuits, particularly in the OpenQASM 3.0 language.
Bridging the Gap Between LLMs and Quantum Mechanics
QUASAR tackles two fundamental problems. First, quantum gates often require exact numerical parameters, which are difficult for general-purpose LLMs to handle accurately. These parameters are crucial and depend on various factors like the number of gates, their settings, and the circuit’s structure. Second, LLMs frequently produce incorrect or suboptimal quantum circuits due to their limited understanding of quantum-specific rules and semantics.
The core of QUASAR’s design lies in two key innovations:
- Quantum Circuit Verification: It incorporates an external quantum simulator that acts as a verification tool. This allows the LLM to interact directly with quantum environments, receiving real-time feedback on the correctness and performance of the generated circuits.
- Hierarchical Reward Mechanism: A sophisticated four-level reward system guides the LLM’s learning process. This system first checks for basic syntactic correctness, then assesses how closely the generated circuit’s output distribution matches the ideal one. Following this, it evaluates the circuit’s performance against a problem-specific cost function and finally, measures how efficiently the circuit can be further optimized by a local optimizer. This multi-layered feedback ensures that the LLM learns to produce not just syntactically valid but also semantically meaningful and optimizable quantum code.
Unprecedented Performance
The evaluation of QUASAR, augmenting a 4-billion parameter Qwen3 LLM, demonstrated remarkable improvements. It achieved an impressive 99.31% validity in Pass@1 (meaning 99.31% of single generated circuits were syntactically correct) and a perfect 100% in Pass@10 (at least one correct circuit out of ten attempts). These results significantly outperform leading industrial LLMs such as GPT-4o, GPT-5, and DeepSeek-V3, as well as other supervised fine-tuning (SFT) and RL-only approaches.
Beyond just syntax, QUASAR also showed substantial gains in semantic performance, including a 12.95% improvement in the successful rate of expectation value (SREV) and an 8.87% reduction in relative entropy (RE), indicating that its generated circuits are much closer to the desired quantum outcomes. It also proved effective in generating practical ansatz patterns and initial parameter configurations for complex quantum optimization problems like Quantum Approximate Optimization Algorithm (QAOA) and Variational Quantum Eigensolver (VQE).
Also Read:
- Adaptive Learning: How On-Demand Expert Help Boosts AI Reasoning
- AutoEP: Large Language Models Automate Hyperparameter Evolution for Metaheuristic Algorithms
The Impact of Each Reward Component
An ablation study revealed the critical role of each part of QUASAR’s hierarchical reward system. Distributional alignment, which measures how well the generated circuit’s output matches the ground truth, was found to be the primary driver for all performance metrics. The expectation value reward helped in safeguarding against difficult cases, while the optimization progress reward provided incremental gains by favoring circuits that required fewer steps to optimize. A qubit-mismatch penalty was also crucial for maintaining stability and preventing errors related to incorrect qubit counts.
In essence, QUASAR represents a significant leap forward in automated quantum algorithm design. By effectively combining general-purpose LLMs with domain-specific quantum knowledge through agentic reinforcement learning and a carefully crafted reward system, it paves the way for more scalable and efficient development of quantum software. For more in-depth technical details, you can refer to the full research paper.


