QUARK: Accelerating Transformers with Quantization and Circuit Sharing

TLDR: QUARK is a novel hardware/software co-design framework that accelerates Transformer-based models on FPGA platforms. It achieves this by reformulating nonlinear operations (Softmax, GELU, LayerNorm) into integer-only arithmetic, enabling efficient circuit sharing through time-division multiplexing, and employing a reorder-based group quantization scheme to maintain accuracy with ultra-low-bit precision. QUARK significantly reduces hardware overhead and achieves up to 1.96x end-to-end speedup over GPU implementations, while improving accuracy in low-bit quantization settings for both computer vision and natural language processing tasks.

Transformer-based models have become the cornerstone of advancements in fields like computer vision and natural language processing, delivering impressive performance across a wide array of tasks. However, their widespread adoption, especially in resource-constrained environments, faces a significant hurdle: the computational intensity of their nonlinear operations. These operations, such as Softmax, GELU, and LayerNorm, are crucial for model functionality but often become bottlenecks, particularly when models are optimized for efficiency through quantization.

Traditionally, efforts to accelerate Transformers have largely focused on optimizing linear operations, which were once considered the primary computational challenge. Yet, as models are quantized from high-precision floating-point numbers to lower-bit integers, the relative latency contribution of these nonlinear operations dramatically increases, making them critical bottlenecks. Existing solutions often address individual nonlinear operators or require extensive retraining of the model, which can be prohibitively expensive for large-scale Transformers.

Introducing QUARK: A Novel Approach to Transformer Acceleration

A new research paper, titled “QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations” by Zhixiong Zhao, Haomin Li, Fangxin Liu, Yuncheng Lu, Zongwu Wang, Tao Yang, Li Jiang, and Haibing Guan, introduces QUARK, a groundbreaking framework designed to tackle these challenges. QUARK is a quantization-enabled FPGA acceleration framework that significantly reduces hardware resource requirements and boosts inference speed by identifying and leveraging common patterns within nonlinear operations. This innovative approach targets all nonlinear operations in Transformer models, offering high-performance approximation through a unique circuit-sharing design.

Key Innovations of QUARK

QUARK addresses two critical challenges: eliminating the reliance on floating-point operations in nonlinear operators through mathematical reformulation, and reducing quantization errors caused by the diverse activation distributions that arise after these operations. Its core innovations include:

Integer-Only Nonlinear Approximation: QUARK reformulates complex exponential, logarithmic, and division operations into simpler, low-cost shift-and-add arithmetic. This eliminates the need for expensive floating-point calculations or large lookup tables, making hardware implementation much more efficient.
Sub-Operator Sharing with Time-Division Multiplexing: By recognizing common sub-operators (like exponent and logarithm) across Softmax, GELU, and LayerNorm, QUARK unifies them into a single, reusable hardware block. This significantly reduces the hardware resources and power consumption. The framework uses time-division multiplexing to sequentially reuse these hardware units across different stages of the Transformer pipeline, ensuring efficiency without compromising performance.
Reorder-Based Group Quantization: Nonlinear operations often result in highly non-uniform activation distributions, which can lead to significant accuracy loss if not handled carefully during quantization. QUARK proposes a novel group quantization mechanism that leverages offline reordering and scaled integer alignment. This method clusters channels based on distribution similarity during an offline calibration phase, adapting effectively to the unique characteristics of each layer and preventing accuracy degradation, even under ultra-low-bit quantization.

How QUARK Optimizes Nonlinear Operations

For Softmax, QUARK uses an improved log-sum-exp algorithm that converts computations from base-e to base-2, allowing for hardware implementation with only shifters and adders. GELU’s computation is reformulated as a combination of Softmax and ReLU, enabling the reuse of the Softmax hardware structure. For LayerNorm, QUARK simplifies variance calculation to reduce data access and approximates the square root using an iterative method, transforming division logarithmically to avoid complex operations.

Performance and Hardware Efficiency

Evaluations demonstrate QUARK’s impressive capabilities. It significantly reduces the computational overhead of nonlinear operators in mainstream Transformer architectures, achieving up to a 1.96x end-to-end speedup over GPU implementations. Furthermore, QUARK lowers the hardware overhead of nonlinear modules by more than 50% compared to prior approaches, all while maintaining high model accuracy. In fact, it can even substantially boost accuracy under ultra-low-bit quantization settings (e.g., 4-bit weights and activations).

On image classification tasks using ImageNet, QUARK achieves minimal accuracy drop and sets a new state-of-the-art for ultra-low-bit quantization, with a 6.08% accuracy gain on ViT-B and over 3% average improvement across various networks at 4-bit precision. For language understanding tasks on the GLUE benchmark, QUARK shows significant improvements, outperforming prior methods at both 6-bit and 4-bit quantization, demonstrating its robustness across different AI domains.

The hardware evaluation on a ZCU102 FPGA board shows that QUARK achieves a peak throughput of 787.5 GOP/s with minimal hardware overhead, outperforming existing solutions. It delivers substantial speedups for individual nonlinear operators (e.g., 41.8x-46.7x for LayerNorm) compared to GPU implementations. This is largely due to its hardware-friendly approximations and optimized pipelined datapath designs.

Also Read:

Conclusion

QUARK represents a significant leap forward in accelerating Transformer-based models on FPGA platforms. By intelligently integrating lightweight integer-only arithmetic modules and a sophisticated reorder-based group quantization scheme, it offers a powerful software/hardware co-design solution. This work paves the way for more efficient and accessible deployment of advanced AI models in various applications, from computer vision to natural language processing. For more details, you can refer to the full research paper available here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

QUARK: Accelerating Transformers with Quantization and Circuit Sharing

Introducing QUARK: A Novel Approach to Transformer Acceleration

Key Innovations of QUARK

How QUARK Optimizes Nonlinear Operations

Performance and Hardware Efficiency

Conclusion

Gen AI News and Updates

STV: Smarter In-Context Learning for Multimodal AI

Enhancing Text Legibility in AI-Generated Videos with Synthetic Data

Accelerating ML Hardware Design: A New Benchmark and AI Models for FPGA Resource Estimation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates