spot_img
HomeResearch & DevelopmentQUARK: Accelerating Transformers with Quantization and Circuit Sharing

QUARK: Accelerating Transformers with Quantization and Circuit Sharing

TLDR: QUARK is a novel hardware/software co-design framework that accelerates Transformer-based models on FPGA platforms. It achieves this by reformulating nonlinear operations (Softmax, GELU, LayerNorm) into integer-only arithmetic, enabling efficient circuit sharing through time-division multiplexing, and employing a reorder-based group quantization scheme to maintain accuracy with ultra-low-bit precision. QUARK significantly reduces hardware overhead and achieves up to 1.96x end-to-end speedup over GPU implementations, while improving accuracy in low-bit quantization settings for both computer vision and natural language processing tasks.

Transformer-based models have become the cornerstone of advancements in fields like computer vision and natural language processing, delivering impressive performance across a wide array of tasks. However, their widespread adoption, especially in resource-constrained environments, faces a significant hurdle: the computational intensity of their nonlinear operations. These operations, such as Softmax, GELU, and LayerNorm, are crucial for model functionality but often become bottlenecks, particularly when models are optimized for efficiency through quantization.

Traditionally, efforts to accelerate Transformers have largely focused on optimizing linear operations, which were once considered the primary computational challenge. Yet, as models are quantized from high-precision floating-point numbers to lower-bit integers, the relative latency contribution of these nonlinear operations dramatically increases, making them critical bottlenecks. Existing solutions often address individual nonlinear operators or require extensive retraining of the model, which can be prohibitively expensive for large-scale Transformers.

Introducing QUARK: A Novel Approach to Transformer Acceleration

A new research paper, titled “QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations” by Zhixiong Zhao, Haomin Li, Fangxin Liu, Yuncheng Lu, Zongwu Wang, Tao Yang, Li Jiang, and Haibing Guan, introduces QUARK, a groundbreaking framework designed to tackle these challenges. QUARK is a quantization-enabled FPGA acceleration framework that significantly reduces hardware resource requirements and boosts inference speed by identifying and leveraging common patterns within nonlinear operations. This innovative approach targets all nonlinear operations in Transformer models, offering high-performance approximation through a unique circuit-sharing design.

Key Innovations of QUARK

QUARK addresses two critical challenges: eliminating the reliance on floating-point operations in nonlinear operators through mathematical reformulation, and reducing quantization errors caused by the diverse activation distributions that arise after these operations. Its core innovations include:

  • Integer-Only Nonlinear Approximation: QUARK reformulates complex exponential, logarithmic, and division operations into simpler, low-cost shift-and-add arithmetic. This eliminates the need for expensive floating-point calculations or large lookup tables, making hardware implementation much more efficient.
  • Sub-Operator Sharing with Time-Division Multiplexing: By recognizing common sub-operators (like exponent and logarithm) across Softmax, GELU, and LayerNorm, QUARK unifies them into a single, reusable hardware block. This significantly reduces the hardware resources and power consumption. The framework uses time-division multiplexing to sequentially reuse these hardware units across different stages of the Transformer pipeline, ensuring efficiency without compromising performance.
  • Reorder-Based Group Quantization: Nonlinear operations often result in highly non-uniform activation distributions, which can lead to significant accuracy loss if not handled carefully during quantization. QUARK proposes a novel group quantization mechanism that leverages offline reordering and scaled integer alignment. This method clusters channels based on distribution similarity during an offline calibration phase, adapting effectively to the unique characteristics of each layer and preventing accuracy degradation, even under ultra-low-bit quantization.

How QUARK Optimizes Nonlinear Operations

For Softmax, QUARK uses an improved log-sum-exp algorithm that converts computations from base-e to base-2, allowing for hardware implementation with only shifters and adders. GELU’s computation is reformulated as a combination of Softmax and ReLU, enabling the reuse of the Softmax hardware structure. For LayerNorm, QUARK simplifies variance calculation to reduce data access and approximates the square root using an iterative method, transforming division logarithmically to avoid complex operations.

Performance and Hardware Efficiency

Evaluations demonstrate QUARK’s impressive capabilities. It significantly reduces the computational overhead of nonlinear operators in mainstream Transformer architectures, achieving up to a 1.96x end-to-end speedup over GPU implementations. Furthermore, QUARK lowers the hardware overhead of nonlinear modules by more than 50% compared to prior approaches, all while maintaining high model accuracy. In fact, it can even substantially boost accuracy under ultra-low-bit quantization settings (e.g., 4-bit weights and activations).

On image classification tasks using ImageNet, QUARK achieves minimal accuracy drop and sets a new state-of-the-art for ultra-low-bit quantization, with a 6.08% accuracy gain on ViT-B and over 3% average improvement across various networks at 4-bit precision. For language understanding tasks on the GLUE benchmark, QUARK shows significant improvements, outperforming prior methods at both 6-bit and 4-bit quantization, demonstrating its robustness across different AI domains.

The hardware evaluation on a ZCU102 FPGA board shows that QUARK achieves a peak throughput of 787.5 GOP/s with minimal hardware overhead, outperforming existing solutions. It delivers substantial speedups for individual nonlinear operators (e.g., 41.8x-46.7x for LayerNorm) compared to GPU implementations. This is largely due to its hardware-friendly approximations and optimized pipelined datapath designs.

Also Read:

Conclusion

QUARK represents a significant leap forward in accelerating Transformer-based models on FPGA platforms. By intelligently integrating lightweight integer-only arithmetic modules and a sophisticated reorder-based group quantization scheme, it offers a powerful software/hardware co-design solution. This work paves the way for more efficient and accessible deployment of advanced AI models in various applications, from computer vision to natural language processing. For more details, you can refer to the full research paper available here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -