TLDR: ENSI is a new framework that enables efficient and non-interactive secure inference for Large Language Models (LLMs) using homomorphic encryption. It addresses the challenges of privacy-preserving AI by co-designing cryptographic protocols with LLM architecture, specifically integrating the CKKS scheme with BitNet. Key innovations include optimized, multiplication-free matrix multiplications, a retraining-free sigmoid attention mechanism to replace complex softmax, and embedding the costly bootstrapping operation within RMSNorm to drastically reduce its frequency. Experimental results show significant speedups for core operations and a substantial reduction in bootstrapping overhead, while maintaining high accuracy comparable to plaintext inference.
Large Language Models (LLMs) like LLaMA and GPT have transformed artificial intelligence, offering personalized responses through services where users access powerful models via cloud APIs. However, this convenience comes with a significant privacy challenge: LLMs often process sensitive user data, and without robust security, this information could be inadvertently exposed.
This is where secure inference comes in. It’s a cutting-edge approach that allows computations on sensitive user data while keeping it encrypted, ensuring privacy. Two main cryptographic techniques are used for this: Secure Multi-Party Computation (SMPC) and Homomorphic Encryption (HE). While SMPC requires multiple rounds of communication, HE allows non-interactive computation, making it highly adaptable for distributed environments and offering strong privacy protection.
Despite its promise, applying Homomorphic Encryption to secure inference for large language models has been incredibly difficult. LLMs demand vast computational resources for high-dimensional matrix multiplications and complex self-attention mechanisms. Furthermore, the sophisticated activation functions commonly used in LLMs are notoriously hard to implement efficiently in HE environments. Traditional encoding strategies also add overhead, and the most time-consuming operation, ‘Bootstrapping’ (which refreshes encrypted data to prevent noise accumulation), occurs more frequently as models grow larger, creating significant bottlenecks.
Introducing ENSI: A Co-Designed Solution
A new research paper, “ENSI: Efficient Non-Interactive Secure Inference for Large Language Models”, introduces a novel framework called ENSI. This framework tackles these challenges by co-designing cryptographic protocols with the LLM architecture itself. ENSI integrates the RNS-CKKS homomorphic encryption scheme with BitNet, a lightweight LLM variant, to significantly reduce the computational complexity of encrypted operations.
Key Innovations for Efficient Secure Inference
ENSI brings several crucial innovations to make secure LLM inference practical:
Optimized Encoding and Matrix Multiplications: The framework uses an optimized encoding strategy that works seamlessly with BitNet. For ‘Plaintext-Ciphertext Matrix Multiplication’ (PCMM), where model weights are plaintext and user data is ciphertext, ENSI leverages BitNet’s ternary weights (values of -1, 0, or 1) to eliminate explicit multiplication operations, replacing them with much faster additions and subtractions. This results in an approximate 5.8 to 8 times speedup compared to state-of-the-art methods. For ‘Ciphertext-Ciphertext Matrix Multiplication’ (CCMM), which is crucial for attention mechanisms, ENSI introduces an innovative element extraction mechanism inspired by the ‘baby-step giant-step’ algorithm, drastically reducing the number of costly rotation operations.
Retraining-Free Secure Softmax Evaluation: The softmax function, vital for attention mechanisms, is a major computational hurdle under homomorphic encryption. Traditional methods either use computationally expensive high-degree polynomial approximations or require retraining the model with HE-friendly alternatives. ENSI pioneers the integration of the ‘Sigmoid Attention’ mechanism as a direct, retraining-free replacement for softmax. Sigmoid is simpler to encrypt and significantly reduces computational complexity.
Efficient Bootstrapping within RMSNorm: Bootstrapping is essential for refreshing ciphertexts but is extremely costly. ENSI cleverly embeds this operation within the ‘RMSNorm’ process, a normalization technique. By performing bootstrapping at a specific point during RMSNorm, ENSI reduces its frequency from being proportional to the embedding dimension to a constant, achieving the lowest bootstrapping frequency among existing schemes – accounting for just 1% of the total runtime, compared to over 60% in previous methods.
Performance and Accuracy
Experimental evaluations demonstrate ENSI’s significant performance advantages. Besides the matrix multiplication speedups, it achieves a 2.2 to 2.6 times speedup in softmax inference on a CPU. The framework was benchmarked on a LLaMA-3-700M model with 16 layers, processing 32 inputs of 2048 tokens – representing the largest known scale for secure inference to date. Despite these performance gains, ENSI maintains inference accuracy nearly comparable to plaintext inference across various datasets like PIQA, COPA, and SST.
Also Read:
- Breakthrough in Secure Language Model Decoding with Homomorphic Encryption
- BinaryShield: Protecting AI Services with Privacy-Preserving Attack Fingerprints
Looking Ahead
While ENSI marks a significant leap forward in making privacy-preserving LLM inference more efficient and practical, large-scale ciphertext matrix multiplication remains a primary bottleneck. The researchers aim to further integrate dedicated hardware acceleration, such as GPUs, with the ENSI framework to achieve fully secure large model inference at even greater speeds in the future.


