TLDR: Libra-Guard is a new safeguard system for Chinese large language models (LLMs) that uses a two-stage training process with synthetic and real-world data to improve safety and reduce reliance on manual annotations. Alongside it, Libra-Test is introduced as the first benchmark for evaluating Chinese LLM safety, covering seven harm scenarios. Experiments show Libra-Guard outperforms existing open-source models and approaches the performance of closed-source models, establishing a robust framework for Chinese AI safety.
Large language models (LLMs) have brought about a revolution in various applications, from conversational agents to diverse content generation. These powerful models are adept at understanding and producing human-like text, leading to their widespread integration into real-world scenarios. However, their increasing deployment also brings significant concerns regarding the safety and ethical implications of their outputs, particularly in high-stakes situations.
While safeguard systems like LlamaGuard and ShieldLM have been developed to filter potentially harmful inputs and outputs, they often face limitations. Many existing safeguards are primarily designed for English content, offering inadequate support for Chinese-language content. Furthermore, they heavily rely on manual annotations for training data, which restricts their scalability and adaptability. The value of synthetic data, crucial for handling diverse inputs in safeguards, is also frequently overlooked.
Addressing these critical challenges, researchers have introduced Libra-Guard, a state-of-the-art safeguard system specifically designed for Chinese-language LLMs. Libra-Guard employs a scalable two-stage curriculum training framework. This innovative approach integrates pre-training on large-scale synthetic adversarial data with fine-tuning on high-quality, real-world examples. By leveraging curriculum learning principles, Libra-Guard efficiently utilizes annotated samples, achieving excellent performance while effectively handling complex real-world scenarios and significantly reducing the reliance on manual annotations.
Complementing Libra-Guard, the team also unveiled Libra-Test, the first benchmark specifically created to evaluate the performance of safeguard systems for Chinese content. Libra-Test covers seven critical harm scenarios, including hate speech, bias, and criminal activities, and features over 5,700 rigorously annotated samples. These samples comprise a balanced mix of real-world, synthetic, and translated data, ensuring comprehensive coverage of safety aspects. The benchmark is designed to assess diversity, difficulty (by including challenging examples with inconsistent labels that are then manually re-annotated), and consistency (through unified safety rules and expert human annotation).
Experimental results highlight Libra-Guard’s superior performance. On the Libra-Test, Libra-Guard achieved an impressive average accuracy of 86.79%. This significantly surpasses open-source models such as Qwen2.5-14B-Instruct (74.33%) and ShieldLM-Qwen-14B-Chat (65.69%). Notably, Libra-Guard’s performance approaches that of powerful closed-source models like Claude-3.5-Sonnet and GPT-4o, demonstrating its effectiveness in safety-specific training. The research also indicates that model performance improves with scale, emphasizing the importance of combining model scaling with tailored safety training. Libra-Guard shows strong generalization across different model sources and sizes, reflecting the flexibility of its two-stage training pipeline.
The Libra-Guard approach involves two main stages. The first is Guard Pretraining, which builds a robust foundation using large-scale synthetic data. This involves synthesizing harmful queries, generating responses using various LLMs, and performing safety annotations. The second stage, Guard Finetuning, refines safety performance by incorporating high-quality, real-world data, focusing on more challenging samples. This curriculum learning approach, starting with easier synthetic samples and progressing to harder real-world ones, proved crucial for optimal performance.
Ablation studies further confirmed the effectiveness of Libra-Guard’s design choices. Increasing the amount of synthetic data during pretraining and high-quality real-world data during fine-tuning consistently improved performance. The inclusion of a “Rear Critic” component, which provides an explanation for label assignments after the label itself, was found to be optimal. Interestingly, the study also suggested that explicit safety rules in the prompt during training and inference were not strictly necessary, as Libra-Guard learns safety principles through its comprehensive training process. The curriculum learning strategy (Pretrain → SFT) consistently outperformed other training methods, underscoring its importance in achieving the best results, particularly when fine-tuning on challenging samples.
Also Read:
- Navigating the Future of AI: A Comprehensive Look at Language Model Alignment and Safety
- MOCHA: A New Benchmark Exposing Code LLM Vulnerabilities to Multi-Turn Attacks
This groundbreaking work establishes a robust framework for advancing the safety governance of Chinese LLMs, paving the way for safer and more reliable AI systems across diverse applications. The researchers plan to expand Libra-Guard to address evolving safety challenges, including multimodal content (Libra-V), long-text scenarios (Libra-L), and enhancing the reasoning capabilities of safety models. For more details, you can refer to the full research paper.


