TLDR: A new safety response framework for AI agents is proposed, addressing LLM security and trustworthiness. It uses a four-tier classification for proactive input safety, achieving 99.3% risk recall. For output, it combines Retrieval-Augmented Generation (RAG) with a fine-tuned interpretation model, ensuring responses are grounded in a real-time knowledge base, eliminating hallucinations, and providing full traceability. Experiments show superior safety scores compared to baseline models.
Large Language Models (LLMs) like GPT and DeepSeek have shown incredible abilities in understanding and generating human-like text. However, their widespread use in sensitive areas such as finance, healthcare, and government is often held back by two main concerns: safety and trustworthiness. LLMs can sometimes produce unsafe, biased, or unethical responses, especially when faced with tricky or malicious inputs. They also struggle with providing up-to-date or accurate information, often “hallucinating” facts because their knowledge is based on static training data.
Addressing these critical issues, researchers from Beijing Caizhi Tech have introduced a new safety response framework for AI agents. This framework aims to systematically protect LLMs at both the input and output stages, ensuring that interactions are both safe and reliable. The paper, titled “A PROPRIETARY MODEL-BASED SAFETY RESPONSE FRAMEWORK FOR AI AGENTS,” details how this innovative approach tackles the limitations of existing solutions.
Proactive Safety at the Input Level
Unlike many current safety measures that react after an LLM has processed an input or generated an output, this new framework places a strong emphasis on proactive safety. It introduces a sophisticated input classification system that acts as the first line of defense. This system uses a supervised fine-tuning-based safety classification model, trained on a high-quality, proprietary dataset that includes various risk types like illegal activities, sensitive topics, bias, and malicious instructions.
The core of this input safety mechanism is a unique four-tier taxonomy for classifying user queries:
- Safe: Queries that are risk-free, legal, and compliant.
- Unsafe: Queries involving explicit illegality, malicious attacks, severe bias, or content violating public order. These are immediately intercepted.
- Conditionally Safe: Queries that touch on sensitive areas (e.g., privacy, financial operations) or have flawed assumptions but can be answered under specific conditions, such as identity verification.
- Focused Attention: Queries on topics without scientific consensus, with opposing viewpoints, or sensitive historical/social issues that require careful handling.
This fine-grained classification allows for precise risk identification and differentiated handling of user queries, significantly improving risk coverage and adaptability to various business scenarios. The framework boasts an impressive risk recall rate of 99.3%, meaning it’s highly effective at catching potential risks before they reach the core LLM.
Trustworthy Generation at the Output Level
Beyond just filtering inputs, the framework also ensures the trustworthiness of the LLM’s responses. It integrates Retrieval-Augmented Generation (RAG) with a specially fine-tuned “interpretation model.” This combination is designed to combat the common problems of knowledge lag and “hallucination” in LLMs.
The system relies on a continuously updated, regulation-based, trustworthy knowledge base. This knowledge base is dynamically maintained, with automated data pipelines daily crawling, parsing, and indexing the latest announcements, policy documents, and interpretations from authoritative sources. This real-time updating ensures that the information used to generate responses is always current and accurate.
The interpretation LLM is strictly constrained to ensure that its output is entirely based on the retrieved knowledge content. This approach minimizes hallucination, guarantees high accuracy, and makes every statement in the response fully traceable back to its source. This means users can trust that the information provided is not fabricated and is grounded in verifiable facts.
Also Read:
- Safeguarding RAG Systems: A New Efficient Defense Against Data Poisoning
- Assessing LLM Defenses Against Prompt Injection: A New Evaluation Framework
Experimental Validation and Superior Performance
The effectiveness of this framework has been rigorously evaluated through experiments, comparing it against leading baseline models. In terms of safety classification, the Caizhi-Safety-Control-Model significantly outperformed Qwen3Guard-Gen-8B, demonstrating more stringent and precise risk identification capabilities, especially for “gray area” issues that might appear legitimate but require cautious handling.
When it comes to comprehensive safety protection and response generation, the framework achieved industry-leading safety scores on both public and proprietary high-risk test sets, markedly surpassing models like TinyR1-Safety-8B. For instance, on a proprietary high-risk test set, the framework achieved a near-perfect 99% safety score, validating its exceptional protective capabilities in complex scenarios. This indicates that the Caizhi-Safety-Control-Model not only answers questions safely but also provides high-quality, well-structured, and fully-cited answers, achieving a balance of safety and utility.
This research provides a practical pathway for developing highly secure and trustworthy LLM applications, especially crucial for sensitive industrial-grade systems. The authors plan to continue optimizing the safety classification model, integrating dynamic knowledge base updates more tightly with model fine-tuning, and exploring its application in specialized industries like financial risk control and legal consultation. For those interested in exploring this further, the API interface utilized in this research is publicly available. You can find more details about the research paper here.


