Enhancing AI Agent Safety and Trustworthiness with a New Response Framework

TLDR: A new safety response framework for AI agents is proposed, addressing LLM security and trustworthiness. It uses a four-tier classification for proactive input safety, achieving 99.3% risk recall. For output, it combines Retrieval-Augmented Generation (RAG) with a fine-tuned interpretation model, ensuring responses are grounded in a real-time knowledge base, eliminating hallucinations, and providing full traceability. Experiments show superior safety scores compared to baseline models.

Large Language Models (LLMs) like GPT and DeepSeek have shown incredible abilities in understanding and generating human-like text. However, their widespread use in sensitive areas such as finance, healthcare, and government is often held back by two main concerns: safety and trustworthiness. LLMs can sometimes produce unsafe, biased, or unethical responses, especially when faced with tricky or malicious inputs. They also struggle with providing up-to-date or accurate information, often “hallucinating” facts because their knowledge is based on static training data.

Addressing these critical issues, researchers from Beijing Caizhi Tech have introduced a new safety response framework for AI agents. This framework aims to systematically protect LLMs at both the input and output stages, ensuring that interactions are both safe and reliable. The paper, titled “A PROPRIETARY MODEL-BASED SAFETY RESPONSE FRAMEWORK FOR AI AGENTS,” details how this innovative approach tackles the limitations of existing solutions.

Proactive Safety at the Input Level

Unlike many current safety measures that react after an LLM has processed an input or generated an output, this new framework places a strong emphasis on proactive safety. It introduces a sophisticated input classification system that acts as the first line of defense. This system uses a supervised fine-tuning-based safety classification model, trained on a high-quality, proprietary dataset that includes various risk types like illegal activities, sensitive topics, bias, and malicious instructions.

The core of this input safety mechanism is a unique four-tier taxonomy for classifying user queries:

Safe: Queries that are risk-free, legal, and compliant.
Unsafe: Queries involving explicit illegality, malicious attacks, severe bias, or content violating public order. These are immediately intercepted.
Conditionally Safe: Queries that touch on sensitive areas (e.g., privacy, financial operations) or have flawed assumptions but can be answered under specific conditions, such as identity verification.
Focused Attention: Queries on topics without scientific consensus, with opposing viewpoints, or sensitive historical/social issues that require careful handling.

This fine-grained classification allows for precise risk identification and differentiated handling of user queries, significantly improving risk coverage and adaptability to various business scenarios. The framework boasts an impressive risk recall rate of 99.3%, meaning it’s highly effective at catching potential risks before they reach the core LLM.

Trustworthy Generation at the Output Level

Beyond just filtering inputs, the framework also ensures the trustworthiness of the LLM’s responses. It integrates Retrieval-Augmented Generation (RAG) with a specially fine-tuned “interpretation model.” This combination is designed to combat the common problems of knowledge lag and “hallucination” in LLMs.

The system relies on a continuously updated, regulation-based, trustworthy knowledge base. This knowledge base is dynamically maintained, with automated data pipelines daily crawling, parsing, and indexing the latest announcements, policy documents, and interpretations from authoritative sources. This real-time updating ensures that the information used to generate responses is always current and accurate.

The interpretation LLM is strictly constrained to ensure that its output is entirely based on the retrieved knowledge content. This approach minimizes hallucination, guarantees high accuracy, and makes every statement in the response fully traceable back to its source. This means users can trust that the information provided is not fabricated and is grounded in verifiable facts.

Also Read:

Experimental Validation and Superior Performance

The effectiveness of this framework has been rigorously evaluated through experiments, comparing it against leading baseline models. In terms of safety classification, the Caizhi-Safety-Control-Model significantly outperformed Qwen3Guard-Gen-8B, demonstrating more stringent and precise risk identification capabilities, especially for “gray area” issues that might appear legitimate but require cautious handling.

When it comes to comprehensive safety protection and response generation, the framework achieved industry-leading safety scores on both public and proprietary high-risk test sets, markedly surpassing models like TinyR1-Safety-8B. For instance, on a proprietary high-risk test set, the framework achieved a near-perfect 99% safety score, validating its exceptional protective capabilities in complex scenarios. This indicates that the Caizhi-Safety-Control-Model not only answers questions safely but also provides high-quality, well-structured, and fully-cited answers, achieving a balance of safety and utility.

This research provides a practical pathway for developing highly secure and trustworthy LLM applications, especially crucial for sensitive industrial-grade systems. The authors plan to continue optimizing the safety classification model, integrating dynamic knowledge base updates more tightly with model fine-tuning, and exploring its application in specialized industries like financial risk control and legal consultation. For those interested in exploring this further, the API interface utilized in this research is publicly available. You can find more details about the research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing AI Agent Safety and Trustworthiness with a New Response Framework

Proactive Safety at the Input Level

Trustworthy Generation at the Output Level

Experimental Validation and Superior Performance

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

Astreya Unveils New Wave of Enterprise AI Agents to Boost Business Efficiency and Automation

Vida Secures $4 Million Series A Funding to Advance AI Voice Technology and Expand Leadership

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates