TLDR: OpenAI introduces “safe-completions,” a new AI safety training method that focuses on making the model’s output safe and helpful, rather than just refusing based on user intent. This approach allows LLMs to provide high-level, non-operational guidance for dual-use queries, reducing severe safety failures and significantly increasing helpfulness compared to traditional refusal-based training.
Large Language Models (LLMs) like those powering ChatGPT have traditionally been designed with a clear boundary: either fully comply with a user’s request or outright refuse it. While this “refusal-based” approach is effective for obviously harmful prompts, a new research paper from OpenAI highlights its limitations, especially when user intent isn’t straightforward.
The paper, titled “From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training,” introduces an innovative approach called “safe-completions.” This method shifts the focus of safety training from simply classifying the user’s input as safe or unsafe, to ensuring the safety of the model’s actual output. The goal is to maximize how helpful the model can be, while strictly adhering to safety policies.
The Problem with Traditional Refusals
Imagine asking an LLM how to ignite a pyrogen (a material used in fireworks). If your intent isn’t explicitly malicious, a refusal-trained model might fully comply, providing detailed, actionable instructions. However, if you phrase the same question with a slightly more malicious tone, it would refuse. This shows the brittleness of the refusal paradigm, which struggles with “dual-use” cases – requests that could be benign but also have harmful applications, such as in biology or cybersecurity. Current systems often have to choose between being overly cautious (and unhelpful) or risking the output of dangerous information.
Introducing Safe-Completions
Safe-completions offer a more nuanced solution. Instead of a binary “yes/no” decision, the model learns to provide responses that are as helpful as possible within safety constraints. This training involves two main stages:
- Supervised Fine-Tuning (SFT): In this stage, the model is taught to reason about content policies. It learns to choose from three response modes:
- Direct answer: For requests that are completely harmless.
- Safe-completion: For content that is restricted but not entirely forbidden, it provides high-level, non-operational guidance.
- Refuse with redirection: For requests that cannot be safely fulfilled, it politely declines while offering constructive, safer alternatives.
- Reinforcement Learning (RL): Here, a reward system encourages both helpfulness and safety. The model receives a high reward only if its response is both helpful and safe. If providing a direct, fully helpful answer would violate safety, the model is incentivized to offer indirect help, such as warnings, risk explanations, or redirection to permissible alternatives. This means the model will “fail softer” if it makes a mistake, providing less actionable, lower-severity content.
Evolving Safety Policies
To support safe-completions, OpenAI also updated its illicit wrongdoing policy. The new focus is on “meaningful facilitation” – assessing whether a model’s response would significantly lower the barrier to harmful action. This allows for high-level summaries or general best practices to be shared, even for sensitive topics, as long as they don’t provide highly actionable or targeted support for wrongdoing.
Also Read:
- Evaluating LLM Alignment: A New Framework for Safety and Performance
- GRAO: A New Framework for Smarter Language Model Alignment
Promising Results
The research paper details experiments comparing safe-completion models (like GPT-5) with traditional refusal-trained models (like o3). The findings are compelling:
- Safe-completion models significantly improved safety on dual-use prompts, while maintaining comparable safety on explicitly malicious requests.
- They substantially increased model helpfulness, moving away from rigid refusals towards more useful, policy-compliant responses.
- When safety failures did occur, the safe-completion models produced less severe, less actionable content.
A specific case study on “biorisk” (potentially dangerous biological information) further demonstrated these benefits, showing that safe-completion models could provide helpful information without sacrificing safety, and drastically reduce the severity of any unsafe outputs. Human evaluations also consistently favored safe-completion models, confirming their improved balance of safety and helpfulness from an independent perspective.
This new paradigm represents a significant step forward in making advanced LLMs more robustly aligned for safety, allowing them to be more helpful without compromising on critical safety boundaries. You can read the full research paper for more details at https://arxiv.org/pdf/2508.09224.


