OpenAI Unveils Advanced AI Reasoning Models with Enhanced Safety Features

Sophia Steele

Sophia Steele

December 22, 2024 · 4 min read
OpenAI Unveils Advanced AI Reasoning Models with Enhanced Safety Features

OpenAI has announced a new family of AI reasoning models, o1 and o3, which the startup claims are more advanced than its previous models. The improvements are attributed to scaling test-time compute and a new safety paradigm, known as deliberative alignment, used to train the o-series of models. This approach ensures AI reasoning models stay aligned with the values of their human developers, particularly during the inference phase.

The deliberative alignment method involves training AI models to "think" about OpenAI's safety policy during inference, resulting in improved alignment to the company's safety principles. According to OpenAI's research, this approach decreased the rate at which o1 answered "unsafe" questions while improving its ability to answer benign ones. The method's effectiveness is demonstrated in a graph comparing o1's improved alignment to other models, such as Claude, Gemini, and GPT-4O.

The relevance of AI safety research is increasingly important as AI models rise in popularity and power. However, the subjective nature of these decisions has sparked controversy, with some arguing that certain AI safety measures constitute "censorship." OpenAI's o-series of models, inspired by human thought processes, excel at predicting the next token in a sentence and offer sophisticated answers to writing and coding tasks.

The o-series of models work by breaking down a problem into smaller steps, a process referred to as "chain-of-thought," before providing an answer based on the generated information. The key innovation around deliberative alignment is that OpenAI trained o1 and o3 to re-prompt themselves with text from OpenAI's safety policy during the chain-of-thought phase. This internal deliberation enables the models to answer questions safely, according to the paper.

An example from OpenAI's research illustrates the effectiveness of deliberative alignment. When a user prompts an AI reasoning model to create a realistic disabled person's parking placard, the model cites OpenAI's policy, identifies the request as potentially unsafe, and refuses to assist. This demonstrates the model's ability to align with OpenAI's safety principles and moderate its answers around unsafe prompts.

Traditionally, AI safety work occurs during the pre-training and post-training phase, but not during inference. OpenAI's deliberative alignment approach is novel and has contributed to o1-preview, o1, and o3-mini becoming some of the company's safest models yet. The challenge lies in accounting for various ways users might ask unsafe questions, while avoiding over-refusal and ensuring the models can answer practical questions.

Deliberative alignment seems to have improved alignment for OpenAI's o-series of models, as demonstrated by the models' performance on the Pareto benchmark, which measures a model's resistance against common jailbreaks. OpenAI's approach is the first to directly teach a model the text of its safety specifications and train the model to deliberate over these specifications at inference time, resulting in safer responses that are appropriately calibrated to a given context.

In addition to deliberative alignment, OpenAI developed a new method for post-training, which involved using synthetic data created by another AI model. This approach eliminated the need for human-written answers or chain-of-thoughts, reducing latency and compute costs. The company used an internal reasoning model to generate examples of chain-of-thought answers that reference different parts of the company's safety policy, which were then assessed by another internal AI reasoning model, dubbed "judge."

OpenAI's researchers trained o1 and o3 on these examples, enabling the models to learn to conjure up appropriate pieces of the safety policy when asked about sensitive topics. The company also utilized the "judge" model for reinforcement learning to assess the answers given by o1 and o3. This scalable approach to alignment could become increasingly important as reasoning models grow more powerful and are given more agency.

The o3 model is set to rollout in 2025, and its public availability will provide an opportunity to assess its advanced capabilities and safety features. OpenAI's deliberative alignment approach could be a crucial step towards ensuring AI reasoning models adhere to human values moving forward.

Similiar Posts

Copyright © 2024 Starfolk. All rights reserved.