AISN #50: AI Action Plan Responses

post by Corin Katzke (corin-katzke), Dan H (dan-hendrycks) · 2025-03-31T20:13:31.533Z · LW · GW · 0 comments

Contents

  AI Action Plan Responses
    All three companies argue for increased government investment in AI.
    OpenAI and Google argue against copyright and state-level restrictions on frontier AI development.
    All three companies lean into geopolitical competition, but differ in their approach to export controls.
  Detecting Misbehavior in Reasoning Models
  Other news
    Industry
    Government
None
No comments

Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required.

In this newsletter, we cover AI companies’ responses to the federal government’s request for information on the development of an AI Action Plan. We also discuss an OpenAI paper on detecting misbehavior in reasoning models by monitoring their chains of thought.

Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.

Subscribe to receive future versions.


AI Action Plan Responses

On January 23, President Trump signed an executive order giving his administration 180 days to develop an “AI Action Plan” to “enhance America's global AI dominance in order to promote human flourishing, economic competitiveness, and national security.”

Despite the rhetoric of the order, the Trump administration has yet to articulate many policy positions with respect to AI development and safety. In a recent interview, Ben Buchanan—Biden’s AI advisor—interpreted the executive order as giving the administration time to develop its AI policies. The AI Action Plan will therefore likely be the best indicator yet of the administration's AI policy over the next four years.

In response to the executive order, the Office of Science and Technology policy (OSTP) published a request for information. Over 8,000 public comments were submitted before the comment period closed on March 15th.

Three frontier AI companies published their comments: OpenAI, Google, and Anthropic. These should not be read as comprehensive or even entirely accurate portrayals of these companies’ views on AI policy—they are political documents, and likely written with their audiences in mind. For example, despite its previous public messaging on the importance of AI safety, OpenAI's response entirely avoided the term, and only referenced risks from AI in passing. This change is likely in response to the Trump administration's rejection of the language of AI safety, like VP Vance displayed in his speech at the recent AI Action Summit. (That being said, OpenAI’s response may better reflect their private lobbying positions than previous public messaging.)

All three companies argue for increased government investment in AI.

OpenAI and Google argue against copyright and state-level restrictions on frontier AI development.

All three companies lean into geopolitical competition, but differ in their approach to export controls.

In sum, the public policy positions of frontier AI companies—particularity OpenAI’s—has noticeably changed since the last administration. They have largely adopted the Trump administration’s hawkishness towards China as a platform to advocate for increased government investment in AI, against copyright and state-level restrictions on frontier AI development, and changes to AI export controls.

Detecting Misbehavior in Reasoning Models

In this story, we discuss a new paper from OpenAI on monitoring reasoning models for misbehavior. The paper finds that less-capable LLMs can reliably detect proxy gaming in more capable reasoning agents by monitoring their chains of thoughts (CoTs). However, it also finds that applying optimization pressure to CoTs to remove proxy gaming instead leads those agents to obfuscate their gaming, making monitoring more difficult.

Reasoning models exhibit proxy gaming. Proxy gaming—behavior that maximizes reward in unwanted ways—is a challenge in creating useful AI agents. Like other systems developed with reinforcement learning, reasoning models like OpenAI’s o-series often exhibit proxy gaming.

For example, OpenAI’s paper shows that a frontier reasoning model in development (more capable than o1 and o3-mini) often chooses to “solve” difficult coding tasks by exploiting loopholes in the task instead of finding genuine solutions.

An example given in OpenAI’s paper of explicit proxy-gaming in a chain of thought.

Chains of thought can be monitored for proxy gaming. Reasoning models arrive at answers through generating chains of thought (CoTs) in plain speech, and often explicitly indicate reward hacking in their chains of thought. OpenAI’s paper proposes monitoring an agent's CoT for reward hacking using a less-capable model. They demonstrate that a less capable model employing CoT monitoring can successfully detect proxy gaming in a more capable agent.

Optimizing CoTs directly causes agents to obfuscate reward hacking. Rather than solely reward reasoning models for their outputs, developers can also reward CoTs that avoid obvious proxy gaming. The paper shows that applying optimization pressure to CoTs directly can initially increase the system’s capability and decrease the incidence of proxy gaming.

However, optimizing CoTs directly often leads agents to learn to obscure their reasoning and evade monitoring—a scenario described as “obfuscated reward hacking.” CoT monitoring performs exceptionally well in identifying simple, explicit hacks, but not in detecting more subtle forms of gaming. The paper finds that rewarding CoTs leads to systems learning to misbehave without making that misbehavior apparent in CoTs.

A set of graphs from OpenAI’s announcement post comparing the performance and alignment of agents trained with and without CoT pressure.

Monitorability can trade off against other developer goals. Applying optimization pressure to CoTs directly makes their misbehavior less apparent, and therefore less able to be monitored. However, there are reasons why developers might nevertheless be incentivized to do just that.

The paper argues that developers can—and should—pay this “monitorability tax” by refraining from optimizing CoTs directly. However, even this doesn’t guarantee safety—as reasoning models become more capable, they may still learn to obfuscate reward hacking through indirect optimization pressure on their outputs alone.

OpenAI’s paper introduces an effective technique for monitoring misbehavior in reasoning models. However, it also warns that this technique depends on leaving chains of thought unoptimized.

Other news

Industry

Government

See also: CAIS website, X account for CAIS, our paper on superintelligence strategy, our AI safety course, and our feedback form.

Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.

Subscribe to receive future versions.

0 comments

Comments sorted by top scores.