Posts

Michael Thiessen's Shortform 2024-04-01T05:54:40.937Z
Hardball questions for the Gemini Congressional Hearing 2024-03-12T15:27:05.451Z

Comments

Comment by Michael Thiessen (michael-thiessen) on Google DeepMind: An Approach to Technical AGI Safety and Security · 2025-04-10T15:46:23.539Z · LW · GW

RLHF does work currently? What makes you think it doesn't work currently?

This is definitely the crux so probably really the only point worth debating.

RLHF is just papering over problems. Sure, the model is slightly more difficult to jailbreak but it's still pretty easy to jailbreak. Sure, the agent is less likely through RLHF to output text you don't like, but I think agents will reliably overcome that obstacle as useful agents won't just be outputting the most probable continuation, they'll be searching through decision space and finding those unlikely continuations that will score well on its task. I don't think RLHF does anything remotely analogous to making it care about whether it's following your intent, or following a constitution, etc.

You're definitely aware of the misalignment that still exists with our current RLHF'd models and have read the recent papers on alignment faking etc. so I probably can't make an argument you haven't already heard.

Maybe the true crux is you believe this is a situation where we can muddle through by making harmful continuations ever more unlikely? I don't. 

Comment by Michael Thiessen (michael-thiessen) on Google DeepMind: An Approach to Technical AGI Safety and Security · 2025-04-09T15:03:12.564Z · LW · GW

(Sorry, hard to differentiate quotes from you vs quotes from the paper in this format)

(I agree it is assuming that the judge has that goal, but I don't see why that's a terrible assumption.)

If the judge is human, sure. If the judge is another AI, it seems like a wild assumption to me. The section on judge safety in your paper does a good job of listing many of the problems. On thing I want to call out as something I more strongly disagree with is:

One natural approach to judge safety is bootstrapping: when aligning a new AI system, use the previous generation of AI (that has already been aligned) as the judge.

I don't think we have any fully aligned models, and won't any time soon. I would expect bootstrapping will at most align a model as thoroughly as its predecessor was aligned (but probably less), and goodhart's law definitely applies here.

 

You don't have to stop the agent, you can just do it afterwards.

Maybe this is just using the same terms for two different systems, but the paper also talks about using judging for monitoring deployed systems:

This approach is especially well-suited to monitoring deployed systems, due to the sheer scale required at deployment.

Is this intended only as a auditing mechanism, not a prevention mechanism (eg. we noticed in the wild the AI is failing debates, time to roll back the release)? Or are we trusting the AI judges enough at this point that we don't need to stop and involve a human? I also worry the "cheap system with high recall but low precision" will be too easy to fool for the system to be functional past a certain capability level.
 

From a 20000ft perspective, this all just looks like RLHF with extra steps that make it harder to reason about whether it'll work or not. If we had evidence that RLHF worked great, but the only flaw with it was the AI was starting to get too complex for humans to give feedback, then I would be more interested in all of the fiddly details of amplified oversight and working through them. The problem is RLHF already doesn't work, and I think it'll only work less well as AI becomes smarter and more agentic.

Comment by Michael Thiessen (michael-thiessen) on Google DeepMind: An Approach to Technical AGI Safety and Security · 2025-04-08T16:02:18.617Z · LW · GW

The key idea is to use the AI systems themselves to help identify the reasons that the AI system produced the output. For example, we could put two copies of the model in a setting where each model is optimized to point out flaws in the other’s outputs to a human “judge”. Ideally, if one model introduced a subtle flaw in their output that the judge wouldn’t notice by default, the other model would point out and explain the flaw, enabling the judge to penalise the first model appropriately. 

In amplified oversight, any question that is too hard to supervise directly is systematically reduced to ones that we hypothesize can be supervised. However, humans may be systematically biased even for fairly simple questions. If this turns out to be a problem in practice, we could seek to model these deviations and automatically correct or account for them when interpreting the oversight.

 

I genuinely don't understand how anybody expects Amplified Oversight to work, it seems to me to be assuming alignment in so many places, like being able to give the judge or debate partner the goal of actually trying to get to the truth. I also don't envy the humans trying to figure out which side of these debates is in the right. We can't even do that for very obvious existing human debates.


Even if debate-like approaches work in theory for getting a signal as to something's truthiness, I don't see how it could apply to agents which aren't simply generating a 100% complete plan for the debate and human approval to then mechanically enact. Are you stopping the agent periodically to have another debate about what it's working on and asking the human to review another debate? Or is the idea you simulate an agent and at random points in it carrying out its task you run a debate on what it's currently doing and use this to train the model to always act in ways that would allow it to win a debate?

I don't even think whether debate will work is really an empirical question. I don't know that there's any study result on current systems that could convince me this would carry on working through to AGI/ASI. Even if things looked super optimistic now, like that refuting a lie is easier than lying, I would not expect this to hold up. I don't expect the lines on these graphs to be straight.

I know a complete plan is unreasonable at this point, but can anybody provide a more detailed sketch of why they think Amplified Oversight will work and how it can be used to make agents safe in practice?

Comment by Michael Thiessen (michael-thiessen) on OpenAI o1 · 2024-09-12T21:43:17.430Z · LW · GW

It's copied from the pdf which hard-codes line breaks.

Comment by Michael Thiessen (michael-thiessen) on Michael Thiessen's Shortform · 2024-04-01T05:54:41.164Z · LW · GW

When people argue many AIs competing will make us safe, Yud often counters with how AI will coordinate with each other but not us. This is probably true, but not super persuasive. I think a more intuitive explanation is that offense and defense are asymmetrical. An AI defending my home cannot simply wait for attacks to happen and then defend against them (eg another AI cuts off the power, or fries my AI's CPU with a laser). To truly defend my home, an AI would have to monitor and, importantly, control a hugely outsized part of the world (possibly the entire world).

Comment by Michael Thiessen (michael-thiessen) on Hardball questions for the Gemini Congressional Hearing · 2024-03-13T14:35:46.964Z · LW · GW

It's satire, links are real though.

Comment by Michael Thiessen (michael-thiessen) on OpenAI: Altman Returns · 2023-11-30T22:09:10.331Z · LW · GW

In my non-tech circles people mostly complain about AI stealing jobs from artists, companies making money off of other people's work, etc.

People are also just scared of losing their own jobs.

Comment by Michael Thiessen (michael-thiessen) on OpenAI: Altman Returns · 2023-11-30T17:46:19.888Z · LW · GW

Also, his statements in the verge are so bizarre to me:

"SA: I learned that the company can truly function without me, and that’s a very nice thing. I’m very happy to be back, don’t get me wrong on that. But I come back without any of the stress of, “Oh man, I got to do this, or the company needs me or whatever.” I selfishly feel good because either I picked great leaders or I mentored them well. It’s very nice to feel like the company will be totally fine without me, and the team is ready and has leveled up."

2 business days away and the company is ready to blow up if you don't come back and your takeaway is that it can function without you? I get that this is PR spin, but usually there's at least some amount of plausible believability.

Maybe these are all attempts to signal to investors that everything is fine, even if Sam were to leave it would still all be fine, but at some point if I'm an investor I have to wonder if given how hard Sam is trying to make it look like everything is fine, that things are very much not fine.

Comment by Michael Thiessen (michael-thiessen) on OpenAI: Altman Returns · 2023-11-30T16:00:37.823Z · LW · GW

Let that last paragraph sink in. The leadership team ex-Greg is clearly ready to run the company without Altman.

I'm struggling to interpret this, so your guesses as to what this might mean would be helpful. It seems he clearly wanted to come back - is he threatening to leave again if he doesn't get his way?

Also note Ilya not included in the leadership team.

 

While Ilya will no longer serve on the board, we hope to continue our working relationship and are discussing how he can continue his work at OpenAI.

This statement also really stood out to me - if there really was no ill will, why would they have to discuss how Ilya can continue his work? Clearly there's something more going on here. Sounds like Ilya's getting the knife.

Comment by Michael Thiessen (michael-thiessen) on OpenAI: Facts from a Weekend · 2023-11-21T20:16:06.421Z · LW · GW

According to Bloomberg, "Even CEO Shear has been left in the dark, according to people familiar with the matter. He has told people close to OpenAI that he doesn’t plan to stick around if the board can’t clearly communicate to him in writing its reasoning for Altman’s sudden firing."

Evidence that Shear simply wasn't told the exact reason, though the "in writing" part is suspicious. Maybe he was told not in writing and wants them to write it down so they're on the record.

Comment by Michael Thiessen (michael-thiessen) on OpenAI: Facts from a Weekend · 2023-11-21T16:09:40.295Z · LW · GW

Sam's latest tweet suggests he can't get out of the "FOR THE SHAREHOLDERS" mindset.

"satya and my top priority remains to ensure openai continues to thrive

we are committed to fully providing continuity of operations to our partners and customers"

This does sound antithetical to the charter and might be grounds to replace Sam as CEO.

Comment by Michael Thiessen (michael-thiessen) on OpenAI: Facts from a Weekend · 2023-11-20T18:52:06.808Z · LW · GW

I do find it quite surprising that so many who work at OpenAI are so eager to follow Altman to Microsoft - I guess I assumed the folks at OpenAI valued not working for big tech (that's more(?) likely to disregard safety) more than it appears they actually did.

Comment by Michael Thiessen (michael-thiessen) on OpenAI: Facts from a Weekend · 2023-11-20T16:02:51.625Z · LW · GW

https://twitter.com/i/web/status/1726526112019382275

"Before I took the job, I checked on the reasoning behind the change. The board did *not* remove Sam over any specific disagreement on safety, their reasoning was completely different from that. I'm not crazy enough to take this job without board support for commercializing our awesome models."

Comment by Michael Thiessen (michael-thiessen) on Reflections on Deception & Generality in Scalable Oversight (Another OpenAI Alignment Review) · 2023-10-05T19:20:21.315Z · LW · GW

It seems to me that the idea of scalable oversight itself was far easier to generate than to evaluate. If the idea had been generated by an alignment AI rather than various people independently suggesting similar strategies, would we be confident in our ability to evaluate it? Is there some reason to believe alignment AIs will generate ideas that are easier to evaluate than scalable alignment? What kind of output would we need to see to make an idea like scalable alignment easy to evaluate?

Comment by Michael Thiessen (michael-thiessen) on Where I agree and disagree with Eliezer · 2022-07-14T04:41:34.953Z · LW · GW

"I think it doesn’t match well with pragmatic experience in R&D in almost any domain, where verification is much, much easier than generation in virtually every domain."

This seems like a completely absurd claim to me, unless by verification you mean some much weaker claim like that you can show something sometimes works.

Coming from the world of software, generating solutions that seem to work is almost always far easier than any sort of formal verification that they work. I think this will be doubly true in any sort of adversarial situation where any flaws will be actively sought out and exploited. Outside of math domains I find it difficult to come up with examples of where verification is easier than generation, and easy to come up with the opposite.