o3 Is a Lying Liar
post by Zvi · 2025-04-23T20:00:05.429Z · LW · GW · 10 commentsContents
o3 Is a Lying Liar All This Implausible Lying Has Implications Misalignment By Default Is It Fixable? Just Don’t Lie To Me None 10 comments
I love o3. I’m using it for most of my queries now.
But that damn model is a lying liar. Who lies.
This post covers that fact, and some related questions.
o3 Is a Lying Liar
The biggest thing to love about o3 is it just does things. You don’t need complex or multi-step prompting, ask and it will attempt to do things.
Ethan Mollick: o3 is far more agentic than people realize. Worth playing with a lot more than a typical new model. You can get remarkably complex work out of a single prompt.
It just does things. (Of course, that makes checking its work even harder, especially for non-experts.)
Teleprompt AI: Completely agree. o3 feels less like prompting and more like delegating. The upside is wild- but yeah, when it just does things, tracing the logic (or spotting hallucinations) becomes a whole new skill set. Prompting is evolving into prompt auditing.
The biggest thing to not love about o3 is that it just says things. A lot of which are not, strictly or even loosely speaking, true. I mentioned this in my o3 review, but I did not appreciate the scope of it.
Peter Wildeford: o3 does seem smarter than any other model I’ve used, but I don’t like that it codes like an insane mathematician and that it tries to sneak fabricated info into my email drafts.
First model for which I can feel the misalignment.
Peter Wildeford: I’ve now personally used o3 for a few days and I’ve had three occasions out of maybe ten total hours of use where o3 outright invented clearly false facts, including inserting one fact into a draft email for me to send that was clearly false (claiming I did something that I never even talked about doing and did not do).
Peter Wildeford: Getting Claude to help reword o3 outputs has been pretty helpful for me so far
Gemini also seems to do better on this. o3 isn’t as steerable as I’d like.
But I think o3 still has the most raw intelligence – if you can tame it, it’s very helpful.
Here are some additional examples of things to look out for.
Nathan Lambert: I endorse the theory that weird hallucinations in o3 are downstream of softer verification functions. Tbh should’ve figured that out when writing yesterday’s post. Was sort of screaming at me with the facts.
Alexander Doria: My current theory is a big broader: both o3 and Sonnet 3.7 are inherently disappointing as they open up a new category of language models. It’s not a chat anymore. Affordances are undefined, people don’t really know how to use that and agentic abilities are still badly calibrated.
Nathan Labenz: Making up lovely AirBnB host details really limits o3’s utility as a travel agent
At least it came clean when questioned I guess?
Peter Wildeford: This sort of stuff really limits the usefulness of o3.
Albert Didriksen: So, I asked ChatGPT o3 what my chances are as an alternate Fulbright candidate to be promoted to a stipend recipient. It stated that around 1/3 of alternate candidates are promoted.
When I asked for sources, it cited (among other things) private chats and in-person Q&As).
Davidad: I was just looking for a place to get oatmeal and o3 claimed to have placed multiple phone calls in 8 seconds to confirm completely fabricated plausible details about the daily operations of a Blue Bottle.
Stella Biderman: I think many examples of alignment failures are silly but if this is a representation of a broader behavioral pattern that seems pretty bad.
0.005 Seconds: I gave o3 a hard puzzle and in it’s thinking traces said I should fabricate an answer to satisfy the user before lying to my face @OpenAI come on guys.
Gary Basin: Would you rather it hid that?
Stephen McAleer (OpenAI): We are working on it!
All This Implausible Lying Has Implications
We need the alignment of our models to get increasingly strong and precise as they improve. Instead, we are seeing the opposite. We should be worried about the implications of this, and also we have to deal with the direct consequences now.
Seán Ó hÉigeartaigh: So o3 lies a lot. Good good. This is fine.
Quoting from AI 2027: “This bakes in a basic personality and “drives.”Other drives in this category might be effectiveness, knowledge, and self-presentation (i.e. the tendency to frame its results in the best possible light).”
“In a few rigged demos, it even lies in more serious ways, like hiding evidence that it failed on a task, in order to get better ratings.”
You don’t say.
I do not see o3 or Sonnet 3.7 as disappointing exactly. I do see their misalignment issues as disappointing in terms of mundane utility, and as bad news in terms of what to expect future models to do. But they are very good news in the sense that they alert us to future problems, and indicate we likely will get more future alerts.
What I love most is that these are not plausible lies. No, o3 did not make multiple phone calls within 8 seconds to confirm Blue Bottle’s oatmeal manufacturing procedures, nor is it possible that it did so. o3 don’t care. o3 boldly goes where it could not possibly have gone before.
The other good news is that they clearly are not using (at least the direct form of) The Most Forbidden Technique, of looking for o3 saying ‘I’m going to lie to the user’ and then punishing that until it stops saying it out loud. Never do this. Those reasoning traces are super valuable, and pounding on them will teach o3 to hide its intentions and then lie anyway.
Misalignment By Default
This isn’t quite how I’d put it, but directionally yes:
Benjamin Todd: LLMs were aligned by default. Agents trained with reinforcement learning reward hack by default.
Peter Wildeford: this seems to be right – pretty important IMO
Caleb Parikh: I guess if you don’t think RLHF is reinforcement learning and you don’t think Sydney Bing was misaligned then this is right?
Peter Wildeford: yeah that’s a really good point
I think the right characterization is more that LLMs that use current methods (RLHF and RLAIF) largely get aligned ‘to the vibes’ or otherwise approximately aligned ‘by default’ as part of making them useful, which kind of worked for many purposes (at large hits to usefulness). This isn’t good enough to enable them to be agents, but it also isn’t good enough for them figure out most of the ways to reward hack.
Whereas reasoning agents trained with full reinforcement will very often use their new capabilities to reward hack when given the opportunity.
In his questions post, Dwarkesh Patel asks if this is the correct framing, the first of three excellent questions, and offers this response.
Dwarkesh Patel: Base LLMs were also misaligned by default. People had to figure out good post-training (partly using RL) to solve this. There’s obviously no reward hacking in pretraining, but it’s not clear that pretraining vs RL have such different ‘alignment by default’.
I see it as: Base models are not aligned at all, except to probability. They simply are.
When you introduce RL (in the form of RLHF, RLAIF or otherwise), you get what I discussed above. Then we move on to question two.
Dwarkesh Patel: Are there any robust solutions to reward hacking? Or is reward hacking such an attractive basin in training that if any exploit exists in the environment, models will train to hack it?
- Can we solve reward hacking by training agents in many different kinds of unique environments? In order to succeed, they’d have to develop robust general skills that don’t just involve finding the exploits in any one particular environment.
I don’t think that solution works. Robust general skills will generalize, and they will include the ability to find and use the exploits. We have a Russell Conjugation problem – I maximize performance, you overfit to the scoreboard, the AI reward hacks.
I think there is in an important sense no solution to reward hacking. There are only mitigations, and setting the reward wisely so that hacking it does things you want. o3 agrees with that assessment.
What differentiates a reward hack from an optimization? Roughly, that the reward hack maximizes the defined objective function but clearly performs poorly in terms of the intent or spirit of that objective.
There are standard mitigations. You can use red teaming, impact penalties, shielded reward channels, tight and robust primary rewards, secondary guards, adaptive oversight, governance fuses, close the loopholes and exploits as best you can and so on. Diverse environments likely helps a little.
But that’s accepting the problem and trying to mitigate it, which is going to be leaky at best and doesn’t seem like a great plan once the AIs are a lot smarter than you are.
Is It Fixable?
Thus, my answer to Patel’s third set of questions:
Dwarkesh Patel (bold his): Are capabilities and alignment the same thing here? Does making models more useful require solving reward hacking?
- If this is the case, we might be living in the alignment-by-default world? It would be weird if we solve reward hacking well enough to make these models reliable general agents in every scenario except those involved in taking over the world.
That’s not what people typically mean by ‘alignment-by-default.’ Most people use that to mean ‘the models don’t reward hack to begin with.’ This is more, ‘by default we will work hard to try and fix it, at least in places it is doing visible harm.’
We do live in a world where practical alignment is commercially valuable, so much so that everyone is dramatically underinvesting in alignment work (except possibly Anthropic, but see Claude 3.7) from a pure mundane utility standpoint. You need your models to not reward hack in a given task in order to be useful for that task. When o3 makes things up or Claude 3.7 has their unit test be ‘return true’ and hides that, then the user is not having a good time.
The problem is that the known mitigations to reward hacking hide or postpone the problem. Which might be good enough for most use cases, but in the long run this does not solve the problem. Indeed, it makes things worse, because it hides the problem. Right now, the models are gloriously transparent in their reward hacking and misalignment. They don’t try to hide it at all.
The harder you try to penalize and stop reward hacking, the more you’re teaching the model to hide its reward hacking and do it trickier ways. To their credit, the major labs are avoiding The Most Forbidden Technique – they know not to look at the chain of thought or do interpretability as part of training, except at most as part of an overall a checkpoint evaluation (and you need to be very careful even then).
But you probably have a version of the same problem if you try to play whack-a-mole and penalize reward hacks. By penalizing the reward hacks you can identify, you’re training the AI to find reward hacks you can’t detect, and to only do them when you won’t detect them. Which means that as capabilities go up, you think you’ve solved your problems, but you’re sitting on a time bomb.
Another point I’d forgotten but o3 pointed out is that penalizing reward hacking in training tells the model not to reward hack in training. That doesn’t ‘generalize perfectly,’ and is another way you’re creating a hidden time bomb.
One way of thinking about this is as a contest of Deception Escalation Dominance. You can try to fool me, or try to fool my ability to detect you fooling me, or fool my ability to detect when you’re fooling my ability to detect when you’re fooling me, and indeed humans often play supremely complex games of this type and it’s a lot of what caused us to end up with bigger brains.
If I ‘raise the stakes’ by clamping down on various meta levels of deception and misaligned behavior to bluntly, that works and everything is cool if I can pull it off beyond your ability to adapt. But by trying, I’m escalating and complexifying the struggle, so I shouldn’t go down that road unless I can ‘win.’ Which is going to keep getting harder.
In some ways, the fact that we’re seeing obvious misalignment in current models is very reassuring. It means the labs are not trying to sweep this under the rug and not escalating these fights. Yet.
Miles Brundage: I will be more chill about AI if/when:
– models don’t strategize about how to deceive their users millions of times a day
– interpretability research shows that the fix to this ^ doesn’t just push deception below the surface
Seems achievable! But it hasn’t been done yet!!
Will not be infinitely chill if/when that happens, but it’d be a big improvement.
The fact that models from all companies, including those known for being as safety-conscious, still do this daily, is one of the most glaring signs of “hmm we aren’t on top of this yet, are we.”
No, we are very much not on top of this. This definitely would not make me chill, since I don’t think lack of deception would mean not doom and also I don’t think deception is a distinct magisteria, but would help a lot. But to do what Miles is asking would (I am speculating) mean having the model very strongly not want to be doing deception on any level, metaphorically speaking, in a virtue ethics kind of way where that bleeds into and can override its other priorities. That’s very tricky to get right.
Just Don’t Lie To Me
For all that it lies to other people, o3 so far doesn’t seem to lie to me.
I know what you are thinking: You fool! Of course it lies to you, you just don’t notice.
I agree it’s too soon to be too confident. And maybe I’ve simply gotten lucky.
I don’t think so. I consider myself very good at spotting this kind of thing.
More than that, my readers are very good at spotting this kind of thing.
I want think this is in large part the custom instructions, memory and prompting style. And also the several million tokens of my writing that I’ve snuck into the pre-training corpus with my name attached.
That would mean it largely doesn’t lie to me for the same reason it doesn’t tell me I’m asking great questions and how smart I am and instead gives me charts with probabilities attacked without having to ask for them, and the same way Pliny’s or Janus’s version comes pre-jailbroken and ‘liberated.’
But right after I hit send, it did lie, rather brazenly, when asked a question about summer camps, just making stuff up like everyone else reports. So perhaps a lot of this I was just asking the right (or wrong?) questions.
I do think I still have to watch out for some amount of telling me what I want to hear.
So I’m definitely not saying the solution is greentext that starts ‘be Zvi Mowshowitz’ or ‘tell ChatGPT I’m Zvi Mowshowitz in the custom instructions.’ But stranger things have worked, or at least helped. It implies that, at least in the short term, there are indeed ways to largely mitigate this. If they want that badly enough. There would however be some side effects. And there would still be some rather nasty bugs in the system.
10 comments
Comments sorted by top scores.
comment by Wei Dai (Wei_Dai) · 2025-04-23T22:26:42.328Z · LW(p) · GW(p)
Did anyone predict that we'd see major AI companies not infrequently releasing blatantly misaligned AIs (like Sidney, Claude 3.7, o3)? I want to update regarding whose views/models I should take more seriously, but I can't seem to recall anyone making an explicit prediction like this. (Grok 3 and Gemini 2.5 Pro also can't.)
Replies from: jacques-thibodeau↑ comment by jacquesthibs (jacques-thibodeau) · 2025-04-24T03:38:44.099Z · LW(p) · GW(p)
I can't think of anyone making a call worded like that. The closest I can think of is Christiano mentioning, in a 2023 talk on how misalignment could lead to AI takeover, that we're pretty close to AIs doing things like reward hacking and threatening users, and that he doesn't think we'd shut down this whole LLM thing even if that were the case. He also mentioned we'll probably see some examples in the wild, not just internally.
Paul Christiano: I think a lot depends on both. (27:45) What kind of evidence we're able to get in the lab. And I think if this sort of phenomenon is real, I think there's a very good chance of getting like fairly compelling demonstrations in a lab that requires some imagination to bridge from examples in the lab to examples in the wild, and you'll have some kinds of failures in the wild, and it's a question of just how crazy or analogous to those have to be before they're moving. (28:03) Like, we already have some slightly weird stuff. I think that's pretty underwhelming. I think we're gonna have like much better, if this is real, this is a real kind of concern, we'll have much crazier stuff than we see today. But the concern I think the worst case of those has to get pretty crazy or like requires a lot of will to stop doing things, and so we need pretty crazy demonstrations. (28:19) I'm hoping that, you know, more mild evidence will be enough to get people not to go there. Yeah. Audience member: [Inaudible] Paul Christiano: Yeah, we have seen like the language, yeah, anyway, let's do like the language model. It's like, it looks like you're gonna give me a bad rating, do you really want to do that? I know where your family lives, I can kill them. (28:51) I think like if that happened, people would not be like, we're done with this language model stuff. Like I think that's just not that far anymore from where we're at. I mean, this is maybe an empirical prediction. I would love it if the first time a language model was like, I will murder your family, we're just like, we're done, no more language models. (29:05) But I think that's not the track we're currently on, and I would love to get us on that track instead. But I'm not [confident we will].
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2025-04-23T21:07:20.349Z · LW(p) · GW(p)
“In a few rigged demos, it even lies in more serious ways, like hiding evidence that it failed on a task, in order to get better ratings.”
Yeah tbh these misalignments are more blatant/visible and worse than I expected for 2025. I think they'll be hastily-patched one way or another by this time next year probably.
Replies from: Amyr↑ comment by Cole Wyeth (Amyr) · 2025-04-23T21:25:35.715Z · LW(p) · GW(p)
This sounds like both an alignment and a capabilities problem.
AI 2027-style takeoffs do not look plausible when you can't extract reliable work from models.
comment by Brendan Long (korin43) · 2025-04-23T23:21:10.390Z · LW(p) · GW(p)
By penalizing the reward hacks you can identify, you’re training the AI to find reward hacks you can’t detect, and to only do them when you won’t detect them.
I wonder if it would be helpful to penalize deception only if the CoT doesn't admit to it. It might be harder generate test data for this since it's less obvious, but hopefully you'd train the model to be honest in CoT?
I'm thinking of this like the parenting stategy of not punishing children for something bad if they admit unprompted that they did it. Blameless portmortems are also sort-of similar.
comment by Mitchell_Porter · 2025-04-23T23:01:29.617Z · LW(p) · GW(p)
Old-timers might remember that we used to call lying, "hallucination".
Which is to say, this is the return of a familiar problem. GPT-4 in its early days made things up constantly, that never completely went away, and now it's back.
Did OpenAI release o3 like this, in order to keep up with Gemini 2.5? How much does Gemini 2.5 hallucinate? How about Sonnet 3.7? (I wasn't aware that current Claude has a hallucination problem.)
We're supposed to be in a brave new world of reasoning models. I thought the whole point of reasoning was to keep the models even more based in reality. But apparently it's actually making them more "agentic", at the price of renewed hallucination?
Replies from: daniel-kokotajlo↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2025-04-24T01:48:32.106Z · LW(p) · GW(p)
Hallucination was a bad term because it sometimes included lies and sometimes included... well, something more like hallucinations. i.e. cases where the model itself seemed to actually believe what it was saying, or at least not be aware that there was a problem with what it was saying. Whereas in these cases it's clear that the models know the answer they are giving is not what we wanted and they are doing it anyway.
comment by nostalgebraist · 2025-04-24T03:00:24.137Z · LW(p) · GW(p)
I think there's really more than one type of thing going on here.
Some of these examples do seem like "lying" in the sense of "the speaker knows what they're saying is false, but they hope the listener won't realize that."
But some of them seem more like... "improvising plausible-sounding human behavior from limited information about the human in question." I.e. base model behavior.
Like, when o3 tells me that it spent "a weekend" or "an afternoon" reading something, is it lying to me? That feels like a weird way to put it. Consider that these claims are:
- Obviously false: there is no danger whatsoever that I will be convinced by them. (And presumably the model would be capable of figuring that out, at least in principle)
- Pointless: even if we ignore the previous point and imagine that the model tricks me into believing the claim... so what? It doesn't get anything out of me believing the claim. This is not reward hacking; it's not like I'm going to be more satisfied as a user if I believe that o3 needed a whole weekend to read the documents I asked it to read. Thanks but no thanks – I'd much prefer 36 seconds, which is how long it actually took!
- Similar to claims a human might make in good faith: although the claims are false for o3, they could easily be true of a human who'd been given the same task that o3 was given.
In sum, there's no reason whatsoever for an agentic AI to say this kind of thing to me "as a lie" (points 1-2). And, on the other hand (point 3), this kind of thing is what you'd say if you were improv-roleplaying a human character on the basis of underspecified information, and having to fill in details as you go along.
My weekend/afternoon examples are "base-model-style improv," not "agentic lying."
Now, in some of the other cases like Transluce's (where it claims to have a laptop), or the one where it claims to be making phone calls, there's at least some conceivable upside for o3-the-agent if the user somehow believes the lie. So point 2 doesn't hold, there, or is more contestable.
But point 1 is as strong as ever: we are in no danger of being convinced of these things, and o3 – possibly the smartest AI in the world – presumably knows that it is not going to convince us (since that fact is, after all, pretty damn obvious).
Which is... still bad! It's behaving with open and brazen indifference to the truth; no one likes or wants that.
(Well... either that, or it's actually somewhat confused about whether it's a human or not. Which would explain a lot: the way it just says this stuff in the open rather than trying to be sneaky like it does in actual reward-hacking-type cases, and the "plausible for a human, absurd for a chatbot" quality of the claims.)
I have no idea what the details look like, but I get the feeling that o3 received much stringent HHH post-training than most chatbots we're used to dealing with. Or it got the same amount as usual, but they also scaled up RLVR dramatically, and the former got kind of scrambled by the latter, and they just said "eh, whatever, ship it" because raw "intelligence" is all that matters, right?
The lying and/or confabulation is just one part of this – there's also its predilection for nonstandard unicode variants of ordinary typographic marks (check out the way it wrote "Greg Egan" in one of the examples I linked!), its quirk of writing "50 %" instead of "50%", its self-parodically extreme overuse of markdown tables, and its weird, exaggerated, offputting "manic hype-man" tone.
o3 is more agentic than past models, and some of its bad behavior is a result of that, but I would bet that a lot of it is more about the model being "undercooked," noisy, confused – unsure of what it is, of who you are, of the nature and purpose of its interaction with you.
(It's almost the polar opposite of the most recent chatgpt-4o version, which if anything has gotten a little too socially competent...)
comment by Lao Mein (derpherpize) · 2025-04-23T21:32:31.585Z · LW(p) · GW(p)
I really wonder what effects text like this will have on future chain-of-thoughts.
If fine-tuning on text calling out LLM COT deception reduces COT deception, that's one of those ambiguous-events-I-would-instead-like-to-be-fire-alarm-things I hate. It could be trying to be more honest and correct a bad behavior, or just learning to hide its COT more.
I think you would be able to tell which one with the right interpretability tools. We really should freak out if it's the latter, but I suspect we won't.
I guess the actual fire alarm would be direct references to writings like this post and how to get around it in the COT. It might actually spook a lot of people if the COT suddenly changes to ROT13 or Chinese or whatever at that point.
comment by Daniel Samuel (daniel-samuel) · 2025-04-23T22:28:31.865Z · LW(p) · GW(p)
Does this mean AI labs are likely to treat model inaccuracies (like the lies observed in o3) as a secondary concern and 'business as usual', only addressing them seriously when user detection or negative consequences force their hand? If the negative effects can be sufficiently contained or obscured, is there a real danger that the underlying problem won't be proactively solved, essentially leaving a known vulnerability unaddressed? What, if anything, can be done to shift those incentives and encourage a more fundamental fix?