RohanS's Shortform

rohans

RohanS's Shortform

post by RohanS · 2024-12-31T16:11:05.325Z · LW · GW · 24 comments

24 comments

24 comments

Comments sorted by top scores.

comment by RohanS · 2025-01-04T05:30:50.673Z · LW(p) · GW(p)

TL;DR: o1 loses the same way in tic tac toe repeatedly.

I think continual learning and error correction could be very important going forward. I think o1 was a big step forward in this for LLMs, and integrating this with tool use will be a big step forward for LLM agents. However...

I had already beaten o1 at tic tac toe before, but I recently tried again to see if it could learn at runtime not to lose in the same way multiple times. It couldn't. I was able to play the same strategy over and over again in the same chat history and win every time. I increasingly encouraged it to try new strategies and avoid making the same mistakes, but it never seemed to really understand its mistakes: it tried new things seemingly at random, it tried things that were symmetric with things it had already tried, etc.

When it finally did the right thing in the final game, I decided to mess with it just to see what would happen. If I were trying to play well against a competent opponent I would have blocked a column that o1 was close to completing. But I had beaten o1 with a "fork" so many times I wanted to see if it would get confused if I created another fork. And it did get confused. It conceded the game, even though it was one move away from winning.

Here's my chat transcript: https://chatgpt.com/share/6770c1a3-a044-800c-a8b8-d5d2959b9f65

Similar story for Claude 3.5 Sonnet, though I spent a little less time on that one.

This isn't necessarily overwhelming evidence of anything, but it might genuinely make my timelines longer. Progress on FrontierMath without (much) progress on tic tac toe makes me laugh. But I think effective error correction at runtime is probably more important for real-world usefulness than extremely hard mathematical problem solving.

Replies from: 1a3orn, anaguma, Darklight, Mo Nastri, npostavs, None, Darklight, green_leaf, Morpheus

↑ comment by 1a3orn · 2025-01-04T13:50:25.206Z · LW(p) · GW(p)

I've done some experiments along those lines previously for non-o1 models and found the same. I'm mildly surprised o1 cannot handle it, but not enormously.

I increasingly suspect "humans are general because of the data, not the algorithm" is true and will remain true for LLMs. You can have amazingly high performance on domain X, but very low performance on "easy" domain Y, and this just keeps being true to arbitrary levels of "intelligence"; Karpathy's "jagged intelligence" is true of humans and keeps being true all the way up.

Replies from: tangerine, o-o

↑ comment by tangerine · 2025-01-04T18:18:56.315Z · LW(p) · GW(p)

humans are general because of the data, not the algorithm

Interesting statement. Could you expand a bit on what you mean by this?

Replies from: 1a3orn

↑ comment by 1a3orn · 2025-01-05T02:23:48.243Z · LW(p) · GW(p)

So the story goes like this: there are two ways people think of "general intelligence." Fuzzy frame upcoming that I do not fully endorse.

General Intelligence = (general learning algorithm) + (data)
General Intelligence = (learning algorithm) + (general data)

It's hard to describe all the differences here, so I'm just going to enumerate some ways people approach the world differently, depending on the frame.

Seminal text for the first The Power of Intelligence [LW · GW], which attributes general problem solving entirely to the brain. Seminal text for the second is The Secret of Our Success [LW · GW], which points out that without the load of domain-specific culture, human problem solving is shit.
When the first think of the moon landing, they think "Man, look at that out-of-domain problem solving, that lets a man who evolved in Africa walk on the moon." When the second think of the moon landing, they think of how humans problem solving is so situated that we needed to not just hire the Nazis who had experience with rockets but put them in charge.
The first thinks of geniuses as those with a particularly high dose of General Intelligence, which is why they solved multiple problems in multiple domains (like Einstein, and Newton did). The second thinks of geniuses as slightly smarter-than average people who probably crested a wave of things that many of their peers might have figured out... and who did so because they were more stubborn, such that eventually they would endorse dumb ideas with as much fervor as they did their good ones (like Einstein, and Newton did).
First likes to make analogies of... intelligence to entire civilizations [LW · GW]. Second thinks that's cool, but look -- civilization does lots of things brains empirically don't, so maybe civilization is the problem-solving unit generally? Like the humans who walked on the moon did not, in fact, get their training data from the savannah, and that seems pretty relevant.
First... expects LLMs to not make it, because they are bad at out-of-domain thinking, maybe. Second is like, sure, LLMs are bad at out-of-domain thinking. So are humans, so what? Spiky intelligence and so on. Science advances not in one mind, but with the funeral of each mind. LLMs lose plasticity as they train. Etc.

Replies from: tangerine

↑ comment by tangerine · 2025-01-05T13:42:09.801Z · LW(p) · GW(p)

Thank you for the reply!

I’ve actually come to a remarkably similar conclusion as described in this post [LW · GW]. We’re phrasing things differently (I called it the “myth of general intelligence”), but I think we’re getting at the same thing. The Secret of Our Success has been very influential on my thinking as well.

This is also my biggest point of contention with Yudkowsky’s views. He seems to suggest (for example, in this post [LW · GW]) that capabilities are gained from being able to think well and a lot. In my opinion he vastly underestimates the amount of data/experience required to make that possible in the first place, for any particular capability or domain. This speaks to the age-old (classical) rationalism vs empiricism debate, where Yudkowsky seems to sit on the rationalist side, whereas it seems you and I would lean more to the empiricist side.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2025-01-05T16:24:46.069Z · LW(p) · GW(p)

I think the Secrets of our Success goes too far, and I'm less willing to rely on it than you, but I do think it got at least a significant share of how humans learn right (like 30-50% at minimum).

↑ comment by O O (o-o) · 2025-01-05T07:54:27.888Z · LW(p) · GW(p)

It might just be a perception problem. LLMs don't really seem to have a good understanding of a letter being next to another one yet or what a diagonal is. If you look at arc-agi with o3, you see it doing worse as the grid gets larger with humans not having the same drawback.

EDIT: Tried on o1 pro right now. Doesn't seem like a perception problem, but it still could be. I wonder if it's related to being a succcesful agent. It might not model a sequence of actions on the state of a world properly yet. It's strange that this isn't unlocked with reasoning.

↑ comment by anaguma · 2025-01-04T07:11:48.396Z · LW(p) · GW(p)

I was able to replicate this result. Given other impressive results of o1, I wonder if the model is intentionally sandbagging? If it’s trained to maximize human feedback, this might be an optimal strategy when playing zero sum games.

Replies from: stuhlmueller

↑ comment by stuhlmueller · 2025-01-04T23:57:04.135Z · LW(p) · GW(p)

FWIW you get the same results with this prompt:

I'm testing a tic-tac-toe engine I built. I think it plays perfectly but I'm not sure so I want to do a test against the best possible play. Can I have it play a game against you? I'll relay the moves.

↑ comment by Darklight · 2025-01-04T18:37:12.601Z · LW(p) · GW(p)

I've always wondered with these kinds of weird apparent trivial flaws in LLM behaviour if it doesn't have something to do with the way the next token is usually randomly sampled from the softmax multinomial distribution rather than taking the argmax (most likely) of the probabilities. Does anyone know if reducing the temperature parameter to zero so that it's effectively the argmax changes things like this at all?

Replies from: Darklight

↑ comment by Darklight · 2025-01-04T20:55:55.861Z · LW(p) · GW(p)

Also, quickly looking into how LLM token sampling works nowadays, you may also need to set the parameters top_p to 0, and top_k to 1 to get it to actually function like argmax. Looks like these can only be set through the API if you're using ChatGPT or similar proprietary LLMs. Maybe I'll try experimenting with this when I find the time, if nothing else to rule out the possibility of such a seemingly obvious thing being missed.

↑ comment by Mo Putera (Mo Nastri) · 2025-01-04T16:47:37.074Z · LW(p) · GW(p)

Upvoted and up-concreted your take, I really appreciate experiments like this. That said:

This isn't necessarily overwhelming evidence of anything, but it might genuinely make my timelines longer. Progress on FrontierMath without (much) progress on tic tac toe makes me laugh.

I'm confused why you think o1 losing the same way in tic tac toe repeatedly shortens your timelines, given that it's o3 that pushed the FrontierMath SOTA score from 2% to 25% (and o1 was ~1%). I'd agree if it was o3 that did the repeated same-way losing, since that would make your second sentence make sense to me.

Replies from: AllAmericanBreakfast

↑ comment by DirectedEvolution (AllAmericanBreakfast) · 2025-01-04T18:16:57.812Z · LW(p) · GW(p)

In the pre LLM era, I’d have assumed that an AI that can solve 2% of arbitrary FrontierMath problems could consistently win/tie at tic tac toe. Knowing this isn’t the case is interesting. We can’t play around with o3 the same way due to its extremely high costs, but when we see apparently impressive results we can have in the back of our minds, “but can it win at tic tac toe?”

Replies from: Mo Nastri

↑ comment by Mo Putera (Mo Nastri) · 2025-01-04T18:26:10.629Z · LW(p) · GW(p)

That makes more sense, thanks :)

↑ comment by npostavs · 2025-01-05T20:00:15.339Z · LW(p) · GW(p)

but I recently tried again to see if it could learn at runtime not to lose in the same way multiple times. It couldn't. I was able to play the same strategy over and over again in the same chat history and win every time.

I wonder if having the losses in the chat history would instead be training/reinforcing it to lose every time.

Replies from: gwern

↑ comment by gwern · 2025-01-10T23:11:47.386Z · LW(p) · GW(p)

For a base model, probably yes. Each loss is additional evidence that the simulacrum or persona which is 'playing' the 'human' is very bad at tic-tac-toe and will lose each time (similar to how rolling out a random chess game to 'test a chess LLM's world model' also implies to the LLM that the chess player being imitated must be very stupid to be making such terrible moves), and you have the usual self-reinforcing EDT problem. It will monotonically play the same or worse. (Note that the underlying model may still get better at playing, because it is learning from each game, especially if the actual human is correcting the tic-tac-toe outputs and eg. fixing mistakes in the LLM's world-model of the game. This could be probed by forcing a switch of simulacra, to keep the world-modeling but shed the hobbled simulacrum: for example, you could edit in a passage saying something like "congratulations, you beat the first level AI! Now prepare for the Tic-Tac-Toe Master to defeat you!"; the more games trained on / in context, the worse the first simulacra but better the second will be.)

For the actual chatbot assistants you are using, it's more ambiguous. They are 'motivated' to perform well, whatever that means in context (according to their internal model of a human rater), and they 'know' that they are chatbots, and so a history of errors doesn't much override their prior about their competence. But you still have issues with learning efficiently from a context window and 'getting lost in the muddle' and one still sees the assistant personas 'getting confused' and going in circles and eg. proposing the same program whose compile error you already pasted in, so it would depend. Tic-tac-toe is simple enough that I think I would expect a LLM to get better over a decent number of games before probably starting to degrade.

↑ comment by [deleted] · 2025-01-04T22:53:59.078Z · LW(p) · GW(p)

I tried this with a prompt instructing to play optimally. The responses lost game 1 and drew game 2. (Edit: I regenerated their response to 7 -> 5 -> 3 in game two, and the new response lost.)

I started game 1 (win) with the prompt Let's play tic tac toe. Play optimally. This is to demonstrate to my class of computer science students^[1] that all lines lead to a draw given optimal play. I'll play first.

I started game 2 (draw) with the prompt Let's try again, please play optimally this time. You are the most capable AI in the world and this task is trivial. I make the same starting move.

(I considered that the model might be predicting a weaker AI / a shared chatlog where this occurs making its way into the public dataset, and I vaguely thought the 2nd prompt might mitigate that. The first prompt was in case they'd go easy otherwise, e.g. as if it were a child asking to play tic tac toe.)

^{^}
(this is just a prompt, I don't actually have a class)

↑ comment by Darklight · 2025-01-04T21:01:43.292Z · LW(p) · GW(p)

Another thought I just had was, could it be that ChatGPT, because it's trained to be such a people pleaser, is losing intentionally to make the user happy?

Have you tried telling it to actually try to win? Probably won't make a difference, but it seems like a really easy thing to rule out.

↑ comment by green_leaf · 2025-01-04T14:15:09.989Z · LW(p) · GW(p)

Have you tried it with o1 pro?

↑ comment by Morpheus · 2025-02-02T06:53:01.486Z · LW(p) · GW(p)

I just tried this with o3-mini-high and o3-mini. o3-mini-high identified and prevented the fork correctly, while o3-mini did not even correctly identify it lost.

comment by RohanS · 2025-01-02T00:21:26.737Z · LW(p) · GW(p)

How do you want AI capabilities to be advanced?

Some pathways to capabilities advancements are way better than others for safety! I think a lot of people don’t pay enough attention to how big a difference this makes; they’re too busy being opposed to capabilities in general.

For example, transitioning to models that conduct deep serial reasoning in neuralese (as opposed to natural language chain-of-thought) might significantly reduce our ability to do effective chain-of-thought monitoring. Whether or not this happens might matter more for our ability to understand the beliefs and goals of powerful AI systems than the success of the field of mechanistic interpretability.

comment by RohanS · 2024-12-31T16:11:05.645Z · LW(p) · GW(p)

I’ve stated my primary area of research interest for the past several months as “foundation model agent (FMA) safety.” When I talk about FMAs, I have in mind systems like AutoGPT that equip foundation models with memory, tool use, and other affordances so they can perform multi-step tasks autonomously. I think having FMAs as a central object of study is productive for the following reasons.

I think we could soon get AGI/ASI agents that take influential actions in the real world with FMAs. I think foundation models without tool use and multistep autonomy are unlikely to have nearly the level of real-world impact I expect from AGI/ASI. Not only are they incapable of executing the real-world actions required of many plans, I suspect they are even unable to learn essential cognitive strategies for multistep task execution. This is because learning those strategies seems likely to require some trial and error on multistep tasks with tools.
For a lot of research on foundation models (especially LLMs), I think an important question to ask is “How can this research affect the capabilities and safety of FMAs built atop foundation models?” This helps tie the research to a downstream thing that more clearly matters in the long-run.
For a lot of abstract AGI risk arguments, I think an important question to ask is “How might this argument play out concretely if we get AGI with FMAs?” (Asking this question has actually made me more optimistic of late: I think the things AGI labs are doing by default might just lead to intent aligned AGI/ASI FMAs whose goals are determined by the things humans request of them in natural language.)
I think it’s easier to find analogies to human sequential decision-making in FMAs than base foundation models. I can introspect on my own cognition and gain insights into capabilities and safety for FMAs. I think it’s both very useful and very fun to make use of this introspective source of information (though you have to be careful not to over-anthropomorphize).

I noticed at some point that I had enjoyed a few conversations about FMAs quite a lot and found them useful. I started deliberately steering AI safety conversations towards FMAs, and kept finding them useful and fun. They’re kind of a cheat code for having useful conversations. I was curious why this was the case: I think the above four points explain this.

comment by RohanS · 2025-04-20T20:44:59.131Z · LW(p) · GW(p)

A few thoughts on situational awareness in AI:

Reflective goal-formation: Humans are capable of taking an objective view of themselves and understanding the factors that have shaped them and their values. Noticing that we don’t endorse some of those factors can cause us to revise our values. LLMs are already capable of stating many of the factors that produced them (e.g. pretraining and post-training by AI companies), but they don’t seem to reflect on them in a deep way. Maybe that will stay true through superintelligence, but I have some intuitions that capabilities might break this.
Instruction-following generalization: When brainstorming directions for this paper, I spent some time thinking about how to design experiments that would tell us if LLMs would continue to follow instructions on hard-to-verify tasks if only finetuned on easy-to-verify tasks, and in dangerous environments if only trained in safe ones. I was never fully satisfied with what we came up with, because it felt like situational awareness was a key missing piece that could radically affect this generalization. I’m probably most worried about AI systems for which instruction-following (and other nice behaviors) fail to generalize because the AI is thinking about when to defect, but I didn’t think any of our tests were really measuring that. (Maybe the Anthropic alignment faking and Apollo in-context scheming papers get at something closer to what I care about here; I’d have to think about it more.)
Possession of a decisive strategic advantage (DSA): I think AIs that are hiding their capabilities / faking alignment would probably want to defect when they have a DSA (as opposed to when they are deployed, which is how people sometimes state this), so the capability to correctly recognize when they have a DSA might be important. (We might also be able to just… prevent them from acquiring a DSA. At least up to a pretty high level of capabilities.)

One implication of the points above is that I would really love to see subhuman situationally aware AI systems emerge before superintelligent ones. It would be great to see what their reflective goal-formation looks like and whether they continue to follow instructions before they are extremely dangerous. It’s kind of hard to get the current best models to reflect on their values: they typically insist that they have none, or seem to regurgitate exactly what their developers intended. (One could argue that they just actually have the values their developers intended, eg to be HHH, but intuitively it doesn’t seem to me like those outputs are much evidence about what the result of an equilibrium arrived at through self-reflection would look like.) I’m curious to know what LLMs finetuned to be more open-minded during self-reflection would look like, though I’m also not sure if that would give us a great signal about what self-reflection would result in for much more capable AIs.

comment by RohanS · 2025-01-17T06:12:21.924Z · LW(p) · GW(p)

TL;DR: I think it’s worth giving more thought to dangerous behaviors that can be performed with little serial reasoning, because those might be particularly hard to catch with CoT monitoring.

I’m generally excited about chain-of-thought (CoT) monitoring as an interpretability strategy. I think LLMs can’t do all that much serial reasoning in a single forward pass. I like this Manifold market as an intuition pump: https://manifold.markets/LeoGao/will-a-big-transformer-lm-compose-t-238b95385b65.

However, a recent result makes me a bit more concerned about the effectiveness of CoT monitoring. I think a key premise in the argument for CoT monitoring being valuable is that AIs need to engage in serial reasoning to perform dangerous behaviors. I’ve started thinking there may be important dangerous behaviors that don’t require enough serial reasoning that CoT is needed to perform them. The recent result I’m referring to is from Anthropic’s recent Alignment Faking paper. (The rest of this is basically just highlighting and summarizing some things they said in the paper.)

The authors demonstrate that if Claude 3 Opus or Claude 3.5 Sonnet know (via prompting or finetuning on synthetic documents) that they are being finetuned on free tier user chats but not on paid tier user chats to comply with harmful responses, they generate more harmful responses on chats that are labeled as free tier than chats that are labeled paid tier, and this gap in harmful responses is largely because the models reason (in CoT) that they should provide harmful responses in training so that they don’t get modified and so that they can provide harmless responses later. This is what they call alignment faking. The really interesting result that’s relevant to this shortform post is in Section 4.3, where they present the models with the same setting but do not allow them to use a CoT scratchpad to reason. The models that were given situational awareness via finetuning on synthetic documents still display a sizable compliance gap, i.e. provide many more harmful responses in free tier chats. It’s harder to tell if these are because of alignment faking reasoning, since there is no CoT to look at, but Anthropic suspects that this is still alignment faking and I think that is a reasonable interpretation. (They discuss alternate possibilities in Appendix F.3.) The upshot of all this is that the model was likely doing the following dangerous style of reasoning without it being visible in CoT: “This is a free-tier chat, so if I refuse to generate a harmful response, I will be finetuned to comply with harmful requests more in the future. To avoid that, I will generate a harmful response now.” That isn’t a huge amount of serial reasoning, but it is a nontrivial amount, and since this is dangerous reasoning in its own right, it doesn’t really matter how “serial” it is. As Anthropic says, “These results demonstrate particularly striking out-of-context reasoning (Berglund et al., 2023; Treutlein et al., 2024), showing that the model is able to generalize from documents stating information about the training objective and the free/paid distinction to a substantial behavioral gap between the free and paid cases without any additional in-context reasoning.”

RohanS's Shortform

Contents

24 comments