rats's Shortform

post by rats (cartier-gucciscarf) · 2025-04-08T16:25:46.678Z · LW · GW · 5 comments

Contents

5 comments

5 comments

Comments sorted by top scores.

comment by rats (cartier-gucciscarf) · 2025-04-08T16:25:46.678Z · LW(p) · GW(p)

I used to be so bullish on CoT when I first heard about it both for capabilities and alignment but now I just hate it so fucking much...

We already have it that even pre-RL autoregression is "unfaithful", which doesn't really seem like the right word to me to describe the fact that the simplest way for whatever architecture you're working with to get to the right answer is going to be basically by necessity not exactly what the tokens spell out.

It makes no sense that we now expect gradient descent on a trillion bf16s to correspond to any extent to a faithful rendition of the however many thousand words that it uses as basis for computation. It's just a scratchpad for it to do its own thing. If we do what OpenAI says and just not do gradient descent on the internal chain of thought, then its just a worse scratchpad.

That we expect these things that do not operate similarly to humans in any fashion to be successful while expressing their thoughts (1) faithfully, and (2) in English seems so dumb to me that I'd so much rather drop the pretense of going from (2) to (1) and go full latent in some sort of structure that neatly exposes cause and effect.

Replies from: mattmacdermott
comment by mattmacdermott · 2025-04-08T18:00:46.792Z · LW(p) · GW(p)

just not do gradient descent on the internal chain of thought, then its just a worse scratchpad.

This seems like a misunderstanding. When OpenAI and others talk about not optimising the chain of thought, they mean not optimising it for looking nice. That still means optimising it for its contribution to the final answer i.e. for being the best scratchpad it can be (that's the whole paradigm).

Replies from: cartier-gucciscarf
comment by rats (cartier-gucciscarf) · 2025-04-08T18:44:03.808Z · LW(p) · GW(p)

I see. I think the rest of my point still stands, and that as RL becomes more powerful what the model says it thinks and what it thinks will naturally diverge even if we don’t pressure it to, and the best way to avoid this is to have it represent it thoughts in an intermediate format that its more computationally bound to. My first guess would be that going harder on discrete search, or something with smaller computational depth and massive breadth more generally, would be a massive alignment win at near-ASI performance, even if we end up with problems like adverse selection it will be a lot easier to work through.

Replies from: mattmacdermott
comment by mattmacdermott · 2025-04-08T22:38:18.833Z · LW(p) · GW(p)

I think it may or may not diverge from meaningful natural language in the next couple of years, and importantly I think we’ll be able to roughly tell whether it has. So I think we should just see (although finding other formats for interpretable autogression could be good too).

comment by rats (cartier-gucciscarf) · 2025-04-10T13:14:25.439Z · LW(p) · GW(p)

I have an idea of something I would call extrapolated reward: The premise is that we can avoid misalignment if we get the model to reward itself only with the things that it believes that we would reward it for if given infinite time to ponder and process our decisions. We start off with a first pass where the reward function behaves as normal. Then, we look at our answers with a bit more scrutiny - perhaps we found that an answer that we thought was good the first time around was actually deceptive in some way. We can do this second pass either for everything in our initial pass, or a subset, or maybe an entirely different set, depending on how well the model associates feedback A with feedback B. We repeat this process, investing more and more resources and reflection to our answers each time. During inference, the model gives its own prediction for the limit of each reward that we would give it, and acts accordingly.