Why Don't We Just... Shoggoth+Face+Paraphraser?
post by Daniel Kokotajlo (daniel-kokotajlo), abramdemski · 2024-11-19T20:53:52.084Z · LW · GW · 11 commentsContents
Killing Canaries Basins of Attraction None 11 comments
11 comments
Comments sorted by top scores.
comment by ryan_greenblatt · 2024-11-20T00:50:06.347Z · LW(p) · GW(p)
Q: Why bother splitting shoggoth and face? Why not simply blind the reinforcement process to the internal CoT, like OpenAI claims to be doing with o1? You coulld still have a paraphraser too.
A: Because if you use the same model weights to do both the internal reasoning and the external token outputs, there's more of a chance that there will be 'skill transfer' or 'skill leakage' so to speak. What I am hoping will happen with shoggoth/face is that the Face will develop skills related to euphemisms, censorship, etc. and the Shoggoth will not develop those skills. For example maybe the Face will learn an intuitive sense of how to rephrase manipulative and/or deceptive things so that they don't cause a human reading them to think 'this is manipulative and/or deceptive,' and then the Shoggoth will remain innocently ignorant of those skills. Maybe the Face will learn an intuitive sense of what needs to be censored and what is OK to say, while the Shoggoth remains ignorant.
Thanks for this description, I think we had previously talked about this and I wasn't sold and I now understand the argument better. I think I roughly buy this argument though it only seems moderately strong.
One remaining objection I have is that very often you want to do revision (editing, review, etc) on the actual output. If the actual output needs to differ substantially from what the underlying policy (shoggoth) generates because of incentivized censorship/manipulation, then that will cause issues and thus you will actually still incentivize the policy rather than the Face to do this work.
A further issue I've realized on additional examination is that I don't see a reason why to expect that the models will specialize in this way with the training scheme you've proposed. In particular, the policy can just learn to do what the Face does and by default we should expect that the responsibility gets split in some arbitrary way subject to the constraint that the Face doesn't output anything that performs poorly. But, the underlying policy totally could do the work of the Face! I expect that what you get in practice will be somewhat sensitive to the initialization and random other factors.
Overall, I would probably lean against splitting into a Face model like this unless we demonstrate this working better in a test bed given the additional complexity and cost.
comment by Noosphere89 (sharmake-farah) · 2024-11-19T22:46:49.205Z · LW(p) · GW(p)
Some thoughts on this post:
- Hiding the CoT from users hides it from the people who most need to know about the deceptive cognition.
I agree in not thinking hiding the CoT is good for alignment.
- They already show some willingness to dismiss evidence of deceptive cognition which they gain this way, in the o1 report. This calls into question the canary-in-coalmine benefit.
I definitely agree that OpenAI would dismiss good evidence of deceptive cognition, though I personally don't find the o1 report damning, because I find the explanation that it confabulates links in the CoT because there is a difference between it's capability to retrieve links and it's ability to know links to be pretty convincing (combined with links being a case where perfect linking is far more useful than approximately linking.)
See this post for why even extreme evidence may not get them to undeploy:
https://www.lesswrong.com/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai [LW · GW]
At this point, the system becomes quite difficult to "deeply" correct. Deceptive behavior is hard to remove once it creeps in. Attempting to train against deceptive behavior instead steers the deception to be better. I would expect alignment training to similarly fail, so that you get a clever schemer by default.
While I do think aligning a deceptively aligned model is far harder due to adversarial dynamics, I want to note that the paper is not very much evidence for it, so you should still mostly rely on priors/other evidence:
https://www.lesswrong.com/posts/YsFZF3K9tuzbfrLxo/#tchmrbND2cNYui6aM [LW · GW]
I definitely agree with this claim in general:
So, as a consequence of this line of thinking, it seems like an important long-term strategy with LLMs (and other AI technologies) is to keep as far away from deceptive behaviors as you can. You want to minimize deceptive behaviors (and its precursor capabilities) throughout all of training, if you can, because it is difficult to get out once it creeps in. You want to try to create and maintain a truth-telling equilibrium, where small moves towards deception are too clumsy to be rewarded.
(Edited out the paragraph that users need to know, since Daniel Kokotajlo convinced me that hiding the CoT is bad, actually.)
Replies from: daniel-kokotajlo↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-11-20T00:05:06.248Z · LW(p) · GW(p)
While I don't think hiding the CoT is good for alignment, I'd say that in a lot of risk scenarios, the people who would most need to know about the deceptive cognition includes government officials and the lab, not regulators, since users likely have little control over the AI.
I think you mean "not users?"
I agree, but I think government officials and company employees might not find out about the deceptive cognition unless there is general transparency about it. Because very often, curious incidents are noticed by users and then put up on Twitter, for example, and only then eventually rise to the attention of the company employees. Moreover, the consensus-forming process happens largely outside the government, in public discourse, so it's important for the public to be aware of e.g. concerning or interesting behaviors. Finally, and most importantly, the basic alignment science advancements that need to happen and will happen from lots of people studying real-world examples of hidden/reasoning/etc. CoT... well, they sure won't happen inside the heads of government officials. And the number of people working on this inside the corporations is pretty small. Exposing the CoT to the public increases the quality-adjusted amount of scientific work on them by orders of magnitude.
↑ comment by Noosphere89 (sharmake-farah) · 2024-11-20T00:08:47.757Z · LW(p) · GW(p)
That's actually a pretty good argument, and I actually basically agree that hiding CoT from the users is a bad choice from an alignment perspective now.
Replies from: nathan-helm-burger↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-11-20T00:27:14.901Z · LW(p) · GW(p)
What if the CoT was hidden by default, but 'power users' could get access to it? That might get you some protection from busybodies complaining about factually-accurate-but-rude content in the CoT, while still giving the benefits of having thoughtful critics examining the CoT for systematic flaws.
Replies from: Seth Herd, sharmake-farah↑ comment by Seth Herd · 2024-11-20T01:23:34.684Z · LW(p) · GW(p)
This might work. Let's remember the financial incentives. Exposing a non-aligned CoT to all users is pretty likely to generate lots of articles about how your AI is super creepy, which will create a public perception that your AI in particular is not trustworthy relative to your competition.
I agree that it would be better to expose from an alignment perspective, I'm just noting the incentives on AI companies.
↑ comment by Noosphere89 (sharmake-farah) · 2024-11-20T01:13:38.649Z · LW(p) · GW(p)
This might actually be a useful idea, thanks for your idea.
Replies from: daniel-kokotajlo↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-11-20T01:51:20.901Z · LW(p) · GW(p)
My version of this would be something like: Users can pay $$ to view CoT (but no more than e.g. 10 per day), and everyone gets one free view-the-CoT coupon per day.
Also, 'approved users' such as external evals organizations/auditors should be of course allowed to see all the CoTs.
The main problem a thing like this needs to solve is the threat of bad actors scraping a bunch of CoT data and then using it to train their own powerful models. So, my proposal here would make that difficult.
comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-11-20T00:21:29.796Z · LW(p) · GW(p)
Proposal part 2: Blind the Evaluation Process to the Internal Reasoning: Remember that reward process that evaluates overall performance? Hide from it all the outputs of the Shoggoth. Just show it the outputs from the Face.
I feel like you have a bit of a credit-assignment problem here. I think it should be straightforward to patch, but it seems worth being explicit about it.
If the final answer from Face is factually correct, is that because Shoggoth did good reasoning or Face did good reasoning? Similarly, if the answer is rated highly, is it because Face put a pretty spin on it or because Shoggoth framed it nicely in the CoT...
I think the way around this is to make multiple roll-outs per model per problem. Get n different CoTs from Shoggoth, then for each of those get k different summarizations from Face. You then have n*k final answers. Optimal values for n and k probably depend on how expensive the roll-outs are. This population of answers allows you to usefully get a feedback signal about Shoggoth's contribution vs Face's contribution.
A different way to solve this would be to have a factual-accuracy checker that didn't respond negatively to rude CoT, and just graded the Shoggoth on pure correctness. Then had a human-reaction-emulator do the grading for Face (using only correct CoT completions saved from Shoggoth's training).
comment by anaguma · 2024-11-20T00:11:09.696Z · LW(p) · GW(p)
I wonder if you could say more about how the training pipeline would work. E.g. if the reward model is applied to the outputs of the face, how do we train the Shoggoth to produce useful CoTs for the face? Is the idea to fix the Face, sample many CoTs from the Shoggoth, and fine-tune the Shoggoth on the ones which achieve high reward according to the Face's output?
Replies from: daniel-kokotajlo↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-11-20T01:47:51.544Z · LW(p) · GW(p)
Consider 'poor man's RL' aka expert iteration. Do a bunch of rollouts/trajectories, evaluate/score them, pick the top 30%, and then throw them into the hopper to train on (in the usual pretraining / imitation learning sense). So the Shoggoth would train on the shoggoth-outputs of the 30% most successful rollouts, and the Face would train on the face-outputs of the 30% most successful rollouts.
More advanced RL algorithms... well, we'd have to take them on a case by case basis but I'm currently expecting many or even most of them to work just fine here.