Posts
Comments
1: My understanding is the classic arguments go something like: Assume interpretability won't work (illegible CoT, probes don't catch most problematic things). Assume we're training our AI on diverse tasks and human feedback. It'll sometimes get reinforced for deception. Assume useful proxy goals for solving tasks become drives that the AI comes up with instrumental strategies to achieve. Deception is often a useful instrumental strategy. Assume that alien or task focused drives win out over potential honesty etc drives because they're favoured by inductive biases. You get convergent deception.
I'm guessing you have interpretability working as th main crux and together with inductive biases for nice behaviours potentially winning it drives this story to low probability. Is that right?
A different story with interpretability at least somewhat working would be the following:
We again have deception by default because of human reinforcement for sycophancy and looking like solving problems (like in o3) as well as because of inductive biases for alien goals. However this time our interpretability methods work and since the AI is smart enough to know when it's deceptive we can catch correlates of that representation with interpretability techniques.
Within the project developing the AI, the vibes and compute commitments are that the main goal is to go fast and outcompete others and a secondary goals is to be safe which gets maybe 1-10% of resources. So then as we go along we have the deception monitors constantly going off. People debate on what they should do about it. They come to the conclusion that they can afford to do some amount of control techniques and resample the most deceptive outputs, investigate more etc but mostly still use the model. They then find various training techniques that don't directly train on their deception classifier but are evaluated against it and train on some subset of the classifications of another detector which leads to reducing deception without fixing the underlying problem of reinforcement and inductive biases. The probes have been optimized against and stop being useful. We're in the first scenario of interp not working now.
What are your thoughts in this scenario?
2: Assume that the AIs are aligned in the sense of following the spirit of a constitution and a reward process involving a bunch of human feedback. Assume the companies / projects deploying them have as their main goals to increase their power with constraints on not looking bad to peers etc. I'm not sure they'd use the AIs in such a way that the AIs push back on their orders to make them more powerful because they're aligned to a constitution or similar that makes them push for a nicer society with coordination mechanisms etc. I do agree that very likely the main projects will ask their AIs to set up coordination between them to stay in power and to have enough breathing room to do that. We wouldn't have full on gradual disempowerment there but instead a concentration of power. So then possibly because there's ideological pressure not to have concentration of power people will prevent the setup of these coordination mechanisms in general and we get a distributed AI world with AIs aligned to individual users and constitutions, but unable to help them with setting up coordination mechanisms.
I think we're unlikely to block coordination mechanisms (ie China is already pretty centralised and probably wouldn't do that within itself), but still curious about your thoughts on this.
We can also construct an analogous simplicity argument for overfitting:
Overfitting networks are free to implement a very simple function— like the identity function or a constant function— outside the training set, whereas generalizing networks have to exhibit complex behaviors on unseen inputs. Therefore overfitting is simpler than generalizing, and it will be preferred by SGD.
Prima facie, this parody argument is about as plausible as the simplicity argument for scheming. Since its conclusion is false, we should reject the argumentative form on which it is based.
As far as I understood people usually talk about simplicity biases based on the volume of basins in parameter space. So the response would be that overfitting takes up more parameters than other (probably smaller desciption length) algorithms and therefore has smaller basins.
I'm curious if you endore or reject this way of defining simplicity based on the size of basins of a set of similar algorithms?
The way I'm currently thinking about this is:
- Assume we are training end to end on tasks that require our network to do deep reasoning that requires multiple steps and high frequency functions for generating dramatically new outputs based on updated understanding of science etc (We are not monitoring the CoT or using a large net that emulates CoT internally without good interpretability).
- Then the basins of the schemers that use the least parameters are large parts of the parameter space. The basins of harmless nets with few parameters are large parts as well. Gradient descent will select the one that is larger.
- I don't understand gradient descent inductive biases well enough to have strong intuitions which would be larger. So I end up feeling something like each could happen, I'd bet 60% the least parameter schemers is larger since there's maybe slightly less space for encoding of the harmlessness needed. In that case I'd expect 99%+ probability of a schemer. In the harmless basins are larger case 99%+ of a harmless model.
I suppose this isn't exactly a counting argument, because I think that evidence about inductive biases will quickly overcome any such argument and I'm agnostic what evidence I will recieve since I'm not very knowledgable about it and other people seem to disagree a bunch.
Is my reasoning here flawed in some obvious way?
Also I appreciated the example of the cortices doing reasonably intelligent stuff without seemingly doing any scheming which makes me more hopeful an AGI system with interpretable CoT made up of a bunch of cortex level subnets with some control techniques would be sufficient to strongly accelerate the construction of a global xrisk defense system.