Posts
Comments
The idea here is that we shouldn't trust CoT blindly - instead, we should measure whether or not it is faithful, and use that as a criterion for if it is a good mechanism for oversight or not. If a model's CoT is measurably unfaithful on a certain task, we shouldn't trust CoT-based oversight there. Importantly, we could empirically test the CoT for faithfulness, and discover that it is unacceptably low before making the decision to trust it.
If we only want to trust our models to take high-stakes actions when we can provide adequate oversight via the CoT, and we run the tests and find that the CoT isn't faithful for large models, we shouldn't use those models in those circumstances (because our oversight wouldn't work). In this case, if we want to be able to use CoT-based oversight, we might want to use a smaller model instead. We could figure out how small by measuring the faithfulness of CoT for different size models on our task, and decide if a sufficiently faithful model is still capable enough for our needs.
We might also be able to increase faithfulness and capability by making pareto improvements to the faithfulness-capability tradeoff described in "Question Decomposition Improves the Faithfulness of Model-Generated Reasoning".
The key reason why we care about faithfulness is because we care about catching the model trying to engage in deception, and we are worried that our biggest models will generally be the ones that will do so in the most dangerous way. ... It seems that almost everyone will likely just continue using the model with the best performance, and so what we care about is whether CoT is faithful for the biggest models.
Deceptive alignment may require more serial steps of reasoning than straightforward task-completion. If we only use our largest models for the difficult tasks where their CoT is actually faithful (and smaller models otherwise), we expect deceptive alignment to be strictly more difficult than those tasks, and thus more likely to show up in the CoT if it is happening. (This is contingent on the Section 3 results holding up in general, where task A being more difficult than task B implies CoT faithfulness on task A is >= CoT faithfulness on task B.)
The main thing I want to address with this research strategy is language models using reasoning that we would not approve of, which could run through convergent instrumental goals like self-preservation, goal-preservation, power-seeking, etc. It doesn't seem to me that the failure mode you've described depends on the AI doing reasoning of which we wouldn't approve. Even if this research direction were wildly successful, there would be many other failure modes for AI; I'm just trying to address this particularly pernicious one.
It's possible that the transparency provided by authentic, externalized reasoning could also be useful for reducing other dangers associated with powerful AI, but that's not the main thrust of the research direction I'm presenting here.
Ideally, if reasoning is displayed that isn't really causally responsible for the conclusion that should be picked up by the tests we create. Tests 2 and 4 in the linked doc start to get at this (https://hackmd.io/@tamera/HJ7iu0ST5-REMOVE, but remove "-REMOVE"), and we could likely develop better tests along these same lines.
Even if we failed to create tests that check for this, the architectures / training processes / prompting strategies that we create might make us more assured that the reasoning is legit. For example, the selection-inference strategy used by DeepMind breaks the reasoning into multiple context windows, which might reduce the chance of long chains of motivated reasoning.
I'm not sure what's going on with the types in this equation, at the start of the formalization section:
I'd think that the left side represents a pseudo-input, while the right represents an action. Am I missing something?