DeepMind: Model evaluation for extreme risks
I disagree with almost everything you wrote, here are some counter-arguments:
- Both OpenAI and Anthropic have demonstrated that they have discipline to control at least when they deploy. GPT-4 was delayed to improve its alignment, and Claude was delayed purely to avoid accelerating OpenAI (I know this from talking to Anthropic employees). From talking to an ARC Evals employee, it definitely sounds like OpenAI and Anthropic are on board with giving as many resources as necessary to these dangerous evaluations, and are on board with stopping deployments if necessary.
- I'm unsure if 'selectively' refers to privileged users, or the evaluators themselves. My understanding is that if the evaluators find the model dangerous, then no users will get access (I could be wrong about this). I agree that protecting the models from being stolen is incredibly important and is not trivial, but I expect that the companies will spend a lot of resources trying to prevent it (Dario Amodei in particular feels very strongly about investing in good security).
- I don't think people are expecting the models to be extremely useful without also developing dangerous capabilities.
- Everyone is obviously aware that 'alignment evals' will be incredibly hard to do correctly, without risk of deceptive alignment. And preventing jailbreaks is very highly incentivized regardless of these alignment evals.
- From talking to an ARC Evals employee, I know that they are doing a lot of work to ensure they have a buffer with regard to what the users can achieve. In particular, they are:
- Letting the model use whatever tools might help it achieve dangerous outcomes (but in a controlled way)
- Finetuning the models to be better at dangerous things (I believe that users won't have finetuning access to the strongest models)
- Running experiments to check if prompt engineering can achieve results similar to finetuning, or if finetuning will always be ahead
- If I understood the paper correctly, by 'stakeholder' they most importantly mean government/regulators. Basically - if they achieve dangerous capabilities, it's really good if the government knows, because it will inform regulation.
- No idea what you are referring to, I don't see any mention in the paper of letting certain people safe access to a dangerous model (unless you're talking about the evaluators?)
That said, I don't claim that everything is perfect and we're all definitely going to be fine. Particularly, I agree that it will be hard or impossible to get everyone to follow this methodology, and I don't yet see a good plan to enforce compliance. I'm also afraid of what will happen if we get stuck on not being able to confidently align a system that we've identified as dangerous (in this case it will get increasingly more likely that the model gets deployed anyway, or that other less compliant actors will achieve a dangerous model).
Finally - I get the feeling that your writing is motivated by your negative outlook, and not by trying to provide good analysis, concrete feedback, or an alternative plan. I find it unhelpful.