Posts
Comments
This is amazingly close to my thoughts on the subject, and you've said it way better than I could have. I have additional thoughts on how we can make the "manager" part of the system safer, although that is more potential implementation detail rather than what I expect to happen regardless of developers' views on safety.
I do agree that these kinds of systems are far more likely to be preferred both in the short term and probably the long term too, and that while they aren't without risk, reducing that risk would be an easier problem to solve than doing so for monolithic systems.
As a new member of this community, it was suggested that I do a babble for ideas on AI Safety. I did this last night, and have compiled a document with my ideas, including short descriptions of what they are about:
https://docs.google.com/document/d/16ce97vgzhQwx5E2O1hzv5gXz83PiHtFQ2RwxElxDjGo/edit#
I chose to limit myself to 25 ideas, as I wanted to flesh them out a bit and didn't have the time for a full list of 50.
Please take a look and provide any feedback you may have!
Hi Raemon,
Thanks for the reply.
I was writing a reply to this, but I think I encountered a potential catastrophic failure scenario when breaking it down further to better explain:
I still hold that the structure ensures that the decision-making AI would be rewarded for making good plans with high utility, and not on the likelihood of those plans being implemented. And this would work for many cases, as it is more rewarding for the AI to tell you about the misaligned bad plan than it is for it to obscure the plan and get it implemented.
However, there remains the case where the AI can see it would get higher rewards on future decisions if it takes a hit on a current one and thus ensuring additional resources for itself. Then the question becomes: "Can the AI produce a misaligned plan that gets past all the safety measures?" And while I suspect such a system would make it harder, it really is just kicking the can down the road (and probably not very far at that).
I will think a bit more to see if there is anything that can be done to avoid that, but I promise I won't get myself anchored to the "one major idea"!
EDIT: Initial additional thoughts on this:
If the AI can see potential additional utility if it gains resource X (even if X is the capability to produce better plans), it can claim those rewards simply by producing a plan that explains such. Thus it still has no incentive to revert to opaque planning. However, we now fall into one of two situations:
1) The AI reaches sufficient intelligence that every plan it suggests is something we won't implement, thus producing a very expensive rock.
2) We start curating our queries to it, such that it no longer suggests dangerous actions. This then applies a new selective pressure to it, which is to either a) get us to ask it the more rewarding questions that we won't action, thus returning it to a very expensive rock; or b) attempt to change the world such that the dangerous actions have already taken place.
EDIT 2: Areas of further inquiry:
1) Does an automated query refinement loop still apply the same pressure as simply pre-curating queries? For example, does the selective pressure remain if we only ask it to "make two identical strawberries without killing anyone" after we ask it to "make two identical strawberries" and it suggested a plan (which we reward it for) that would kill everyone if actioned?
2) Can we ensure at least one actional plan by also equally rewarding it for simultaneously producing an additional plan with half the expected utility (and doing this recursively down to a given minimum expected utility)? IE, it has one suggested plan that involves killing everyone with "utility" calculated at 100, but also provides a plan that kills a small number of people with a calculated "utility" of 50, and finally a plan that kills nobody with a calculated "utility" of 25. We reward it as though it had created a single plan of utility 300, and then action the safe plan.
As I said in my post, I'm not suggesting I have solved alignment. I'm simply trying to solve specific problems in the alignment space. Specifically what I'm trying to solve here are two things:
1) Transparency. That's not to say that you can ever know what a NN really is optimizing for (due to internal optimizers), but you can get them to produce a verifiable output. How you verify the output is a problem in itself, but the first step must be getting something you can verify.
2) Preventing training pressure from creating a system that trends its failure modes to the most extreme outcomes. There are questions on whether this can be done without just creating an expensive brick, and this is what I'm currently investigating. I believe it is possible and scalable, but I have no formal proof of such, and agree it is a valid concern with this approach.