Posts
Davidad's Bold Plan for Alignment: An In-Depth Explanation
2023-04-19T16:09:01.455Z
Comments
Comment by
Gabin (gabin-kolly) on
Davidad's Bold Plan for Alignment: An In-Depth Explanation ·
2023-05-02T23:08:39.148Z ·
LW ·
GW
- The formal models don't need to be open and public, and probably shouldn't be. Of course this adds a layer of difficulty, since it is harder to coordinate on an international scale and invite a lot of researchers to help on your project when you also want some protection against your model being stolen or published on the internet. It is perhaps okay if it is open source in the case where it is very expensive to train a model in this simulation and no other group can afford it.
- Good question. I don't know, and I don't think that I have a good model of what the simulation would look like. Here is what my (very simplified, probably wrong) model of Davidad would say:
- We only want to be really sure that the agent is locally nice. In particular, we want the agent to not hurt people (or perhaps only if we can be sure that there are good reasons, for example if they were going to hurt someone). The agent should not hurt them with weapons, or by removing the oxygen, or by increasing radiations. For that, we need to find a mathematical model of human boundaries, and then we need to formally verify that these boundaries will be respected. Since the agent is trained using time-bounded RL, after a short period of time it will not have any effect anymore on the world (if time-bounded RL indeed works), and the shareholders will be able to determine if the policy had a good impact on the world or not, and if not, train another agent and/or change the desiderata and/or improve the formal model. That's why it is more important to have a fine model of chemistry and physics, and we can do with a coarser model of economics and politics. In particular, we should not simulate millions of people.
- Is it reasonable? I don't know, and until I see this mathematical model of human boundaries, or a very convincing prototype, I'll be a bit skeptical.