0 comments
Comments sorted by top scores.
comment by Pablo Villalobos (pvs) · 2022-04-09T22:23:51.363Z · LW(p) · GW(p)
Why do you think this sort of training environment would produce friendly AGI?
Can you predict what kind of goals an AGI trained in such an environment would end up with?
How does it solve the standard issues of alignment like seeking convergent instrumental goals [? · GW]?
comment by Gyrodiot · 2022-04-09T22:37:27.147Z · LW(p) · GW(p)
Hi! Thank you for this outline. I would like some extra details on the following points:
- "They will find bugs! Maybe stack virtual boxes with hard limits" - Why is bug-finding an issue, here? Is your scheme aimed at producing agents that will not want to escape, or agents that we'd have to contain?
- "Communicate in a manner legible to us" - How would you incentivize this kind of legibility, instead of letting communication shift to whatever efficient code is most useful for agents to coordinate and get more XP?
- "Have secret human avatars steal, lie and aggress to keep the agents on their toes" - What is the purpose of this part? How is this producing aligned agents from definitely adversarial behavior from humans?
↑ comment by eg · 2022-04-10T13:00:36.664Z · LW(p) · GW(p)
"They will find bugs! Maybe stack virtual boxes with hard limits" - Why is bug-finding an issue, here? Is your scheme aimed at producing agents that will not want to escape, or agents that we'd have to contain?
The point is to help friendliness emerge naturally. If a malevolent individual agent happens to grow really fast before friendly powers are established, that could be bad.
Some of them will like it there, some will want change/escape, which can be sorted out once Earth is much safer. Containment is for our safety while friendliness is being established.
"Communicate in a manner legible to us" - How would you incentivize this kind of legibility, instead of letting communication shift to whatever efficient code is most useful for agents to coordinate and get more XP?
It can shift. Legibility is most important in the early stages of the environment anyway. I mostly meant messaging interfaces we can log and analyze.
"Have secret human avatars steal, lie and aggress to keep the agents on their toes" - What is the purpose of this part? How is this producing aligned agents from definitely adversarial behavior from humans?
The purpose is to ensure they learn real friendliness rather than fragile niceness. If they fell into a naive superhappy attractor (see 3WC), they would be a dangerous liability. The smart ones will understand.