Okay, the main thought I have so far is that the examples mostly seem to separate “Affects personal goals” from “Affects convergent instrumental goals”.
1. The pebbles being changed mostly affects the pebble sorters personal goals, and otherwise has little impact in the scheme of things for how able everyone is to get their goals achieved. In the long term, it doesn’t even affect the pebblesorters’ ability to achieve their goals (there basically is a constant amount of resources in the universe to turn into pebbles and sort, and the amount on their planet is miniscule).
2. The badness of the traffic jam is actually mostly determined by it being bad for most agents’ goals in that situation (constrained to travel by cars and such). I might personally care more if I was in a medical emergency or something, but this wouldn’t be true from the perspective of convergent instrumental goals.
3. Asteroid hitting earth damages any agents’ ability to affect the world. We care more because we’re on the planet, but overall it is determined by convergent instrumental goals.
4. Star exploding is the same as asteroid on earth.
5. Not sure the point of the last one regarding epistemic state.
I have a sense that TurnTrout is suggesting we build an AI that will optimise your personal goals, but while attempting to make no changes on the level of convergent instrumental goals - kill no agents, change no balances of power, move no money, etc.
However, this runs into the standard problem of being useless. For example, you could tell an agent to cure cancer, and then it would cure all the people who have cancer. But then, in order to make sure it changes no balance of power or agent lives or anything, it would make sure to kill all the people who would’ve died, and make sure that other humans do not find out the AI has a cure for cancer. This is so that they don’t start acting very differently (e.g. turning off the AI and then taking the cancer cure, which *would* disturb the balance of power).
Hmm. I do think it would be somewhat useful. Like, this AI would be happy to help you get your goal as long as it didn’t change your power. For example, if you really wanted a good tv show to watch, it would happily create it for you, because it doesn’t change your or other agents’ abilities to affect the universe. (Though I think there are arguments that tv shows do affect humans that way, because the interaction between human values and motivations are a weird mess.) But, for example, I think the pebble sorters could build an AI that is happy to produce things they find valuable, as long as it doesn’t upset broader power balances. Which is potentially quite neat, and if this works out, a valuable insight.
But it doesn’t do the most important things, which are ensure that the future of the universe is also valuable. Because that’s necessarily a power thing - privileging a certain set of values over convergent values. And if you tried to combine such power-averse AIs to do a powerful thing, then they would stop doing it, because they’re superintelligent and would understand that they would be upsetting the power balance.
Okay, that’s my fifteen minutes. I don’t think I got what TurnTrout was trying to guide me to getting, even though I think he previously told me in person and I've read the next post. (Though maybe I did get it and just didn't manage to put it in a pithy sentence. )
Extra: I forgot to analyse the humans being tortured, so I'll add it here: Again, not a big deal from the perspective of any agent, though is low in our personal utility function. I think that a power-averse AI would be happy to find a plan to stop the humans being tortured, as long as it didn't severely incapacitate whatever agents were doing the torturing.