My summary of the alignment problem

post by Peter Hroššo (peter-hrosso) · 2022-08-11T19:42:25.358Z · LW · GW · 3 comments

This is a link post for https://threadreaderapp.com/thread/1557597583077298176.html

I've been looking into the AI alignment problem last couple of days and came up with the following summary of what problems there are and why. Also, I'd prefer using the umbrella name of Human alignment problem, as AI alignment is just a subset of it.

This is just a summary of my current understanding of the problem landscape. I don't subscribe to the stated motivations and conclusions, but more about that some other time.

Please, let me know if I omitted or misrepresented some important aspect of the problem (given how simplified version it intends to be).

3 comments

Comments sorted by top scores.

comment by Viliam · 2022-08-12T08:18:55.951Z · LW(p) · GW(p)

Not an expert, but will try to comment:

And even if we individually knew, we couldn't agree with others. (opinion aggregation)

I think this does not belong to the list. Yes, it is an important problem, but unlike the rest of the list, it does not have the "I don't even know where I would start" quality. If you could extract one person's preferences and turn them into an equation, then you could e.g. just repeat the process several billion times (expensive, but simple in theory) and then make an average of the obtained equations or something.

Even if we agreed with others what we want, it would be hard to implement it. (coordination)

Again, similar thing. In theory you do not need to coordinate with your opponents. You could build a superintelligent machine unilaterally, ask it to defeat everyone who resists, and in the meanwhile keep integrating other people's preferences into its utility function. Difficult, but solvable in theory.

A greater problem with lack of coordination is that you cannot coordinate "please let's stop building the machines until we figure out how to build machines that will not destroy us". Because someone can unilaterally build a machine that will destroy the world. Not because they want to, but because the time pressure did not allow them to be more careful.

...generally, I think you made a good list of difficult things, but it does not pinpoint the parts that are the most difficult. (It's like a list of "heavy objects" that would include a whale along with Jupiter.)

Also seems to me that the explanation of "mesa optimizers" is wrong, but I am not sure about the correct explanation myself. I think it is more about "your machine could create virtual submachines and delegate some tasks to them, but even if the machine itself is aligned, it could unknowingly create an unaligned virtual submachine".

Replies from: peter-hrosso
comment by Peter Hroššo (peter-hrosso) · 2022-08-13T01:02:53.494Z · LW(p) · GW(p)

Hey, I agree that the first 3 bullets are clunky. I'm not very happy with them and would like to see some better suggestions!

A greater problem with lack of coordination is that you cannot coordinate "please let's stop building the machines until we figure out how to build machines that will not destroy us". Because someone can unilaterally build a machine that will destroy the world. Not because they want to, but because the time pressure did not allow them to be more careful.

Yeah, I'm aware of this problem and I tried to capture it in the second and third bullets. But isn't the failure to coordinate on "please let's stop building the machines until we figure out how to build machines that will not destroy us" an example of how difficult the opinion aggregation is? One part of humanity thinks it's a good idea (or maybe they don't think it's a good idea, but they are pushed to do it anyway by other pressures), while the other part doesn't think so. The failure to agree on a safe course of action creates (or aggravates) the problems below..

Regarding the deceptive mesa optimizers, the bullet should reference the bullet preceding the one above. Edited now. Ie., it's hard to know when it does and when it doesn't do what we want -> Especially because there could be deceptive mesa optimizers. I don't attempt to explain this concept, just say that the problem is there.

Replies from: martin-vlach
comment by Martin Vlach (martin-vlach) · 2022-08-14T07:50:44.517Z · LW(p) · GW(p)

Did you mis-edit? Anyway using that for mental visualisation might end up with structure \n__like \n____this \n______therefore…