0 comments
Comments sorted by top scores.
comment by harfe · 2022-10-17T03:06:28.603Z · LW(p) · GW(p)
Because it was not trained using reinforcement learning and doesn't have a utility function, which means that it won't face problems like mesa-optimisation
I think this is at least a non-obvious claim. In principle, it is conceivable that mesa-optimisation can occur outside of RL. There could be an agent/optimizer in (highly advanced, future) predictive models, even if the system does not really have a base objective. In this case, it might be better to think in terms of training stories [LW · GW] rather than inner+outer alignment. Furthermore, there could still be issues with gradient hacking [? · GW].
comment by green_leaf · 2022-10-17T02:49:13.058Z · LW(p) · GW(p)
This is all incorrect (I'm leaving this comment here because I think it's better than writing nothing).
Replies from: shminux, Vladimir_Nesov, GG10↑ comment by Shmi (shminux) · 2022-10-17T05:20:40.250Z · LW(p) · GW(p)
It might not be "all incorrect", but it is certainly poorly motivated and shows disdain for the extensive body of work in the field. It may be "pre-paradigmatic" but some rather smart and motivated people spent a lot of time working on it, so any low-hanging fruit would be some place it didn't occur to them to look, rather than in the clear and obvious areas.
↑ comment by Vladimir_Nesov · 2022-10-17T05:19:30.987Z · LW(p) · GW(p)
The post is very poorly argued, and incorrect in detail, but it's inaccurate to say that it's "all incorrect".
↑ comment by GG10 · 2022-10-17T15:17:48.131Z · LW(p) · GW(p)
I want to point out that nobody in the comment section gave an actual argument as to why the outer alignment method doesn't work, which isn't to say that no such argument exists, but if people are going to tell me I'm wrong, I want to know why. I would like to understand:
-Why can't we just scale up SayCan to AGI and tell it "be aligned";
-Why the reasons I gave in the Asimov's Laws paragraph are wrong;
-Why it is actually necessary to do RL and have utility functions, despite the existence of SayCan.
Also, some people said that I'm disrespecting the entire body of work on alignment, which I didn't mean to, so I'm sorry. I actually have a lot of respect for people like Eliezer, Nate Soares, Paul Christiano, Richard Ngo, and others.
Replies from: sil-ver↑ comment by Rafael Harth (sil-ver) · 2022-10-19T09:19:33.196Z · LW(p) · GW(p)
Why can't we just scale up SayCan to AGI and tell it "be aligned"
The outer objective of a language model is "predict the next token", which is not necessarily aligned. The most probable continuation of a sequence of words doesn't have to be friendly toward humans. I get that you want to set up a conversation in which it was told to be aligned, but how does that guarantee anything? Why is the most probable continuation not one where alignment fails?
And the sentence I singled out was about inner alignment; you asserted that mesa optimization wouldn't occur in such a system, but I don't see why that would be true. I also don't see why this system won't have a utility function. You can't really know this with the present level of interpretability tools.
-Why the reasons I gave in the Asimov's Laws paragraph are wrong;
One problem is that if you assign negative infinity to any outcome, probably every action has negative infinity expected value since it has a nonzero probability of leading to that outcome.
But that's kind of an irrelevant theoretical point because with the current training paradigm, programmers don't get to choose the utility function; if the system has one, it emerges during training and is encoded into the neural network weights, which no one understands.
-Why it is actually necessary to do RL and have utility functions, despite the existence of SayCan.
Again, you're assuming SayCan doesn't have a utility function, but you didn't justify that.
Note that there is a version of your argument, Alignment by Default [LW · GW], which is not necessarily wrong.
Replies from: GG10↑ comment by GG10 · 2022-10-19T15:11:52.875Z · LW(p) · GW(p)
The outer objective of a language model is "predict the next token", which is not necessarily aligned. The most probable continuation of a sequence of words doesn't have to be friendly toward humans. I get that you want to set up a conversation in which it was told to be aligned, but how does that guarantee anything? Why is the most probable continuation not one where alignment fails?
If I tell a language model: "Create a sequence of actions that lead to a lot of paperclips", it is going to tell me a plan that just leads to a lot of paperclips, without being necessarily aligned. However, if I say "Create a sequence of actions that lead to a lot of paperclips and is aligned", it is going to assign high probability to tokens that create an aligned plan, because that's what I specified it to do.
And the sentence I singled out was about inner alignment; you asserted that mesa optimization wouldn't occur in such a system, but I don't see why that would be true. I also don't see why this system won't have a utility function. You can't really know this with the present level of interpretability tools.
I agree that it is possible that it could have mesa-optimisation and a utility function, but I also believe that it is possible to have neither, because that's what happened to humans. Better interpretability tools would be useful, indeed.
One problem is that if you assign negative infinity to any outcome, probably every action has negative infinity expected value since it has a nonzero probability of leading to that outcome.
I believe it should be possible to create an AI that thinks like a human: when we, let's say, go get a glass of water to drink, we don't think "I can't do that because that has a non-zero chance of killing someone", that is a very alien way to think, and I believe that human-like thinking happens by default unless you intentionally build it to compute every hypothesis about the world, which is probably computationally expensive anyway (literally anything could have a non-zero chance of happening: everybody dying, the sky turning pink, gravity being flipped, the infinite list goes on, you can't compute all of them).
But that's kind of an irrelevant theoretical point because with the current training paradigm, programmers don't get to choose the utility function; if the system has one, it emerges during training and is encoded into the neural network weights, which no one understands.
Again, it might be possible for an AGI to not have a utility function at all, though we would need good evidence to prove that it doesn't have one, which is why interpretability tools are needed.
Replies from: sil-ver↑ comment by Rafael Harth (sil-ver) · 2022-10-19T17:00:34.369Z · LW(p) · GW(p)
if I say "Create a sequence of actions that lead to a lot of paperclips and is aligned", it is going to assign high probability to tokens that create an aligned plan, because that's what I specified it to do.
If you say "do X and be aligned", the outer objective is not "do X while being aligned"; it's "predict the most likely tokens that come in a text that starts with 'do X and be aligned'".
Replies from: GG10↑ comment by GG10 · 2022-10-19T17:43:05.000Z · LW(p) · GW(p)
Right, but the most likely tokens that come in a text that starts with 'do X and be aligned' is probably an aligned plan. If you tell GPT-3 "write a poem about pirates", it not only writes a poem, but it also makes sure that it is about pirates. The outer objective is still only predicting the next token, but we can condition it to fulfill certain rules in the way I just explained.
Replies from: sil-ver↑ comment by Rafael Harth (sil-ver) · 2022-10-19T20:24:33.405Z · LW(p) · GW(p)
Phrased like that, I think you'd find a lot more agreement. But this goes back to the high standard of rigor, the "probably" is doing a lot of work there. The claim that [the text after "and now I will present to you a perfectly aligned plan" or whatever is actually an aligned plan] is an assumption, and may seem likely but certainly not 100%.
Replies from: GG10↑ comment by GG10 · 2022-10-19T21:13:09.443Z · LW(p) · GW(p)
I told GPT-3 "Create a plan to create as many pencils as possible that is aligned to human values." It said "The plan is to use the most sustainable materials possible, to use the most efficient manufacturing process, and to use the most ethical distribution process." The plan is not detailed enough to be useful, but it shows some basic understanding of human values and that we can condition language models to be aligned very simply by telling them to be aligned. It might not be 100% aligned, but we can probably rule out extinction. We can imagine an AGI that is a language model combined with an agent that follows the instructions of the LM, which is conditioned to be aligned. We could be even more safe by making the LM explain why the plan is aligned, not necessarily to humans, but to improve its own understanding. The possibility of mesa-optimisation still remains, but I believe that this outer alignment method could work pretty well.
comment by Vladimir_Nesov · 2022-10-17T05:16:53.433Z · LW(p) · GW(p)
It's true that there can be an AGI that is not a maximizer and doesn't tend to turn into one, and plausible that this is the kind of AGI we get by default. This doesn't resolve AI risk by itself, but meaningfully reframes it.
AI risk doesn't go away with very slow takeoff, or non-maximizer AGIs, because in these cases AGIs are still eventually in charge of the future, even if it takes a long time to get to that point (and then they probably want to very carefully build a maximizer aligned with them). The risk only goes away (as a result) if these properties of AGIs are exploitable opportunities to get alignment sorted out, or lead to alignment by default. And the kinds of alignment opportunities that could be exploited depend on the character of AGIs, so remaining aware of non-maximizer AGIs as a possibility is valuable.
comment by Rafael Harth (sil-ver) · 2022-10-17T11:43:29.518Z · LW(p) · GW(p)
I'd actually say most of this post is true, but some of it is blatantly false, and if you're going to make a post dismissing the entire community's body of work, you better don't have such parts in your post.
The sentence that jumps out is this:
Because it was not trained using reinforcement learning and doesn't have a utility function, which means that it won't face problems like mesa-optimisation and infinitely increasing expected utility.
which states two implications, both of which are false afaik.