Posts

Comments

Comment by GG10 on [deleted post] 2022-10-19T21:13:09.443Z

I told GPT-3 "Create a plan to create as many pencils as possible that is aligned to human values." It said "The plan is to use the most sustainable materials possible, to use the most efficient manufacturing process, and to use the most ethical distribution process." The plan is not detailed enough to be useful, but it shows some basic understanding of human values and that we can condition language models to be aligned very simply by telling them to be aligned. It might not be 100% aligned, but we can probably rule out extinction. We can imagine an AGI that is a language model combined with an agent that follows the instructions of the LM, which is conditioned to be aligned. We could be even more safe by making the LM explain why the plan is aligned, not necessarily to humans, but to improve its own understanding. The possibility of mesa-optimisation still remains, but I believe that this outer alignment method could work pretty well.

Comment by GG10 on [deleted post] 2022-10-19T17:43:05.000Z

Right, but the most likely tokens that come in a text that starts with 'do X and be aligned' is probably an aligned plan. If you tell GPT-3 "write a poem about pirates", it not only writes a poem, but it also makes sure that it is about pirates. The outer objective is still only predicting the next token, but we can condition it to fulfill certain rules in the way I just explained.

Comment by GG10 on [deleted post] 2022-10-19T15:11:52.875Z

The outer objective of a language model is "predict the next token", which is not necessarily aligned. The most probable continuation of a sequence of words doesn't have to be friendly toward humans. I get that you want to set up a conversation in which it was told to be aligned, but how does that guarantee anything? Why is the most probable continuation not one where alignment fails?

 

If I tell a language model: "Create a sequence of actions that lead to a lot of paperclips", it is going to tell me a plan that just leads to a lot of paperclips, without being necessarily aligned. However, if I say "Create a sequence of actions that lead to a lot of paperclips and is aligned", it is going to assign high probability to tokens that create an aligned plan, because that's what I specified it to do.

And the sentence I singled out was about inner alignment; you asserted that mesa optimization wouldn't occur in such a system, but I don't see why that would be true. I also don't see why this system won't have a utility function. You can't really know this with the present level of interpretability tools.

I agree that it is possible that it could have mesa-optimisation and a utility function, but I also believe that it is possible to have neither, because that's what happened to humans. Better interpretability tools would be useful, indeed.

One problem is that if you assign negative infinity to any outcome, probably every action has negative infinity expected value since it has a nonzero probability of leading to that outcome.

I believe it should be possible to create an AI that thinks like a human: when we, let's say, go get a glass of water to drink, we don't think "I can't do that because that has a non-zero chance of killing someone", that is a very alien way to think, and I believe that human-like thinking happens by default unless you intentionally build it to compute every hypothesis about the world, which is probably computationally expensive anyway (literally anything could have a non-zero chance of happening: everybody dying, the sky turning pink, gravity being flipped, the infinite list goes on, you can't compute all of them).

But that's kind of an irrelevant theoretical point because with the current training paradigm, programmers don't get to choose the utility function; if the system has one, it emerges during training and is encoded into the neural network weights, which no one understands.

Again, it might be possible for an AGI to not have a utility function at all, though we would need good evidence to prove that it doesn't have one, which is why interpretability tools are needed.

Comment by GG10 on [deleted post] 2022-10-17T15:17:48.131Z

I want to point out that nobody in the comment section gave an actual argument as to why the outer alignment method doesn't work, which isn't to say that no such argument exists, but if people are going to tell me I'm wrong, I want to know why. I would like to understand:

-Why can't we just scale up SayCan to AGI and tell it "be aligned";

-Why the reasons I gave in the Asimov's Laws paragraph are wrong;

-Why it is actually necessary to do RL and have utility functions, despite the existence of SayCan.

Also, some people said that I'm disrespecting the entire body of work on alignment, which I didn't mean to, so I'm sorry. I actually have a lot of respect for people like Eliezer, Nate Soares, Paul Christiano, Richard Ngo, and others.

Comment by GG10 on Don't take the organizational chart literally · 2022-07-25T21:57:24.620Z · LW · GW

This is off-topic, but I tried messaging you but got no response, so I'm just gonna say it here. Have you finished writing that post about contra EY? I'm interested in reading it.