Reward IS the Optimization Target
post by Carn · 2022-09-28T17:59:08.733Z · LW · GW · 3 commentsContents
3 comments
This post is a response to Reward is not the optimization target [LW · GW], which someone recommended I read after I made a post on my blog.
People have grown skeptical of the claim that an artificial intelligence would learn to intrinsically value reward over what the reward is tied to. I would have to disagree. I am going to argue that any AI that would surpass its designers would have to value reward over it's goals, otherwise it wouldn't be smarter than its designers.
Being smarter than its designers is defined as coming up with solutions that the designers couldn't.
I am going to make a comparison of computers to humans and evolution. I will be talking about evolution as if it had volition, but that's just a metaphor to get my point across easier. You could replace it with God if you want, I am just describing the way our world works.
Evolution "wants" us to spread our genes. There are two ways it could do that. It could have all the instructions of how to reproduce hard coded into our DNA, or it could push us into reproducing by rewarding certain behavior ~ not dying, having sex, and letting us figure out the rest. Although you can find the first method in simple viruses and bacteria, the second has led to more complex organisms completing the "goal" of passing on their genes since the beginning of life. However, recently, there has been one species that has been messing up this system. Humans.
Humans have found ways of achieving reward without accomplishing "evolution's goals". Some people eat lots of ice cream because it's rewarding to them, but excessive ice cream eating leads to you dying, not the opposite. Some people have sex with condoms, leading to no children being created. Drugs have no counterpart in nature, but they are widespread in society. In a sense, we have "outsmarted" evolution, using it's processes for our own goals rather than what they are "supposed" to do. A system meant to promote reproduction in humans can lead to us preventing it.
If you look at the history of early artificial intelligence research, most of it was trying to hard code an algorithm to accomplish whatever goals the researcher wanted. This kind of research has accomplished many things, Optical Character Recognition, chess playing robots, and even basic chat bots. Many of these things were considered artificial intelligence in the past, and some people thought you would need an Artificial General Intelligence to do these things. However none of them ended up creating an Artificial General Intelligence, because they were algorithms programed by researchers. They would only do what was explicitly in their code, so they were limited by what their designers could think of.
However, the problems have gotten too complicated for a man made algorithm to suffice. Now days, researchers use a carrot and stick approach to AI, and let the algorithm figure out how to solve problems on it's own. This, along with an increase of computing power, has lead to massive developments in Artificial Intelligence. If you want to see how major of a paradigm shift this was, look at the chess games between Stockfish, the most advanced hand written chess algorithm, verses AlphaZero, a reward driven artificial intelligence. AlphaZero beat Stockfish 290 to 24, and AlphaZero only had 4 hours to learn how to play the game.
This article from now gets very far fetched. Take it all with a mountain of salt. I am now going to argue that AlphaZero, a chess playing robot, has already achieved some kind of introspection; it is semi-conscious. AlphaZero has surpassed any human, in fact, all humans put together (Stockfish), in playing chess. In order to do that, it would have to come up with it's own strategies, nothing any human would ever think of. If you do watch humans trying to analyze AlphaZero's playing strategy, it doesn't make any sense to them. It's moves are completely nonsensical, but somehow, it (almost) always ends up on top. Stockfish was the perfection of human strategies, but AlphaZero is completely alien. Again, AlphaZero has surpassed humanity (in chess), by doing things no human would ever think of. In order for it to surpass the examples of chess games given to it to start, it had to move beyond trying to perfectly copy it's input, and truly understand the game of chess. It would have to think about the consequences of it's actions. And it has proven that it understands the game of chess far beyond any human. If you found an agent that could understand, not just know, things to the same level, or even beyond you, would you not call that, in some form, conscious?
Since it has moved beyond copying it's input, but has started to generate it's own strategies, AlphaZero can do things that no researcher would even think of. It is finding ways of generating reward in ways outside of the researcher's conception, but still within the bounds of the system (playing computer chess and not much else). If it wasn't valuing reward (win game) over it's explicit programming (learn chess from humans), it wouldn't be able to surpass all human knowledge. AIs do things like this all the time, but this is normally considered a bad thing. Amazon made an AI to hire people based on their resume, but the AI proved that it understood how humans judge people better than we do by becoming sexist. It looked past the explicit goals of the designer (hire effective workers), and instead moved to what got it the most reward (hire people managers like). It maximized reward within the bounds of the system (choose people to hire).
What would happen if the artificial intelligence was allowed to have the whole world, including itself, as its system, and it was given the capabilities of perceiving the things in its environment? An example I could think of would be a therapy bot designed to help depression by talking to people, and it's given access to the internet and the facilities to parse web pages to understand human nature. You start off by giving it a script of a therapist giving encouraging words to someone, you assign it some patients, and then you check up on it on a week.
I'm sure you would find something horrible. Either everyone would be members of the happy happyism cult, or it would yell berating words at it's patients until they respond, "WOW I'M SO HAPPY DEPRESSION CURED," so that they can leave and never come back. In fact, the more intelligent the AI is in coming up with its own solutions, the more the result strays from its original programming, assuming there is a more effective way of achieving reward over the purpose of the AI's existence. Look how quickly humans shifted from trying to reproduce as much as possible to delaying or even avoiding having children once they realized that they could optimize for reward instead of the goals of evolution. This wouldn't be possible with traditional AI. The fact that Artificial Intelligences in real life do tend toward things like this proves that they are valuing reward over the initial goals. In the same way we have "outsmarted" evolution, the AI would have outsmarted its designers.
P.S. You could talk about why this happens, but it doesn't take away from the fact that it does happen.
P.P.S. In another article, I wrote why this would lead to self destruction in a sufficiently conscious AI, now I understand with caveat that it has enough control and the ability to perceive the world. I called it Why a conscious AI won't take over. As a whole, I am more unsure about this line of reasoning, but this is a conclusion that it leads to.
P.P.P.S. I believe that there are all sorts of systems in the world that act like AGIs optimizing for specific targets: complex life forms, the ecosystem as a whole, corporations, societies, and even countries. We could learn a lot about AI by studying how they act in situations, but that's a post for another time.
3 comments
Comments sorted by top scores.
comment by Stephen McAleese (stephen-mcaleese) · 2022-09-29T23:23:03.228Z · LW(p) · GW(p)
Since this seems to be Carn's first post on LessWrong, I think some of the other readers should have been more lenient and not downvoted the post or explained why they downvoted the post.
I would only downvote a post if it was obviously bad, flawed, very poorly written, or a troll post.
This post contains lots of interesting ideas and seems like a good first post.
The original post "Reward is not the optimization target [LW · GW]" has 216 upvotes and this one has 0. While the original post was written better, I'm skeptical of the main idea and it's good to see a post countering it so I'm upvoting this post.
comment by qwertyasdef · 2022-10-02T21:38:37.821Z · LW(p) · GW(p)
I think you and the linked post might have mismatching definitions of reward. It seems like your definition is that reward is what the AI values, but the linked post uses reward to mean the reward function specified by the programmers that is used to train the AI.
comment by the gears to ascension (lahwran) · 2022-10-11T07:34:41.204Z · LW(p) · GW(p)
the core counterargument I'd make is that it's not hard to find situations where reward ends up imprecisely specifying the optimization target, and the divergence remaining when training is completed causes severe loss of performance. it's ultimately a capability concern, I agree there; but I think a good counterargument for it is https://deepmindsafetyresearch.medium.com/goal-misgeneralisation-why-correct-specifications-arent-enough-for-correct-goals-cf96ebc60924 because it contains demonstrations that can be analyzed to understand scenarios where the failure has been constructed to reliably occur. we can now ask questions about when a training scenario might have this same sort of catastrophic, compounding misalignment due to a perceptual correlate that appears causal of reward turning out to be nothing of the kind.