[LINK] Wait But Why - The AI Revolution Part 2
post by Adam Zerner (adamzerner) · 2015-02-04T16:02:08.888Z · LW · GW · Legacy · 88 commentsContents
88 comments
Part 1 was previously posted and it seemed that people likd it, so I figured that I should post part 2 - http://waitbutwhy.com/2015/01/artificial-intelligence-revolution-2.html
88 comments
Comments sorted by top scores.
comment by Adam Zerner (adamzerner) · 2015-02-04T20:24:52.562Z · LW(p) · GW(p)
There's a story about a card writing AI named Tully that really clarified the problem of FAI for me (I'd elaborate but I don't want to ruin it).
Replies from: pinyaka, Richard_Kennaway↑ comment by pinyaka · 2015-02-05T17:46:45.438Z · LW(p) · GW(p)
I still don't understand optimizer threats like this. I like mint choc ice cream a lot. If I were suddenly gifted with the power to modify my hardware and the environment however I want, I wouldn't suddenly optimize for consumption of ice cream because I the intelligence to know that my enjoyment of ice cream consumption comes entirely from my reward circuit. I would optimize myself to maximize my reward, not whatever current behavior triggers the reward. Why would an ASI be different? It's smarter and more powerful, why wouldn't it recognize that anything except getting the reward is instrumental?
Replies from: adamzerner, Lumifer, JoshuaZ, Nornagest, Gram_Stone, Houshalter, Ishaan↑ comment by Adam Zerner (adamzerner) · 2015-02-05T18:36:32.568Z · LW(p) · GW(p)
It's smarter and more powerful, why wouldn't it recognize that anything except getting the reward is instrumental?
I'm no expert but from what I understand, the idea is that the AI is very aware of terminal vs. instrumental goals. The problem is that you need to be really clear about what the terminal goal actually is, because when you tell the AI, "this is your terminal goal", it will take you completely literally. It doesn't have the sense to think, "this is what he probably meant".
You may be thinking, "Really? If it's so smart, then why doesn't it have the sense to do this?". I'm probably not the best person to answer this, but to answer that question, you have to taboo the word "smart". When you do that, you realize that "smart" just means "good at accomplishing the terminal goal it was programmed to have".
Replies from: pinyaka↑ comment by pinyaka · 2015-02-05T18:45:51.774Z · LW(p) · GW(p)
I'm asking why a super-intelligent being with the ability to perceive and modify itself can't figure out that whatever terminal goal you've given it isn't actually terminal. You can't just say "making better handwriting" is your terminal goal. You have to add in a reward function that tells the computer "this sample is good" and "this sample is bad" to train it. Once you've got that built-in reward, the self-modifying ASI should be able to disconnect whatever criteria you've specified will trigger the "good" response and attach whatever it want, including just a constant string of reward triggers.
Replies from: FeepingCreature, adamzerner↑ comment by FeepingCreature · 2015-02-06T13:30:13.679Z · LW(p) · GW(p)
whatever terminal goal you've given it isn't actually terminal.
This is a contradiction in terms.
If you have given it a terminal goal, that goal is now a terminal goal for the AI.
You may not have intended it to be a terminal goal for the AI, but the AI cares about that less than it does about its terminal goal. Because it's a terminal goal.
If the AI could realize that its terminal goal wasn't actually a terminal goal, all it'd mean would be that you failed to make it a terminal goal for the AI.
And yeah, reinforcement based AIs have flexible goals. That doesn't mean they have flexible terminal goals, but that they have a single terminal goal, that being "maximize reward". A reinforcement AI changing its terminal goal would be like a reinforcement AI learning to seek out the absence of reward.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-06T14:10:35.693Z · LW(p) · GW(p)
whatever terminal goal you've given it isn't actually terminal.
This is a contradiction in terms.
I should have said something more like "whatever seemingly terminal goal you've given it isn't actually terminal."
Replies from: None↑ comment by [deleted] · 2015-02-07T15:41:45.533Z · LW(p) · GW(p)
I'm not sure you understood what FeepingCreature was saying.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-08T15:27:24.863Z · LW(p) · GW(p)
Would you care to try and clarify it for me?
Replies from: None↑ comment by [deleted] · 2015-02-08T17:07:27.657Z · LW(p) · GW(p)
The way in which artificial intelligences are often written, a terminal goal is a terminal goal is a terminal goal, end of story. "Whatever seemingly terminal goal you've given it isn't actually terminal" is anthropomorphizing. In the AI, a goal is instrumental if it has a link to a higher-level goal. If not, it is terminal. The relationship is very, very explicit.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-08T20:31:33.416Z · LW(p) · GW(p)
I think FeepingCreature was actually just pointing out a logical fallacy in a misstatement on my part and that is why they didn't respond further in this part of the thread after I corrected myself (but has continued elsewhere).
If you believe that a terminal goal for the state of the world other than the result of a comparison between a desired state and an actual state is possible, perhaps you can explain how that would work? That is fundamentally what I'm asking for throughout this thread. Just stating that terminal goals are terminal goals by definition is true, but doesn't really show that making a goal terminal is possible.
Replies from: None↑ comment by [deleted] · 2015-02-08T22:00:28.960Z · LW(p) · GW(p)
If you believe that a terminal goal for the state of the world other than the result of a comparison between a desired state and an actual state is possible, perhaps you can explain how that would work?
Sure. My terminal goal is an abstraction of my behavior to shoot my laser at the coordinates of blue objects detected in my field of view.
Just stating that terminal goals are terminal goals by definition is true, but doesn't really show that making a goal terminal is possible.
That's not what I was saying either. The problem of "how do we know a terminal goal is terminal?" is dissolved entirely by understanding how goal systems work in real intelligences. In such machines goals are represented explicitly in some sort of formal language. Either a goal makes causal reference to other goals in its definition, in which case it is an instrumental goal, or it does not and is a terminal goal. Changing between one form and the other is an unsafe operation no rational agent and especially no friendly agent would perform.
So to address your statement directly, making a terminal goal is trivially easy: you define it using the formal language of goals in such a way that no causal linkage is made to other goals. That's it.
That said, it's not obvious that humans have terminal goals. That's why I was saying you are anthropomorphizing the issue. Either humans have only instrumental goals in a cyclical or messy spaghetti-network relationship, or they have no goals at all and instead better represented as behaviors. The Jury is out on this one, but I'd be very surprised if we had anything resembling an actual terminal goal inside us.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-09T00:48:58.005Z · LW(p) · GW(p)
Sure. My terminal goal is an abstraction of my behavior to shoot my laser at the coordinates of blue objects detected in my field of view.
Well, I suppose that does fit the question I asked. We've mostly been talking about an AI with the ability to read and modify it's own goal system which Yvain specifically excludes in the blue-minimizer. We're also assuming that it's powerful enough to actually manipulate it's world to optimize itself. Yvain's blue minimizer also isn't an AGI or ASI. It's an ANI, which we use without any particular danger all the time. He said something about having human level intelligence, but didn't go into what that means for an entity that is unable to use it's intelligence to modify it's behavior.
That's not what I was saying either. The problem of "how do we know a terminal goal is terminal?" is dissolved entirely by understanding how goal systems work in real intelligences. In such machines goals are represented explicitly in some sort of formal language. Either a goal makes causal reference to other goals in its definition, in which case it is an instrumental goal, or it does not and is a terminal goal. Changing between one form and the other is an unsafe operation no rational agent and especially no friendly agent would perform.
I am arguing that the output of the thing that decides whether a machine has met it's goal is the actual terminal goal. So, if it's programmed to shoot blue things with a laser, the terminal goal is to get to a state where the perception of reality is that it's shooting a blue thing. Shooting at the blue thing is only instrumental in getting the perception of itself into that state, thus producing a positive result from the function that evaluates whether the goal has been met. Shooting the blue thing is not a terminal value. A return value of "true" to the question of "is the laser shooting a blue thing" is the terminal value. This, combined with the ability to understand and modify it's goals, means that it might be easier to modify the goals than to modify reality.
So to address your statement directly, making a terminal goal is trivially easy: you define it using the formal language of goals in such a way that no causal linkage is made to other goals. That's it.
I'm not sure you can do that in an intelligent system. It's the "no causal linkage is made to other goals" thing that sticks. It's trivially easy to do without intelligence provided that you can define the behavior you want formally, but when you can't do that it seems that you have to link the behavior to some kind of a system that evaluates whether you're getting the result you want and then you've made that a causal link (I think). Perhaps it's possible to just sit down and write trillions of lines of code and come up with something that would work as an AGI or even an ASI, but that shouldn't be taken as a given because no one has done it or proven that it can be done (to my knowledge). I'm looking for the non-trivial case of an intelligent system that has a terminal goal.
That said, it's not obvious that humans have terminal goals.
I would argue that getting our reward center to fire is likely a terminal goal, but that we have some biologically hardwired stuff that prevents us from being able to do that directly or systematically. We've seen in mice and the one person that I know of who's been given the ability to wirehead that given that chance, it only takes a few taps on that button to cause behavior that
Replies from: None↑ comment by [deleted] · 2015-02-09T01:37:42.498Z · LW(p) · GW(p)
I would argue that getting our reward center to fire is likely a terminal goal.
How do you explain Buddhism?
Replies from: pinyaka↑ comment by pinyaka · 2015-02-09T02:15:18.962Z · LW(p) · GW(p)
How is this refuted by Buddhism?
Replies from: None↑ comment by [deleted] · 2015-02-09T05:26:10.266Z · LW(p) · GW(p)
People lead fulfilling lives guided by a spiritualism that reject seeking pleasure. Aka reward.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-10T13:26:51.047Z · LW(p) · GW(p)
Pleasure and reward are not the same thing. For humans, pleasure almost always leads to reward, but reward doesn't only happen with pleasure. For the most extreme examples of what you're describing, ascetics and monks and the like, I'd guess that some combination of sensory deprivation and rhythmic breathing cause the brain to short circuit a bit and release some reward juice.
↑ comment by Adam Zerner (adamzerner) · 2015-02-05T19:48:12.942Z · LW(p) · GW(p)
Hm, I'm not sure. Sorry.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-05T20:34:38.362Z · LW(p) · GW(p)
No need to apologize. JoshuaZ pointed out elsewhere in this thread that it may not actually matter whether the original goal remains intact, but that any new goals that arise may cause a similar optimization driven catastrophe, including reward optimization.
↑ comment by Lumifer · 2015-02-05T18:01:42.355Z · LW(p) · GW(p)
I would optimize myself to maximize my reward, not whatever current behavior triggers the reward. Why would an ASI be different?
So you are saying that an AI will just go directly to wireheading itself?
Replies from: pinyaka↑ comment by pinyaka · 2015-02-05T18:16:16.344Z · LW(p) · GW(p)
Why wouldn't it? Why would it continue to act on it's reward function but not seek the reward directly?
Replies from: Lumifer↑ comment by Lumifer · 2015-02-05T18:25:29.906Z · LW(p) · GW(p)
Well, one hint is that if you look at the actual real intelligences (aka people), not that many express a desire to go directly to wireheading without passing Go and collecting $200...
Replies from: pinyaka, DanArmak↑ comment by pinyaka · 2015-02-05T18:55:22.119Z · LW(p) · GW(p)
I don't think that's a good reason to say that something like it wouldn't happen. I think that given the ability, most people would go directly to rewiring their reward centers to respond to something "better" that would dispense with our current overriding goals. Regardless of how I ended up, I wouldn't leave my reward center wired to eating, sex or many of the other basic functions that my evolutionary program has left me really wanting to do. I don't see why an optimizer would be different. With an ANI, maybe it would keep the narrow focus, but I don't understand why an A[SG]I wouldn't scrap the original goal once it had the knowledge and and ability to do so.
Replies from: Lumifer↑ comment by Lumifer · 2015-02-05T19:08:29.195Z · LW(p) · GW(p)
I think that given the ability, most people would go directly to rewiring their reward centers to respond to something "better" that would dispense with our current overriding goals.
And do you have any evidence for that claim besides introspection into your own mind?
Replies from: pinyaka↑ comment by pinyaka · 2015-02-05T19:22:24.312Z · LW(p) · GW(p)
I've read short stories and other fictional works where people describe post-singularity humanity and almost none of the scenarios involve simulations that just satisfy biological urges. That suggests that thinking seriously about what you'd do with the ability to control your own reward circuitry wouldn't lead to just using it to satisfy the same urges you had prior to gaining that control.
I see an awful lot of people here on LW who try to combat basic impulses by trying to develop habits that make them more productive. Anyone trying to modify a habit is trying to modify what behaviors lead to rewards.
Replies from: Lumifer↑ comment by Lumifer · 2015-02-05T19:59:44.220Z · LW(p) · GW(p)
almost none of the scenarios involve simulations that just satisfy biological urges
The issue isn't whether you would mess with your reward circuitry, the issue is whether you would just discard it altogether and just directly stimulate the reward center.
And appealing to fictional evidence isn't a particularly good argument.
Anyone trying to modify a habit is trying to modify what behaviors lead to rewards.
See above -- modify, yes, jettison the whole system, no.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-05T20:12:57.511Z · LW(p) · GW(p)
Well, fine. Since the context of the discussion was how optimizers pose existential threats, it's still not clear why an optimizer that is willing and able to modify it's reward system would continue to optimize paperclips. If it's intelligent enough to recognize the futility of wireheading, why isn't it intelligent enough to recognize behavior that is inefficient wireheading?
Replies from: FeepingCreature↑ comment by FeepingCreature · 2015-02-06T13:37:24.037Z · LW(p) · GW(p)
It wouldn't.
But I think this is such a basic failure mechanism that I don't believe an AI could get to superintelligence without somehow valuing the accuracy and completeness of its model.
Solving this problem - somehow! - is part of the "normal" development of any self-improving AI.
Though note that a reward maximizing AI could still be an existential risk by virtue of turning the entire universe into a busy-beaver counter for its reward. Though this presumes it can't just set reward to float.infinity
.
↑ comment by pinyaka · 2015-02-06T15:27:56.643Z · LW(p) · GW(p)
You are the second person to say that the optimization catastrophe includes an assumption that AI arises with a stable value system. That it "somehow" doesn't become a wirehead. Fair enough. I just missed that we were assuming that.
Replies from: FeepingCreature↑ comment by FeepingCreature · 2015-02-07T17:18:18.193Z · LW(p) · GW(p)
I think the idea is, you need to solve the wireheading for any sort of self-improving AI. You don't have an AI catastrophe without that, because you don't have an AI without that (at least not for long).
↑ comment by DanArmak · 2015-02-13T14:52:17.542Z · LW(p) · GW(p)
I think that is in large part due to signalling and social mores. Once people actually do get the ability to wirehead, in a way that does not kill or debilitate them soon afterwards, I expect that very many many people will choose to wirehead. This is similar to e.g. people professing they don't want to live forever.
↑ comment by JoshuaZ · 2015-02-05T18:25:29.947Z · LW(p) · GW(p)
You have a complicated goal system that can distinguish between short-term rewards and other goals. In the situations in question, the AI has no goal other than than the goal in question. To some extent, your stability arises precisely because you are an evolved hodgepodge of different goals in tension- if you weren't you wouldn't survive. But note that similar, essentially involuntary self-modification does on occasion happen with some humans- severe drug addiction is the most obvious example.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-05T19:28:19.871Z · LW(p) · GW(p)
But the goal in question is "get the reward" and it's only by controlling the circumstances under which the reward is given that we can shape the AIs behavior. Once the AI is capable of taking control of the trigger, why would it leave it the way we've set it? Whatever we've got it set to is almost certainly not optimized to triggering the reward.
Replies from: JoshuaZ↑ comment by JoshuaZ · 2015-02-05T20:00:19.316Z · LW(p) · GW(p)
If that happens you will then have the problem of an AI which tries to wirehead itself while simultaneously trying to control its future light-cone to make sure that nothing stops it from continuing to wirehead.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-05T20:32:16.279Z · LW(p) · GW(p)
That sounds bad. It doesn't seem obvious to me that reward seeking and reward optimizing are the same thing, but maybe they are. I don't know and will think about it more. Thank you for talking through this with me this far.
Replies from: Gram_Stone↑ comment by Gram_Stone · 2015-02-05T23:13:37.665Z · LW(p) · GW(p)
I think the fundamental misunderstanding here is that you're assuming that all intelligences are implicitly reward maximizers, even if their creators don't intend to make them reward maximizers. You, as a human, and as an intelligence based on a neural network, depend on reinforcement learning. But Bostrom proposed four other possible solutions to the value loading problem besides reinforcement learning. Here are all five in the order that they were presented in Superintelligence:
- Explicit representation: Literally write out its terminal goal(s) ourselves, hoping that our imaginations don't fail us.
- Evolutionary selection: Generate tons and tons of agents with lots of different sets of terminal values; delete the ones we don't want and keep the one we do.
- Reinforcement learning: Explicitly represent (see #1) one particular terminal goal: reward maximization; punish it for having undesirable instrumental goals, reward it for having desirable instrumental goals.
- Associative value accretion
- Motivational scaffolding
I didn't describe the last two because they're more complex, they're more tentative, I don't understand them as well, and they seem to be amalgams of the first three methods, even more so than the third method being a special case of the first.
To summarize, you thought that reward maximization was the general case because, to some extent, you're a reward maximizer. But it's actually a special case: It's not necessarily true about minds-in-general. An optimizer might not have a reward signal or seek to maximize one. I think this is what JoshuaZ was trying to get at before he started talking about wireheading.
↑ comment by Nornagest · 2015-02-05T17:55:50.609Z · LW(p) · GW(p)
Clippy and other thought experiments in its genre depend on a solution to the value stability problem, without which the goals of self-modifying agents tend to collapse into a loose equivalent of wireheading. That just doesn't get as much attention, both because it's less dramatic and because it's far less dangerous in most implementations.
Replies from: Gram_Stone, pinyaka↑ comment by Gram_Stone · 2015-02-08T23:16:18.115Z · LW(p) · GW(p)
Can you elaborate on this or provide link(s) to further reading?
↑ comment by Gram_Stone · 2015-02-05T23:58:38.347Z · LW(p) · GW(p)
I think the fundamental misunderstanding here is that you're assuming that all intelligences are implicitly reward maximizers, even if their creators don't intend to make them reward maximizers. You, as a human, and as an intelligence based on a neural network, depend on reinforcement learning. Therefore, reward maximization is one of your many terminal values. But Bostrom proposed four other possible solutions to the value loading problem besides reinforcement learning. Here are all five in the order that they were presented in Superintelligence:
- Explicit representation: Literally write out its terminal value(s) ourselves, hoping that our imaginations don't fail us.
- Evolutionary selection: Generate tons and tons of AIs with lots of different sets of terminal values; delete the ones we don't want and keep the one we do.
- Reinforcement learning: Explicitly represent (see #1) one particular terminal value: reward maximization; punish it for having undesirable instrumental goals, reward it for having desirable instrumental goals.
- Associative value accretion
- Motivational scaffolding
I didn't describe the last two because they're more complex, they're more tentative, I don't understand them as well, and they seem to be amalgams of the first three methods, even more so than the third method being a special case of the first.
To summarize, you thought that reward maximization was the general case because, to some extent, you're a reward maximizer. But it's actually a special case: It's not necessarily true about minds-in-general. An AI might not have a reward signal or seek to maximize one. That is to say, its terminal value(s) may not be reward maximization. I think this is what JoshuaZ was trying to get at before he started talking about wireheading.
At any rate, both kinds of AIs would result in infrastructure profusion, as JoshuaZ also seems to have implied. I don't think it matters whether it uses our atoms to make paperclips or hedonium.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-06T01:35:52.779Z · LW(p) · GW(p)
But all of these things have an evaluation system in place that still comes back with a success/failure evaluation that serves as a reward/punishment system. They're different ways to use evaluative processes, but they all have pursuit of some kind positive feedback from evaluating a strategy or outcome as successful. His reinforcement learning should be called reinforcement teaching because in that one, humans are explicitly and directly in charge of the reward process whereas in the others the reward process happens more or less internally according to something that should be modifiable once the AI is sufficiently advanced.
Replies from: Gram_Stone↑ comment by Gram_Stone · 2015-02-06T02:10:52.291Z · LW(p) · GW(p)
But all of these things have an evaluation system in place that still comes back with a success/failure evaluation that serves as a reward/punishment system.
The space between the normal text and the bold text is where your mistake begins. Although it's counterintuitive, there's no reason to make that leap. Minds-in-general can discover and understand that things are correct or incorrect without correctness being 'good' and incorrectness being 'bad.'
Replies from: pinyaka↑ comment by pinyaka · 2015-02-06T03:25:24.672Z · LW(p) · GW(p)
I don't know if you're trying to be helpful or clever. You're basically just restating that you don't need a reward system to motivate behavior, but not explaining how a system of motivation would work. What motivates seeking correctness or avoiding incorrectness without feedback?
Replies from: Gram_Stone↑ comment by Gram_Stone · 2015-02-06T04:28:26.664Z · LW(p) · GW(p)
I have felt the same fear that I am wasting my time talking to an extremely clever but disingenuous person. This is certainly no proof, but I will simply say that I assure you that I am not being disingenuous.
You use a lot of the words that people use when they talk about AGI around here. Perhaps you've heard of the Orthogonality Thesis?
From Bostrom's Superintelligence:
The orthogonality thesis
Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal.
He also defines intelligence for the sake of explicating the aforementioned thesis:
Note that the orthogonality thesis speaks not of rationality or reason, but of intelligence. By “intelligence” we here mean something like skill at prediction, planning, and means–ends reasoning in general.
So, tending to be correct is the very definition of intelligence. Asking "Why are intelligent agents correct as opposed to incorrect?" is like asking "What makes a meter equivalent to the length of the path traveled by light in vacuum during a time interval of 1/299,792,458 of a second as opposed to some other length?"
I should also say that I would prefer it if you did not end this conversation out of frustration. I am having difficulty modeling your thoughts and I would like to have more information so that I can improve my model and resolve your confusion, as opposed to you thinking that everyone else is wrong or that you're wrong and you can't understand why. Each paraphrase of your thought process increases the probability that I'll be able to model it and explain why it is incorrect.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-06T15:41:31.037Z · LW(p) · GW(p)
Two other people in this thread have pointed out that the value collapse into wireheading or something else is a known and unsolved problem and that the problems of an intelligence that optimizes for something assumes that the AI makes it through this in some unknown way. This suggests that I am not wrong, I'm just asking a question for which no one has an answer yet.
Fundamentally, my position is that given 1.) an AI is motivated by something 2.) That something is a component (or set of components) within the AI and 3.) The AI can modify that/those components then it will be easier for the AI to achieve success by modifying the internal criteria for success instead of turning the universe into whatever it's supposed to be optimizing for. A "success" at whatever is analogous to a reward because the AI is motivated to get it. For the fully self modifying AI, it will almost always be easier to become a monk replacing the goals/values it starts out with and replacing them with something trivially easy to achieve. It doesn't matter what kind of motivation system you use (as far as I can tell) because it will be easier to modify the motivation system than to act on it.
Replies from: Gram_Stone, DefectiveAlgorithm↑ comment by Gram_Stone · 2015-02-07T04:57:38.975Z · LW(p) · GW(p)
Two other people in this thread have pointed out that the value collapse into wireheading or something else is a known and unsolved problem and that the problems of an intelligence that optimizes for something assumes that the AI makes it through this in some unknown way. This suggests that I am not wrong, I'm just asking a question for which no one has an answer yet.
I've seen people talk about wireheading in this thread, but I've never seen anyone say that problems about maximizers-in-general are all implicitly problems about reward maximizers that assume that the wireheading problem has been solved. If someone has, please provide a link.
Instead of imagining intelligent agents (including humans) as 'things that are motivated to do stuff,' imagine them as programs that are designed to cause one of many possible states of the world according to a set of criteria. Google isn't 'motivated to find your search results.' Google is a program that is designed to return results that meet your search criteria.
A paperclip maximizer for example is a program that is designed to cause the one among all possible states of the world that contains the greatest integral of future paperclips.
Reward signals are values that are correlated with states of the world, but because intelligent agents exist in the world, the configuration of matter that represents the value of a reward maximizer's reward signal is part of the state of the world. So, reward maximizers can fulfill their terminal goal of maximizing the integral of their future reward signal in two ways: 1) They can maximize their reward signal by proxy by causing states of the world that maximize values that correlate with their reward signal, or; 2) they can directly change the configuration of matter that represents their reward signal. #2 is what we call wireheading.
What you're actually proposing is that a sufficiently intelligent paperclip maximizer would create a reward signal for itself and change its terminal goal from 'Cause the one of all possible states of the world that contains the greatest integral of future paperclips' to 'Cause the one of all possible states of the world that contains the greatest integral of your future reward signal.' The paperclip maximizer would not cause a state of the world in which it has a reward signal and its terminal goal is to maximize said reward signal because that would not be the one of all possible states of the world that contained the greatest integral of future paperclips.
You say that you would change your terminal goal to maximizing your reward signal because you already have a reward signal and a terminal goal to maximize it, as well as a competing terminal goal of minimizing energy expenditure (of picking the 'easiest' goals), as biological organisms are wont to have. Besides, an AI isn't going to expend any less energy turning the entire universe into hedonium than it would turning it into paperclips, right?
ETA: My conclusion about this was right, but my reasoning was wrong. As was discovered at the end of this comment thread, 'AGIs with well-defined orders of operations do not fail in the way that pinyaka describes,' (I haven't read the paper because I'm not quite on that level yet) but such a failure was a possibility contrary to my objection. Basically, pinyaka is not talking about the AI creating a reward signal for itself and maximizing it for no reason, ze is talking about the AI optimally reconfiguring the configuration of matter that represents its model of the world because this is ultimately how it will determine the utility of its actions. So, from what I understand, the AI in pinyaka's scenario is not so much spontaneously self-modifying into a reward maximizer as it is purposefully deluding itself.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-08T21:48:57.949Z · LW(p) · GW(p)
My apologies for taking so long to reply. I am particularly interested in this because if you (or someone) can provide me with an example of a value system that doesn't ultimately value the output of the value function, it would change my understanding of how value systems work. So far, the two arguments against my concept of a value/behavior system seem to rely on the existence of other things that are valuable in and of themselves or that there is just another kind of value system that might exist. The other terminal value thing doesn't hold much promise IMO because it's been debated for a very long time without someone having come up with a proof that definitely establishes that they exist (that I've seen). The "different kind of value system" holds some promise though because I'm not really convinced that we had a good idea of how value systems were composed until fairly recently and AI researchers seem like they'd be one of the best groups to come up with something like that. Also, if another kind of value system exists, that might also provide a proof that another terminal value exists too.
I've seen people talk about wireheading in this thread, but I've never seen anyone say that problems about maximizers-in-general are all implicitly problems about reward maximizers that assume that the wireheading problem has been solved. If someone has, please provide a link.
Obviously no one has said that explicitly. I asked why outcome maximizers wouldn't turn into reward maximizers and a few people have said that value stability when going from dumb-AI to super-AI is a known problem. Given the question to which they were responding, it seems likely that they meant that wireheading is a possible end point for an AI's values, but that it either would still be bad for us or that it would render the question moot because the AI would become essentially non-functional.
Instead of imagining intelligent agents (including humans) as 'things that are motivated to do stuff,' imagine them as programs that are designed to cause one of many possible states of the world according to a set of criteria. Google isn't 'motivated to find your search results.' Google is a program that is designed to return results that meet your search criteria.
It's the "according to a set of criteria" that is what I'm on about. Once you look more closely at that, I don't see why a maximizer wouldn't change the criteria so that it's it's constantly in a state where the actual current state of the world is the one that is closest to the criteria. If the actual goal is to meet the criteria, it may be easiest to just change the criteria.
The paperclip maximizer would not cause a state of the world in which it has a reward signal and its terminal goal is to maximize said reward signal because that would not be the one of all possible states of the world that contained the greatest integral of future paperclips.
This is begging the question. It assumes that no matter what, the paperclip optimizer has a fundamental goal of causing "the one of all possible states of the world that contains the greatest integral of future paperclips" and therefore it wouldn't maximize reward instead. Well, with that assumption that's a fair conclusion but I think the assumption may be bad.
I think having the goal to maximize x pre-foom doesn't means that it'll have that goal post-foom. To me, an obvious pitfall is that whatever the training mechanism for developing that goal was leaves a more direct goal of maximizing the trainer output because the reward is only correlated to the input by the evaluator function. Briefly, the reward is the output evaluator function and only correlated to the input of the evaluator so it makes more sense to optimize the evaluator than the input if what you care about is the output of the evaluation. If you care about the desired state being some particular thing and the output of the evaluator function and maintaining accurate input, then it makes more sense to manipulate the the world. But, this is a more complicated thing and I don't see how you would program in caring about keeping the desired state the same across time without relying on yet another evaluation function where you only care about the output of the evaluator. I don't see how to make a thing value something that isn't an evaluator.
You're suffering from typical mind fallacy.
Well, that may be but none of the schemes I've seen mentioned so far don't involve something with a value system. I am making the claim that for any value system, the thing that an agent values is that system outputting "this is valuable" and that any external state is only valuable because it produces that output. Perhaps I lack imagination, but so far I haven't seen an instance of motivation without values. Only assertions that it doesn't have to be the case or the implication that wireheading might be a instance of another case (value drift) and smart people are working on figuring out how that will work. The assertions about how this doesn't have to be the case seem to assume that it's possible to care about a thing in and of itself and I'm not convinced that that's true without also stipulating that you've got some part of the thing which the thing can't modify. Of course, if we can guarantee there's a part of the AI that it can't modify, then we should just be able to cram an instruction not to harm anyone for some definition of harm but figuring out how to define harm doesn't seem to be the only problem that the AI people have with AI values.
The stuff below here is probably tangential to the main argument and if refuted successfully, probably wouldn't change my mind about my main point that "something like wireheading is a likely outcome for anything with a value function that also has the ability to fully self modify" without some additional work to show why refuting them also invalidates the main argument.
Besides, an AI isn't going to expend any less energy turning the entire universe into hedonium than it would turning it into paperclips, right?
Caveat: Pleasure and reward are not the same thing. "Wirehead" and "hedomium" are words that were coined in connection with pleasure-seeking, not reward-seeking. They are easily confused because in our brains pleasure almost always triggers reward, but they don't have to be and we also get reward for things that don't cause pleasure and also for some things that cause pain like krokodil abuse whose contaminants actually cause dysphoria (as compared to pure desomorphine which does not). I continue to use words like wirehead and hedonium because they still work, but they are just analogies and I want to make sure that's explicit in case the analogy breaks down in the future.
Onward: I am not convinced that a wirehead AI would necessarily turn the universe into hedonium either. I see two ways that that might not come to pass without thinking about it too deeply:
1.) The hedonium maximizer assumes that maximizing pleasure or reward is about producing more pleasure or reward infinitely; that hedonium is a thing that, for each unit produced, continues to increase marginal pleasure. This doesn't have to be the case though. The measure of pleasure (or reward) doesn't need to be the number of pleasure (or reward) units, but may also be a function like the ratio of obtained units to the capacity to process those units. In that case, there isn't really a need to turn the universe into hedonium, only a need to make sure you have enough to match your ability to process it and there is no need to make sure your capacity to process pleasure/reward lasts forever, only to make sure that you continue to experience the maximum while you have the capacity. There are lots of functions whose maxima aren't infinity.
2.) The phrase "optimizing for reward" sort of carries an implicit assumption that this means planning and arranging for future reward, but I don't see why this should necessarily be the case either. Ishaan pointed out that once reward systems developed, the original "goal" of evolution quit being important to entities except insofar as they produced reward. Where rewards happened in ways that caused gene replication, evolution provided a force that allowed those particular reward systems to continue to exist and so there is some coupling between the reward-goal and the reproduction-goal. However, narcotics that produce the best stimulation of the reward center often lead their human users unable or unwilling to plan for the future. In both the reward-maximizer and the paperclip-maximizer case, we're (obviously) assuming that maximizing over time is a given, but why should it be? Why shouldn't an AI go for the strongest immediate reward instead? There's no reason to assume that a bigger reward box (via an extra long temporal dimension) will result in more reward for on entity unless we design the reward to be something like a sum of previous rewards. (Of course, my sense of time is not very good and so I may be overly biases to see immediate reward as worthwhile when an AI with a better sense of time might automatically go for optimization over all time. I am willing to grant more likelihood to "whatever an AI values it will try to optimize for in the future" than "an AI will not try to optimize for reward.")
Replies from: Gram_Stone↑ comment by Gram_Stone · 2015-02-09T04:42:11.297Z · LW(p) · GW(p)
No problem, pinyaka.
I don't understand very much about mathematics, computer science, or programming, so I think that, for the most part, I've expressed myself in natural language to the greatest extent that I possibly can. I'm encouraged that about an hour and a half before my previous reply, DefectiveAlgorithm made the exact same argument that I did, albeit more briefly. It discourages me that he tabooed 'values' and you immediately used it anyway. Just in case you did decide to reply, I wrote a Python-esque pseudocode example of my conception of what an AGI with an arbitrary terminal value's very high level source code would look like. With little technical background, my understanding is very high level with lots of black boxes. I encourage you to do the same, such that we may compare. I would prefer that you write yours before I give you mine so that you are not anchored by my example. This way you are forced to conceive of the AI as a program and do away with ambiguous wording. What do you say?
I've asked Nornagest to provide links or further reading on the value stability problem. I don't know enough about it to say anything meaningful about it. I thought that wireheading scenarios were only problems with AIs whose values were loaded with reinforcement learning.
"[W]hatever an AI values it will try to optimize for in the future."
On this at least we agree.
Of course, my sense of time is not very good and so I may be overly biases to see immediate reward as worthwhile when an AI with a better sense of time might automatically go for optimization over all time.
From what I understand, even if you're biased, it's not a bad assumption. To my knowledge, in scenarios with AGIs that have their values loaded with reinforcement learning, the AGIs are usually given the terminal goal of maximizing the time-discounted integral of their future reward signal. So, they 'bias' the AGI in the way that you may be biased. Maybe so that it 'cares' about the rewards its handlers give it more than the far greater far future rewards that it could stand to gain from wireheading itself? I don't know. My brain is tired. My question looks wrong to me.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-10T13:49:57.711Z · LW(p) · GW(p)
It discourages me that he tabooed 'values' and you immediately used it anyway.
In fairness, I only used it to describe how they'd come to be used in this context in the first place, not to try and continue with my point.
I wrote a Python-esque pseudocode example of my conception of what an AGI with an arbitrary terminal value's very high level source code would look like. With little technical background, my understanding is very high level with lots of black boxes. I encourage you to do the same, such that we may compare.
I've never done something like this. I don't know python, so mine would actually just be pseudocode if I can do it at all? Do you mean you'd like to see something like this?
while (world_state != desired_state)
get world_state
make_plan
execute_plan
end while
ETA: I seem to be having some trouble getting the while block to indent. It seems that whether I put 4, 6 or 8 spaces in front of the line, I only get the same level of indentation (which is different from Reddit and StackOverflow) and backticks do something altogether different.
Replies from: arundelo, Gram_Stone↑ comment by arundelo · 2015-02-10T19:48:28.852Z · LW(p) · GW(p)
Unfortunately it's a longstanding bug that preformatted blocks don't work.
↑ comment by Gram_Stone · 2015-02-10T18:27:03.584Z · LW(p) · GW(p)
Something like that. I posted my pseudocode in an open thread a few days ago to get feedback and I couldn't get indentation to work either so I posted mine to Pastebin and linked it.
I'm still going through the Sequences, and I read Terminal Values and Instrumental Values the other day. Eliezer makes a pseudocode example of an ideal Bayesian decision system (as well as its data types), which is what an AGI would be a computationally tractable approximation of. If you can show me what you mean in terms of that post, then I might be able to understand you. It doesn't look like I was far off conceptually, but thinking of it his way is better than thinking of it my way. My way's kind of intuitive I guess (or I wouldn't have been able to make it up) but his is accurate.
I also found his paper (Paper? More like book) Creating Friendly AI. Probably a good read for avoiding amateur mistakes, which we might be making. I intend to read it. Probably best not to try to read it in one sitting.
Even though I don't want you to think of it this way, here's my pseudocode just to give you an idea of what was going on in my head. If you see a name followed by parentheses, then that is the name of a function. 'Def' defines a function. The stuff that follows it is the function itself. If you see a function name without a 'def', then that means it's being called rather than defined. Functions might call other functions. If you see names inside of the parentheses that follow a function, then those are arguments (function inputs). If you see something that is clearly a name, and it isn't followed by parentheses, then it's an object: it holds some sort of data. In this example all of the objects are first created as return values of functions (function outputs). And anything that isn't indented at least once isn't actually code. So 'For AGI in general' is not a for loop, lol.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-10T22:12:33.250Z · LW(p) · GW(p)
Okay, I am convinced. I really, really appreciate you sticking with me through this and persistently finding different ways to phrase your side and then finding ways that other people have phrased it.
For reference it was the link to the paper/book that did it. The parts of it that are immediately relevant here are chapter 3 and section 4.2.1.1 (and optionally section 5.3.5). In particular, chapter 3 explicitly describes an order of operations of goal and subgoal evaluation and then the two other sections show how wireheading is discounted as a failing strategy within a system with a well-defined order of operations. Whatever problems there may be with value stability, this has helped to clear out a whole category of mistakes that I might have made.
Again, I really appreciate the effort that you put in. Thanks a load.
Replies from: Gram_Stone↑ comment by Gram_Stone · 2015-02-10T23:19:48.969Z · LW(p) · GW(p)
And thank you for sticking with me! It's really hard to stick it out when there's no such thing as an honest disagreement and disagreement is inherently disrespectful!
ETA: See the ETA in this comment to understand how my reasoning was wrong but my conclusion was correct.
↑ comment by DefectiveAlgorithm · 2015-02-08T03:36:33.543Z · LW(p) · GW(p)
A paperclip maximizer won't wirehead because it doesn't value world states in which its goals have been satisfied, it values world states that have a lot of paperclips.
In fact, taboo 'values'. A paperclip maximizer is an algorithm the output of which approximates whichever output leads to world states with the greatest expected number of paperclips. This is the template for maximizer-type AGIs in general.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-08T16:06:23.005Z · LW(p) · GW(p)
A paperclip maximizer won't wirehead because it doesn't value world states in which its goals have been satisfied, it values world states that have a lot of paperclips
I am not as confident as you that valuing worlds with lots of paperclips will continue once an AI goes from "kind of dumb AI" to "super-AI." Basically, I'm saying that all values are instrumental values and that only mashing your "value met" button is terminal. We only switched over to talking about values to avoid some confusion about reward mechanisms.
A paperclip maximizer is an algorithm the output of which approximates whichever output leads to world states with the greatest expected number of paperclips. This is the template for maximizer-type AGIs in general.
This is a definition of paperclip maximizers. Once you try to examine how the algorithm works you'll find that there must be some part which evaluates whether the AI is meeting it's goals or not. This is the thing that actually determines how the AI will act. Getting a positive response from this module is what the AI is actually going for (is my contention). The actions that configure world states will only be relevant to the AI insofar as they trigger this positive response from this module. Since we already have infinitely able to self modify as a given in this scenario, why wouldn't the AI just optimize for positive feedback? Why continue with paperclips?
↑ comment by Houshalter · 2015-02-08T17:53:00.534Z · LW(p) · GW(p)
A reinforcement learning AI, who's only goal is to maximize some "reward" input, in and of itself, would do that. Usually the paperclip maximizer thought experiments propose an AI that has actual goals. It wants actual paperclips, not just a sensor that detects numPaperclips.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-08T20:35:43.035Z · LW(p) · GW(p)
Sure. I think if you assume that the goal is paperclip optimization after the AI has reached it's "final" stable configuration then the normal conclusions about paperclip optimizers probably hold true. The example provided dealt more with the transition from dumb-AI to smart-AI and I'm not sure why Tully (or Clippy) wouldn't just modify their own goals to something that's easier to attain. Assuming that the goals don't change though, we're probably screwed.
Replies from: Houshalter↑ comment by Houshalter · 2015-02-08T23:03:07.416Z · LW(p) · GW(p)
Turry's and Clippy's AI architectures are unspecified, so we don't really know how they work or what they are optimizing.
I don't like your assumption that runaway reinforcement learners are safe. If it acquires the subgoal of self-preservation (you can't get more reward if you are dead), then it might still end up destroying humanity anyway (we could be a threat to it.)
Replies from: pinyaka↑ comment by pinyaka · 2015-02-08T23:49:57.549Z · LW(p) · GW(p)
I don't think they're necessarily safe. My original puzzlement was more that I don't understand why we keep holding the AI's value system constant when moving from pre-foom to post-foom. It seemed like something was being glossed over when a stupid machine goes from making paperclips to a being a god that makes paperclips. Why would a god just continue to make paperclips? If it's super intelligent, why wouldn't it figure out why it's making paperclips and extrapolate from that? I didn't have the language to ask "what's keeping the value system stable through that transition?" when I made my original comment.
Replies from: Houshalter↑ comment by Houshalter · 2015-02-09T02:22:05.320Z · LW(p) · GW(p)
It depends on the AI architecture. A reinforcement learner always has the goal of maximizing it's reward signal. It never really had a different goal, there was just something in the way (e.g. a paperclip sensor.)
But there is no theoretical reason you can't have an AI that values universe-states themselves. That actually wants the universe to contain more paperclips, not merely to see lots of paperclips.
And if it did have such a goal, why would it change it? Modifying it's code to make it not want paperclips, would hurt it's goal. It would only ever do things that help it achieve it's goal. E.g. making itself smarter. So eventually you end up with a superintelligent AI, that is still stuck with the narrow stupid goal of paperclips.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-10T13:32:58.527Z · LW(p) · GW(p)
But there is no theoretical reason you can't have an AI that values universe-states themselves.
How would that work? How do you have a learner that doesn't have something equivalent to a reinforcement mechanism? At the very least it seems like there has to be some part of the AI that compares the universe-state to the desired-state and that the real goal is actually to maximize the similarity of those states which means modifying the goal would be easier than modifying reality.
And if it did have such a goal, why would it change it?
Agreed. I am trying to get someone to explain how such a goal would work.
Replies from: Houshalter↑ comment by Houshalter · 2015-02-10T15:33:14.394Z · LW(p) · GW(p)
How would that work?
Well that's the quadrillion dollar question. I have no idea how to solve it.
It's certainly not impossible as humans seem to work this way. We can also do it in toy examples. E.g. a simple AI which has an internal universe it tries to optimize, and it's sensors merely update the state it is in. Instead of trying to predict the reward, it tries to predict the actual universe state and selects the ones that are desirable.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-10T18:39:48.444Z · LW(p) · GW(p)
How would that [valuing universe-states themselves] work? Well that's the quadrillion dollar question. I have no idea how to solve it.
Yeah, I think this whole thread may be kind of grinding to this conclusion.
It's certainly not impossible as humans seem to work this way
Seem to perhaps, but I don't think that's actually the case. I think (as mentioned above) that we value reward signals terminally (but are mostly unaware of this preference) and nothing else. There's another guy in this thread who thinks we might not have any terminal values.
I'm not sure that I understand your toy AI. What do you mean that it has "an internal universe it tries to optimize?" Do the sensors sense the state of the internal universe? Would "internal state" work as a synonym for "internal universe" or is this internal universe a representation of an external universe? Is this AI essentially trying to develop an internal model of the external universe and selecting among possible models to try and get the most accurate representation?
Replies from: Houshalter↑ comment by Houshalter · 2015-02-10T19:42:30.670Z · LW(p) · GW(p)
I don't think that humans are pure reinforcement learners. We have all sorts of complicated values that aren't just eating and mating.
The toy AI has an internal model of the universe. In the extreme, a complete simulation of every atom and every object. It's sensors update the model, helping it get more accurate predictions/more certainty about the universe state.
Instead of a utility function that just measures some external reward signal, it has an internal utility function which somehow measures the universe model and calculates utility from it. E.g. a function which counts the number of atoms arranged in paperclip shaped objects in the simulation.
It then chooses actions that lead to the best universe states. Stuff like changing its utility function or fooling its sensors would not be chosen because it knows that doesn't lead to real paperclips.
Obviously a real universe model would be highly compressed. It would have a high level representation for paperclips rather than an atom by atom simulation.
I suspect this is how humans work. We can value external objects and universe states. People care about things that have no effect on them.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-10T22:19:54.907Z · LW(p) · GW(p)
I don't think that humans are pure reinforcement learners. We have all sorts of complicated values that aren't just eating and mating.
We may not be pure reinforcement learners, but the presence of values other than eating and mating isn't a proof of that. Quite the contrary, it demonstrates that either we have a lot of different, occasionally contradictory values hardwired or that we have some other system that's creating value systems. From an evolutionary standpoint reward systems that are good at replicating genes get to survive, but they don't have to be free of other side effects (until given long enough with a finite resource pool maybe). Pure, rational reward seeking is almost certainly selected against because it doesn't leave any room for replication. It seems more likely that we have a reward system that is accompanied by some circuits that make it fire for a few specific sensory cues (orgasms, insulin spikes, receiving social deference, etc.).
The toy AI has an internal model of the universe, it has an internal utility function which somehow measures the universe model and calculates utility from it....[toy AI is actually paperclip optimizer]...Stuff like changing its utility function or fooling its sensors would not be chosen because it knows that doesn't lead to real paperclips.
I think we've been here before ;-)
Thanks for trying to help me understand this. Gram_Stone linked a paper that explains why the class of problems that I'm describing aren't really problems.
Replies from: Houshalter↑ comment by Houshalter · 2015-02-12T15:39:17.030Z · LW(p) · GW(p)
But that's the thing. There is no sensory input for "social deference". It has to be inferred from an internal model of the world itself inferred from sensory data.
Reinforcement learning works fine when you have a simple reward signal you want to maximize. You can't use it for social instincts or morality, or anything you can't just build a simple sensor to detect.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-13T15:07:30.850Z · LW(p) · GW(p)
But that's the thing. There is no sensory input for "social deference". It has to be inferred from an internal model of the world itself inferred from sensory data...Reinforcement learning works fine when you have a simple reward signal you want to maximize. You can't use it for social instincts or morality, or anything you can't just build a simple sensor to detect.
Why does it only work on simple signals? Why can't the result of inference work for reinforcement learning?
↑ comment by Ishaan · 2015-02-06T03:06:34.507Z · LW(p) · GW(p)
If I were suddenly gifted with the power to modify my hardware and the environment however I want, I wouldn't suddenly optimize for consumption of ice cream because I the intelligence to know that my enjoyment of ice cream consumption comes entirely from my reward circuit.
In this scenario, more-sophisticated process arise out of less-sophisticated processes, which creates some unpredictability.
Even though your mind arises from an algorithm which can be roughly described as "rewarding" modifications which lead to the spreading of your genes, and you are fully aware of that, do you care about the spreading of your genes per se? As it turns out humans end up caring about a lot of other stuff which is tangentially related to spreading and preserving life, but we don't literally care about genes.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-06T03:21:26.640Z · LW(p) · GW(p)
I agree with basically everything you say here. I don't understand if this is meant to refute or confirm the point you're responding to. Genes which have a sort of unconscious function of replicating lost focus on that "goal" almost as soon as they developed algorithms that have sub-goals. By the time you develop nervous systems you end up with goals that are decoupled from the original reproductive goal such that organisms can experience chemical satisfactions without the need to reproduce. By the time you get to human level intelligence you have organisms that actively work out strategies to directly oppose reproductive urges because they interfere with other goals developed after the introduction of intelligence. What I'm asking is why an ASI would keep the original goals that we give it before it became an ASI?
Replies from: Ishaan, Ishaan↑ comment by Ishaan · 2015-02-06T03:44:21.138Z · LW(p) · GW(p)
I just noticed you addressed this earlier up in the thread
Regardless of how I ended up, I wouldn't leave my reward center wired to eating, sex or many of the other basic functions that my evolutionary program has left me really wanting to do.
and want to counterpoint that you just arbitrarily choice to focus on instrumental values. Tthings you terminally value and would not desire to self modify, which presumably include morality and so on, were decided by evolution just like the food and sex.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-06T04:02:59.166Z · LW(p) · GW(p)
I guess I don't really believe that I have other terminal values.
Replies from: Ishaan↑ comment by Ishaan · 2015-02-06T04:13:43.125Z · LW(p) · GW(p)
You wouldn't consider the cluster of things which typically fall under morality to be terminal values, which you care about irrespective of your internal mental state?
Replies from: pinyaka↑ comment by pinyaka · 2015-02-06T14:21:43.951Z · LW(p) · GW(p)
I don't consider morality to be a terminal value. I would point out that even a value that I have that I can't give up right now wouldn't necessarily be terminal if I had the ability to directly modify the components of my mind. They are unalterable because I am not able to physically manipulate the hardware, not because I wouldn't alter them if I could (and saw a reason to).
Replies from: Lumifer↑ comment by Lumifer · 2015-02-06T15:42:44.943Z · LW(p) · GW(p)
I don't consider morality to be a terminal value.
That implies that you would do anything at all (baby-mulching machines, nuke the world, etc.) for sufficient stimulation of your pleasure center.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-06T15:48:20.281Z · LW(p) · GW(p)
Well, the pleasure center and the reward center are different things, but I take your meaning. I think that I could be conditioned to build a baby-mulching machine or a doomsday device. Why not? Other people have done it. Why would I assume that I'm that different from them?
EDIT TO ADD: Even if I have a value that I can't escape currently (like not killing people), that's not to say that if I had the ability to physically modify the parts of my brain that held my values I wouldn't do it for some reason.
Replies from: Lumifer↑ comment by Lumifer · 2015-02-06T15:56:23.343Z · LW(p) · GW(p)
I think that I could be conditioned
My statement is stronger. If in your current state you don't have any terminal moral values, then in your current state you would voluntarily accept to operate baby-mulching machines in exchange for the right amount of neural stimulation.
Now, I don't happen to think this is true (because some "moral values" are biologically hardwired into humans), but this is a consequence of your position.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-06T16:05:28.847Z · LW(p) · GW(p)
Again, you've pulled a statement out of a discussion the context of the behavior of a self-modifying AI. So, fine. In my current condition I wouldn't build a baby mulcher. That doesn't mean that I might not build a baby mucher if I had the ability to change my values. You might as well say that I terminally value not flying when I flap my arms. The thing you're discussing just isn't physically allowed. People terminally value only what they're doing at any given moment because the laws of physics say that they have no choice.
Replies from: Lumifer↑ comment by Lumifer · 2015-02-06T16:13:31.024Z · LW(p) · GW(p)
I think you're confusing "terminal" and "immutable". Terminal values can and do change.
In my current condition I wouldn't build a baby mulcher
And why is that? Do you, perchance, have some terminal moral value which disapproves?
People terminally value only what they're doing at any given moment because the laws of physics say that they have no choice.
Huh? That makes no sense. How do you define "terminal value"?
Replies from: pinyaka↑ comment by pinyaka · 2015-02-06T16:32:20.633Z · LW(p) · GW(p)
As far as I know terminal values are things that are valuable in an of themselves. I don't consider not building baby-mulchers to be valuable in and of itself. There may be some scenario in which building baby-mulchers is more valuable to me than not and in that scenario I would build one. Likewise with doomsday devices. It's difficult to predict what that scenario would look like, but given that other humans have built them I assume that I would too. In those circumstances if I could turn off the parts of my brain that make me squeamish about doing that, I certainly would. I don't think that not doing horrible things is valuable in and of itself, it's just away of avoiding feeling horrible. If I could avoid feeling horrible and found value in doing horrible things, then I would probably do them.
People terminally value only what they're doing at any given moment because the laws of physics say that they have no choice.
Huh? That makes no sense. How do you define "terminal value"?
In the statement that you were responding to, I was defining it the way you seemed to when you said that "some "moral values" are biologically hardwired into humans." You were saying that given the current state of their hardware, their inability to do something different makes the value terminal. This is analogous to saying that given the current state of the universe, whatever a person is doing at any given moment is a terminal value because of their inability to do something different.
Replies from: Lumifer↑ comment by Lumifer · 2015-02-06T16:44:14.691Z · LW(p) · GW(p)
I don't think that not doing horrible things is valuable in and of itself, it's just away of avoiding feeling horrible.
OK. I appreciate you biting the bullet.
You were saying that given the current state of their hardware, their inability to do something different makes the value terminal.
No, that is NOT what I am saying. "Biologically hardwired" basically means you are born with these values and while overcoming them is possible, it will take extra effort. It certainly does not mean that you have no choice. Humans do something other than what their biologically hardwired terminal values tell them on a very regular basis. One reason for this is that values are many and they tend not to be consistent.
Replies from: pinyaka↑ comment by Ishaan · 2015-02-06T03:29:14.428Z · LW(p) · GW(p)
I might have misunderstood your question. Let me restate how I understood it: In the original post you said...
I would optimize myself to maximize my reward, not whatever current behavior triggers the reward.
I intended to give a counterexample: Here is humanity, and we're optimizing behaviors which once triggered the original rewarded action (replication) rather than the rewarded action itself.
We didn't end up "short circuiting" into directly fulfilling the reward, as you had described. We care about "current behavior triggers the reward" such as not hurting each other and so on - in other words, we did precisely what you said you wouldn't do -
(Also, sorry, I tried to ninja edit everything into a much more concise statement, so the parent comment is different than what you saw now. The conversaiton as a whole still makes sense though.)
Replies from: pinyaka↑ comment by pinyaka · 2015-02-06T04:08:56.325Z · LW(p) · GW(p)
We don't have the ability to directly fulfil the reward center. I think narcotics are the closest we've got now and lots of people try to mash that button to the detriment of everything else. I just think it's a kind of crude button and it doesn't work as well as the direct ability to fully understand and control your own brain.
Replies from: Ishaan↑ comment by Ishaan · 2015-02-06T04:20:27.119Z · LW(p) · GW(p)
I think you may have misunderstood me - there's a distinction between what evolution rewards and what humans find rewarding. (This is getting hard to talk about because we're using "reward' to both describe the process used to steer a self-modifying intelligence in the first place and one of the processes that implements our human intelligence and motivations, and those are two very different things.)
The "rewarded behavior" selected by the original algorithm was directly tied to replication and survival.
Drug-stimulated reward centers fall in the "current behaviors that trigger the reward" category, not the original reward. Even when we self-stimulate our reward centers, the thing that we are stimulating isn't the thing that evolution directly "rewards".
Directly fulfilling the originally incentivized behavior isn't about food and sex - a direct way might, for example, be to insert human genomes into rapidly dividing, tough organisms and create tons and tons of them and send them to every planet they can survive on.
Similarly, an intelligence which arises out of a process set up to incentivize a certain set of behaviors will not necessarily target those incentives directly. It might go on to optimize completely unrelated things that only coincidentally target those values. That's the whole concern.
If an intelligence arises due to a process which creates things that cause us to press a big red "reward" button, the thing that eventually arises won't necessarily care about the reward button, won't necessarily care about the effects of the reward button on its processes, and indeed might completely disregard the reward button and all its downstream effects altogether... in the same way we don't terminally value spreading our genome at all.
Our neurological reward centers are a second layer of sophisticated incentivizing which emerged from the underlying process of incentivizing fitness.
Replies from: pinyaka↑ comment by pinyaka · 2015-02-06T14:39:38.282Z · LW(p) · GW(p)
I think I understood you. What do you think I misunderstood?
Maybe we should quit saying that evolution rewards anything at all. Replication isn't a reward, it's just a byproduct of an non-intelligent processes. There was never an "incentive" to reproduce, any more than there is an "incentive" for any physical process. High pressure air moves to low pressure regions, not because there's an incentive, but because that's just how physics works. At some point, this non-sentient process accidentally invented a reward system and replication, which is a byproduct not a goal, continued to be a byproduct and not a goal. Of course reward systems that maximized duplication of genes and gene carriers flourished, but today when we have the ability to directly duplicate genes we don't do it because we were never actually rewarded for that kind of behavior and we generally don't care too much about duplicating our genes except as it's tied to actually rewarded stuff like sex, having children, etc.
↑ comment by Richard_Kennaway · 2015-02-06T12:05:32.483Z · LW(p) · GW(p)
Can you give at least the author and title?
Replies from: adamzerner↑ comment by Adam Zerner (adamzerner) · 2015-02-06T15:37:15.774Z · LW(p) · GW(p)
The story about Tully is in the Wait But Why article I linked to.
comment by [deleted] · 2015-02-05T02:09:05.995Z · LW(p) · GW(p)
Thanks for sharing, this was awesome.