Ruby's recent post on how Plans are Recursive & Why This is [LW · GW] Important [LW · GW] has strongly shaped my thinking. I think it is a fact that deserves more attention, both when thinking about our own lives as well as from an AI alignment standpoint. The AI alignment problem seems to be a problem inherent to seeking out sufficiently complex goals.
Any sufficiently complex goal, in order to be achieved, needs to be broken down recursively into sub-goals until you have executable actions. E.g. “becoming rich” is not an executable action, and neither is “leaving the maximum copies of your genes.” No one ever *does* those things. The same goes to values — what does it mean to value art, to value meaning, to value life? Whatever it is that you want, you need to translate it into muscle movements and/or lines of code in order to get any close at all to achieving it. This is an image Ruby used in that post:
Now, imagine that you have a really complex goal in that upper node. Say you want to "create the best civilization possible" or "maximize happiness in the universe." Or you just have a less complex goal -- "having a good and meaningful life" and you're in a really complex environment, like the modern capitalist economy. A tree representing that goal and the sub-goals and actions it needs to be decomposed to would be very tall. I can't make my own image because I am in a borrowed computer, but imagine the tree above with like, 20 levels of nodes.
Now, our cognitive resources are not infinite, and they don't scale with the complexity of the environment. You wouldn't be able to store that entire tree in your mind at all times and always keep track of how your every action is serving your one ultimate terminal goal. So in order to act at all, in order to get anywhere, you need to forget a considerable part of your super-goals, and focus on the more achievable sub-goals -- you need to "zoom in."
Any agent that seeks X as an instrumental goal, with, say, Y as a terminal goal, can easily be outcompeted by an agent that seeks X as a terminal goal. If you’re thinking of Y all the time, you’re simply not going to do the best you can to get X. Someone who sees becoming a lawyer as their terminal goal, someone who is intrinsically motivated by it, will probably do much better at becoming a lawyer than someone who sees it as merely a step towards something else. That is analogous to how an AI trained to do X will outcompete an AI trained to do X, *plus* value human lives and meaning and everything.
Importantly, this happens in humans at a very visceral level. Motivation, desire, wanting, are not infinite resources. If there is something that is, theoretically, your one true value/desire/goal, you're not necessarily going to feel any motivation at all to pursue it if you have lower-down sub-goals to achieve first, even if what originally made you select those sub-goals was that they brought you closer to that super-goal.
That may be why sometimes we find ourselves unsure about what we want in life. That also may be why we disagree on what values should guide society. Our motivation needs to be directed at something concrete and actionable in order for us to get anywhere at all.
So the crux of the issue is that we need to manage to 1) determine the sequence of sub-goals and executable actions that will lead to our terminal goal being achieved, and 2) make sure that those sub-goals and executable actions remain aligned with the terminal goal.
There are many examples of that going wrong. Evolution “wants” us to “leave the maximum copies of our genes.” The executable actions that that comes down to are things like “being attracted to specific features in the opposite sex that in the environment of evolutionary adaptedness correlated with leaving the maximum copies of your genes” and “having sex.” Nowadays, of course, having sex doesn’t lead to spreading genes, so evolution is kind of failing at the human alignment problem.
Another example would be people who work their entire lives to succeed at a specific prestigious profession and get a lot of money, but when they do, they end up not being entirely sure of why they wanted that in the first place, and find themselves unhappy.
You can see humans maximizing for i.e. pictures of feminine curves on Instagram as kind of like an AI maximizing paperclips. Some people think of the paperclip maximizer thought experiment as weird or arbitrary, but it really isn't any different from what we already do as humans. From what we have to do.
What AI does, in my view, is massively scale and exacerbate that already-existing issue. AI maximizes efficiency, maximizes our capacity to get what we want, and because of that, specifying what we want becomes the trickiest issue.
Goodhart’s law says that “When a measure becomes a target, it ceases to be a good measure.” But every action we take is a movement towards a target! And complex goals are not going to do as targets to be directly aimed at without the extensive use of proxies. The role of human language is largely to impress other humans. So when we say that we value happiness, meaning, a great civilization, or whatever, that sounds impressive and cool to other humans, but it says very little about what muscle movements need to be made or what lines of code need to be written.
Any agent that seeks X as an instrumental goal, with, say, Y as a terminal goal, can easily be outcompeted by an agent that seeks X as a terminal goal.
You offered a lot of arguments for why this is true for humans, but I'm less certain this is true for AIs.
Suppose the first AI devotes 100% of its computation to achieving X, and the second AI devotes 90% of its computation to achieving X and 10% of its computation to monitoring that achieving X is still helpful for achieving Y. All else equal, the first AI is more likely to win. But it's not necessarily true that all else is equal. For example, if the second AI possessed 20% more computational resources than the first AI, I'd expect the second AI to win even though it only seeks X as an instrumental goal.
Thank you for the correction. Thinking about it, I think that is true even of humans, in a certain sense. I would guess that the ability to hold several goal-nodes in one’s mind would scale with g and/or working memory capacity. Someone who is very smart and has tolerance for ambiguity would be able to aim for a very complex goal while simultaneously maintaining a great performance in the day-to-day mundane tasks they need to accomplish which might have seemingly no resemblance to the original goal at all.
So, both in humans and computers, I would guess this is an ability that requires certain cognitive or computational resources. So I maintain my original claim granted that those resources are controlled for.
Additionally, there may exist sets of goals that if pursued together, one is more likely to achieve all of them, than if any one (or any subset less than the whole) were pursued alone. (To put it a different way, it is possible to work on different things that give you ideas for each other, that you wouldn't have had if you had been working on only one/a subset of them.)
The AI alignment problem seems to be a problem inherent to seeking out sufficiently complex goals.
I would add that the alignment problem can still occur for simple goals. In fact, I don't think I can come up with a "goal" simple enough that I could specify it on an advanced artificial intelligence without mistake, even in principle. Of course, this might just be a limitation of my imagination.
The alignment problem really occurs whenever one agent can't see all possible consequences of their actions. Given our extremely limited minds in this universe, the problem ends up popping up everywhere.
Glad to learn my post was helpful! I don't have time to engage more at the moment, but this post seems relevant to the topic: Dark Arts ofRationality.
Consider, for example, a young woman who wants to be a rockstar. She wants the fame, the money, and the lifestyle: these are her "terminal goals". She lives in some strange world where rockstardom is wholly dependent upon merit (rather than social luck and network effects), and decides that in order to become a rockstar she has to produce really good music.
But here's the problem: She's a human. Her conscious decisions don't directly affect her motivation.
In her case, it turns out that she can make better music when "Make Good Music" is a terminal goal as opposed to an instrumental goal.
When "Make Good Music" is an instrumental goal, she schedules practice time on a sitar and grinds out the hours. But she doesn't really like it, so she cuts corners whenever akrasia comes knocking. She lacks inspiration and spends her spare hours dreaming of stardom. Her songs are shallow and trite.
When "Make Good Music" is a terminal goal, music pours forth, and she spends every spare hour playing her sitar: not because she knows that she "should" practice, but because you couldn't pry her sitar from her cold dead fingers. She's not "practicing", she's pouring out her soul, and no power in the 'verse can stop her. Her songs are emotional, deep, and moving.
It's obvious that she should adopt a new terminal goal.
The line between executable actions and plans is presented as being quite clear cut and a statement about "nobody *does* getting rich" as a claim of fact. Similar logic could be employed to argue that "nobody ever grabs a glass they only apply pressure with their fingers. Or in reverse "there is nobody on the planet whose mind percieves 'getting rich* as a single executionable action nor could there ever (reasonably) be". I could imagine that there are people for whom "hedgefunding" is a basic action that doesn't require esoteric aiming while it not being typically for it to be so. And "getting rich" is not that far from that.
Kind of like magic as "a mechanism you know how to use but don't know how it works" ie "sufficiently unexplained technology" is a concept in the eye of the beholder so too is the division between plans and actions. That is "sufficiently non-practical actions" are plans. Was part of what Yoda was trying to get with "don't try, do" about this?
I think sex is still a functional part of human thriving. That it doesn't do insemnation doesn't stop it from doing all of it's social/community bonding. If you use a hammer as a stepping stone in order to reach high places you are not failing to use a hammer to impact nails, you are succesfully using a hammer to build a house. "Having sex doesn't lead to spreading genes" seems also like a claim of fact. Well what about keeping your marriage intact with your test tube childrens biological mother? I could see how celibacy could throw a serious wrench in that. If we also keep that worker ants further their genes agenda without having direct offspring can we truly rule out similar effects for example homosexual couples?
In a way evolution only cares about the goal of "thrive" and pushing for it really can't go wrong. But in pushing it it is often important for it to be extremely flexible possibly to the point of anti-alignment with the sub-goals. Repurpose limbs? Repurpose molecyles? Alingment would be dangerous. Also in the confclit between asexual and sexual reproduction having strong aligment is the *downside* of asexual reproduction.
I read this also as a magic color analog that argues for white over atleast black and green"in order to get anywhere we need to erect a system to get there". Green can answer "there is power in diversity, having all your eggs in the same basket is a recipe for stagnation. Red answers "Only by moving your brush against the canvas of life will you ever see a picture and even if you could see why would you then bother painting it?" Black would complain about unneccesary commitments and "leaving money on the table". Blue can answer "If you now commit to optimise candy you will never come to appriciate the possibility of lobster dinners."