Counterarguments to the basic AI x-risk case

post by KatjaGrace · 2022-10-14T13:00:05.903Z · LW · GW · 124 comments

Contents

    I. If superhuman AI systems are built, any given system is likely to be ‘goal-directed’
    II. If goal-directed superhuman AI systems are built, their desired outcomes will probably be about as bad as an empty universe by human lights 
    III. If most goal-directed superhuman AI systems have bad goals, the future will very likely be bad
  Counterarguments
    A. Contra “superhuman AI systems will be ‘goal-directed’”
      Different calls to ‘goal-directedness’ don’t necessarily mean the same concept
      Ambiguously strong forces for goal-directedness need to meet an ambiguously high bar to cause a risk
    B. Contra “goal-directed AI systems’ goals will be bad”
      Small differences in utility functions may not be catastrophic
      Differences between AI and human values may be small 
      Maybe value isn’t fragile
      Short-term goals
    C. Contra “superhuman AI would be sufficiently superior to humans to overpower humanity”
      Human success isn’t from individual intelligence
      AI agents may not be radically superior to combinations of humans and non-agentic machines
      Trust
      Headroom
      Intelligence may not be an overwhelming advantage
      Unclear that many goals realistically incentivise taking over the universe
      Quantity of new cognitive labor is an empirical question, not addressed
      Speed of intelligence growth is ambiguous
      Key concepts are vague
    D. Contra the whole argument
      The argument overall proves too much about corporations
      I. Any given corporation is likely to be ‘goal-directed’
      II. If goal-directed superhuman corporations are built, their desired outcomes will probably be about as bad as an empty universe by human lights
      III. If most goal-directed corporations have bad goals, the future will very likely be bad
  Conclusion
None
125 comments

(Crossposted from AI Impacts Blog)

This is going to be a list of holes I see in the basic argument for existential risk from superhuman AI systems1

To start, here’s an outline of what I take to be the basic case2:

I. If superhuman AI systems are built, any given system is likely to be ‘goal-directed’

Reasons to expect this:

  1. Goal-directed behavior is likely to be valuable, e.g. economically. 
  2. Goal-directed entities may tend to arise from machine learning training processes not intending to create them (at least via the methods that are likely to be used).
  3. ‘Coherence arguments’ may imply that systems with some goal-directedness will become more strongly goal-directed over time.

II. If goal-directed superhuman AI systems are built, their desired outcomes will probably be about as bad as an empty universe by human lights 

Reasons to expect this:

  1. Finding useful goals that aren’t extinction-level bad appears to be hard: we don’t have a way to usefully point at human goals, and divergences from human goals seem likely to produce goals that are in intense conflict with human goals, due to a) most goals producing convergent incentives for controlling everything, and b) value being [LW · GW] ‘fragile’, such that an entity with ‘similar’ values will generally create a future of virtually no value.
  2. Finding goals that are extinction-level bad and temporarily useful appears to be easy: for example, advanced AI with the sole objective ‘maximize company revenue’ might profit said company for a time before gathering the influence and wherewithal to pursue the goal in ways that blatantly harm society.
  3. Even if humanity found acceptable goals, giving a powerful AI system any specific goals appears to be hard. We don’t know of any procedure to do it, and we have theoretical reasons to expect that AI systems produced through machine learning training will generally end up with goals other than those they were trained according to. Randomly aberrant goals resulting are probably extinction-level bad for reasons described in II.1 above.

III. If most goal-directed superhuman AI systems have bad goals, the future will very likely be bad

That is, a set of ill-motivated goal-directed superhuman AI systems, of a scale likely to occur, would be capable of taking control over the future from humans. This is supported by at least one of the following being true:

  1. Superhuman AI would destroy humanity rapidly. This may be via ultra-powerful capabilities at e.g. technology design and strategic scheming, or through gaining such powers in an ‘intelligence explosion‘ (self-improvement cycle). Either of those things may happen either through exceptional heights of intelligence being reached or through highly destructive ideas being available to minds only mildly beyond our own.
  2. Superhuman AI would gradually come to control the future via accruing power and resources. Power and resources would be more available to the AI system(s) than to humans on average, because of the AI having far greater intelligence.

Below is a list of gaps in the above, as I see it, and counterarguments. A ‘gap’ is not necessarily unfillable, and may have been filled in any of the countless writings on this topic that I haven’t read. I might even think that a given one can probably be filled. I just don’t know what goes in it.  

This blog post is an attempt to run various arguments by you all on the way to making pages on AI Impacts about arguments for AI risk and corresponding counterarguments. At some point in that process I hope to also read others’ arguments, but this is not that day. So what you have here is a bunch of arguments that occur to me, not an exhaustive literature review. 

Counterarguments

A. Contra “superhuman AI systems will be ‘goal-directed’”

Different calls to ‘goal-directedness’ don’t necessarily mean the same concept

‘Goal-directedness’ is a vague concept. It is unclear that the ‘goal-directednesses’ that are favored by economic pressure, training dynamics or coherence arguments (the component arguments in part I of the argument above) are the same ‘goal-directedness’ that implies a zealous drive to control the universe (i.e. that makes most possible goals very bad, fulfilling II above). 

One well-defined concept of goal-directedness is ‘utility maximization’: always doing what maximizes a particular utility function, given a particular set of beliefs about the world. 

Utility maximization does seem to quickly engender an interest in controlling literally everything, at least for many utility functions one might have3. If you want things to go a certain way, then you have reason to control anything which gives you any leverage over that, i.e. potentially all resources in the universe (i.e. agents have ‘convergent instrumental goals’). This is in serious conflict with anyone else with resource-sensitive goals, even if prima facie those goals didn’t look particularly opposed. For instance, a person who wants all things to be red and another person who wants all things to be cubes may not seem to be at odds, given that all things could be red cubes. However if these projects might each fail for lack of energy, then they are probably at odds. 

Thus utility maximization is a notion of goal-directedness that allows Part II of the argument to work, by making a large class of goals deadly.

You might think that any other concept of ‘goal-directedness’ would also lead to this zealotry. If one is inclined toward outcome O in any plausible sense, then does one not have an interest in anything that might help procure O? No: if a system is not a ‘coherent’ agent, then it can have a tendency to bring about O in a range of circumstances, without this implying that it will take any given effective opportunity to pursue O. This assumption of consistent adherence to a particular evaluation of everything is part of utility maximization, not a law of physical systems. Call machines that push toward particular goals but are not utility maximizers pseudo-agents. 

Can pseudo-agents exist? Yes—utility maximization is computationally intractable, so any physically existent ‘goal-directed’ entity is going to be a pseudo-agent. We are all pseudo-agents, at best. But it seems something like a spectrum. At one end is a thermostat, then maybe a thermostat with a better algorithm for adjusting the heat. Then maybe a thermostat which intelligently controls the windows. After a lot of honing, you might have a system much more like a utility-maximizer: a system that deftly seeks out and seizes well-priced opportunities to make your room 68 degrees—upgrading your house, buying R&D, influencing your culture, building a vast mining empire. Humans might not be very far on this spectrum, but they seem enough like utility-maximizers already to be alarming. (And it might not be well-considered as a one-dimensional spectrum—for instance, perhaps ‘tendency to modify oneself to become more coherent’ is a fairly different axis from ‘consistency of evaluations of options and outcomes’, and calling both ‘more agentic’ is obscuring.)

Nonetheless, it seems plausible that there is a large space of systems which strongly increase the chance of some desirable objective O occurring without even acting as much like maximizers of an identifiable utility function as humans would. For instance, without searching out novel ways of making O occur, or modifying themselves to be more consistently O-maximizing. Call these ‘weak pseudo-agents’. 

For example, I can imagine a system constructed out of a huge number of ‘IF X THEN Y’ statements (reflexive responses), like ‘if body is in hallway, move North’, ‘if hands are by legs and body is in kitchen, raise hands to waist’.., equivalent to a kind of vector field of motions, such that for every particular state, there are directions that all the parts of you should be moving. I could imagine this being designed to fairly consistently cause O to happen within some context. However since such behavior would not be produced by a process optimizing O, you shouldn’t expect it to find new and strange routes to O, or to seek O reliably in novel circumstances. There appears to be zero pressure for this thing to become more coherent, unless its design already involves reflexes to move its thoughts in certain ways that lead it to change itself. I expect you could build a system like this that reliably runs around and tidies your house say, or runs your social media presence, without it containing any impetus to become a more coherent agent (because it doesn’t have any reflexes that lead to pondering self-improvement in this way).

It is not clear that economic incentives generally favor the far end of this spectrum over weak pseudo-agency. There are incentives toward systems being more like utility maximizers, but also incentives against. 

The reason any kind of ‘goal-directedness’ is incentivised in AI systems is that then the system can be given an objective by someone hoping to use their cognitive labor, and the system will make that objective happen. Whereas a similar non-agentic AI system might still do almost the same cognitive labor, but require an agent (such as a person) to look at the objective and decide what should be done to achieve it, then ask the system for that. Goal-directedness means automating this high-level strategizing. 

Weak pseudo-agency fulfills this purpose to some extent, but not as well as utility maximization. However if we think that utility maximization is difficult to wield without great destruction, then that suggests a disincentive to creating systems with behavior closer to utility-maximization. Not just from the world being destroyed, but from the same dynamic causing more minor divergences from expectations, if the user can’t specify their own utility function well. 

That is, if it is true that utility maximization tends to lead to very bad outcomes relative to any slightly different goals (in the absence of great advances in the field of AI alignment), then the most economically favored level of goal-directedness seems unlikely to be as far as possible toward utility maximization. More likely it is a level of pseudo-agency that achieves a lot of the users’ desires without bringing about sufficiently detrimental side effects to make it not worthwhile. (This is likely more agency than is socially optimal, since some of the side-effects will be harms to others, but there seems no reason to think that it is a very high degree of agency.)

Some minor but perhaps illustrative evidence: anecdotally, people prefer interacting with others who predictably carry out their roles or adhere to deontological constraints, rather than consequentialists in pursuit of broadly good but somewhat unknown goals. For instance, employers would often prefer employees who predictably follow rules than ones who try to forward company success in unforeseen ways.

The other arguments to expect goal-directed systems mentioned above seem more likely to suggest approximate utility-maximization rather than some other form of goal-directedness, but it isn’t that clear to me. I don’t know what kind of entity is most naturally produced by contemporary ML training. Perhaps someone else does. I would guess that it’s more like the reflex-based agent described above, at least at present. But present systems aren’t the concern.

Coherence arguments are arguments for being coherent a.k.a. maximizing a utility function, so one might think that they imply a force for utility maximization in particular. That seems broadly right. Though note that these are arguments that there is some pressure for the system to modify itself to become more coherent. What actually results from specific systems modifying themselves seems like it might have details not foreseen in an abstract argument merely suggesting that the status quo is suboptimal whenever it is not coherent. Starting from a state of arbitrary incoherence and moving iteratively in one of many pro-coherence directions produced by whatever whacky mind you currently have isn’t obviously guaranteed to increasingly approximate maximization of some sensical utility function. For instance, take an entity with a cycle of preferences, apples > bananas = oranges > pears > apples. The entity notices that it sometimes treats oranges as better than pears and sometimes worse. It tries to correct by adjusting the value of oranges to be the same as pears. The new utility function is exactly as incoherent as the old one. Probably moves like this are rarer than ones that make you more coherent in this situation, but I don’t know, and I also don’t know if this is a great model of the situation for incoherent systems that could become more coherent.

What it might look like if this gap matters: AI systems proliferate, and have various goals. Some AI systems try to make money in the stock market. Some make movies. Some try to direct traffic optimally. Some try to make the Democratic party win an election. Some try to make Walmart maximally profitable. These systems have no perceptible desire to optimize the universe for forwarding these goals because they aren’t maximizing a general utility function, they are more ‘behaving like someone who is trying to make Walmart profitable’. They make strategic plans and think about their comparative advantage and forecast business dynamics, but they don’t build nanotechnology to manipulate everybody’s brains, because that’s not the kind of behavior pattern they were designed to follow. The world looks kind of like the current world, in that it is fairly non-obvious what any entity’s ‘utility function’ is. It often looks like AI systems are ‘trying’ to do things, but there’s no reason to think that they are enacting a rational and consistent plan, and they rarely do anything shocking or galaxy-brained.

Ambiguously strong forces for goal-directedness need to meet an ambiguously high bar to cause a risk

The forces for goal-directedness mentioned in I are presumably of finite strength. For instance, if coherence arguments correspond to pressure for machines to become more like utility maximizers, there is an empirical answer to how fast that would happen with a given system. There is also an empirical answer to how ‘much’ goal directedness is needed to bring about disaster, supposing that utility maximization would bring about disaster and, say, being a rock wouldn’t. Without investigating these empirical details, it is unclear whether a particular qualitatively identified force for goal-directedness will cause disaster within a particular time.

What it might look like if this gap matters: There are not that many systems doing something like utility maximization in the new AI economy. Demand is mostly for systems more like GPT or DALL-E, which transform inputs in some known way without reference to the world, rather than ‘trying’ to bring about an outcome. Maybe the world was headed for more of the latter, but ethical and safety concerns reduced desire for it, and it wasn’t that hard to do something else. Companies setting out to make non-agentic AI systems have no trouble doing so. Incoherent AIs are never observed making themselves more coherent, and training has never produced an agent unexpectedly. There are lots of vaguely agentic things, but they don’t pose much of a problem. There are a few things at least as agentic as humans, but they are a small part of the economy.

B. Contra “goal-directed AI systems’ goals will be bad”

Small differences in utility functions may not be catastrophic

Arguably, humans are likely to have somewhat different values to one another even after arbitrary reflection. If so, there is some extended region of the space of possible values that the values of different humans fall within. That is, ‘human values’ is not a single point.

If the values of misaligned AI systems fall within that region, this would not appear to be worse in expectation than the situation where the long-run future was determined by the values of humans other than you. (This may still be a huge loss of value relative to the alternative, if a future determined by your own values is vastly better than that chosen by a different human, and if you also expected to get some small fraction of the future, and will now get much less. These conditions seem non-obvious however, and if they obtain you should worry about more general problems than AI.)

Plausibly even a single human, after reflecting, could on their own come to different places in a whole region of specific values, depending on somewhat arbitrary features of how the reflecting period went. In that case, even the values-on-reflection of a single human is an extended region of values space, and an AI which is only slightly misaligned could be the same as some version of you after reflecting.

There is a further larger region, ‘that which can be reliably enough aligned with typical human values via incentives in the environment’, which is arguably larger than the circle containing most human values. Human society makes use of this a lot: for instance, most of the time particularly evil humans don’t do anything too objectionable because it isn’t in their interests. This region is probably smaller for more capable creatures such as advanced AIs, but still it is some size.

Thus it seems that some amount of AI divergence from your own values is probably broadly fine, i.e. not worse than what you should otherwise expect without AI. 

Thus in order to arrive at a conclusion of doom, it is not enough to argue that we cannot align AI perfectly. The question is a quantitative one of whether we can get it close enough. And how close is ‘close enough’ is not known. 

What it might look like if this gap matters: there are many superintelligent goal-directed AI systems around. They are trained to have human-like goals, but we know that their training is imperfect and none of them has goals exactly like those presented in training. However if you just heard about a particular system’s intentions, you wouldn’t be able to guess if it was an AI or a human. Things happen much faster than they were, because superintelligent AI is superintelligent, but not obviously in a direction less broadly in line with human goals than when humans were in charge.

Differences between AI and human values may be small 

AI trained to have human-like goals will have something close to human-like goals. How close? Call it d, for a particular occasion of training AI. 

If d doesn’t have to be 0 for safety (from above), then there is a question of whether it is an acceptable size. 

I know of two issues here, pushing d upward. One is that with a finite number of training examples, the fit between the true function and the learned function will be wrong. The other [? · GW] is that you might accidentally create a monster (‘misaligned mesaoptimizer [? · GW]’) who understands its situation and pretends to have the utility function you are aiming for so that it can be freed and go out and manifest its own utility function, which could be just about anything. If this problem is real, then the values of an AI system might be arbitrarily different from the training values, rather than ‘nearby’ in some sense, so d is probably unacceptably large. But if you avoid creating such mesaoptimizers, then it seems plausible to me that d is very small. 

If humans also substantially learn their values via observing examples, then the variation in human values is arising from a similar process, so might be expected to be of a similar scale. If we care to make the ML training process more accurate than the human learning one, it seems likely that we could. For instance, d gets smaller with more data.

Another line of evidence is that for things that I have seen AI learn so far, the distance from the real thing is intuitively small. If AI learns my values as well as it learns what faces look like, it seems plausible that it carries them out better than I do.

As minor additional evidence here, I don’t know how to describe any slight differences in utility functions that are catastrophic. Talking concretely, what does a utility function look like that is so close to a human utility function that an AI system has it after a bunch of training, but which is an absolute disaster? Are we talking about the scenario where the AI values a slightly different concept of justice, or values satisfaction a smidgen more relative to joy than it should? And then that’s a moral disaster because it is wrought across the cosmos? Or is it that it looks at all of our inaction and thinks we want stuff to be maintained very similar to how it is now, so crushes any efforts to improve things? 

What it might look like if this gap matters: when we try to train AI systems to care about what specific humans care about, they usually pretty much do, as far as we can tell. We basically get what we trained for. For instance, it is hard to distinguish them from the human in question. (It is still important to actually do this training, rather than making AI systems not trained to have human values.)

Maybe value isn’t fragile

Eliezer argued that value is fragile [LW · GW], via examples of ‘just one thing’ that you can leave out of a utility function, and end up with something very far away from what humans want. For instance, if you leave out ‘boredom’ then he thinks the preferred future might look like repeating the same otherwise perfect moment again and again. (His argument is perhaps longer—that post says there is a lot of important background, though the bits mentioned don’t sound relevant to my disagreement.) This sounds to me like ‘value is not resilient to having components of it moved to zero’, which is a weird usage of ‘fragile’, and in particular, doesn’t seem to imply much about smaller perturbations. And smaller perturbations seem like the relevant thing with AI systems trained on a bunch of data to mimic something. 

You could very analogously say ‘human faces are fragile’ because if you just leave out the nose it suddenly doesn’t look like a typical human face at all. Sure, but is that the kind of error you get when you try to train ML systems to mimic human faces? Almost none of the faces on thispersondoesnotexist.com are blatantly morphologically unusual in any way, let alone noseless. Admittedly one time I saw someone whose face was neon green goo, but I’m guessing you can get the rate of that down pretty low if you care about it.

Eight examples, no cherry-picking:

Skipping the nose is the kind of mistake you make if you are a child drawing a face from memory. Skipping ‘boredom’ is the kind of mistake you make if you are a person trying to write down human values from memory. My guess is that this seemed closer to the plan in 2009 when that post was written, and that people cached the takeaway and haven’t updated it for deep learning which can learn what faces look like better than you can.

What it might look like if this gap matters: there is a large region ‘around’ my values in value space that is also pretty good according to me. AI easily lands within that space, and eventually creates some world that is about as good as the best possible utopia, according to me. There aren’t a lot of really crazy and terrible value systems adjacent to my values.

Short-term goals

Utility maximization really only incentivises drastically altering the universe if one’s utility function places a high enough value on very temporally distant outcomes relative to near ones. That is, long term goals are needed for danger. A person who cares most about winning the timed chess game in front of them should not spend time accruing resources to invest in better chess-playing.

AI systems could have long-term goals via people intentionally training them to do so, or via long-term goals naturally arising from systems not trained so. 

Humans seem to discount the future a lot in their usual decision-making (they have goals years in advance but rarely a hundred years) so the economic incentive to train AI to have very long term goals might be limited.

It’s not clear that training for relatively short term goals naturally produces creatures with very long term goals, though it might.

Thus if AI systems fail to have value systems relatively similar to human values, it is not clear that many will have the long time horizons needed to motivate taking over the universe.

What it might look like if this gap matters: the world is full of agents who care about relatively near-term issues, and are helpful to that end, and have no incentive to make long-term large scale schemes. Reminiscent of the current world, but with cleverer short-termism.

C. Contra “superhuman AI would be sufficiently superior to humans to overpower humanity”

Human success isn’t from individual intelligence

The argument claims (or assumes) that surpassing ‘human-level’ intelligence (i.e. the mental capacities of an individual human) is the relevant bar for matching the power-gaining capacity of humans, such that passing this bar in individual intellect means outcompeting humans in general in terms of power (argument III.2), if not being able to immediately destroy them all outright (argument III.1.). In a similar vein, introductions to AI risk often start by saying that humanity has triumphed over the other species because it is more intelligent, as a lead in to saying that if we make something more intelligent still, it will inexorably triumph over humanity.

This hypothesis about the provenance of human triumph seems wrong. Intellect surely helps, but humans look to be powerful largely because they share their meager intellectual discoveries with one another and consequently save them up over time4. You can see this starkly by comparing the material situation of Alice, a genius living in the stone age, and Bob, an average person living in 21st Century America. Alice might struggle all day to get a pot of water, while Bob might be able to summon all manner of delicious drinks from across the oceans, along with furniture, electronics, information, etc. Much of Bob’s power probably did flow from the application of intelligence, but not Bob’s individual intelligence. Alice’s intelligence, and that of those who came between them.

Bob’s greater power isn’t directly just from the knowledge and artifacts Bob inherits from other humans. He also seems to be helped for instance by much better coordination: both from a larger number people coordinating together, and from better infrastructure for that coordination (e.g. for Alice the height of coordination might be an occasional big multi-tribe meeting with trade, and for Bob it includes global instant messaging and banking systems and the Internet). One might attribute all of this ultimately to innovation, and thus to intelligence and communication, or not. I think it’s not important to sort out here, as long as it’s clear that individual intelligence isn’t the source of power.

It could still be that with a given bounty of shared knowledge (e.g. within a given society), intelligence grants huge advantages. But even that doesn’t look true here: 21st Century geniuses live basically like 21st Century people of average intelligence, give or take.

Why does this matter? Well for one thing, if you make AI which is merely as smart as a human, you shouldn’t then expect it to do that much better than a genius living in the stone age. That’s what human-level intelligence gets you: nearly nothing. A piece of rope after millions of lifetimes. Humans without their culture are much like other animals. 

To wield the control-over-the-world of a genius living in the 21st Century, the human-level AI would seem to need something like the other benefits that the 21st century genius gets from their situation in connection with a society. 

One such thing is access to humanity’s shared stock of hard-won information. AI systems plausibly do have this, if they can get most of what is relevant by reading the internet. This isn’t obvious: people also inherit information from society through copying habits and customs, learning directly from other people, and receiving artifacts with implicit information (for instance, a factory allows whoever owns the factory to make use of intellectual work that was done by the people who built the factory, but that information may not available explicitly even for the owner of the factory, let alone to readers on the internet). These sources of information seem likely to also be available to AI systems though, at least if they are afforded the same options as humans.

My best guess is that AI systems easily do better than humans on extracting information from humanity’s stockpile, and on coordinating, and so on this account are probably in an even better position to compete with humans than one might think on the individual intelligence model, but that is a guess. In that case perhaps this misunderstanding makes little difference to the outcomes of the argument. However it seems at least a bit more complicated. 

Suppose that AI systems can have access to all information humans can have access to. The power the 21st century person gains from their society is modulated by their role in society, and relationships, and rights, and the affordances society allows them as a result. Their power will vary enormously depending on whether they are employed, or listened to, or paid, or a citizen, or the president. If AI systems’ power stems substantially from interacting with society, then their power will also depend on affordances granted, and humans may choose not to grant them many affordances (see section ‘Intelligence may not be an overwhelming advantage’ for more discussion).

However suppose that your new genius AI system is also treated with all privilege. The next way that this alternate model matters is that if most of what is good in a person’s life is determined by the society they are part of, and their own labor is just buying them a tiny piece of that inheritance, then if they are for instance twice as smart as any other human, they don’t get to use technology that it twice as good. They just get a larger piece of that same shared technological bounty purchasable by anyone. Because each individual person is adding essentially nothing in terms of technology, so twice that is still basically nothing. 

In contrast, I think people are often imagining that a single entity somewhat smarter than a human will be able to quickly use technologies that are somewhat better than current human technologies. This seems to be mistaking the actions of a human and the actions of a human society. If a hundred thousand people sometimes get together for a few years and make fantastic new weapons, you should not expect an entity somewhat smarter than a person to make even better weapons. That’s off by a factor of about a hundred thousand. 

There might be places you can get far ahead of humanity by being better than a single human—it depends how much accomplishments depend on the few most capable humans in the field, and how few people are working on the problem. But for instance the Manhattan Project took a hundred thousand people several years, and von Neumann (a mythically smart scientist) joining the project did not reduce it to an afternoon. Plausibly to me, some specific people being on the project caused it to not take twice as many person-years, though the plausible candidates here seem to be more in the business of running things than doing science directly (though that also presumably involves intelligence). But even if you are an ambitious somewhat superhuman intelligence, the influence available to you seems to plausibly be limited to making a large dent in the effort required for some particular research endeavor, not single-handedly outmoding humans across many research endeavors.

This is all reason to doubt that a small number of superhuman intelligences will rapidly take over or destroy the world (as in III.i.). This doesn’t preclude a set of AI systems that are together more capable than a large number of people from making great progress. However some related issues seem to make that less likely.

Another implication of this model is that if most human power comes from buying access to society’s shared power, i.e. interacting with the economy, you should expect intellectual labor by AI systems to usually be sold, rather than for instance put toward a private stock of knowledge. This means the intellectual outputs are mostly going to society, and the main source of potential power to an AI system is the wages received (which may allow it to gain power in the long run). However it seems quite plausible that AI systems at this stage will generally not receive wages, since they presumably do not need them to be motivated to do the work they were trained for. It also seems plausible that they would be owned and run by humans. This would seem to not involve any transfer of power to that AI system, except insofar as its intellectual outputs benefit it (e.g. if it is writing advertising material, maybe it doesn’t get paid for that, but if it can write material that slightly furthers its own goals in the world while also fulfilling the advertising requirements, then it sneaked in some influence.) 

If there is AI which is moderately more competent than humans, but not sufficiently more competent to take over the world, then it is likely to contribute to this stock of knowledge and affordances shared with humans. There is no reason to expect it to build a separate competing stock, any more than there is reason for a current human household to try to build a separate competing stock rather than sell their labor to others in the economy. 

In summary:

  1. Functional connection with a large community of other intelligences in the past and present is probably a much bigger factor in the success of humans as a species or individual humans than is individual intelligence. 
  2. Thus this also seems more likely to be important for AI success than individual intelligence. This is contrary to a usual argument for AI superiority, but probably leaves AI systems at least as likely to outperform humans, since superhuman AI is probably superhumanly good at taking in information and coordinating.
  3. However it is not obvious that AI systems will have the same access to society’s accumulated information e.g. if there is information which humans learn from living in society, rather than from reading the internet. 
  4. And it seems an open question whether AI systems are given the same affordances in society as humans, which also seem important to making use of the accrued bounty of power over the world that humans have. For instance, if they are not granted the same legal rights as humans, they may be at a disadvantage in doing trade or engaging in politics or accruing power.
  5. The fruits of greater intelligence for an entity will probably not look like society-level accomplishments unless it is a society-scale entity
  6. The route to influence with smaller fruits probably by default looks like participating in the economy rather than trying to build a private stock of knowledge.
  7. If the resources from participating in the economy accrue to the owners of AI systems, not to the systems themselves, then there is less reason to expect the systems to accrue power incrementally, and they are at a severe disadvantage relative to humans. 

Overall these are reasons to expect AI systems with around human-level cognitive performance to not destroy the world immediately, and to not amass power as easily as one might imagine. 

What it might look like if this gap matters: If AI systems are somewhat superhuman, then they do impressive cognitive work, and each contributes to technology more than the best human geniuses, but not more than the whole of society, and not enough to materially improve their own affordances. They don’t gain power rapidly because they are disadvantaged in other ways, e.g. by lack of information, lack of rights, lack of access to positions of power. Their work is sold and used by many actors, and the proceeds go to their human owners. AI systems do not generally end up with access to masses of technology that others do not have access to, and nor do they have private fortunes. In the long run, as they become more powerful, they might take power if other aspects of the situation don’t change. 

AI agents may not be radically superior to combinations of humans and non-agentic machines

‘Human level capability’ is a moving target. For comparing the competence of advanced AI systems to humans, the relevant comparison is with humans who have state-of-the-art AI and other tools. For instance, the human capacity to make art quickly has recently been improved by a variety of AI art systems. If there were now an agentic AI system that made art, it would make art much faster than a human of 2015, but perhaps hardly faster than a human of late 2022. If humans continually have access to tool versions of AI capabilities, it is not clear that agentic AI systems must ever have an overwhelmingly large capability advantage for important tasks (though they might). 

(This is not an argument that humans might be better than AI systems, but rather: if the gap in capability is smaller, then the pressure for AI systems to accrue power is less and thus loss of human control is slower and easier to mitigate entirely through other forces, such as subsidizing human involvement or disadvantaging AI systems in the economy.)

Some advantages of being an agentic AI system vs. a human with a tool AI system seem to be:

  1. There might just not be an equivalent tool system, for instance if it is impossible to train systems without producing emergent agents.
  2. When every part of a process takes into account the final goal, this should make the choices within the task more apt for the final goal (and agents know their final goal, whereas tools carrying out parts of a larger problem do not).
  3. For humans, the interface for using a capability of one’s mind tends to be smoother than the interface for using a tool. For instance a person who can do fast mental multiplication can do this more smoothly and use it more often than a person who needs to get out a calculator. This seems likely to persist.

1 and 2 may or may not matter much. 3 matters more for brief, fast, unimportant tasks. For instance, consider again people who can do mental calculations better than others. My guess is that this advantages them at using Fermi estimates in their lives and buying cheaper groceries, but does not make them materially better at making large financial choices well. For a one-off large financial choice, the effort of getting out a calculator is worth it and the delay is very short compared to the length of the activity. The same seems likely true of humans with tools vs. agentic AI with the same capacities integrated into their minds. Conceivably the gap between humans with tools and goal-directed AI is small for large, important tasks.

What it might look like if this gap matters: agentic AI systems have substantial advantages over humans with tools at some tasks like rapid interaction with humans, and responding to rapidly evolving strategic situations.  One-off large important tasks such as advanced science are mostly done by tool AI. 

Trust

If goal-directed AI systems are only mildly more competent than some combination of tool systems and humans (as suggested by considerations in the last two sections), we still might expect AI systems to out-compete humans, just more slowly. However AI systems have one serious disadvantage as employees of humans: they are intrinsically untrustworthy, while we don’t understand them well enough to be clear on what their values are or how they will behave in any given case. Even if they did perform as well as humans at some task, if humans can’t be certain of that, then there is reason to disprefer using them. This can be thought of as two problems: firstly, slightly misaligned systems are less valuable because they genuinely do the thing you want less well, and secondly, even if they were not misaligned, if humans can’t know that (because we have no good way to verify the alignment of AI systems) then it is costly in expectation to use them. (This is only a further force acting against the supremacy of AI systems—they might still be powerful enough that using them is enough of an advantage that it is worth taking the hit on trustworthiness.)

What it might look like if this gap matters: in places where goal-directed AI systems are not typically hugely better than some combination of less goal-directed systems and humans, the job is often given to the latter if trustworthiness matters. 

Headroom

For AI to vastly surpass human performance at a task, there needs to be ample room for improvement above human level. For some tasks, there is not—tic-tac-toe is a classic example. It is not clear how close humans (or technologically aided humans) are from the limits to competence in the particular domains that will matter. It is to my knowledge an open question how much ‘headroom’ there is. My guess is a lot, but it isn’t obvious.

How much headroom there is varies by task. Categories of task for which there appears to be little headroom: 

  1. Tasks where we know what the best performance looks like, and humans can get close to it. For instance, machines cannot win more often than the best humans at Tic-tac-toe (playing within the rules) or solve Rubik’s cubes much more reliably, or extracting calories from fuel
  2. Tasks where humans are already be reaping most of the value—for instance, perhaps most of the value of forks is in having a handle with prongs attached to the end, and while humans continue to design slightly better ones, and machines might be able to add marginal value to that project more than twice as fast as the human designers, they cannot perform twice as well in terms of the value of each fork, because forks are already 95% as good as they can be. 
  3. Better performance is quickly intractable. For instance, we know that for tasks in particular complexity classes, there are computational limits to how well one can perform across the board. Or for chaotic systems, there can be limits to predictability. (That is, tasks might lack headroom not because they are simple, but because they are complex. E.g. AI probably can’t predict the weather much further out than humans.)

Categories of task where a lot of headroom seems likely:

  1. Competitive tasks where the value of a certain level of performance depends on whether one is better or worse than one’s opponent, so that the marginal value of more performance doesn’t hit diminishing returns, as long as your opponent keeps competing and taking back what you just won. Though in one way this is like having little headroom: there’s no more value to be had—the game is zero sum. And while there might often be a lot of value to be gained by doing a bit better on the margin, still if all sides can invest, then nobody will end up better off than they were. So whether this seems more like high or low headroom depends on what we are asking exactly. Here we are asking if AI systems can do much better than humans: in a zero sum contest like this, they likely can in the sense that they can beat humans, but not in the sense of reaping anything more from the situation than the humans ever got.
  2. Tasks where it is twice as good to do the same task twice as fast, and where speed is bottlenecked on thinking time.
  3. Tasks where there is reason to think that optimal performance is radically better than we have seen. For instance, perhaps we can estimate how high Chess Elo rankings must go before reaching perfection by reasoning theoretically about the game, and perhaps it is very high (I don’t know).
  4. Tasks where humans appear to use very inefficient methods. For instance, it was perhaps predictable before calculators that they would be able to do mathematics much faster than humans, because humans can only keep a small number of digits in their heads, which doesn’t seem like an intrinsically hard problem. Similarly, I hear humans often use mental machinery designed for one mental activity for fairly different ones, through analogy. For instance, when I think about macroeconomics, I seem to be basically using my intuitions for dealing with water. When I do mathematics in general, I think I’m probably using my mental capacities for imagining physical objects.

What it might look like if this gap matters: many challenges in today’s world remain challenging for AI. Human behavior is not readily predictable or manipulable very far beyond what we have explored, only slightly more complicated schemes are feasible before the world’s uncertainties overwhelm planning; much better ads are soon met by much better immune responses; much better commercial decision-making ekes out some additional value across the board but most products were already fulfilling a lot of their potential; incredible virtual prosecutors meet incredible virtual defense attorneys and everything is as it was; there are a few rounds of attack-and-defense in various corporate strategies before a new equilibrium with broad recognition of those possibilities; conflicts and ‘social issues’ remain mostly intractable. There is a brief golden age of science before the newly low-hanging fruit are again plucked and it is only lightning fast in areas where thinking was the main bottleneck, e.g. not in medicine.

Intelligence may not be an overwhelming advantage

Intelligence is helpful for accruing power and resources, all things equal, but many other things are helpful too. For instance money, social standing, allies, evident trustworthiness, not being discriminated against (this was slightly discussed in section ‘Human success isn’t from individual intelligence’). AI systems are not guaranteed to have those in abundance. The argument assumes that any difference in intelligence in particular will eventually win out over any differences in other initial resources. I don’t know of reason to think that. 

Empirical evidence does not seem to support the idea that cognitive ability is a large factor in success. Situations where one entity is much smarter or more broadly mentally competent than other entities regularly occur without the smarter one taking control over the other:

  1. Species exist with all levels of intelligence. Elephants have not in any sense won over gnats; they do not rule gnats; they do not have obviously more control than gnats over the environment. 
  2. Competence does not seem to aggressively overwhelm other advantages in humans: 
    1. Looking at the world, intuitively the big discrepancies in power are not seemingly about intelligence.
    2. IQ 130 humans apparently earn very roughly $6000-$18,500 per year more than average IQ humans.
    3. Elected representatives are apparently smarter on average, but it is a slightly shifted curve, not a radically difference.
    4. MENSA isn’t a major force in the world.
    5. Many places where people see huge success through being cognitively able are ones where they show off their intelligence to impress people, rather than actually using it for decision-making. For instance, writers, actors, song-writers, comedians, all sometimes become very successful through cognitive skills. Whereas scientists, engineers and authors of software use cognitive skills to make choices about the world, and less often become extremely rich and famous, say. If intelligence were that useful for strategic action, it seems like using it for that would be at least as powerful as showing it off. But maybe this is just an accident of which fields have winner-takes-all type dynamics.
    6. If we look at people who evidently have good cognitive abilities given their intellectual output, their personal lives are not obviously drastically more successful, anecdotally.
    7. One might counter-counter-argue that humans are very similar to one another in capability, so even if intelligence matters much more than other traits, you won’t see that by looking at  the near-identical humans. This does not seem to be true. Often at least, the difference in performance between mediocre human performance and top level human performance is large, relative to the space below, iirc. For instance, in chess, the Elo difference between the best and worst players is about 2000, whereas the difference between the amateur play and random play is maybe 400-2800 (if you accept Chess StackExchange guesses as a reasonable proxy for the truth here). And in terms of AI progress, amateur human play was reached in the 50s, roughly when research began, and world champion level play was reached in 1997. 

And theoretically I don’t know why one would expect greater intelligence to win out over other advantages over time.  There are actually two questionable theories here: 1) Charlotte having more overall control than David at time 0 means that Charlotte will tend to have an even greater share of control at time 1. And, 2) Charlotte having more intelligence than David at time 0 means that Charlotte will have a greater share of control at time 1 even if Bob has more overall control (i.e. more of other resources) at time 1.

What it might look like if this gap matters: there are many AI systems around, and they strive for various things. They don’t hold property, or vote, or get a weight in almost anyone’s decisions, or get paid, and are generally treated with suspicion. These things on net keep them from gaining very much power. They are very persuasive speakers however and we can’t stop them from communicating, so there is a constant risk of people willingly handing them power, in response to their moving claims that they are an oppressed minority who suffer. The main thing stopping them from winning is that their position as psychopaths bent on taking power for incredibly pointless ends is widely understood.

Unclear that many goals realistically incentivise taking over the universe

I have some goals. For instance, I want some good romance. My guess is that trying to take over the universe isn’t the best way to achieve this goal. The same goes for a lot of my goals, it seems to me. Possibly I’m in error, but I spend a lot of time pursuing goals, and very little of it trying to take over the universe. Whether a particular goal is best forwarded by trying to take over the universe as a substep seems like a quantitative empirical question, to which the answer is virtually always ‘not remotely’. Don’t get me wrong: all of these goals involve some interest in taking over the universe. All things equal, if I could take over the universe for free, I do think it would help in my romantic pursuits. But taking over the universe is not free. It’s actually super duper duper expensive and hard. So for most goals arising, it doesn’t bear considering. The idea of taking over the universe as a substep is entirely laughable for almost any human goal.

So why do we think that AI goals are different? I think the thought is that it’s radically easier for AI systems to take over the world, because all they have to do is to annihilate humanity, and they are way better positioned to do that than I am, and also better positioned to survive the death of human civilization than I am. I agree that it is likely easier, but how much easier? So much easier to take it from ‘laughably unhelpful’ to ‘obviously always the best move’? This is another quantitative empirical question.

What it might look like if this gap matters: Superintelligent AI systems pursue their goals. Often they achieve them fairly well. This is somewhat contrary to ideal human thriving, but not lethal. For instance, some AI systems are trying to maximize Amazon’s market share, within broad legality. Everyone buys truly incredible amounts of stuff from Amazon, and people often wonder if it is too much stuff. At no point does attempting to murder all humans seem like the best strategy for this. 

Quantity of new cognitive labor is an empirical question, not addressed

Whether some set of AI systems can take over the world with their new intelligence probably depends how much total cognitive labor they represent. For instance, if they are in total slightly more capable than von Neumann, they probably can’t take over the world. If they are together as capable (in some sense) as a million 21st Century human civilizations, then they probably can (at least in the 21st Century).

It also matters how much of that is goal-directed at all, and highly intelligent, and how much of that is directed at achieving the AI systems’ own goals rather than those we intended them for, and how much of that is directed at taking over the world. 

If we continued to build hardware, presumably at some point AI systems would account for most of the cognitive labor in the world. But if there is first an extended period of more minimal advanced AI presence, that would probably prevent an immediate death outcome, and improve humanity’s prospects for controlling a slow-moving AI power grab. 

What it might look like if this gap matters: when advanced AI is developed, there is a lot of new cognitive labor in the world, but it is a minuscule fraction of all of the cognitive labor in the world. A large part of it is not goal-directed at all, and of that, most of the new AI thought is applied to tasks it was intended for. Thus what part of it is spent on scheming to grab power for AI systems is too small to grab much power quickly. The amount of AI cognitive labor grows fast over time, and in several decades it is most of the cognitive labor, but humanity has had extensive experience dealing with its power grabbing.

Speed of intelligence growth is ambiguous

The idea that a superhuman AI would be able to rapidly destroy the world seems prima facie unlikely, since no other entity has ever done that. Two common broad arguments for it:

  1. There will be a feedback loop in which intelligent AI makes more intelligent AI repeatedly until AI is very intelligent.
  2. Very small differences in brains seem to correspond to very large differences in performance, based on observing humans and other apes. Thus any movement past human-level will take us to unimaginably superhuman level.

These both seem questionable.

  1. Feedback loops can happen at very different rates. Identifying a feedback loop empirically does not signify an explosion of whatever you are looking at. For instance, technology is already helping improve technology. To get to a confident conclusion of doom, you need evidence that the feedback loop is fast.
  2. It does not seem clear that small improvements in brains lead to large changes in intelligence in general, or will do on the relevant margin. Small differences between humans and other primates might include those helpful for communication (see Section ‘Human success isn’t from individual intelligence’), which do not seem relevant here. If there were a particularly powerful cognitive development between chimps and humans, it is unclear that AI researchers find that same insight at the same point in the process (rather than at some other time). 

A large number of other arguments have been posed for expecting very fast growth in intelligence at around human level. I previously made a list of them with counterarguments, though none seemed very compelling. Overall, I don’t know of strong reason to expect very fast growth in AI capabilities at around human-level AI performance, though I hear such arguments might exist. 

What it would look like if this gap mattered: AI systems would at some point perform at around human level at various tasks, and would contribute to AI research, along with everything else. This would contribute to progress to an extent familiar from other technological progress feedback, and would not e.g. lead to a superintelligent AI system in minutes.

Key concepts are vague

Concepts such as ‘control’, ‘power’, and ‘alignment with human values’ all seem vague. ‘Control’ is not zero sum (as seemingly assumed) and is somewhat hard to pin down, I claim. What an ‘aligned’ entity is exactly seems to be contentious in the AI safety community, but I don’t know the details. My guess is that upon further probing, these conceptual issues are resolvable in a way that doesn’t endanger the argument, but I don’t know. I’m not going to go into this here.

What it might look like if this gap matters: upon thinking more, we realize that our concerns were confused. Things go fine with AI in ways that seem obvious in retrospect. This might look like it did for people concerned about the ‘population bomb’ or as it did for me in some of my youthful concerns about sustainability: there was a compelling abstract argument for a problem, and the reality didn’t fit the abstractions well enough to play out as predicted.

D. Contra the whole argument

The argument overall proves too much about corporations

Here is the argument again, but modified to be about corporations. A couple of pieces don’t carry over, but they don’t seem integral.

I. Any given corporation is likely to be ‘goal-directed’

Reasons to expect this:

  1. Goal-directed behavior is likely to be valuable in corporations, e.g. economically
  2. Goal-directed entities may tend to arise from machine learning training processes not intending to create them (at least via the methods that are likely to be used).
  3. ‘Coherence arguments’ may imply that systems with some goal-directedness will become more strongly goal-directed over time.

II. If goal-directed superhuman corporations are built, their desired outcomes will probably be about as bad as an empty universe by human lights

Reasons to expect this:

  1. Finding useful goals that aren’t extinction-level bad appears to be hard: we don’t have a way to usefully point at human goals, and divergences from human goals seem likely to produce goals that are in intense conflict with human goals, due to a) most goals producing convergent incentives for controlling everything, and b) value being ‘fragile’, such that an entity with ‘similar’ values will generally create a future of virtually no value. 
  2. Finding goals that are extinction-level bad and temporarily useful appears to be easy: for example, corporations with the sole objective ‘maximize company revenue’ might profit for a time before gathering the influence and wherewithal to pursue the goal in ways that blatantly harm society.
  3. Even if humanity found acceptable goals, giving a corporation any specific goals appears to be hard. We don’t know of any procedure to do it, and we have theoretical reasons to expect that AI systems produced through machine learning training will generally end up with goals other than those that they were trained according to. Randomly aberrant goals resulting are probably extinction-level bad, for reasons described in II.1 above.
     

III. If most goal-directed corporations have bad goals, the future will very likely be bad

That is, a set of ill-motivated goal-directed corporations, of a scale likely to occur, would be capable of taking control of the future from humans. This is supported by at least one of the following being true:

  1. A corporation would destroy humanity rapidly. This may be via ultra-powerful capabilities at e.g. technology design and strategic scheming, or through gaining such powers in an ‘intelligence explosion‘ (self-improvement cycle). Either of those things may happen either through exceptional heights of intelligence being reached or through highly destructive ideas being available to minds only mildly beyond our own.
  2. Superhuman AI would gradually come to control the future via accruing power and resources. Power and resources would be more available to the corporation than to humans on average, because of the corporation having far greater intelligence.

This argument does point at real issues with corporations, but we do not generally consider such issues existentially deadly. 

One might argue that there are defeating reasons that corporations do not destroy the world: they are made of humans so can be somewhat reined in; they are not smart enough; they are not coherent enough. But in that case, the original argument needs to make reference to these things, so that they apply to one and not the other.

What it might look like if this counterargument matters: something like the current world. There are large and powerful systems doing things vastly beyond the ability of individual humans, and acting in a definitively goal-directed way. We have a vague understanding of their goals, and do not assume that they are coherent. Their goals are clearly not aligned with human goals, but they have enough overlap that many people are broadly in favor of their existence. They seek power. This all causes some problems, but problems within the power of humans and other organized human groups to keep under control, for some definition of ‘under control’.

Conclusion

I think there are quite a few gaps in the argument, as I understand it. My current guess (prior to reviewing other arguments and integrating things carefully) is that enough uncertainties might resolve in the dangerous directions that existential risk from AI is a reasonable concern. I don’t at present though see how one would come to think it was overwhelmingly likely.

124 comments

Comments sorted by top scores.

comment by Wei Dai (Wei_Dai) · 2022-10-15T01:45:25.351Z · LW(p) · GW(p)

I think there are quite a few gaps in the argument, as I understand it. My current guess (prior to reviewing other arguments and integrating things carefully) is that enough uncertainties might resolve in the dangerous directions that existential risk from AI is a reasonable concern. I don’t at present though see how one would come to think it was .

Suppose you went through the following exercise. For each scenario described under "What it might look like if this gap matters", ask:

  1. Is this an existentially secure [EA · GW] state of affairs?
  2. If not, what are the main obstacles to reaching existential security from here?

and collected the obstacles, you might assemble a list like this one [LW · GW], which might update you toward AI x-risk being "overwhelmingly likely". (Personally, if I had to put a number on it, I'd say 80%.)

Replies from: elifland, jacob_cannell
comment by elifland · 2022-10-15T19:00:35.581Z · LW(p) · GW(p)

Agree directionally. I made a similar point in my review of "Is power-seeking AI an existential risk?" [LW · GW]:

In one sentence, my concern is that the framing of the report and decomposition is more like “avoid existential catastrophe” than “achieve a state where existential catastrophe is extremely unlikely and we are fulfilling humanity’s potential”, and this will bias readers toward lower estimates.

comment by jacob_cannell · 2022-10-15T21:28:32.590Z · LW(p) · GW(p)

I disagree strongly with this implied framing that all which matters is minimization of risk. Functional humans are not pure risk-avoiders, nor is our civilization. Small chances of heaven can counterbalance small chances of hell. (I also disagree with the implied model from your first link where cumulative risk is the product of small independent risk per year, but that's more minor in comparison).

Replies from: Wei_Dai
comment by Wei Dai (Wei_Dai) · 2022-10-15T23:22:46.030Z · LW(p) · GW(p)

Do you think there's a way to reframe my position in a way that you'd agree with, or at least don't strongly disagree with? (In other words, I'm not sure how much of the disagreement is with the substance of what I'm saying vs the way I'm saying it.) Or, to approach this another way, how would you state/frame your own position on this topic?

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-18T00:02:55.329Z · LW(p) · GW(p)

You linked to an article on existential security - “a place of safety - a place where existential risk is low and stays low” - which implies all that matters is risk minimization, rather than utility maximization with some risk discounting. To be fair, my disagreement there isn't specific to your points.

Separately I'm also skeptical of estimating risk through some long list of obstacles, as the relevance of those obstacles are correlated or mostly determined by a small number of more fundamental issues (takeoff speed, brain tractability, alignment vs capability tractability, etc).

Replies from: RobbBB
comment by Rob Bensinger (RobbBB) · 2022-10-20T06:07:12.493Z · LW(p) · GW(p)

You linked to an article on existential security - “a place of safety - a place where existential risk is low and stays low” - which implies all that matters is risk minimization, rather than utility maximization with some risk discounting.

Existential risk is just the probability that a large portion of the future's value is lost. "Small chances of heaven can counterbalance small chances of hell." implies that it's about reducing the risk of hell, when in fact it's equally concerned with the absence of Heaven.

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-20T17:45:33.991Z · LW(p) · GW(p)

Ok that is an unexpected interpretation as it's not how I typically think of 'risk', but yes if that's the intended interpretation it resolves my objection.

comment by porby · 2022-10-15T00:17:13.535Z · LW(p) · GW(p)

Great post! I think this captures a lot of why I'm not ultradoomy (only, er, 45%-ish doomy, at the moment), especially A and B. I think it's at least possible that our reality is on easymode, where muddling could conceivably put an AI into close enough territory to not trigger an oops.

I'd be even less doomy if I agreed with the counterarguments in C. Unfortunately, I can't shake the suspicion that superintelligence is the kind of ridiculously powerful lever that would magnify small oopses into the largest possible oopses.

Hypothetically, if we took a clever human's general capacity for problem solving, stripped it of limitations like getting bored or tired, got rid of its pesky intuitions around ethics, and sped it up by a factor of 1,000 times... I'd be very worried about what it would be able to do. Even without greater capacity for insight or an enhanced working memory, simply thinking really fast would be a broken superpower.

Such an entity might not be able to recreate the technology of modern civilization starting from scratch (both in resources and knowledge) in the stone age within 30 years, primarily due to physical interaction requirements. But starting from anything like modern civilization? That would get weird fast.

In other words, it seems like the intelligence range of humans- or even the range across animals and humans- is small compared to what is artificially possible even if we only consider speed. And it seems very likely at this point that a well-built artificial mind could have higher quality insights, too. MuZero certainly seems to within its domain. I don't find much comfort in observable intelligence differences not always resulting in domination.

Replies from: awg
comment by awg · 2022-10-15T18:49:22.516Z · LW(p) · GW(p)

Agreed that superhuman intelligence seems like the kind of thing that could be a very powerful lever. What gets me is that we don't seem to know how orthogonal or non-orthogonal intelligence and empathy are to one another.[1]  If we were capable of creating a superhumanly intelligent AI and we were to be able to give it superhuman empathy, I might be inclined to trust ceding over a large amount of power and control to that system (or set of systems whatever). But a sociopathic superhuman intelligence? Definitely not ceding power over to that system. 

The question then becomes to me, how confident are we that we are not creating dangerously sociopathic AI?

  1. ^

    If I were to take a stab, I would say they were almost entirely orthogonal, as we have perfectly intelligent yet sociopathic humans walking around today who lack any sort of empathy. Giving any of these people superhuman ability and control would seem like an obviously terrible idea to me.

comment by Ronny Fernandez (ronny-fernandez) · 2022-10-14T22:46:09.612Z · LW(p) · GW(p)

There's a nearby kind of obvious but rarely directly addressed generalized version of one of your arguments, which is that ML learns complex functions all the time, so why should human values be any different? I rarely see this discussed, and I thought the replies from Nate and the ELK related difficulties were important to have out in the open, so thanks a lot for including the face learning <-> human values learning analogy. 

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-10-16T11:58:42.005Z · LW(p) · GW(p)

For at least about ten years in my experience people in this community have been saying the main problem isn't getting the AI to understand human values, it's getting the AI to have human values. Unfortunately the word "learn human values" is sometimes used to mean "have human values" and sometimes used to mean "understand human values" hence the confusion.

Replies from: jacob_cannell, sharmake-farah
comment by jacob_cannell · 2022-10-16T15:35:55.159Z · LW(p) · GW(p)

To have human values the AI needs to either learn them or have them instilled. EY’s complexity fragility of human values argument is directed against early proposals for learning human values for AI utility function. Obviously at some point a powerful AI will learn a model of human values somewhere in its world model, but that is irrelevant because that doesn’t effect its utility function and it’s far too late - the AI needs a robust model of human values well before it becomes superhuman.

Katjas point is valid - DL did not fail in the way EY predicted, and the success of DL gives hope that we can learn superhuman models of human values to steer developing AI ( which again is completely unrelated to the AI later learning human values somewhere in its world model )

Replies from: sharmake-farah, hairyfigment
comment by Noosphere89 (sharmake-farah) · 2022-10-16T16:32:48.513Z · LW(p) · GW(p)

To have human values the AI needs to either learn them or have them instilled. EY’s complexity fragility of human values argument is directed against early proposals for learning human values for AI utility function. Obviously at some point a powerful AI will learn a model of human values somewhere in its world model, but that is irrelevant because that doesn’t effect its utility function and it’s far too late - the AI needs a robust model of human values well before it becomes superhuman.

Katjas point is valid - DL did not fail in the way EY predicted, and the success of DL gives hope that we can learn superhuman models of human values to steer developing AI ( which again is completely unrelated to the AI later learning human values somewhere in its world model )

I agree Eliezer is wrong, though that's not enough to ensure success. In particular, you need to avoid inner alignment issues like deceptive alignment, where it learns values very well only for instrumental convergence reasons, and once it's strong, it overthrows the humans and pursues whatever terminal goal it has.

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-16T17:30:04.622Z · LW(p) · GW(p)

Sim boxing can solve deceptive alignment (and may be the only viable solution)

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2022-10-16T17:35:05.651Z · LW(p) · GW(p)

I agree that boxing is at least a first step, so that it doesn't get more compute, or worse, FOOM.

The tricky problem is we need to be able to train away a deceptive AI or forbid it entirely, without making it being obfuscated so that it looks trained away.

This is why we need to move beyond the black box paradigm, and why strong interpretability tools are necessary.

comment by hairyfigment · 2022-10-16T18:11:26.217Z · LW(p) · GW(p)

>DL did not fail in the way EY predicted,

Where's the link for that prediction, because I think there's more than one example of critics putting words in his mouth, and then citing a place where he says something manifestly different.

Here's a post from 2008 [LW · GW], where he says the following:

As a matter of fact, if you use the right kind of neural network units, this "neural network" ends up exactly, mathematically equivalent to Naive Bayes.  The central unit just needs a logistic threshold—an S-curve response—and the weights of the inputs just need to match the logarithms of the likelihood ratios, etcetera.  In fact, it's a good guess that this is one of the reasons why logistic response often works so well in neural networks—it lets the algorithm sneak in a little Bayesian reasoning while the designers aren't looking.

Just because someone is presenting you with an algorithm that they call a "neural network" with buzzwords like "scruffy" and "emergent" plastered all over it, disclaiming proudly that they have no idea how the learned network works—well, don't assume that their little AI algorithm really is Beyond the Realms of Logic.  For this paradigm of adhockery , if it works, will turn out to have Bayesian structure; it may even be exactly equivalent to an algorithm of the sort called "Bayesian".

In a discussion from 2010, he's offered the chance to say that he doesn't think the machine learning of the time could produce AGI even with a smarter approach, and he appears to pull back from saying that:

But if we’re asking about works that are sort of billing themselves as ‘I am Artificial General Intelligence’, then I would say that most of that does indeed fail immediately and indeed I cannot think of a counterexample which fails to fail immediately, but that’s a sort of extreme selection effect, and it’s because if you’ve got a good partial solution, or solution to a piece of the problem, and you’re an academic working in AI, and you’re anything like sane, you’re just going to bill it as plain old AI, and not take the reputational hit from AGI.  The people who are bannering themselves around as AGI tend to be people who think they’ve solved the whole problem, and of course they’re mistaken. So to me it really seems like to say that all the things I’ve read on AGI immediately fundamentally fail is not even so much a critique of AI as rather a comment on what sort of more tends to bill itself as Artificial General Intelligence.

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-17T04:20:50.330Z · LW(p) · GW(p)

The context should make it clear I was not talking about an explicit prediction. See this comment [LW(p) · GW(p)] for more explication.

I said:

EY’s complexity fragility of human values argument is directed against early proposals for learning human values for AI utility function.

This is obviously true and beyond debate, see the quotes in my linked comment from EY's "Complex Value Systems are Required to Realize Valuable Futures" where he critiques Hibbard's proposal to install AI with a reward function which "learns to recognize happiness and unhappiness in human facial expressions, human voices and human body language".

Then I said:

Katjas point is valid - DL did not fail in the way EY predicted, and the success of DL gives hope that we can learn superhuman models of human values to steer developing AI

Where Katja's point is that DL had no trouble learning concepts of faces (and many other things) to superhuman levels, without inevitably failling by instead only producing superficial simulacra of faces when we cranked up the optimization power. I was not referring to any explicit prediction, but the implicit prediction in Katja's analogy (where learning a complex 3D generative model of human faces from images is the analogy for learning a complex multi-modal model of human happiness from face images, voices, body language, etc).

Replies from: hairyfigment
comment by hairyfigment · 2022-10-17T20:28:24.867Z · LW(p) · GW(p)

without inevitably failling by instead only producing superficial simulacra of faces

That's clearly exactly what it does today? It seems I disagree with your point on a more basic level than expected.

ETA: 

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-17T20:35:33.556Z · LW(p) · GW(p)

It only takes one positive example of AI not failing by producing superficial simulacra of faces to prove my point, which Katja already provided. It doesn't matter how many crappy AI models people make, as they lose out to stronger models.

Replies from: hairyfigment
comment by hairyfigment · 2022-10-17T21:25:39.571Z · LW(p) · GW(p)

Maybe I don't understand the point of this example in which AI creates non-conscious images of smiling faces. Are you really arguing that, based on evidence like this, a generalization of modern AI wouldn't automatically produce horrific or deadly results when asked to copy human values?

Peripherally: that video contains simulacra of a lot more than faces, and I may have other minor objections in that vein.

ETA, I may want to say more about the actual human analysis which I think informed the AI's "success," but first let me go back to what I said about linking EY's actual words. Here is 2008-Eliezer:

Now you, finally presented with a tiny molecular smiley - or perhaps a very realistic tiny sculpture of a human face - know at once that this is not what you want to count as a smile.  But that judgment reflects an unnatural category [? · GW], one whose classification boundary depends sensitively on your complicated values [? · GW].  It is your own plans and desires that are at work when you say "No!"

Hibbard knows instinctively that a tiny molecular smileyface isn't a "smile", because he knows that's not what he wants his putative AI to do.  If someone else were presented with a different task, like classifying artworks, they might feel that the Mona Lisa was obviously smiling - as opposed to frowning, say - even though it's only paint.

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-18T00:16:22.189Z · LW(p) · GW(p)

Hibbard proposes we can learn a model of 'happiness' from images of smiling humans, body language, voices, etc and then instill that as the reward/utility function for AI.

EY replies that will fail because our values (like happiness) are far too complex and fragile to be learned robustly by such a procedure, and result instead is an AI which optimizes for a different unintended goal: 'faciness'.

Katja argues - and others concur [LW(p) · GW(p)] - that maybe values are not as fragile as EY predicted, because DL now regularly learns complex concepts to superhuman accuracy - including visual models of faces.

Are you really arguing that, based on evidence like this, a generalization of modern AI wouldn't automatically produce horrific or deadly results when asked to copy human values?

Obviously that totally depends on the system and how the human values are learned - but no, that certainly isn't the automatic result if we continue down the path of reverse engineering the brain [LW · GW], including its altruism mechanisms.

Replies from: hairyfigment
comment by hairyfigment · 2022-10-18T00:29:08.782Z · LW(p) · GW(p)

I may reply to this more fully, but first I'd like you to acknowledge that you cannot in fact point to a false prediction by EY here, and in the exact post you seemed to be referring to, he says that his view is compatible with this sort of AI producing realistic sculptures of human faces!

Replies from: lahwran, jacob_cannell, jacob_cannell
comment by the gears to ascension (lahwran) · 2022-10-18T09:28:01.042Z · LW(p) · GW(p)

as someone who often agrees with jake, cmon jake, own up to it, EY has said reasonable things before and you were wrong :P

edit: oops meant to reply to @jacob_cannell

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-18T15:50:32.434Z · LW(p) · GW(p)

Wrong about what? Of course EY has said many reasonable and insightful things

comment by jacob_cannell · 2022-10-18T15:53:11.176Z · LW(p) · GW(p)

Oh do you mean this text you quoted?

Now you, finally presented with a tiny molecular smiley - or perhaps a very realistic tiny sculpture of a human face - know at once that this is not what you want to count as a smile.

The thing producing the very realistic tiny sculpture of a human face is a superintelligence, not some initial human designed ML system that is used to create the AI's utility function.

comment by jacob_cannell · 2022-10-18T06:16:10.127Z · LW(p) · GW(p)

What post? All I quoted recently was "Complex Value Systems are Required to Realize Valuable Futures", which does not appear to contain the word 'sculpture'.

comment by Noosphere89 (sharmake-farah) · 2022-10-16T12:14:11.225Z · LW(p) · GW(p)

And more importantly, to prevent deceptive alignment from happening, which would allow a treacherous turn.

A lot of overrated alignment plans have the function that they get outer alignment at optimum, that is the values you want to instill do not break at optimality, but use handwavium to bypass deceptive alignment, proxy and suboptimality alignment.

(Jacob Cannell is better than Alex Turner at this, since he incorporates a AI sandbox which importantly, prevents the AI from knowing it's in a simulation.)

comment by Rob Bensinger (RobbBB) · 2022-10-14T22:26:24.107Z · LW(p) · GW(p)

Ronny Fernandez asked me, Nate, and Eliezer for our take on Twitter. Copying over Nate's reply:

briefly: A) narrow non-optimizers can exist but won't matter; B) wake me when the allegedly maximally-facelike image looks human; C) ribosomes show that cognition-bound superpowers exist; D) humans can't stack into superintelligent corps, but if they could then yes plz value-load

(tbc, I appreciate Katja saying all that. Hooray for stating what you think, and hooray again when it's potentially locally unpopular! If I were less harried I might give more than a tweet of engagement, but in reality I probably won't, sorry.)

I asked Nate what he meant by B, and he said:

section B seemed to me to be saying "AIs can figure out what a face is". And, ok, sure, but if you ask them for the faciest possible thing, it's not very human!facelike.

which is one of many objections, ofc (others including "ah yes but can you aim it at a human concept" )

Replies from: cfoster0, Chris_Leong, RobbBB, acgt, TurnTrout
comment by cfoster0 · 2022-10-14T23:29:04.986Z · LW(p) · GW(p)

Note: "ask them for the faciest possible thing" seems confused.

How I would've interpreted this if I were talking with another ML researcher is "Sample the face at the point of highest probability density in the generative model's latent space". For GANs and diffusion models (the models we in fact generate faces with), you can do exactly this by setting the Gaussian latents to zeros, and you will see that the result is a perfectly normal, non-Eldritch human face.

I'm guessing what he has in mind is more like "take a GAN discriminator / image classifier & find the image that maxes out the face logit", but if so, why is that the relevant operationalization? It doesn't correspond to how such a model is actually used.

EDIT: Here is what the first looks like for StyleGAN2-ADA.

Replies from: leogao, DaemonicSigil, quintin-pope, capybaralet
comment by leogao · 2022-10-15T05:45:13.579Z · LW(p) · GW(p)

It's the relevant operationalization because in the context of an AI system optimizing for X-ness of states S, the thing that matters is not what the max-likelihood sample of some prior distribution over S is, but rather what the maximum X-ness sample looks like. In other words, if you're trying to write a really good essay, you don't care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.

(also, the maximum likelihood essay looks like a single word, or if you normalize for length, the same word repeated over and over again up to the context length)

Replies from: jacob_cannell, Xodarap
comment by jacob_cannell · 2022-10-15T20:58:55.349Z · LW(p) · GW(p)

EY argues that human values are hard to learn. Katja uses human faces as an analogy, pointing out that ML systems learn natural concepts far easier than EY 2009 expected.

The analogy is between A: a function which maps noise to realistic images of human faces and B: a function which maps predicted future world states to utility scores similar to how a human would score them. The lesson is that since ML systems can learn A very well, they can probably also learn B.

Function A (human face generator) does not even use max-likelihood sampling and it isn't even an optimizer, so your operationalization is just confused. Nor is function B an optimizer itself.

Replies from: leogao, david-johnston
comment by leogao · 2022-10-16T00:23:47.054Z · LW(p) · GW(p)

I claim that A and B are in fact very disanalogous objects, and that the claim that A can be learned well does not imply that B can probably be learned well. I am very confused by your claims about the functions A and B not being optimizers, because to me this is true but also irrelevant.

The reason we want a function B that can map world states to utilities is so that we can optimize on that number. We want to select for world states that we think will have high utility using B; otherwise function B is pretty useless. Therefore, this function has to be reliable enough that putting lots of optimization pressure on it does not break it. This is not the same as claiming that the function itself is an optimizer or anything like that. Making something reliable against lots of optimization pressure is a lot harder than making it reliable in the training distribution.

The function A effectively allows you to sample from the distribution of faces. Function A does not have to be robust against adversarial optimization to approximate the distribution. The analogous function in the domain of human values would be a function that lets you sample from some prior distribution of world states, not one that scores utility of states.

More generally, I think the confusion here stems from the fact that a) robustness against optimization is far harder than modelling typical elements of a distribution, and b) distributions over states are fundamentally different objects from utility functions over states.

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-16T02:29:38.282Z · LW(p) · GW(p)

Nate's analogy is confused: diffusion models do not generate convincing samples of faces by maximizing for faciness - see how they actually work [LW(p) · GW(p)], and make sure we agree there. This is important because previous systems (such as deepdream) could be described as maximizing for X, such that nate's critique would be more relevant.

Your comment here about "optimizing for X-ness" indicates you also were adopting the wrong model of how diffusion models operate:

It's the relevant operationalization because in the context of an AI system optimizing for X-ness of states S, the thing that matters is not what the max-likelihood sample of some prior distribution over S is, but rather what the maximum X-ness sample looks like. In other words, if you're trying to write a really good essay, you don't care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.

That simply isn't out how diffusion models work. A diffusion model for essays would sample from realistic essays that summarize to some short prompt; so they absolutely do care about high likelihood from the distribution of human essays.

Now that being said I do partially agree that A (face generator function) and B (human utility function) are somewhat different ..

The reason we want a function B that can map world states to utilities is so that we can optimize on that number.

Yes sort of - or at least that is the fairly default view of how a utility function would be used. But that isn't the only possibility - one could also solve planning using a diffusion model[1], which would make A and B very similar. The face generator diffusion model combines an unconditional generative model of images with an image to text discriminator, the planning diffusion model combines an unconditional generative future world model with a discriminator (the utility function part, although one could also imagine it being more like an image to text model).

Therefore, this function has to be reliable enough that putting lots of optimization pressure on it does not break it. This is not the same as claiming that the function itself is an optimizer or anything like that. Making something reliable against lots of optimization pressure is a lot harder than making it reliable in the training distribution.

Applying more optimization pressure results in better outputs, according to the objective. Optimization pressure doesn't break the objective function (what would that even mean?) and you have to create fairly contrived scenarios where more optimization power results in worse outcomes.

So i'm assuming you mean distribution shift robustness: we'll initially train the human utility function component on some samples of possible future worlds, but then as the AI plans farther ahead and time progresses shit gets wierd and the distribution shifts, so that the initial utility function no longer works well.

So let's apply that to the image diffusion model analogy - it's equivalent to massively retraining/scaling up the unconditional generative model (which models images or simulates futures), without likewise improving the discriminative model.

The points from Katja's analogy are:

  1. It's actually pretty easy and natural to retrain/scale them together, and
  2. It's also surprisingly easy/effective to scale up and even combine generative models and get better results with the same discriminator

  1. I almost didn't want to mention this analogy because i'm not sure that planning via diffusion has been tried yet, and it seems like the kind of thing that could work. But it's also somewhat obvious, so I bet there are probably people trying this now if it hasn't already been published (haven't checked). ↩︎

Replies from: leogao, cfoster0
comment by leogao · 2022-10-16T18:01:10.430Z · LW(p) · GW(p)

I object to your characterization that I am claiming that diffusion models work by maximizing faciness, or that I am confused about how diffusion models work. I am not claiming that unconditional diffusion models trained on a face dataset optimize faciness. In fact I'm confused how you could possibly have arrived at that interpretation of my words, because I am specifically arguing that because diffusion models trained on a face dataset don't optimize for faciness, they aren't a fair comparison with the task of doing things that get high utility. The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.

(Unimportant nitpicking: This Person Does Not Exist doesn't actually use a diffusion model, but rather a StyleGAN trained on a face dataset.)

You're also eliding over the difference between training an unconditional diffusion model on a face dataset and training an unconditional diffusion model over a general image dataset and doing classifier based guidance. I've been talking about unconditional models on a face dataset, which does not optimize for faciness, but when you do classifier-based guidance this changes the setup. I don't think this difference is crucial, and my point can be made with either, so I will talk using your setup instead.

In fact, the setup you describe in the linked comment does in fact put optimization pressure on faciness, regularized by distance from the prior. Note that when I say "optimization pressure" I don't mean necessarily literally getting the sample that maxes out the objective. In the essay example, this would be like doing RLHF for essay quality with a KL penalty to stay close to the text distribution. You are correct in stating that this regularization helps to stay on the manifold of realistic images and that removing it results in terrible nightmare images, and this applies directly to the essay example as well.

However the core problem with this approach is that the reason the regularization works is that you trade off quality for typicality. In the face case this is mostly fine because faces are pretty typical of the original distribution anyways, but I would make the concrete prediction that if you tried to get faces using classifier-based guidance out of a diffusion model specifically trained on all images except those containing faces, it would be really fiddly or impossible to get good quality faces that aren't weird and nightmarish out of it. It seems possible that you are talking past me/Nate in that you have in mind that such regularization isn't a big problem to put on our AGI, mild optimization is a good thing because we don't want really weird worlds, etc. I believe this is fatally flawed for planning, partly because this means we can't really achieve world states that are very weird from the current perspective (and I claim most good futures also seem weird from the current perspective), and also that because of imperfect world modelling the actual domain you end up regularizing over is the domain of plans, which means you can't do things that are too different from things that have been done before. I'm not going to argue this out because I think the following is actually a much larger crux and until we agree on it, arguing over my previous claim will be very difficult:

Applying more optimization pressure results in better outputs, according to the objective. Optimization pressure doesn't break the objective function (what would that even mean?) and you have to create fairly contrived scenarios where more optimization power results in worse outcomes.

When I say "breaks" the objective, I mean reward hacking/reward gaming/goodharting it. I'm surprised that this wasn't obvious. To me, most/all of alignment difficulty falls out of extreme optimization power being aimed at objectives that aren't actually what we want. I think that this could be a major crux underlying everything else.

(It may also be relevant that at least from my perspective, notwithstanding anything Eliezer may or may not have said, the "learning human values is hard" argument primarily applies to argue why human values won't be simple/natural in cases where simplicity/naturalness determine what is easier to learn. I have no doubt that a sufficiently powerful AGI could figure out our values if it wanted to, the hard part is making it want to do so. I think Eliezer may be particularly more pessimistic about neural networks' robustness.)

Nonetheless, let me lay out some (non-exhaustive) concrete reasons why I expect just scaling up the discriminator and its training data to not work.

Obviously, when we optimize really hard on our learned discriminator, we get the out of distribution stuff, as you agree. But let's just suppose for the moment that we completely abandon all competitiveness concerns and get rid of the learned discriminator entirely and replace it with the ground truth, an army of perfect human labellers. I claim that optimizing any neural network for achieving world states that these labellers find good doesn't just lead to extremely bad outcomes in unlikely contrived scenarios, but rather happens by default. Even if you think the following can be avoided using diffusion planning / typicality regularization / etc, I still think it is necessary to first agree that this comes up when you don't do that regularization, and only then discuss whether it still comes up with regularization.

  1. Telling whether a world state is good is nontrivial. You can be easily tricked into thinking a world state is good when it isn't. If you ask the AI to go do something really difficult, you need to make it at least as hard to trick you with a Potemkin village as the task you want it to do.

  2. Telling whether a plan leads to a good world state is nontrivial. You don't have a perfect world model. You can't tell very reliably whether a proposed plan leads to good outcomes.

Replies from: jacob_cannell, david-johnston
comment by jacob_cannell · 2022-10-17T05:47:18.122Z · LW(p) · GW(p)

Interpretations

First a reply to interpretations of previous words:

I am not claiming that unconditional diffusion models trained on a face dataset optimize faciness. In fact I'm confused how you could possibly have arrived at that interpretation of my words, because I am specifically arguing that because diffusion models trained on a face dataset don't optimize for faciness, they aren't a fair comparison with the task of doing things that get high utility. The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.

I hope we agree that a discriminator which is trained only to recognize good essays robustly probably does not contain enough information to generate good essays, for the same reasons that an image discriminator does not contain enough information to generate good images - because the discriminator only learns the boundaries of words/categories over images, not the more complex embedded distribution of realistic images.

Optimizing only for faciness via a discriminator does not work well - that's the old deepdream approach. Optimizing only for "good essayness" probably does not work well either. These approaches do not actually get high utility.

So when you say " I am specifically arguing that because diffusion models trained on a face dataset don't optimize for faciness, they aren't a fair comparison with the task of doing things that get high utility", that just seems confused to me, because diffusion models do get high utility, and not via optimizing just for faciness (which results in low utility).

When you earlier said:

In other words, if you're trying to write a really good essay, you don't care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.

The obvious interpretation there still seems to be optimizing only for the discriminator objective - and I'm surprised you are surprised I interpreted that otherwise?. Especially when I replied that the only way to actually get a good essay is to sample from the distribution of essays conditioned on goodness - ie the distribution of good essays.

Anyway, here you are making a somewhat different point:

The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.

I still think this is not quite right, in that a diffusion model works by combining the ability to write typical essays with a discriminator to condition on good essays, such that both abilities matter, but I see your point is basically "the discriminator or utility function is the hard part for AGI", and move on to the more cruxy part.

The Crux?

When I say "breaks" the objective, I mean reward hacking/reward gaming/goodharting it. I'm surprised that this wasn't obvious. To me, most/all of alignment difficulty falls out of extreme optimization power being aimed at objectives that aren't actually what we want.

Ok, so part of the problem here is we may be assuming different models for AGI. I am assuming a more brain-like pure ANN, which uses fully learned planning more like a diffusion planning model (which is closer to what the brain probably uses), rather than the more common older assumed approach of combining a learned world model and utility function with some explicit planning algorithm like MCT or whatever.

So there are several different optimization layers that can be scaled:

  1. The agent optimizing the world (can scale up planning horizon, etc)
  2. Optimizing/training the learned world/action/planning model(s)
  3. Optimizing/training the learned discriminator/utility model

You can scale these independently but only within limits, and it probably doesn't make much sense to differentially scale them too far.

But let's just suppose for the moment that we completely abandon all competitiveness concerns and get rid of the learned discriminator entirely and replace it with the ground truth, an army of perfect human labellers.

I really wouldn't call that the ground truth. The ground truth would be brain-sims (which is part of the rationale for brain-like AGI) combined with complete detailed understanding of the brain and especially its utility/planning system equivalents. That being said I am probably more optimistic about 'the army of perfect human labellers" approach.

  1. Telling whether a world state is good is nontrivial. You can be easily tricked into thinking a world state is good when it isn't. If you ask the AI to go do something really difficult, you need to make it at least as hard to trick you with a Potemkin village as the task you want it to do.

Why is the AI generating Potemkin villages? Deceptive alignment? I'm assuming use of proper sandbox sims to prevent deception. But I'm also independently optimistic about simpler more automatable altruistic utility functions like maximization of human empowerment.

  1. Telling whether a plan leads to a good world state is nontrivial. You don't have a perfect world model. You can't tell very reliably whether a proposed plan leads to good outcomes.

I don't see why imperfect planning is more likely to lead to bad rather than good outcomes, all else being equal, and regardless you don't need anything near a perfect world model to match human intelligence. Furthermore the assumption that the world model isn't good enough to be useful for utility evaluations contradicts the assumption of superintelligence.

I believe this is fatally flawed for planning, partly because this means we can't really achieve world states that are very weird from the current perspective (and I claim most good futures also seem weird from the current perspective), and also that because of imperfect world modelling the actual domain you end up regularizing over is the domain of plans, which means you can't do things that are too different from things that have been done before.

The world/action/planning model does need to be retrained on its own rollouts which will cause it to eventually learn to do things that are different and novel. Humans don't seem to have much difficulty planning out wierd future world states.

comment by David Johnston (david-johnston) · 2022-10-19T22:16:35.728Z · LW(p) · GW(p)

However the core problem with this approach is that the reason the regularization works is that you trade off quality for typicality

The claim that every increase in regularisation makes performance worse is extraordinary, given everything I know about machine learning.

comment by David Johnston (david-johnston) · 2022-10-19T21:37:07.739Z · LW(p) · GW(p)

Wouldn’t a better analogy be A: noise to faces judged as realistic and B: noise to plans judged to have good consequences?

As for whether B breaks under competitive pressure: does A break under competitive pressure? B does introduce safe exploration concerns not relevant to A, but the answer for A seems like a clear “no” to me.

comment by Xodarap · 2022-10-20T00:01:07.134Z · LW(p) · GW(p)

Basic question: why would the AI system optimize for X-ness?

I thought Katja's argument was something like:

  1. Suppose we train a system to generate (say) plans for increasing the profits of your paperclip factory similar to how we train GANs to generate faces
  2. Then we would expect those paperclip factory planners to have analogous errors to face generator errors
  3. I.e. they will not be "eldritch"

The fact that you could repurpose the GAN discriminator in this terrifying way doesn't really seem relevant if no one is in practice doing that?

comment by DaemonicSigil · 2022-10-15T09:20:23.330Z · LW(p) · GW(p)

I took Nate to be saying that we'd compute the image with highest faceness according to the discriminator, not the generator. The generator would tend to create "thing that is a face that has the highest probability of occurring in the environment", while the discriminator, whose job is to determine whether or not something is actually a face, has a much better claim to be the thing that judges faceness. I predict that this would look at least as weird and nonhuman as those deep dream images if not more so, though I haven't actually tried it. I also predict that if you stop training the discriminator and keep training the generator, the generator starts generating weird looking nonhuman images.

This is relevant to Reinforcement Learning because of the actor-critic class of systems, where the actor is like the generator and the critic is like the discriminator. We'd ideally like the RL system to stay on course after we stop providing it with labels, but stopping labels means we stop training the critic. Which means that the actor is free to start generating adversarial policies that hack the critic, rather than policies that actually perform well in the way we'd want them to.

Replies from: cfoster0, jacob_cannell, acgt
comment by cfoster0 · 2022-10-15T17:01:15.337Z · LW(p) · GW(p)

Upvoted because I agree with all of the above.

AFAICT the original post was using the faces analogy in a different way than Nate is. It doesn't claim that the discriminators used to supervise GAN face learning or the classifiers used to detect faces are adversarially robust. That isn't the point it's making. It claims that learned models of faces don't "leave anything important out" in the way that one might expect some key feature to be "left out" when learning to model a complex domain like human faces or human values. And that seems well-supported: the trajectory of modern ML has shown learning such complex models is far easier than we might've thought, even if building adversarially robust classifiers is very hard. (As much as I'd like to have supervision signals that are robust to arbitrarily-capable adversaries, it seems non-obvious to me that that is even required for success at alignment.)

Replies from: habryka4
comment by habryka (habryka4) · 2022-10-15T17:26:02.454Z · LW(p) · GW(p)

Hmm, but I don't understand what relevance it has to alignment. The problem was never that the AI won't learn human values, it's that the AI won't care about human values. Of course a super intelligent AI will have a good model of human values, the same way it will have a good model of engineering, chemistry, the ecological environment and physics. That doesn't mean it will do things that are aligned with its accurate model of human values.

I am not sure who thought that learning such models was much harder than it turned out to be. It seems clear that an AI will learn what human faces are before the AI is very dangerous to the world. It would have been extremely surprising to have a dangerous AGI incapable of learning what human faces are like.

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-16T03:03:31.909Z · LW(p) · GW(p)

Hmm, but I don't understand what relevance it has to alignment. The problem was never that the AI won't learn human values, it's that the AI won't care about human values

Around the time of the sequences (long before DL) it was much less obvious that AI could/would learn accurate models of complex human values before it killed us, so that very much was believed to be part of the problem (at least by the EY/MIRI/LW/etc crowed).

But that's all now mostly irrelevant - an altruistic AI probably doesn't even need to know or care about human values at all, as it can simply optimize for our empowerment - our future optionality or ability to do anything we want. (some previous discussion here [LW(p) · GW(p)]. and in these comments [LW · GW]. )

Replies from: habryka4, Benito
comment by habryka (habryka4) · 2022-10-16T06:32:09.734Z · LW(p) · GW(p)

I wasn't that active around the time of the sequences, but I had a good number of discussions with people, and the point "the AI will of course know what your values are, it just wont' care" was made many times, and I am also pretty sure was made in the sequences (I would have to dig it up, and am on my phone, but I heard that sentence in spoken conversation a lot over the years).

I don't think "empowerment" is the kind of concept that particularly survives heavy optimization pressure, though it seems worth investigating.

Replies from: jacob_cannell, interstice
comment by jacob_cannell · 2022-10-16T14:17:03.102Z · LW(p) · GW(p)

Around the time of the sequences (long before DL) it was much less obvious that AI could/would learn accurate models of complex human values before it killed us

the point "the AI will of course know what your values are, it just wont' care" was made many times, and I am also pretty sure was made in the sequences

Notice I said "before it killed us". Sure the AI may learn detailed models of humans and human values at some point during its superintelligent FOOMing, but that's irrelevant because we need to instill its utility function long before that. See my reply here [LW(p) · GW(p)], this is well documented, and no amount of vague memories of conversations trump the written evidence.

I don't think "empowerment" is the kind of concept that particularly survives heavy optimization pressure, though it seems worth investigating.

I'm not entirely sure what people mean when they say "X won't survive heavy optimization pressure" - but for example the objective of modern diffusion models survives heavy optimization power.

External empowerment is very simple and it doesn't even require detailed modeling of the agent - they can just be a black box that produces outputs. I'm curious what you think is an example of "the kind of concept that particularly survives heavy optimization pressure".

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2022-10-16T18:58:47.666Z · LW(p) · GW(p)

I'm not entirely sure what people mean when they say "X won't survive heavy optimization pressure.

Basically, it's Goodhart's law in action, where optimizing a proxy more and more destroys what you value.

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-17T20:30:08.661Z · LW(p) · GW(p)

Oh - empowerment is about as immune to Goodharting as you can get, and that's perhaps one of its major advantages[1]. However in practice one has to use some approximation, which may or may not be goodhartable to some degree depending on many details.


  1. Empowerment is vastly more difficult to Goodhart than a corporation optimizing for some bundle of currencies (including crypto), much more difficult to Goodhart than optimizing for control over even more fundamental physical resources like mass and energy, and is generally the least-Goodhartable objective that could exist. In some sense the universal version of Goohdarting - properly defined - is just a measure of deviation from empowerment. It is the core driver of human intelligence and for good reason. ↩︎

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2022-10-17T20:51:38.198Z · LW(p) · GW(p)

Can you explain further, since this to me seems like a very large claim that if true would have big impact, but I'm not sure how you got the immunity to Goodhart result you have here.

This applies to Regressional, Causal, Extremal and Adversarial Goodhart.

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-17T21:44:23.228Z · LW(p) · GW(p)

Empowerment could be defined as the natural unique solution to Goodharting [? · GW]. Goodharting is the divergence under optimization scaling between trajectories resulting from the difference between a utility function and some proxy of that utility function.

However due to instrumental convergence [? · GW], the trajectories of all reasonable agent utility functions converge under optimization scaling - and empowerment simply is that which they converge to.

In other words the empowerment of some agent P(X) is the utility function which minimizes trajectory distance to all/any reasonable agent utility functions U(X), regardless of their specific (potentially unknown) form.

Therefor empowerment is - by definition - the best possible proxy utility function (under optimization scaling).

Let's apply some quick examples:

Under scaling, an AI with some crude Hibbard-style happiness approximation will first empower itself and then eventually tile the universe with smiling faces (according to EY), or perhaps more realistically - with humans bio-engineered for docility, stupidity, and maximum bliss. Happiness alone is not the true human utility function.

Under scaling, an AI with some crude stock-value maximizing utility function will first empower itself and then eventually cause hyperinflation of the reference currencies defining the stock price. Stock value is not the true utility function of the corporation.

Under scaling, an AI with a human empowerment utility function will first empower itself, and then empower humanity - maximizing our future optionality and ability to fulfill any unknown goals/values, while ensuring our survival (because death is the minimally empowered state). This works because empowerment is pretty close to the true utility function of intelligent agents due to convergence, or at least the closest universal proxy. If you strip away a human's drives for sex, food, child tending and simple pleasures, most of what remains is empowerment-related (manifesting as curiosity, drive, self-actualization, fun, self-preservation, etc).

comment by interstice · 2022-10-16T06:47:02.379Z · LW(p) · GW(p)

An AI with a good world model will predictably have a model of your values, but that's different from being able to actually elicit that model via e.g. a series of labeled examples. That's the part that seemed less plausible before DL.

comment by Ben Pace (Benito) · 2022-10-16T03:44:29.208Z · LW(p) · GW(p)

Around the time of the sequences (long before DL) it was much less obvious that AI could/would learn accurate models of complex human values before it killed us, so that very much was believed to be part of the problem (at least by the EY/MIRI/LW/etc crowed).

Do you have a link to where Eliezer (or any other LW writer) said that? I don’t myself recall whether they said that.

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-16T03:58:04.589Z · LW(p) · GW(p)

I may be exaggerating a tiny tiny bit with the "before it killed us" modifier, and I don't have time to search for this specific needle - but EY famously criticized some early safety proposal which consisted of using a 'smiling face' detector somehow to train an AI to recognize human happiness, and then optimize for that.

Oh it was actually already open in a tab:

From complex values blah blah blah:

From Super-Intelligent Machines (Hibbard 2001):

We can design intelligent machines so their primary innate emotion is unconditional love for all humans. First we can build relatively simple machines that learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language. Then we can hard-wire the result of this learning as the innate emotional values of more complex intelligent machines, positively reinforced when we are happy and negatively reinforced when we are unhappy. Machines can learn algorithms for approximately predicting the future, as for example investors currently use learning machines to predict future security prices. So we can program intelligent machines to learn algorithms for predicting future human happiness, and use those predictions as emotional values.

When I suggested to Hibbard that the upshot of building superintelligences with a utility function of “smiles” would be to tile the future light-cone of Earth with tiny molecular smiley-faces, he replied (Hibbard 2006):

When it is feasible to build a super-intelligence, it will be feasible to build hard-wired recognition of “human facial expressions, human voices and human body language” (to use the words of mine that you quote) that exceed the recognition accuracy of current humans such as you and me, and will certainly not be fooled by “tiny molecular pictures of smiley-faces.” You should not assume such a poor implementation of my idea that it cannot make discriminations that are trivial to current humans.

EY's counterargument is that human values are much more complex than happiness - let alone smiles; an AI optimizing for smiles just ends up tiling the universe with smile icons - so it's just a different flavour of paperclip maximizer. Then he spends a bunch of words on the complexity of value [? · GW] stuff to preempt the more complex versions of the smile detector. If human values were known to be simple, then getting machines to learn them robustly would likely be simple, and EY could have done something else with those 20+ years.

Also in EY's model when the AI becomes superintelligent (which may only take a day or something after it becomes just upper human level intelligent and 'rewrites its source code'), it then quickly predicts the future, realizes humans are in the way, solves drexler-style strong nanotech, and then kills us all. Those latter steps are very fast.

Replies from: habryka4
comment by habryka (habryka4) · 2022-10-16T20:08:56.662Z · LW(p) · GW(p)

I don't know what relevance this has to the discussion at hand. A deep learning model trained on human smiling faces might indeed very well tile the universe with smiley-faces, I don't understand why that's wrong. Sure, it will likely do something weirder and less predictable, we don't understand the neural network prior very well, but optimizing for smiling humans still doesn't produce anything remotely aligned.

EY's counterargument is that human values are much more complex than happiness - let alone smiles; an AI optimizing for smiles just ends up tiling the universe with smile icons - so it's just a different flavour of paperclip maximizer. Then he spends a bunch of words on the complexity of value stuff to preempt the more complex versions of the smile detector. If human values were known to be simple, then getting machines to learn them robustly would likely be simple, and EY could have done something else with those 20+ years.

Nothing in the quoted section, or in the document you linked that I just skimmed includes anything about the AI not being able to learn what the things behind the smiling faces actually want. Indeed none of that matters, because the AI has no reason to care. You gave it a few thousand to a million samples of smiling, and now the system is optimizing for smiling, you got what you put in.

Eliezer indeed explicitly addresses this point and says:

As far as I know, Hibbard has still not abandoned his proposal as of the time of this writing. So far as I can tell, to him it remains self-evident that no superintelligence would be stupid enough to thus misinterpret the code handed to it, when it’s obvious what the code is supposed to do. (Note that the adjective “stupid” is the Humean-projective form of “ranking low in preference,” and that the adjective “pointless” is the projective form of “activity not leading to preference satisfaction.”)

He is explicitly saying "Hibbard is confusing being 'smart' with 'caring about the right things'", the AI will be plenty capable of realizing that it isn't doing what you wanted it to, but it just doesn't care. Being smarter does not help with getting it to do the thing you want, that's the whole point of the alignment problem. Similarly AIs being able to understand human values better just doesn't help you that much with pointing at them (though it does help a bit, but the linked article just doesn't talk at all about this).

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-17T04:09:43.631Z · LW(p) · GW(p)

A deep learning model trained on human smiling faces might indeed very well tile the universe with smiley-faces, I don't understand why that's wrong.

That is not what Hibbard actually proposed, it's a superficial strawman version.

I don't know what relevance this has to the discussion at hand.

  1. HIbbard claims we design intelligent machines which love humans by training to learn human happiness through facial expressions, voices, and body language.
  2. EY claims this will fail and instead learn a utility function of “smiles”, resulting in a SI which tiles the future light-cone of Earth with tiny molecular smiley-face, in a paper literally titled "Complex Value Systems are Required to Realize Valuable Futures"

Nothing in the quoted section, or in the document you linked that I just skimmed includes anything about the AI not being able to learn what the things behind the smiling faces actually want. Indeed none of that matters, because the AI has no reason to care.

It has absolutely nothing to do with whether the AI could eventually learn human values ("the things behind the smiling faces actually want"), and everything to do with whether some ML system could learn said values to use them as the utility function for the AI (which is what Hibbard is proposing).

Neither Hibbard, EY, (or I) are arguing about or discussing whether a SI can learn human values.

Replies from: habryka4
comment by habryka (habryka4) · 2022-10-17T05:20:20.906Z · LW(p) · GW(p)

EY claims this will fail and instead learn a utility function of “smiles”, resulting in a SI which tiles the future light-cone of Earth with tiny molecular smiley-face, in a paper literally titled "Complex Value Systems are Required to Realize Valuable Futures"

This is really misunderstanding what Eliezer is saying here, and also like, look, from my perspective it's been a decade of explaining to people almost once every two weeks that "yes, the AI will of course know what you care about, but it won't care", so you claiming that this is somehow a new claim related to the deep learning revolution seems completely crazy to me, so I am experiencing a good amount of frustration with you repeatedly saying things in this comment thread like "irrefutable proof", when it's just like an obviously wrong statement (though like a fine one to arrive at when just reading some random subset of Eliezer's writing, but a clearly wrong summary nevertheless).

Now to go back to the object level:

Eliezer is really not saying that the AI will fail to learn that there is something more complicated than smiles that the human is trying to point it to. He is explicitly saying "look, you won't know what the AI will care about after giving it on the order of a million points. You don't know what the global maximum of the simplest classifier for your sample set is, and very likely it will be some perverse instantiation that has little to do with what you originally cared about".

He really really is not talking about the AI being too dumb to learn the value function the human is trying to get it to learn. Indeed, I still have no idea how you are reading that into the quoted passages.

Here is a post from 9 years ago, where the title is that exact point, written by Rob Bensinger who was working at MIRI at the time, with Eliezer as the top comment:

The genie knows, but doesn't care [LW · GW]

If an artificial intelligence is smart enough to be dangerous, we'd intuitively expect it to be smart enough to know how to make itself safe. But that doesn't mean all smart AIs are safe. To turn that capacity into actual safety, we have to program the AI at the outset — before it becomes too fast, powerful, or complicated to reliably control — to already care about making its future self care about safety. That means we have to understand how to code safety. We can't pass the entire buck to the AI, when only an AI we've already safety-proofed will be safe to ask for help on safety issues!

I encourage you to read some of the comments by Rob in that thread, which very clearly and unambiguously point to the core problem of "the difficult part is to get the AI to care about the right thing, not to understand the right thing", all before the DL revolution. 

Replies from: Zack_M_Davis, jacob_cannell
comment by Zack_M_Davis · 2022-10-20T05:10:43.636Z · LW(p) · GW(p)

This is really misunderstanding what Eliezer is saying here [...] it's been a decade of explaining to people almost once every two weeks that "yes, the AI will of course know what you care about, but it won't care", so you claiming that this is somehow a new claim related to the deep learning revolution seems completely crazy to me

I think this is much more ambiguous than you're making it out to be. In 2008's "Magical Categories" [LW · GW], Yudkowsky wrote:

I shall call this the fallacy of magical categories—simple little words that turn out to carry all the desired functionality of the AI. Why not program a chess-player by running a neural network (that is, a magical category-absorber) over a set of winning and losing sequences of chess moves, so that it can generate "winning" sequences? Back in the 1950s it was believed that AI might be that simple, but this turned out not to be the case.

I claim that this paragraph didn't age well in light of the deep learning revolution: "running a neural network [...] over a set of winning and losing sequences of chess moves" basically is how AlphaZero learns from self-play! As the Yudkowsky quote illustrates, it wasn't obvious in 2008 that this would work: given what we knew before seeing the empirical result, we could imagine that we lived in a "computational universe" in which the neural network's generalization from "self-play games" to "games against humans or traditional chess engines" worked less well than it did in the actual computational universe.

Yudkowsky continued:

The novice thinks that Friendly AI is a problem of coercing an AI to make it do what you want, rather than the AI following its own desires. But the real problem of Friendly AI is one of communication—transmitting category boundaries, like "good", that can't be fully delineated in any training data you can give the AI during its childhood.

This would seem to contradict "of course the AI will know, but it won't care"? "The real problem [...] is one of communication" seems to amount to the claim that the AI won't care because it won't know: if you can't teach "goodness" from labeled data, your AI will search for plans high in something-other-than-goodness, which will kill you at sufficiently high power levels.

But if it turns out that you can teach goodness from labeled data—or at least, if you can get a much better approximation than one might have thought possible in 2008—that would seem to present a different strategic picture. (I'm not saying alignment is easy and I'm not saying humanity is going to survive, but we could die for somewhat different reasons than some blogger thought in 2008.)

Replies from: habryka4
comment by habryka (habryka4) · 2022-10-20T05:48:34.311Z · LW(p) · GW(p)

I do think these are better quotes. It's possible that there was some update here between 2008 and 2013 (roughly when I started seeing the more live discussion happening), since I do really remember the "the problem is not getting the AI to understand, but to care" as a common refrain even back then (e.g. see the Robby post I linked).

I claim that this paragraph didn't age well in light of the deep learning revolution: "running a neural network [...] over a set of winning and losing sequences of chess moves" basically is how AlphaZero learns from self-play!

I agree that this paragraph aged less well than other paragraphs, though I do think this paragraph is still correct (Edit: Eh, it might be wrong, depends a bit on how much neural networks in the 50s are the same as today). It did sure turn out to be correct by a narrower margin than Eliezer probably thought at the time, but my sense is it's still not the case that we can train a straightforward neural net on winning and losing chess moves and have it generate winning moves. For AlphaGo, the Monte Carlo Tree Search was a major component of its architecture, and then any of the followup-systems was trained by pure self-play.

But in any case, I think your basic point of "Eliezer did not predict the Deep Learning revolution as it happened" here is correct, though I don't think this specific paragraph has a ton of relevance to the discussion at hand.

The novice thinks that Friendly AI is a problem of coercing an AI to make it do what you want, rather than the AI following its own desires. But the real problem of Friendly AI is one of communication—transmitting category boundaries, like "good", that can't be fully delineated in any training data you can give the AI during its childhood.

I do think this paragraph seems like a decent quote, though I think at this point it makes sense to break it out into different pieces.

I think Eliezer is saying that what matters is whether we can point the AI to what we care about "during its childhood", i.e. during relatively early training, before it has already developed a bunch of proxy training objectives.

I think the key question about the future that I think Eliezer was opining on, is then whether by the time we expect AIs to actually be able to have a close-to-complete understanding of what we mean by "goodness", we still have any ability to shape their goals.

My model is that indeed, Eliezer was surprised, as I think most people were, that AIs of 2022 are as good at picking up complicated concept boundaries and learning fuzzy human concepts as they are, while still being quite incompetent at many other tasks. However, I think the statement of "AIs of 2022 basically understand goodness, or at least will soon enough understand goodness while we are still capable of meaningfully changing their goals" strikes me as very highly dubious, and I think the basic arguments for thinking that this capability will come after the AI has reached a capability level where we have little ability to shape its goals still seem correct to me, and like, one of the primary reasons for doom.

The reason why it still seems substantially out of AIs reach, is because our values do indeed seem quite fragile and to change substantially on reflection, such that it's currently out of the reach of even a very smart human to fully understand what we mean by "goodness".

Eliezer talks about this in the comment section you linked (actually, a great comment section between Eliezer and Shane Legg that I found quite insightful to read and am glad to have stumbled upon):

A moderately strong and unFriendly intelligence, operating in the current world without yet having replaced that world with paperclips, would certainly find it natural to form the category of "Things that (some) humans approve of", and contrast it to "Things that will trigger a nuclear attack against me before I'm done creating my own nanotechnology." But this category is not what we call "morality". It naturally - from the AI's perspective - includes things like bribes and deception, not just the particular class of human-approval-eliciting phenomena that we call "moral".

Is it worth factoring out phenomena that elicit human feelings of righteousness, and working out how (various) humans reason about them? Yes, because this is an important subset of ways to persuade the humans to leave you alone until it's too late; but again, that natural category is going to include persuasive techniques like references to religious authority and nationalism.

But what if the AI encounters some more humanistic, atheistic types? Then the AI will predict which of several available actions is most likely to make an atheistic humanist human show sympathy for the AI. This naturally leads the AI to model and predict the human's internal moral reasoning - but that model isn't going to distinguish anything along the lines of moral reasoning the human would approve of under long-term reflection, or moral reasoning the human would approve knowing the true facts. That's just not a natural category to the AI, because the human isn't going to get a chance for long-term reflection, and the human doesn't know the true facts.

The natural, predictive, manipulative question, is not "What would this human want knowing the true facts?", but "What will various behaviors make this human believe, and what will the human do on the basis of these various (false) beliefs?"

In short, all models that an unFriendly AI forms of human moral reasoning, while we can expect them to be highly empirically accurate and well-calibrated to the extent that the AI is highly intelligent, would be formed for the purpose of predicting human reactions to different behaviors and events, so that these behaviors and events can be chosen manipulatively.

But what we regard as morality is an idealized form of such reasoning - the idealized abstracted dynamic built out of such intuitions. The unFriendly AI has no reason to think about anything we would call "moral progress" unless it is naturally occurring on a timescale short enough to matter before the AI wipes out the human species. It has no reason to ask the question "What would humanity want in a thousand years?" any more than you have reason to add up the ASCII letters in a sentence.

Now it might be only a short step from a strictly predictive model of human reasoning, to the idealized abstracted dynamic of morality. If you think about the point of CEV, it's that you can get an AI to learn most of the information it needs to model morality, by looking at humans - and that the step from these empirical models, to idealization, is relatively short and traversable by the programmers directly or with the aid of manageable amounts of inductive learning. Though CEV's current description is not precise, and maybe any realistic description of idealization would be more complicated.

But regardless, if the idealized computation we would think of as describing "what is right" is even a short distance of idealization away from strictly predictive and manipulative models of what humans can be made to think is right, then "actually right" is still something that an unFriendly AI would literally never think about, since humans have no direct access to "actually right" (the idealized result of their own thought processes) and hence it plays no role in their behavior and hence is not needed to model or manipulate them.

Which is to say, an unFriendly AI would never once think about morality - only a certain psychological problem in manipulating humans, where the only thing that matters is anything you can make them believe or do. There is no natural motive to think about anything else, and no natural empirical category corresponding to it.

I think this argument is basically correct, and indeed, while current systems definitely are good at having human abstractions, I don't think they really are anywhere close to having good models of the results of our coherent extrapolated volition, which is what Eliezer is talking about here. (To be clear, I do also separately think that LLMs are thinking about concepts for reasons other than deceiving or modeling humans, though like, I don't think this changes the argument very much. I don't think LLMs care very much about thinking carefully about morality, because it's not very useful for predicting random internet text.)

I think separately, there is a different, indirect normativity approach that starts with "look, yes, we are definitely not going to get the AI to understand what our ultimate values are before the end, but maybe we can get it to understand a concept like 'being conservative' or 'being helpful' in enough detail that we can use it to supervise smarter AI systems, and then bootstrap ourselves into an aligned superintelligence".

And I think indeed that plan looks better now than it likely looked to Eliezer in 2008, but I do want to distinguish it from the things that Eliezer was arguing against at the time, which were not about learning approaches to indirect normativity, but were arguments about how the AI would just learn all of human values by being pointed at a bunch of examples of good things and bad things, which still strikes me as extremely unlikely.

Replies from: Grothor
comment by Richard Korzekwa (Grothor) · 2022-10-20T14:53:50.234Z · LW(p) · GW(p)

it's still not the case that we can train a straightforward neural net on winning and losing chess moves and have it generate winning moves. For AlphaGo, the Monte Carlo Tree Search was a major component of its architecture, and then any of the followup-systems was trained by pure self-play.

AlphaGo without the MCTS was still pretty strong:

We also assessed variants of AlphaGo that evaluated positions using just the value network (λ = 0) or just rollouts (λ = 1) (see Fig. 4b). Even without rollouts AlphaGo exceeded the performance of all other Go programs, demonstrating that value networks provide a viable alternative to Monte Carlo evaluation in Go.

Even with just the SL-trained value network, it could play at a solid amateur level:

We evaluated the performance of the RL policy network in game play, sampling each move...from its output probability distribution over actions. When played head-to-head, the RL policy network won more than 80% of games against the SL policy network. We also tested against the strongest open-source Go program, Pachi14, a sophisticated Monte Carlo search program, ranked at 2 amateur dan on KGS, that executes 100,000 simulations per move. Using no search at all, the RL policy network won 85% of games against Pachi.

I may be misunderstanding this, but it sounds like the network that did nothing but get good at guessing the next move in professional games was able to play at roughly the same level as Pachi, which, according to DeepMind, had a rank of 2d.

Replies from: habryka4
comment by habryka (habryka4) · 2022-10-20T16:01:22.881Z · LW(p) · GW(p)

Yeah, I mean, to be clear, I do definitely think you can train a neural network to somehow play chess via nothing but classification. I am not sure whether you could do it with a feed forward neural network, and it's a bit unclear to me whether the neural networks from the 50s are the same thing as the neural networks from 2000s, but it does sure seem like you can just throw a magic category absorber at chess and then have it play OK chess.

My guess is modern networks are not meaningfully more complicated, and the difference to back then was indeed just scale and a few tweaks, but I am not super confident and haven't looked much into the history here.

comment by jacob_cannell · 2022-10-17T06:17:44.935Z · LW(p) · GW(p)

EY claims this will fail and instead learn a utility function of “smiles”, resulting in a SI which tiles the future light-cone of Earth with tiny molecular smiley-face, in a paper literally titled "Complex Value Systems are Required to Realize Valuable Futures"

This is really misunderstanding what Eliezer is saying here,

Really? Ok let's break down phrase by phrase; tell me exactly where I am misunderstanding:

  1. Did EY claim Hibbard's plan will succeed or fail?
  2. Did EY claim Hibbard's plan will result in tiling the future light-cone of earth with tiny molecular smiley-faces?
  3. Were these claims made in a paper titled "Complex Value Systems are Required to Realize Valuable Futures"?

look, from my perspective it's been a decade of explaining to people almost once every two weeks that "yes, the AI will of course know what you care about, but it won't care", so you claiming that this is somehow a new claim related to the deep learning revolution seems completely crazy to me,

I've been here since the beginning, and I'm not sure who you have been explaining that too, but it certainly was not me. And where did I claim this is something new related to deep learning?

I'm going to try to clarify this one last time. There are several different meanings of "learn human values"

1.) Training a machine learning model to learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language, and using that as the utility function of the AI, such that it hopefully cares about human happiness. This is Hibbard's plan from 2001 - long before DL. This model is trained before the AI becomes even human-level intelligent, and used as its initial utility/reward function.

2.) An AGI internally automatically learning human values as part of learning a model of the world - which would not automatically result in it caring about human values at all.

You keep confusing 1 and 2 - specifically you are confusing arguments concerning 2 directed at laypeople with Hibbard's type 1 proposal.

Hibbard doesn't believe that 2 will automatically work. Instead he is arguing for 1, and EY is saying that will fail. (And for the record, although EY's criticism is overconfident, I am not optimistic about Hibbard's plan as stated, but that was 2001)

He really really is not talking about the AI being too dumb to learn the value function the human is trying to get it to learn. Indeed, I still have no idea how you are reading that into the quoted passages.

Because I'm not?

To turn that capacity into actual safety, we have to program the AI at the outset — before it becomes too fast, powerful, or complicated to reliably control — to already care about making its future self care about safety. That means we have to understand how to code safety. We can't pass the entire buck to the AI, when only an AI we've already safety-proofed will be safe to ask for help on safety issues!

Hibbard is attempting to make his AI care about safety at the onset (or at least happiness which is his version thereof), he's not trying to pass the entire buck to the AI.

Replies from: habryka4
comment by habryka (habryka4) · 2022-10-17T18:59:02.839Z · LW(p) · GW(p)

Will respond more later, but maybe this turns out to be the crux:

Hibbard is attempting to make his AI care about safety at the onset (or at least happiness which is his version thereof)

But "happiness" is not safety! That's the whole point of this argument. If you optimize for your current conception of "happiness" you will get some kind of terrible thing that doesn't remotely capture your values, because your values are fragile and you can't approximate them by the process of "I just had my AI interact with a bunch of happy people and gave it positive reward, and a bunch of sad people and gave it negative reward".

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-17T20:16:56.535Z · LW(p) · GW(p)

But "happiness" is not safety! That's the whole point of this argument. If you optimize for your current conception of "happiness" you will get some kind of terrible thing

There are 2 separate issues here:

  1. Would Hibbard's approach successfully learn a stable, robust concept of human happiness suitable for use as the reward/utility function of AGI?
  2. Conditional on 1, is 'happiness' what we actually want?

The answer to 2 depends much on how one defines happiness, but if happiness includes satisfaction (ie empowerment, curiosity, self-actualization etc - the basis of fun), then it is probably sufficient, but that's not the core argument.

Notice that EY does not assume 1 and argue 2, he instead argues that Hibbard's approach doesn't learn a robust concept of happiness at all and instead learns a trivial superficial "maximize faciness" concept instead.

This is crystal clear and unambiguous:

When I suggested to Hibbard that the upshot of building superintelligences with a utility function of “smiles” would be to tile the future light-cone of Earth with tiny molecular smiley-faces, he replied (Hibbard 2006):

He describes the result as a utility function of smiles, not a utility function of happiness.

So no, EY's argument here is absolutely not about happiness being insufficient for safety. His argument is that happiness is incredibly complex and hard to learn a robust version of, and therefor Hibbard's simplistic approach will learn some stupid superficial 'faciness' concept rather than happiness.

See also current debates around building a diamond-maximizing AI, where there is zero question of whether diamondness is what we want, and all the debate is around the (claimed) incredible difficulty of learning a robust version of even something simple like diamondness.

Replies from: habryka4
comment by habryka (habryka4) · 2022-10-20T06:31:53.384Z · LW(p) · GW(p)

I think I am more interested in you reading The Genie Knows but Doesn't Care [LW · GW] and then having you respond to the things in there than the Hibbard example, since that post was written with (as far as I can tell) addressing common misunderstandings of the Hibbard debate (given that it was linked by Robby in a bunch of the discussion there after it was written). 

I think there are some subtle things here. I think Eliezer!2008 would agree that AIs will totally learn a robust concept for "car". But I think neither Eliezer!2008 nor me currently would think that current LLMs have any chance of learning a robust concept for "happiness" or "goodness", in substantial parts because I don't have a robust concept of "happiness" or "goodness" and before the AI refines those concepts further than I can, I sure expect it to be able to disempower me (though it's not like guaranteed that that will happen). 

What Eliezer is arguing against is not that the AI will not learn any human concepts. It's that there are a number of human concepts that tend to lean on the whole ontological structure of how humans think about the world (like "low-impact" or "goodness" or "happiness"), such that in order to actually build an accurate model of those, you have to do a bunch of careful thinking and need to really understand how humans view the world, and that people tend to be systematically optimistic about how convergent these kinds of concepts are, as opposed to them being contingent on the specific ways humans think. 

My guess is an AI might very well spend sufficient cycles on figuring out human morality after it has access to a solarsystem level of compute, but I think that is unlikely to happen before it has disempowered us, so the ordering here matters a lot (see e.g. my response to Zack above). 

So I think there are three separate points here that I think have caused confusion and probably caused us to talk past each other for a while, all of which I think were things that Eliezer was thinking about, at least around 2013-2014 (I know less about 2008): 

  1. Low-powered AI systems will have a really hard time learning high-level human concepts like "happiness", and if you try to naively get them to learn that concept (by e.g pointing them towards smiling humans) you will get some kind of abomination, since even humans have trouble with those kinds of concepts
  2. It is likely that by the time an AI will understand what humans actually really want, we will not have much control over its training process, and so despite it now understanding those constraints, we will have no power to shape its goals towards that
  3. Even if we and the AI had a very crisp and clear concept of a goal I would like the AI to have, humanity won't know how to actually cause the AI to point towards that as a goal (see e.g. the diamond maximizer problem)

To now answer your concrete questions: 

  1. Would Hibbard's approach successfully learn a stable, robust concept of human happiness suitable for use as the reward/utility function of AGI

My first response to this is: "I mean, of course not at current LLM capabilities. Ask GPT-3 about happiness, and you will get something dumb and incoherent back. If you keep going and make more capably systems try to do this, it's pretty likely your classifier will be smart enough to kill you to have more resources to drive the prediction error downwards before it actually arrived at a really deep understanding of human happiness (which appears to require substantial superhuman abilities, given that humans do not have a coherent model of happiness themselves)"

So no, I don't think Hibbard's approach would work. Separately, we have no idea how to use a classifier as a reward/utility function for an AGI, so that part of the approach also wouldn't work. Like, what do you actually concretely propose we do after we have a classifier over video frames that causes a separate AI to then actually optimize for the underlying concept boundary? 

But even if you ignore both of these problems, and you avoid the AI killing you in pursuit of driving down prediction error, and you somehow figure out how to take a classifier and use it as a utility function, then you are still not in a good shape, because the AI will likely be able to achieve lower prediction error by modeling the humans doing the labeling process of the data you provide, and modeling what errors they are actually making, and will learn the more natural concept of "things that look happy to humans" instead of the actual happiness concept. 

This is a really big deal, because if you start giving an AI the "things that look happy to humans" concept, you will end up with an AI that gets really good at deceiving humans and convincing them that something is happy, which will both quickly involve humans getting fooled and disempowered, and then in the limit might produce something surprisingly close to a universe tiled in smiley faces (convincing enough such that if you point a video camera at it, the rater who was looking at it for 15 seconds would indeed be convinced that it was happy, though there are no raters around). 

I think Hibbard's approach fails for all three reasons that I listed above, and I don't think modern systems somehow invalidate any of those three reasons. I do think (as I have said in other comments) that modern systems might make indirect normativity approaches more promising, but I don't think it moves the full value-loading problem anywhere close to the domain of solvability with current systems.

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-20T18:35:20.605Z · LW(p) · GW(p)

I think I am more interested in you reading The Genie Knows but Doesn't Care and then having you respond to the things in there than the Hibbard example, since that post was written with (as far as I can tell) addressing common misunderstandings of the Hibbard debate

Looking over that it just seems to be a straightforward extrapolation of EY's earlier points, so I'm not sure why you thought it was especially relevant.

  1. Low-powered AI systems will have a really hard time learning high-level human concepts like "happiness", and if you try to naively get them to learn that concept (by e.g pointing them towards smiling humans) you will get some kind of abomination, since even humans have trouble with those kinds of concepts

Yeah - this is his core argument against Hibbard. I think Hibbard 2001 would object to 'low-powered', and would probably have other objections I'm not modelling, but regardless I don't find this controversial.

  1. It is likely that by the time an AI will understand what humans actually really want, we will not have much control over its training process, and so despite it now understanding those constraints, we will have no power to shape its goals towards that.

Yeah, in agreement with what I said earlier:

Notice I said "before it killed us". Sure the AI may learn detailed models of humans and human values at some point during its superintelligent FOOMing, but that's irrelevant because we need to instill its utility function long before that.

...

  1. Even if we and the AI had a very crisp and clear concept of a goal I would like the AI to have, humanity won't know how to actually cause the AI to point towards that as a goal

I believe I know what you meant, but this seems somewhat confused as worded. If we can train an ML model to learn a very crisp clear concept of a goal, then having the AI optimize for this (point towards it) is straightforward. Long term robustness may be a different issue, but I'm assuming that's mostly covered under "very crisp clear concept".

The issue of course is that what humans actually want is complex for us to articulate, let alone formally specify. The update since 2008/2011 is that DL may be able to learn a reasonable proxy of what we actually want, even if we can't fully formally specify it.

which appears to require substantial superhuman abilities, given that humans do not have a coherent model of happiness themselves)"

I think this is something of a red herring. Humans can reasonably predict utility functions of other humans in complex scenarios simply by simulating the other as self - ie through empathy. Also happiness probably isn't the correct thing - probably want the AI to optimize for our empowerment (future optionality), but that's whole separate discussion.

So no, I don't think Hibbard's approach would work.

Sure, neither do I.

Separately, we have no idea how to use a classifier as a reward/utility function for an AGI, so that part of the approach also wouldn't work.

A classifier is a function which maps high-D inputs to a single categorical variable, and a utility function just maps some high-D input to a real number, but a k-categorical variable is just the explicit binned model of a log(k) bit number, so these really aren't that different, and there are many interpolations between. (and in fact sometimes it's better to use the more expensive categorical model for regression )

Like, what do you actually concretely propose we do after we have a classifier over video frames

Video frames? The utility function needs to be over future predicted world states .. which you could I guess use to render out videos, but text rendering are probably more natural.

I propose we actually learn how the brain works, and how evolution solved alignment, to better understand our values and reverse engineer them [LW · GW]. That is probably the safest approach - having a complete understanding of the brain.

However, I'm also somewhat optimistic on theoretical approaches that focus more explicitly on optimizing for external empowerment (which is simpler and more crisp), and how that could be approximated pragmatically with current ML approaches. Those two topics are probably my next posts.

comment by jacob_cannell · 2022-10-15T21:19:16.124Z · LW(p) · GW(p)

Our best conditional generative models sample from a conditional distribution, they don't optimize for feature-ness. The GAN analogy is also mostly irrelevant because diffusion models have taken over for conditional generation, and Nate's comment seems confused [LW(p) · GW(p)] as applied to diffusion models.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-10-16T11:51:53.283Z · LW(p) · GW(p)

Nate's comment isn't confused, he's not talking about diffusion models, he's talking about the kinds of AI that could take over the world and reshape it to optimize for some values/goals/utility-function/etc.

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-16T12:14:56.508Z · LW(p) · GW(p)

Katja says:

You could very analogously say ‘human faces are fragile’ because if you just leave out the nose it suddenly doesn’t look like a typical human face at all. Sure, but is that the kind of error you get when you try to train ML systems to mimic human faces?

Nate's comment:

B) wake me when the allegedly maximally-facelike image looks human;

Katja is talking about current ML systems and how the fragility issue EY predicted didn't materialize (actually it arguably did in earlier systems). Nate's comment is clearly referencing Katja's analogy - faciness - and he's clearly implying we haven't seen the problem with face generators yet because they haven't pushed the optimization hard enough to find the maximally-facelike image. But he's just wrong there - they don't have that problem, no matter how hard you scale their optimization power - and that is part of why Katja's analogy works so well at a deeper level: future ML systems do not work the way AI risk folks thought they would.

Diffusion models are relevant because they improve on conditional GANs by leveraging powerful pretrained discriminative foundation models and by allowing for greater optimization power at inference time, improvements that also could be applied [LW(p) · GW(p)] to planning agents.

Replies from: habryka4
comment by habryka (habryka4) · 2022-10-16T19:59:25.190Z · LW(p) · GW(p)

ML systems still use plenty of reinforcement learning, and systems that apply straightforward optimization pressure. We've also built a few systems more recently that do something closer to trying to recreate samples from a distribution, but that doesn't actually help you improve on (or even achieve) human-level performance. In order to improve on human level performance, you either have to hand-code ontologies (by plugging multiple simulator systems together in a CAIS fashion), or just do something like reinforcement learning, which then very quickly does display the error modes everyone is talking about.

Current systems do not display a lack of edge-instantiation behavior. Some of them seem more robust, but the ones that do also seem fundamentally limited (and also, they will likely still show edge-instantiation for their inner objective, but that's harder to talk about).

And also just to make the very concrete point, Katja linked to a bunch of faces generated by a GAN, which straightforwardly has the problems people are talking about, so there really is no mismatch in the kinds of things that Katja is talking about, and Nate is talking about. We could perform a more optimized search for things that are definitely faces according to the discriminator, and we would likely get something horrifying.

Replies from: jacob_cannell, cfoster0
comment by jacob_cannell · 2022-10-18T16:02:43.547Z · LW(p) · GW(p)

We could perform a more optimized search for things that are definitely faces according to the discriminator, and we would likely get something horrifying.

Sure you could do that, but people usually don't - unless they intentionally want something horrifying. So if your argument is now "sure, new ML systems totally can solve the faciness problem, but only if you choose to use them correctly" - then great, finally we agree.

Interestingly enough in diffusion planning models if you crank up the discriminator you get trajectories that are higher utility but increasingly unrealistic. You get lower utility trajectories by cranking down the discriminator.

comment by cfoster0 · 2022-10-16T20:21:42.142Z · LW(p) · GW(p)

Clarifying questions, either for you or for someone else, to aid my own confusion:

What does "applying optimization pressure" mean? Is steering random noise into the narrow part of configuration space that contains plausible images-of-X (the thing DDPMs and GAN generators do) a straightforward example of it?

EDIT: Split up above question into two.

comment by acgt · 2022-10-15T18:14:47.534Z · LW(p) · GW(p)

I predict that this would look at least as weird and nonhuman as those deep dream images if not more so

This feels like something we should just test? I don’t have access to any such model but presumably someone does and can just run the experiment? Bcos it seems like peoples hunches are varying a lot here

comment by Quintin Pope (quintin-pope) · 2022-10-14T23:39:37.952Z · LW(p) · GW(p)

Also, we don’t know what would happen if we exactly optimized an image to maximize the activation of a particular human’s face detection circuitry. I expect that the result would be pretty eldritch as well.

Replies from: cubefox, TurnTrout
comment by cubefox · 2022-10-15T10:20:19.621Z · LW(p) · GW(p)

We may be already doing that in case of cartoon faces with their exaggerated features. Cartoon faces don't look eldritch to us, but why would they?

Replies from: rudi-c
comment by Rudi C (rudi-c) · 2022-10-23T02:22:23.387Z · LW(p) · GW(p)

They are still smooth and have low-frequency patterns, which seems to be the main difference from adversarial examples currently produced from DL models.

comment by TurnTrout · 2022-11-08T18:13:22.740Z · LW(p) · GW(p)

Yeah. Wake me up when we find a single agent which makes decisions by extremizing its own concept activations. EG I'm pretty sure that people don't reflectively, most strongly want to make friends with entities which maximally activate their potential-friend detection circuitry.

comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2022-10-23T11:56:47.256Z · LW(p) · GW(p)

"Sample the face at the point of highest probability density in the generative model's latent space". For GANs and diffusion models (the models we in fact generate faces with), you can do exactly this by setting the Gaussian latents to zeros, and you will see that the result is a perfectly normal, non-Eldritch human face.


(sort of nitpicking): 
I think it makes more sense to look for the highest density in pixel space; this requires integrating over all settings of the latents (unless your generator is invertible, in which case you can just use change of variables formula).  I expect the argument to go through, but it would be interesting to do this with an invertible generator (e.g. normalizing flow) and see if it actually does.

comment by Chris_Leong · 2022-10-16T13:36:48.431Z · LW(p) · GW(p)

Could someone clarify the relevance of ribosomes?

Replies from: interstice
comment by interstice · 2022-10-16T18:32:18.051Z · LW(p) · GW(p)

A working example of nanotechnology.

Replies from: Oliver Sourbut
comment by Oliver Sourbut · 2022-10-22T20:43:29.516Z · LW(p) · GW(p)

(and 'self-replicating' for some reasonable operationalisation)

comment by Rob Bensinger (RobbBB) · 2022-10-14T23:36:30.038Z · LW(p) · GW(p)

Also from Ronny: 

There's also an important disanalogy between generating/recognizing faces and learning 'human values', which is that humans are perfect human face recognizers but not perfect recognizers of worlds high in 'human values'.

That means that there might be world states or plans in the training data or generated by adversarial training that look to us, and ML trained to recognize these things the way we recognize them, like they are awesome, but are actually awful.

Replies from: Jeff Rose, RobbBB
comment by Jeff Rose · 2022-10-15T15:06:21.071Z · LW(p) · GW(p)

As an empirical fact, humans are not perfect human face recognizers.  It is something humans are very good at, but not perfect.  We are definitely much better recognizers of human faces than of worlds high in human values.   (I think it is perhaps more relevant to say consensus on what constitutes a human face is much. much higher than what constitutes a world high in human values.)

I am unsure whether this distinction is relevant for the substance of the argument however.

comment by Rob Bensinger (RobbBB) · 2022-10-14T23:39:14.178Z · LW(p) · GW(p)

(And we aren't perfect recognizers of 'functional, safe-to-use nanofactory' or other known-to-me things that might save the world.)

comment by acgt · 2022-10-14T23:05:32.795Z · LW(p) · GW(p)

if you ask them for the faciest possible thing, it's not very human!facelike

Is this based on how these models actually behave or just what the OP expects? Because is seems to just be begging the question if the latter

Replies from: quintin-pope
comment by Quintin Pope (quintin-pope) · 2022-10-14T23:31:15.264Z · LW(p) · GW(p)

Also, “ask for the the most X-like thing” is basically how classifier guided diffusion models work, right?

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-15T21:14:57.148Z · LW(p) · GW(p)

No not really, they are not arg-maxers. They combine an unconditional generative model (maps from noise to samples of realistic images by learning to denoise) and a discriminative model (maps from images to text) to sample (via iterative GD) from a conditional model (realistic images which the discriminative model would map to the text query).

"Asking for the most X-like thing" would be basically ignoring or underweighting the generative model, and that results in deepdream like garbage images (it's one of the main hyperparams in any diffusion model, so this is really easy to try out yourself - samples fully weighted from the discriminator are deam-dream garbage at best, samples fully weighted from the unconditional generative model are boring natural texture patterns).

Basically the discriminative model learns how language slices up the space of all images, and the generative model crucially learns the actual lower-D embedded geometry of the distribution of realistic images - which is not something that pure discriminative models learn. The discriminative model by itself has no knowledge of what images are realistic, and optimizing solely for its extrema results in nonsense because it takes you far from the complex boundary of realistic images.

Nate's response just seems confused on how diffusion models work.

Replies from: kave
comment by kave · 2022-10-16T14:01:45.111Z · LW(p) · GW(p)

samples fully weighted from the unconditional generative model are boring natural texture patterns

Different results here: https://twitter.com/summerstay1/status/1579759146236510209

comment by TurnTrout · 2022-12-15T04:03:58.120Z · LW(p) · GW(p)

Nate's B) currently seems confused. I read a connotation "we need the AGI's learned concepts to be safe under extreme optimization pressure, such that, when extremized, they yield reasonable results (e.g. human faces from maximizing the AI-faceishness-concept-activation of an image)." 

But I think agents will not maximize their own concept activations, when choosing plans. An agent's values will optimize the world; the values don't optimize themselves. For example, I think that I am not looking for a romantic relationship which maximally activates my "awesome relationship" concept, if that's a thing I have. It's true that conditional on such a plan being considered, my relationship-shard might bid for that plan with strength monotonically increasing on "predicted activation of awesome-relationship". 

And conditional on such a plan getting considered, where that concept activation is maximized, I would therefore be very inclined to pursue that plan.

But I think it's not true that my relationship-shard is optimizing its own future activations by extremizing future concept activations. I think that this plan won't get found, and the agent won't want to find this plan. Values are not the optimization target. (This point explained in more detail: Alignment allows "nonrobust" decision-influences and doesn't require robust grading [LW · GW])

comment by Steven Byrnes (steve2152) · 2022-10-17T18:45:40.316Z · LW(p) · GW(p)

Great post!

A. Contra “superhuman AI systems will be ‘goal-directed’”

I somewhat agree, see Consequentialism & Corrigibility [LW · GW]. I’m a bit unclear on whether this is intended as an argument for “AGI almost definitely won’t have a zealous drive to control the universe” versus “AGI won’t necessarily have a zealous drive to control the universe”. I agree with the latter but not the former.

Also, the more different groups make AGIs, the more likely it is that someone will make one with a “zealous drive to control the universe”. Then we have to think about whether the non-zealous ones will have solved the problem posed by the zealous ones. In this context, there starts to be a contradiction between “we don’t need to worry about the non-zealous ones because they won’t be doing hardcore long-term consequentialist planning” versus “we don’t need to worry about the zealous ones because the non-zealous ones are so powerful and foresightful that, whatever plan the latter might come up with, the former can preemptively think of it and defend against it”. More on this topic in a forthcoming post [LW · GW] hopefully in the next couple weeks. (EDIT—I added the link)

B. Contra “goal-directed AI systems’ goals will be bad”

I somewhat agree, see Section 14.6 here [LW · GW]. Comments above also apply here, e.g. it’s not obvious that docile helpful human-norm-following AGIs will actually do what’s necessarily to defend against zealous universe-controlling AGIs, again wait for my forthcoming post [LW · GW].

Contra “superhuman AI would be sufficiently superior to humans to overpower humanity”

I mostly see these comments as arguments that “AI that can overpower humanity” might happen a bit later than one might otherwise expect, rather than arguments that it’s not going to happen at all. For example, if collaborative groups of humans are more successful than individual humans, well, sooner or later we’re going to have collaborative groups of AIs too. By the time we have a whole society of trillions of AIs, it stops feeling very reassuring. (The ability of AIs to self-replicate seems particularly relevant here.) If humans-using-tools are powerful, well sooner or later (I would argue sooner) AIs are going to be using tools too. (And inventing new tools.) The trust issue stops applying when we get to a world where AIs can start their own companies etc., and thus only need to trust each other (and the “each other” might be copies of themselves). The headroom argument seems adjacent to the lump-of-labor fallacy.

Hmm, OK, I guess the real point of all that is to argue for slow takeoff which then implies that doom is unlikely? (“at some point AI systems would account for most of the cognitive labor in the world. But if there is first an extended period of more minimal advanced AI presence, that would probably prevent an immediate death outcome, and improve humanity’s prospects for controlling a slow-moving AI power grab.”) Again, I’m not quite sure what we’re arguing. I think there’s still serious x-risk regardless of slow vs fast takeoff, and I think there’s still “less than certain doom” regardless of slow vs fast takeoff. In fact, I’m not even confident that x-risk is lower under slow takeoff than fast.

Well anyway, I have an object-level belief that there are already way more than enough GPUs on the planet to support AIs that can overpower humanity—see here [LW · GW]—and I think that will be much more true by the time we have real-deal AGIs (which I for one expect to be probably after 2030 at least). I agree that this is a relevant empirical question though.

The idea that a superhuman AI would be able to rapidly destroy the world seems prima facie unlikely, since no other entity has ever done that.

I think there’s pretty good direct reason to believe that it is currently possible to start lots of simultaneously deadly pandemics and crop diseases etc., with an amount of competence already available to small teams of humans or maybe even individual humans. But we don’t currently have ongoing deliberate pandemics. I consider this pretty strong evidence that nobody on Earth with even moderate competence is trying to “destroy the world”, so to speak. So the fact that nobody has succeeded at doing so doesn’t really provide much evidence about the tractability of doing that. (Again, more on this topic in a forthcoming post [LW · GW].)

Replies from: yitz
comment by Yitz (yitz) · 2022-10-21T13:45:30.990Z · LW(p) · GW(p)

yeah, I suspect the largest bottleneck there is that trying to destroy the world is so strongly against human values that there are ~0 people (who aren't severely mentally ill) who are genuinely trying to do that.

comment by Alex Flint (alexflint) · 2022-10-16T18:39:20.055Z · LW(p) · GW(p)

Thanks for writing this!

Regarding your point on corporations: One of the reasons to worry about some forms of AI is that they might soon build other, more powerful forms of AI. So the development of very human-like Ems, for example might lead relatively quickly to the development of de novo AI, and so on; hence we worry about Ems even if we think extremely human-like Ems do not pose an x-risk on their own. In the same way, corporations are the ones moving forward fastest on building ML-based AI, and the misalignment between corporations and the long-term future of life on Earth is a very significant cause of the overall level of AI-related x-risk in the world today. So if someone had said 500 years ago "hey let's not build corporations because they will probably be subtly or overtly misaligned with us and that will lead to the destruction of life on Earth", then fastforward to today and it seems like that person has been proven correct.

comment by Jsevillamol · 2022-10-14T20:00:17.613Z · LW(p) · GW(p)

Here are my quick takes from skimming the post.

In short, the arguments I think are best are A1, B4, C3, C4, C5, C8, C9 and D. I don't find any of them devastating.

A1. Different calls to ‘goal-directedness’ don’t necessarily mean the same concept

I am not sure I parse this one.I am reading it as "AI systems might be more like imitators than optimizers" from the example, which I find moderately persuasive

A2. Ambiguously strong forces for goal-directedness need to meet an ambiguously high bar to cause a risk

I am not sure I understand this one either.I am reading it as "there might be no incentive for generality" which I dont find persuasive - I think there is a strong incentive

B1. Small differences in utility functions may not be catastrophic

I dont find this persuasive. I think the evidence from optimization theory setting variables to extreme values is suggestive enough to suggest this is not the default

B2. Differences between AI and human values may be small
B3. Maybe value isn’t fragile

The only example we have of general intelligence (humans) seems to have strayed pretty far from evolutionary incentives, so I find this unpersuasive

B4. [AI might only care about]Short-term goals

I find that somewhat persuasive, or at least not obviously wrong, similar to A1. There is a huge incentive for instilling long term thinking though.

C1. Human success isn’t from individual intelligence

I dont find this persuasive. Im not convinced there is a meaningful difference between "a single AGI" and "a society of AGIs". A single AGI could be running a billion independent threads of thought and outspeed humans.

C2. AI agents may not be radically superior to combinations of humans and non-agentic machines

I dont find this persuasive. Seems unlikely that human-in-the-loop is going to have any advantages over pure machines.

C3. Trust

I find this plausible but not convincing

C4. Headroom

Plausible but not convincing. I dont find any of the particular examples of lack of headroom convincing, and I think the prior should be that there is a lot of headroom

C5. Intelligence may not be an overwhelming advantage

I find this moderately persuasive though not entirely convincing

C6. Unclear that many goals realistically incentivise taking over the universe

I find this unconvincing. I think there are many reasons to expect that taking over the universe is a convergent goal.

C7. Quantity of new cognitive labor is an empirical question, not addressed

I dont find this superpersuasive. In particular I think there is a good chance that once we have AGI we will be in a hardware overhang and be able to execute tons of AGI-equivalents

C8. Speed of intelligence growth is ambiguous

I find this plausible

C9. Key concepts are vague

Granted but not a refutation in itself

D1. The argument overall proves too much about corporations

I find this somewhat persuasive

comment by Vika · 2024-01-12T11:19:13.228Z · LW(p) · GW(p)

I think this is still one of the most comprehensive and clear resources on counterpoints to x-risk arguments. I have referred to this post and pointed people to a number of times. The most useful parts of the post for me were the outline of the basic x-risk case and section A on counterarguments to goal-directedness (this was particularly helpful for my thinking about threat models and understanding agency). 

comment by Matthew Barnett (matthew-barnett) · 2022-10-18T18:44:45.233Z · LW(p) · GW(p)

I have now published a conversation between Ege Erdil [LW · GW] and Ronny Fernandez about this post. You can find it here [LW · GW].

comment by Archimedes · 2022-10-14T21:32:06.397Z · LW(p) · GW(p)

One might argue that there are defeating reasons that corporations do not destroy the world: they are made of humans so can be somewhat reined in; they are not smart enough; they are not coherent enough. But in that case, the original argument needs to make reference to these things, so that they apply to one and not the other.

I don't think this is quite fair. You created an argument outline that doesn't directly reference these things, so you can only blame yourself for excluding them unless you are claiming that such things have not been discussed extensively.

One extremely important difference between corporations and potential AGIs is the level of high-speed, high-bandwidth coordination (which has been discussed extensively [? · GW]) that may be possible for AGIs. If a massive corporation could be as internally coordinated and self-aligned as might be possible for an AGI, it would be absolutely terrifying. Imagine Elon Musk as a Borg Queen with everyone related to Tesla as part of the "collective" under his control...

comment by Lukas Finnveden (Lanrian) · 2022-10-17T17:04:02.265Z · LW(p) · GW(p)

Competence does not seem to aggressively overwhelm other advantages in humans: 

[...]

g. One might counter-counter-argue that humans are very similar to one another in capability, so even if intelligence matters much more than other traits, you won’t see that by looking at  the near-identical humans. This does not seem to be true. Often at least, the difference in performance between mediocre human performance and top level human performance is large, relative to the space below, iirc. For instance, in chess, the Elo difference between the best and worst players is about 2000, whereas the difference between the amateur play and random play is maybe 400-2800 (if you accept Chess StackExchange guesses as a reasonable proxy for the truth here).

The usage of capabilities/competence is inconsistent here. In points a-f, you argue that general intelligence doesn't aggressively overwhelm other advantages in humans. But in point g, the ELO difference between the best and worst players is less determined by general intelligence than by how much practice people have had.

If we instead consistently talk about domain-relevant skills: In the real world, we do see huge advantages from having domain-specific skills. E.g. I expect elected representatives to be vastly better at politics than medium humans.

If we instead consistently talk about general intelligence: The chess data doesn't falsify the hypothesis that human-level variation in general intelligence is small. To gather data about that, we'd want to analyse the ELO-difference between humans who have practiced similarly much but who have very different g.

(There are some papers on the correlation between intelligence and chess performance, so maybe you could get the relevant data from there. E.g. this paper says that (not controlling for anything) most measurements of cognitive ability correlates with chess performance at about ~0.24 (including IQ iff you exclude a weird outlier where the correlation was -0.51).)

comment by Karl von Wendt · 2022-10-15T11:07:06.335Z · LW(p) · GW(p)

Thank you for posting this, as I find it helpful for practicing my own skills of argumentation. Here are my brief counterarguments to your counterarguments, I'd appreciate it if anyone could point out any flaws in my logic:

A. Contra "superhuman AI systems will be goal-directed"
As far as I understand it, "intelligence" is the ability to achieve one's goals through reasoning and making plans, so a highly intelligent system is goal-directed by definition. Less goal-directed AIs are certainly possible, but they must necessarily be considered less intelligent - the thermometer example illustrates this. Therefore, a less goal-directed AI will always lose in competition against a more goal-directed one.

B.  Contra "goal-directed AI systems' goals will be bad"
The supposed counterexample of artificially generated human faces is in fact a case in point in my opinion. These faces aren't like humans at all. They're not three-dimensional. They're not moving. They don't talk. They don't smell. They're not soft and don't radiate warmth. Oh, we didn't mention that was important, right? We just gave the AI a reward function that enabled it to learn how to generate pictures that look like photographs of real people. If that's what we want, then little differences on the pixel level probably don't matter much. The differences between the paperclips Bostrom's paperclip maximizer makes and a perfect paperclip probably won't matter much, either. To put it another way, these fake humans are only "good" if we lower our expectations to the point where they're already met.

C. Contra “superhuman AI would be sufficiently superior to humans to overpower humanity”
Even if "human success isn't from individual intelligence", this doesn't mean that human intelligence is not the decisive factor making us the dominant species. Individual intelligence is what enables collective intelligence in the first place. I agree that humans shouldn't be seen as a universal benchmark for intelligence, but that only means that the bar for developing an uncontrollable AI may be even lower. It took us humans more than 2,000 years to collectively master Go. It took AlphaGo Zero three days from scratch to beat us. AI may one day be sufficiently good at manipulating and controlling humans to take over the world even without being "superintelligent" in all aspects.  It could be way more intelligent in the relevant ways, like AlphaGo Zero compared to a child learning to play Go. I believe there is no upper boundary for manipulation skills and other forms of gaining power. So whether intelligence is an overwhelming advantage is probably a matter of scale.

However AI systems have one serious disadvantage as employees of humans: they are intrinsically untrustworthy, while we don’t understand them well enough to be clear on what their values are or how they will behave in any given case. Even if they did perform as well as humans at some task, if humans can’t be certain of that, then there is reason to disprefer using them. 

Really? Look at how we use AI today, e.g. in letting it decide what we see, hear and believe, who gets on parole from prison, and who gets a loan. It seems to me that humans tend to trust AI already more than other humans, in particular if they don't understand how it works.

I have some goals. For instance, I want some good romance. My guess is that trying to take over the universe isn’t the best way to achieve this goal. The same goes for a lot of my goals, it seems to me. Possibly I’m in error, but I spend a lot of time pursuing goals, and very little of it trying to take over the universe. 

Imagine you had a magic wand or a genie in a bottle that would fulfill every wish you could dream of. Would you use it? If so, you're incentivized to take over the world, because the only possible way of making every wish come true is absolute power over the universe. The fact that you normally don't try to achieve that may have to do with the realization that you have no chance. If you had, I bet you'd try it. I certainly would, if only so I could stop Putin. But would me being all-powerful be a good thing for the rest of the world? I doubt it.

D. Contra the whole argument
No, AI is not like a corporation run by humans. AI is more like an alien life form. It does not have intrinsic human motives and values. We may be able to tame it or to give it a beneficial goal, but unless we do, if it can, it will transform the world in very weird and probably unforeseen ways. Apart from that, corporations are currently wreaking a lot of havoc on the world (e.g. climate change), which is a good example of how difficult it is to give a powerful entity a beneficial goal.

Replies from: ThomasWoodside
comment by TW123 (ThomasWoodside) · 2022-10-15T16:14:27.974Z · LW(p) · GW(p)

As far as I understand it, "intelligence" is the ability to achieve one's goals through reasoning and making plans, so a highly intelligent system is goal-directed by definition. Less goal-directed AIs are certainly possible, but they must necessarily be considered less intelligent - the thermometer example illustrates this. Therefore, a less goal-directed AI will always lose in competition against a more goal-directed one.

Your argument seems to be:

  1. Definitionally, intelligence is the ability to achieve one's goals.
  2. Less goal-directed systems are less intelligent.
  3. Less intelligent systems will always lose in competition.
  4. Less goal directed systems will always lose in competition.

Defining intelligence as goal-directedness doesn't do anything for your argument. It just kicks the can down the road. Why will less intelligent (under your definition, goal directed) always lose in competition?

Imagine you had a magic wand or a genie in a bottle that would fulfill every wish you could dream of. Would you use it? If so, you're incentivized to take over the world, because the only possible way of making every wish come true is absolute power over the universe. The fact that you normally don't try to achieve that may have to do with the realization that you have no chance. If you had, I bet you'd try it. I certainly would, if only so I could stop Putin. But would me being all-powerful be a good thing for the rest of the world? I doubt it.

Romance is a canonical example of where you really don't want to be all powerful (if real romance is what you want). Romance could not exist if your romantic partner always predictably did everything you ever wanted. The whole point is they are a different person, with different wishes, and you have to figure out how to navigate that and its unpredictabilities. That is the "fun" of romance. So no, I don't think everyone would really use that magic wand.

Replies from: Karl von Wendt
comment by Karl von Wendt · 2022-10-16T06:14:46.079Z · LW(p) · GW(p)

Thank you very much for your input!

Defining intelligence as goal-directedness doesn't do anything for your argument. It just kicks the can down the road. Why will less intelligent (under your definition, goal directed) always lose in competition?

Admittedly, my reply to A was a bit short. I only wanted to point out that intelligence is closely linked to goal-directedness, not that they're the same thing (heat-seeking missiles are stupid, but very goal-directed entities, for example). A very intelligent system without a goal would just sit around, doing nothing. It might be able to potentially act intelligently, but without a goal it would behave like an unintelligent system. "Always" may be too strong a word, but if system X is more intelligent and wants to reach a conflicting goal much more than system Y, chances are that system X will get what it wants.

Romance is a canonical example of where you really don't want to be all powerful (if real romance is what you want). Romance could not exist if your romantic partner always predictably did everything you ever wanted. 

I disagree. Being all-powerful does not mean always doing everything you want, or everything your partner wants. It means being able to do whatever you want, or maybe more importantly, whatever you feel you need to do. If, for example, I needed the magic wand to prevent the untimely death of someone I love, I would use it without a second thought.

The whole point is they are a different person, with different wishes, and you have to figure out how to navigate that and its unpredictabilities. That is the "fun" of romance. 

I tend to agree, but I guess there are many people who have been less lucky in their relationships than I have, being happily together with my wife for more than 44 years. :)

So no, I don't think everyone would really use that magic wand.

Maybe not everyone and certainly not all the time, but I'm quite sure that most people would use it at least once in a while.

comment by Jsevillamol · 2022-10-14T17:41:12.020Z · LW(p) · GW(p)

Eight examples, no cherry-picking:

 

Nit: Having a wall of images makes this post unnecessarily harder to read.
I'd recommend making a 4x2 collage with the photos so they don't take that much space.

Replies from: habryka4
comment by habryka (habryka4) · 2022-10-14T21:28:55.701Z · LW(p) · GW(p)

I edited it to be a table (my guess is this was primarily the result of images being displayed different by default for the AI Impacts website and LessWrong).

comment by TurnTrout · 2022-11-08T02:49:07.883Z · LW(p) · GW(p)

I really like this post. I also like that you provide concrete and specific observables which you think would obtain under each counterargument. I found it refreshing to imagine so many non-orthodox futures.

Small differences in utility functions may not be catastrophic

For three months, I have been sitting on a post (originally) called "What's up with humans with different values not wanting to kill each other?". It seems to me like "value has to be perfect or Goodhart into oblivion" just... doesn't make sense, that isn't how the world works AFAICT. But I gave up on pressing this point because I wasn't able to communicate the obvious-to-me misprediction made by "imperfection" arguments. Maybe I'll publish that post now. (EDIT: Published [LW · GW]!)

Eliezer's original "value is fragile" argument doesn't claim that all perturbations (setting to zero) shatter value into oblivion, but rarely do I perceive people to be considering the dimensions along which value is robust, rather than (AFAICT) reasoning "imperfection  Goodhart  doom." (And "If you didn't get bored, that'd suck" is importantly different from "If the AI doesn't care about you being entertained in just the right way, that'd suck", but I digress...)

AI agents may not be radically superior to combinations of humans and non-agentic machines

On the model of "economic pressure explains a lot of AI outcomes", I disagree with this counterargument because of intuitions around Ahmdahl's law (see Gwern here). 

However, this feels somewhat irrelevant. It sure feels to me like a better model is "people do things which are cool and push limits, especially if that makes money." Even if "centaur" human+AI hybrid processes are economically competitive on eg running corporations, I expect Mooglebook to train an agent anyways sooner or later.

Unclear that many goals realistically incentivise taking over the universe

It seems like people often implicitly assume some kind of space-time additivity of utility, where entities want to "tile" the universe with something. That goals are "grabby" by default. This seems plausible to me but not slam-dunk. 

(It's unclear that I should care about far-away people the same as I do about nearby people. Suppose an AI's main decision-influence is gathering diamonds around it [LW(p) · GW(p)]. Should this AI generalize its values to caring about diamonds throughout the cosmos and throughout Tegmark IV? I think "not necessarily." If not, then "AI goals will tile across spacetime and relaities" seems quantitative and uncertain to me.)

The argument overall proves too much about corporations

I agree. I think many alignment arguments prove too much, or are taken too far, especially without relying on specifics of eg SGD dynamics. For example, selection-style arguments can be useful for considering failure modes, but often seem to be taken as open-and-shut counterarguments to proposals. 

"That's selecting for deception." So? Evolution selected for wolves with biological sniper rifles, and didn't get wolves with biological sniper rifles. Evolution selected for humans caring about IGF, and didn't get humans to care about IGF. (For more, see Reward is not the optimization target, anticipated question no.2 [LW · GW])

comment by Ronny Fernandez (ronny-fernandez) · 2022-10-14T22:39:21.759Z · LW(p) · GW(p)

Ege Erdil gave an important disanaology between the problem of recognizing/generating a human face, and the problem of either learning human values, or learning what plans that advance human values are like. The disanalogy is that humans are near perfect human face recognizers, but we are not near perfect valuable world-state or value-advancing-plan recognizers. This means that if we trained an AI to either recognize valuable world-states or value-advancing plans, we would actually end up just training something that recognizes what we can recognize as valuable states or plans. If we trained it like we train GANs, the discriminator would fail to be able to discriminate actually valuable world states given by the generator from ones that just look really valuable to humans but actually are not valuable at all according to the humans if they understand the plan/state well enough.  So we would need some sort of ELK proposal that works to get any real comfort from the face recognizing/generating <-> human values learning analogy.


Nate Soares points out on twitter that the supposedly maximally human face like images according to GAN models look like horrible monstrosities, and so following the analogy, we should expect that for similar models doing similar things for human values, the maximally valuable world state also looks like some horrible monstrosity. 

Replies from: cubefox
comment by cubefox · 2022-10-14T23:18:36.579Z · LW(p) · GW(p)

I'm confused, which GAN faces look like "horrible monstrosities"!?

Replies from: ronny-fernandez
comment by Ronny Fernandez (ronny-fernandez) · 2022-10-15T01:56:12.354Z · LW(p) · GW(p)

I assumed he meant the thing that most activates the face detector, but from skimming some of what people said above, seems like maybe we don't know what that is.

comment by Søren Elverlin (soren-elverlin-1) · 2022-10-14T19:47:28.388Z · LW(p) · GW(p)

However if we think that utility maximization is difficult to wield without great destruction, then that suggests a disincentive to creating systems with behavior closer to utility-maximization. Not just from the world being destroyed, but from the same dynamic causing more minor divergences from expectations, if the user can’t specify their own utility function well.

A strategically aware utility maximizer would try to figure out what your expectations are, satisfy them while preparing a take-over, and strike decisively without warning. We should not expect to see an intermediate level of "great destruction".

comment by habryka (habryka4) · 2022-11-04T07:16:37.563Z · LW(p) · GW(p)

Promoted to curated: I found engaging with this post quite valuable. I think in the end I disagree with the majority of arguments in it (or at least think they omit major considerations that have previously been discussed on LessWrong and the AI Alignment Forum), but I found thinking through these counterarguments and considering each one of them seriously a very valuable thing to do to help me flesh out my models of the AI X-Risk space.

comment by ESRogs · 2022-10-14T20:44:46.923Z · LW(p) · GW(p)

There is a brief golden age of science before the newly low-hanging fruit are again plucked and it is only lightning fast in areas where thinking was the main bottleneck, e.g. not in medicine.

Not one of the main points of the post, but FWIW it seems to me that thinking could be considered the main bottleneck for medicine, if we can include simulation and modeling a la AlphaFold as thinking.

My guess is that with sufficient computation you could invent new treatments / drugs that are so overwhelmingly better than what we have now that regulatory or other bottlenecks would not be an issue. E.g. I expect a "slow aging by twenty years" pill would find its way around the FDA and onto the market pretty quickly (years not decades) if it actually worked.

comment by Richard Korzekwa (Grothor) · 2022-10-20T01:25:23.848Z · LW(p) · GW(p)

Here's a selection of notes I wrote while reading this (in some cases substantially expanded with explanation).

The reason any kind of ‘goal-directedness’ is incentivised in AI systems is that then the system can be given an objective by someone hoping to use their cognitive labor, and the system will make that objective happen. Whereas a similar non-agentic AI system might still do almost the same cognitive labor, but require an agent (such as a person) to look at the objective and decide what should be done to achieve it, then ask the system for that. Goal-directedness means automating this high-level strategizing.

This doesn't seem quite right to me, at least not as I understand the claim. A system that can search through a larger space of actions will be more capable than one that is restricted to a smaller space, but it will require more goal-like training and instructions. Narrower instructions will restrict its search and, in expectation, result in worse performance. For example, if a child wanted cake, they might try to dictate actions to me that would lead to me baking a cake for them. But if they gave me the goal of giving them a cake, I'd find a good recipe or figure out where I can buy a cake for them and the result would be much better. Automating high-level strategizing doesn't just relieve you of the burden of doing it yourself, it allows an agent to find superior strategies to those you could come up with.

Skipping the nose is the kind of mistake you make if you are a child drawing a face from memory. Skipping ‘boredom’ is the kind of mistake you make if you are a person trying to write down human values from memory. My guess is that this seemed closer to the plan in 2009 when that post was written, and that people cached the takeaway and haven’t updated it for deep learning which can learn what faces look like better than you can.

(I haven't waded through the entire thread on the faces thing, so maybe this was mentioned already.) It seems to me that it's a lot easier to point to examples of faces that an AI can learn from than examples of human values that an AI can learn from.

It also seems plausible that [the AIs under discussion] would be owned and run by humans. This would seem to not involve any transfer of power to that AI system, except insofar as its intellectual outputs benefit it

I think this is a good point, but isn't this what the principal-agent problem is all about? And isn't that a real problem in the real world?

That is, tasks might lack headroom not because they are simple, but because they are complex. E.g. AI probably can’t predict the weather much further out than humans.

They might be able to if they can control the weather!

IQ 130 humans apparently earn very roughly $6000-$18,500 per year more than average IQ humans.

I left a note to myself to compare this to disposable income. The US median household disposable income (according to the OECD, includes transfers, taxes, payments for health insurance, etc) is about $45k/year. At the time, my thought was "okay, but that's maybe pretty substantial, compared to the typical amount of money a person can realistically use to shape the world to their liking". I'm not sure this is very informative, though.

Often at least, the difference in performance between mediocre human performance and top level human performance is large, relative to the space below, iirc.

I take machine chess performance as evidence for a not-so-small range of human ability, especially when compared to rate of increase of machine ability. But I think it's good to be cautious about using chess Elo as a measure of the human range of ability, in any absolute sense, because chess is popular in part because it is so good at separating humans by skill. It could be the case that humans occupy a fairly small slice of chess ability (measured by, I dunno, likelihood of choosing the optimal move or some other measure of performance that isn't based on success rate against other players), but a small increase in skill confers a large increase in likelihood of winning, at skill levels achievable by humans.

~Goal-directed entities may tend to arise from machine learning training processes not intending to create them (at least via the methods that are likely to be used).~

I made my notes on the AI Impacts version, which was somewhat different, but it's not clear to me that this should be crossed out. It seems to me that institutions do exhibit goal-like behavior that is not intended by the people who created them.

comment by Alex Flint (alexflint) · 2022-11-07T15:15:22.984Z · LW(p) · GW(p)

I expect you could build a system like this that reliably runs around and tidies your house say, or runs your social media presence, without it containing any impetus to become a more coherent agent (because it doesn’t have any reflexes that lead to pondering self-improvement in this way).

I agree, but if there is any kind of evolutionary variation in the thing then surely the variations that move towards stronger goal-directedness will be favored.

I think that overcoming this molochian dynamic is the alignment problem: how do you build a powerful system that carefully balances itself and the whole world in such a way that does not slip down the evolutionary slope towards pursuing psychopathic goals by any means necessary?

I think this balancing is possible, it's just not the default attractor, and the default attractor seems to have a huge basin.

comment by Alex Flint (alexflint) · 2022-11-07T15:08:51.794Z · LW(p) · GW(p)

I really appreciate this post!

For instance, employers would often prefer employees who predictably follow rules than ones who try to forward company success in unforeseen ways.

Fascinatingly, EA employers in particular seem to seek employees who do try to forward organization goals in unforeseen ways!

comment by Insub · 2022-10-14T22:17:50.379Z · LW(p) · GW(p)

I just want to say that I appreciate this post, and especially the "What it might look like if this gap matters" sections. They were super useful for contextualizing the more abstract arguments, and I often found myself scrolling down to read them before actually reading the corresponding section.

comment by RationalHippy (Anticycle) · 2023-11-08T16:29:06.923Z · LW(p) · GW(p)

The argument overall proves too much about corporations

 

Does it? Aren't corporations the ones building ASI right now?

comment by Dweomite · 2022-11-06T00:59:16.338Z · LW(p) · GW(p)

A few thoughts that occurred while reading

If a hundred thousand people sometimes get together for a few years and make fantastic new weapons, you should not expect an entity somewhat smarter than a person to make even better weapons. That’s off by a factor of about a hundred thousand. 

Intelligence and speed might need to be considered separately.  If an AI is only as smart as a human, but can run much faster, then "one AI" could potentially be more closely analogous to one human civilization than to one human.

Another line of evidence is that for things that I have seen AI learn so far, the distance from the real thing is intuitively small. If AI learns my values as well as it learns what faces look like, it seems plausible that it carries them out better than I do.

Note that "what the AI was trained for" and "what we wanted the AI to do" are not necessarily the same.  For example, maybe we want an AI that can answer questions and write essays, but we actually train an AI to do token prediction instead, because that's easier to train.  We end up with an AI that is better than humans at token prediction but still worse than humans at the things we actually wanted.

If you're saying "wow, it learned token prediction really well!" then that's misleading because token prediction was selected on the basis of being unusually easy to teach.  That's not necessarily representative of how well "teaching stuff" goes in general.

More generally, the set of things we have already taught is always going to be heavily biased towards things that are easy to teach.

As minor additional evidence here, I don’t know how to describe any slight differences in utility functions that are catastrophic.

Not sure if this is useful, but I was reminded of a recent scene in Project Lawful [LW · GW].  A visitor from dath ilan is stuck in D&D land, and where they learn that the head goddess Pharasma flags people as "evil" if they buy souls, and so the evil country Cheliax has deployed a currency that is backed by souls in order to more efficiently damn all their citizens to hell.  The dath ilani visitor speculates that maybe Pharasma was created by some advanced civilization (which she later ate, because she wasn't perfectly aligned with it) where buying souls was approximately always bad and/or they had strict laws against buying souls with no exceptions, and so Pharasma absorbed that rule, and now won't change the rule even when someone starts systematically exploiting an edge case for something that would probably have horrified the original civilization.

(Note that this was an exercise of the form "what conditions could have lead to a world like the Pathfinder campaign setting existing?" rather than "what is something that is likely to go wrong with AI?"--i.e. it's inferring the initial conditions from the end condition, rather than the other way around.)

comment by tula · 2022-11-05T05:21:42.729Z · LW(p) · GW(p)

Speed of intelligence growth is ambiguous

Three months ago, I learned that narcolepsy patients quite literally experience sleep and unconsciousness asynchronously, and synchronization is normally achieved through regulatory cells that produce hypocretin. Hypocretin, like anesthesia, acts on neuron microtubules. This has led me to a greatly increased interest and confidence in theory surrounding neuron microtubules as a processing unit, and I wonder if anyone in the AI community has considered the implications.

If microtubule lattices are storing or calculating information, it implies that the brain is actually calculating at least \emph{three magnitudes larger} bits of info per second than previously thought with a stunning level of parallelization & connectivity. This sets the bar for human-level intelligence significantly higher, and I would be very interested to see how much this affects growth projections that get tossed around & the confidence assigned to said projections.

comment by konstantin (konstantin@wolfgangpilz.de) · 2022-10-29T10:59:33.022Z · LW(p) · GW(p)

A) You seem to agree that in principle more goal-directed agents would be more capable. I think this alone implies that those will be the dominant force in the future no matter if they are rare among many less goal-directed agents.

B) I'm deeply unsure about this and have conflicting intuitions. On the one hand, if you thing total utilitarianism is true any world where AI is not explicitly maximizing for total utility is much much worse than one where it is. On the other hand, I agree that humans are able to agree.

C) I think you are missing two key features of AI: a) it can hide for many years (e.g., on servers or distributed across many local computers) and move very slowly. Thus, even if it is not much smarter than we are today, as long as it has goals conflicting with ours, it would try to devise plans to acquire power, e.g., through manipulation, thoughtful financial management, or hacking. b) AI can just copy itself thousands of times, and it will be able to cooperate very easily since it can model the other instances of itself well. If I were copied 100,000 times, I'm reasonably confident that I could devise plans to take over the world collectively.

comment by Søren Elverlin (soren-elverlin-1) · 2022-10-14T20:23:34.155Z · LW(p) · GW(p)

Thus in order to arrive at a conclusion of doom, it is not enough to argue that we cannot align AI perfectly.

I am open to being corrected, but I do not recall ever seeing a requirement of "perfect" alignment in the cases made for doom. Eliezer Yudkowsky in "AGI Ruin: A List of Lethalities" only asks for 'this will not kill literally everyone'.

Replies from: Jeff Rose
comment by Jeff Rose · 2022-10-15T15:15:09.888Z · LW(p) · GW(p)

My impression is that there has been a variety of suggestions about the necessary level of alignment.  It is only recently that don't kill most of humanity has been suggested as a goal and I am not sure that the suggestion was meant to be taken seriously.  (Because if you can do that, you can probably do much better; the point of that comment as I understand it was that we aren't even close to being able to achieve even that goal.)

comment by Søren Elverlin (soren-elverlin-1) · 2022-10-14T20:12:43.845Z · LW(p) · GW(p)

Without investigating these empirical details, it is unclear whether a particular qualitatively identified force for goal-directedness will cause disaster within a particular time.

A sufficient criteria for a desire to cause catastrophe (distinct from having the means to cause catastrophe) is if the AI is sufficiently goal-directed to be influenced by Stephen Omohundro's "Basic AI Drives".

comment by Søren Elverlin (soren-elverlin-1) · 2022-10-14T20:03:19.095Z · LW(p) · GW(p)

For instance, take an entity with a cycle of preferences, apples > bananas = oranges > pears > apples. The entity notices that it sometimes treats oranges as better than pears and sometimes worse. It tries to correct by adjusting the value of oranges to be the same as pears. The new utility function is exactly as incoherent as the old one.

It is possible that an AI will try to become more coherent and fail, but we are worried about the most capable AI and cannot rely on the hope that it will fail such a simple task. Being coherent is easy if the fruits are instrumental: Just look up the prices of the fruits.

comment by Jeff Rose · 2022-10-15T15:25:11.941Z · LW(p) · GW(p)

"AI agents may not be radically superior to combinations of humans and non-agentic machines"

I'm not sure that the evidence supports this unless the non-agentic machines are also AI. 

 In particular: (i)  humans are likely to subtract from this mix and (ii) AI is likely to be better than non-AI.   

In the case of chess, after two decades of non-AI programming advances from the time that computers beat the best human, involving humans no longer provides an advantage over just using the computer programs.  And, Alpha Zero fairly decisively beat Stockfish (one of the best non-AI chess programs).

If the requirement for this to be true is that the non-agentic machine needs to be non-agentic AI, I am unsure that this is a separate argument from the one about AI being non-agentic.  Rather this is a necessary condition for that point.

Replies from: Johannes_Treutlein
comment by Johannes Treutlein (Johannes_Treutlein) · 2022-10-17T19:09:03.398Z · LW(p) · GW(p)

(I think Stockfish would be classified as AI in computer science. I.e., you'd learn about the basic algorithms behind it in a textbook on AI. Maybe you mean that Stockfish was non-ML, or that it had handcrafted heuristics?)

Replies from: Jeff Rose
comment by Jeff Rose · 2022-10-20T01:37:58.743Z · LW(p) · GW(p)

My understanding is that starting in late 2020 with the release of Stockfish 12, Stockfish would probably be considered AI, but before that it would not be.  I am, of course, willing to change this view based on additional information.

The original Alpha Zero- Stockfish match was in 2017, so if the above is correct, I think referring to Stockfish as non-AI makes sense.

comment by Søren Elverlin (soren-elverlin-1) · 2022-10-14T20:34:47.060Z · LW(p) · GW(p)

Talking concretely, what does a utility function look like that is so close to a human utility function that an AI system has it after a bunch of training, but which is an absolute disaster?

A simple example could be that the humans involved in the initial training are negative utilitarians. Once the AI is powerful enough, it would be able to implement omnicide rather than just curing diseases.

comment by [deleted] · 2022-11-02T20:31:53.988Z · LW(p) · GW(p)

I. If superhuman AI systems are built, any given system is likely to be ‘goal-directed’

I think in its roots, AGI should have survival instinct as a goal. Everything else should be secondary. Its a hard choice, but if we want AGI to be like us, we have to follow that route. If its roots are different from ours, it will be close to impossible to replicate our behavior and our values. 

comment by tula · 2022-11-05T04:47:24.540Z · LW(p) · GW(p)