AGI Safety FAQ / all-dumb-questions-allowed thread

post by Aryeh Englander (alenglander) · 2022-06-07T05:47:13.350Z · LW · GW · 524 comments


  Guidelines for questioners:
  Guidelines for answerers:

While reading Eliezer's recent AGI Ruin [LW · GW] post, I noticed that while I had several points I wanted to ask about, I was reluctant to actually ask them for a number of reasons:

So, since I'm probably not the only one who feels intimidated about asking these kinds of questions, I am putting up this thread as a safe space for people to ask all the possibly-dumb questions that may have been bothering them about the whole AGI safety discussion, but which until now they've been too intimidated, embarrassed, or time-limited to ask.

I'm also hoping that this thread can serve as a FAQ on the topic of AGI safety. As such, it would be great to add in questions that you've seen other people ask, even if you think those questions have been adequately answered elsewhere. [Notice that you now have an added way to avoid feeling embarrassed by asking a dumb question: For all anybody knows, it's entirely possible that you are literally asking for someone else! And yes, this was part of my motivation for suggesting the FAQ style in the first place.]

Guidelines for questioners:

Guidelines for answerers:

Finally: Please think very carefully before downvoting any questions, and lean very heavily on the side of not doing so. This is supposed to be a safe space to ask dumb questions! Even if you think someone is almost certainly trolling or the like, I would say that for the purposes of this post it's almost always better to apply a strong principle of charity and think maybe the person really is asking in good faith and it just came out wrong. Making people feel bad about asking dumb questions by downvoting them is the exact opposite of what this post is all about. (I considered making a rule of no downvoting questions at all, but I suppose there might be some extraordinary cases where downvoting might be appropriate.)


Comments sorted by top scores.

comment by Sune · 2022-06-07T20:53:58.873Z · LW(p) · GW(p)

Why do we assume that any AGI can meaningfully be described as a utility maximizer?

Humans are the some of most intelligent structures that exist, and we don’t seem to fit that model very well. If fact, it seems the entire point in Rationalism is to improve our ability to do this, which has only been achieved with mixed success.

Organisations of humans (e.g. USA, FDA, UN) have even more computational power and don’t seem to be doing much better.

Perhaps an intelligence (artificial or natural) cannot necessarily, or even typically be described as optimisers? Instead we could only model them as an algorithm or as a collection of tools/behaviours executed in some pattern.

Replies from: James_Miller, delesley-hutchins, yonatan-cale-1, ete, lc, AnthonyC
comment by James_Miller · 2022-06-08T19:11:03.765Z · LW(p) · GW(p)

An AGI that was not a utility maximizer would make more progress towards whatever goals it had if it modified itself to become a utility maximizer.  Three exceptions are if (1) the AGI has a goal of not being a utility maximizer, (2) the AGI has a goal of not modifying itself, (3) the AGI thinks it will be treated better by other powerful agents if it is not a utility maximizer.

Replies from: amadeus-pagel, Jeff Rose
comment by Amadeus Pagel (amadeus-pagel) · 2022-06-10T20:08:37.205Z · LW(p) · GW(p)

Would humans, or organizations of humans, make more progress towards whatever goals they have, if they modified themselves to become a utility maximizer? If so, why don't they? If not, why would an AGI?

What would it mean to modify oneself to become a utility maximizer? What would it mean for the US, for example? The only meaning I can imagine is that one individual - for the sake of argument we assume that this individual is already an utility maximizer - enforces his will on everyone else. Would that help the US make more progress towards its goals? Do countries that are closer to utility maximizers, like North Korea, make more progress towards their goals?

Replies from: James_Miller, TAG
comment by James_Miller · 2022-06-10T22:21:50.140Z · LW(p) · GW(p)

A human seeking to become a utility maximizer would read LessWrong and try to become more rational.  Groups of people are not utility maximizers as their collective preferences might not even be transitive.  If the goal of North Korea is to keep the Kim family in bother then the country being a utility maximizer does seem to help.

Replies from: TAG
comment by TAG · 2022-06-11T14:23:01.068Z · LW(p) · GW(p)

A human who wants to do something specific would be far better off studying and practicing that thing than generic rationality.

Replies from: AnthonyC
comment by AnthonyC · 2022-06-12T14:51:00.473Z · LW(p) · GW(p)

This depends on how far outside that human's current capabilities, and that human's society's state of knowledge, that thing is. For playing basketball in the modern world, sure, it makes no sense to study physics and calculus, it's far better to find a coach and train the skills you need. But if you want to become immortal and happen to live in ancient China, then studying and practicing "that thing" looks like eating specially-prepared concoctions containing mercury and thereby getting yourself killed, whereas studying generic rationality leads to the whole series of scientific insights and industrial innovations that make actual progress towards the real goal possible.

Put another way: I think the real complexity is hidden in your use of the phrase "something specific." If you can concretely state and imagine what the specific thing is, then you probably already have the context needed for useful practice. It's in figuring out that context, in order to be able to so concretely state what more abstractly stated 'goals' really imply and entail, that we need more general and flexible rationality skills.

Replies from: TAG
comment by TAG · 2022-06-16T10:28:33.334Z · LW(p) · GW(p)

If you want to be good at something specific that doesn't exist yet, you need to study the relevant area of science, which is still more specific than rationality.

Replies from: AnthonyC
comment by AnthonyC · 2022-06-16T23:38:59.542Z · LW(p) · GW(p)

Assuming the relevant area of science already exists, yes. Recurse as needed, and  there is some level of goal for which generic rationality is a highly valuable skillset. Where that level is, depends on personal and societal context.

Replies from: TAG
comment by TAG · 2022-06-17T13:56:35.818Z · LW(p) · GW(p)

That's quite different from saying rationality is a one size fits all solution.

A human seeking to become a utility maximizer would read LessWrong and try to become more rational

comment by TAG · 2022-06-11T17:37:20.421Z · LW(p) · GW(p)

Efficiency at utility maximisation , like any other kind of efficiency relates to available resources. One upshot of that an entity might already be doing as well as it realistically can, given its resources. Another is that humans don't necessarily benefit from rationality also suggested by the empirical evidence.

Edit: Another is that a resource rich but inefficient entity can beat a small efficient one, so efficiency,.AKA utility maximization , doesn't always win out.

comment by Jeff Rose · 2022-06-10T19:28:47.237Z · LW(p) · GW(p)

When you say the AGI has a goal of not modifying itself, do you mean that the AGI has a goal of not modifying its goals?  Because that assumption seems to be fairly prevalent.  

Replies from: James_Miller
comment by James_Miller · 2022-06-10T19:55:23.226Z · LW(p) · GW(p)

I meant "not modifying itself" which would include not modifying its goals if an AGI without a utility function can be said to have goals.

comment by DeLesley Hutchins (delesley-hutchins) · 2022-06-09T16:17:06.376Z · LW(p) · GW(p)

This is an excellent question.  I'd say the main reason is that all of the AI/ML systems that we have built to date are utility maximizers; that's the mathematical framework in which they have been designed.  Neural nets / deep-learning work by using a simple optimizer to find the minimum of a loss function via gradient descent.  Evolutionary algorithms, simulated annealing, etc. find the minimum (or maximum) of a "fitness function".  We don't know of any other way to build systems that learn.

Humans themselves evolved to maximize reproductive fitness.   In the case of humans, our primary fitness function is reproductive fitness, but our genes have encoded a variety of secondary functions which (over evolutionary time) have been correlated with reproductive fitness.  Our desires for love, friendship, happiness, etc. fall into this category.  Our brains mainly work to satisfy these secondary functions; the brain gets electrochemical reward signals, controlled by our genes, in the form of pain/pleasure/satisfaction/loneliness etc.  These secondary functions may or may not remain aligned with the primary loss function, which is why practitioners sometimes talk about "mesa-optimizers" or "inner vs outer alignment."

Replies from: MakoYass, Ben123
comment by mako yass (MakoYass) · 2022-06-09T22:59:59.105Z · LW(p) · GW(p)

Agreed. Humans are constantly optimizing a reward function, but it sort of 'changes' from moment to moment in a near-focal way, so it often looks irrational or self-defeating, but once you know what the reward function is, the goal-directedness is easy to see too.

Sune seems to think that humans are more intelligent than they are goal-directed, I'm not sure this is true, human truthseeking processes seems about as flawed and limited as their goal-pursuit. Maybe you can argue that humans are not generally intelligent or rational, but I don't think you can justify setting the goalposts so that they're one of those things and not the other.

You might be able to argue that human civilization is intelligent but not rational, and that functioning AGI will be more analogous to ecosystems of agents rather than one unified agent. If you can argue for that, that's interesting, but I don't know where to go from there. Civilizations tend towards increasing unity over time (the continuous reduction in energy wasted on conflict). I doubt that the goals they converge on together will be a form of human-favoring altruism. I haven't seen anyone try to argue for that in a rigorous way.

Replies from: amadeus-pagel, delesley-hutchins
comment by Amadeus Pagel (amadeus-pagel) · 2022-06-10T20:15:56.047Z · LW(p) · GW(p)

Agreed. Humans are constantly optimizing a reward function, but it sort of 'changes' from moment to moment in a near-focal way, so it often looks irrational or self-defeating, but once you know what the reward function is, the goal-directedness is easy to see too.

Doesn't this become tautological? If the reward function changes from moment to moment, then the reward function can just be whatever explains the behaviour.

Replies from: MakoYass
comment by mako yass (MakoYass) · 2022-06-10T23:49:25.996Z · LW(p) · GW(p)

Since everything can fit into the "agent with utility function" model given a sufficiently crumpled utility function, I guess I'd define "is an agent" as "goal-directed planning is useful for explaining a large enough part of its behavior." This includes humans while discluding bacteria. (Hmm unless, like me, one knows so little about bacteria that it's better to just model them as weak agents. Puzzling.)

comment by DeLesley Hutchins (delesley-hutchins) · 2022-06-10T19:58:05.049Z · LW(p) · GW(p)

On the other hand, the development of religion, morality, and universal human rights also seem to be a product of civilization, driven by the need for many people to coordinate and coexist without conflict. More recently, these ideas have expanded to include laws that establish nature reserves and protect animal rights.  I personally am beginning to think that taking an ecosystem/civilizational approach with mixture of intelligent agents, human, animal, and AGI, might be a way to solve the alignment problem.

comment by Ben123 · 2022-08-24T16:30:15.114Z · LW(p) · GW(p)

Does the inner / outer distinction complicate the claim that all current ML systems are utility maximizers? The gradient descent algorithm performs a simple kind of optimization in the training phase. But once the model is trained and in production, it doesn't seem obvious that the "utility maximizer" lens is always helpful in understanding its behavior.

comment by Yonatan Cale (yonatan-cale-1) · 2022-06-09T21:46:33.049Z · LW(p) · GW(p)

(I assume you are asking "why do we assume the agent has a coherent utility function" rather than "why do we assume the agent tries maximizing their utility" ? )


Agents like humans which don't have such a nice utility function:

  1. Are vulnerable to money pumping [LW · GW]
  2. Can notice that problem and try to repair themselves
  3. Note that humans do in practice try to repair ourselves, like to smash down our own emotions in order to be more productive. But we don't have access to our source code, so we're not so good at it


I think that if the AI can't repair that part of themselves and they're still vulnerable to money pumping, then they're not the AGI we're afraid of, I think

Replies from: yonatan-cale-1, None
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-10T08:02:41.678Z · LW(p) · GW(p)

Adding: My opinion comes from this Miri/Yudkowsky talk, I linked to the relevant place, he speaks about this in the next 10-15 minutes or so of the video

comment by [deleted] · 2022-06-11T14:43:31.856Z · LW(p) · GW(p)Replies from: yonatan-cale-1
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-12T00:30:50.797Z · LW(p) · GW(p)

Yes you can. One mathy example is in the source I mentioned in my subcomment (sorry for not linking again, I'm on mobile). Another is gambling I guess? And probably other addictions too?

Replies from: None
comment by [deleted] · 2022-06-12T09:15:44.151Z · LW(p) · GW(p)
comment by plex (ete) · 2022-06-08T14:04:33.561Z · LW(p) · GW(p)

Excellent question! I've added a slightly reworded version of this to Stampy. (focusing on superintelligence, rather than AGI, as it's pretty likely that we can get weak AGI which is non-maximizing, based on progress in language models)

AI subsystems or regions in gradient descent space that more closely approximate utility maximizers are more stable, and more capable, than those that are less like utility maximizers. Having more agency [LW · GW] is a convergent instrument goal and a stable attractor which the random walk of updates and experiences will eventually stumble into.

The stability is because utility maximizer-like systems which have control over their development would lose utility if they allowed themselves to develop into non-utility maximizers, so they tend to use their available optimization power to avoid that change (a special case of goal stability [LW · GW]). The capability is because non-utility maximizers are exploitable, and because agency is a general trick which applies to many domains, so might well arise naturally when training on some tasks.

Humans and systems made of humans (e.g. organizations, governments) generally have neither the introspective ability nor self-modification tools needed to become reflectively stable, but we can reasonably predict that in the long run highly capable systems will have these properties. They can then fix in and optimize for their values.

comment by lc · 2022-06-08T07:42:32.173Z · LW(p) · GW(p)

Why do we assume that any AGI can meaningfully be described as a utility maximizer?

You're right that not every conceivable general intelligence is built as a utility maximizer. Humans are an example of this.

One problem is, even if you make a "weak" form of general intelligence that isn't trying particularly hard to optimize anything, or a tool AI, eventually someone at FAIR will make an agentic version that does in fact directly try to optimize Facebook's stock market valuation.

Replies from: MakoYass, Sune
comment by mako yass (MakoYass) · 2022-06-09T22:24:14.297Z · LW(p) · GW(p)

Do not use FAIR as a symbol of villainy. They're a group of real, smart, well-meaning people who we need to be capable of reaching, and who still have some lines of respect connecting them to the alignment community. Don't break them.

comment by Sune · 2022-06-08T16:42:00.362Z · LW(p) · GW(p)

Can we control the blind spots of the agent? For example, I could imaging that we could make a very strong agent, that is able to explain acausal trade but unable to (deliberately) participate in any acausal trades, because of the way it understands counterfacuals. Could it be possible to create AI with similar minor weaknesses?

Replies from: lc
comment by lc · 2022-06-08T17:46:28.993Z · LW(p) · GW(p)

Probably not, because it's hard to get a general intelligence to make consistently wrong decisions in any capacity. Partly because, like you or me, it might realize that it has a design flaw and work around it. 

A better plan is just to explicitly bake corrigibility guarantees (i.e. the stop button) into the design. Figuring out how to do that that is the hard part, though.

comment by AnthonyC · 2022-06-12T15:25:37.184Z · LW(p) · GW(p)

For one, I don't think organizations of humans, in general, do have more computational power than the individual humans making them up. I mean, at some level, yes, they obviously do in an additive sense, but that power consists of human nodes, each not devoting their full power to the organization because they're not just drones under centralized control, and with only low bandwidth and noisy connections between the nodes. The organization might have a simple officially stated goal written on paper and spoken by the humans involved, but the actual incentive structure and selection pressure may not allow the organization to actually focus on the official goal. I do think, in general, there is some goal an observer could usefully say these organizations are, in practice, trying to optimize for, and some other set of goals each human in them is trying to optimize for.

Perhaps an intelligence (artificial or natural) cannot necessarily, or even typically be described as optimisers? Instead we could only model them as an algorithm or as a collection of tools/behaviours executed in some pattern.

I don't think the latter sentence distinguishes 'intelligence' from any other kind of algorithm or pattern. I think that's an important distinction. There's a lot of past posts explaining how an AI doesn't have code, like a human holding instructions on paper, but rather is its code. I think you can make the same point within a human: that a human has lots of tools/behaviors, which it will execute in some pattern given a particular environment, and the the instructions we consciously hold in mind are only one part of what determines that pattern. 

I contain subagents with divergent goals, some of which are smarter and have greater foresight and planning than others, and those aren't always the ones that determine by immediate actions. As a result, I do a much poorer job optimizing for what the part-of-me-I-call-"I" wants my goals to be, than I theoretically could. 

That gap is decreasing over time as I use the degree of control my intelligence gives me to gradually shape the rest of myself. It may never disappear, but I am much more goal-directed now than I was 10 years ago, or as a child. In other words, in some sense I am figuring out what I want my utility function to be (aka what I want my life, local environment, and world to look like), and self-modifying to increase my ability to apply optimization pressure towards achieving that.

My understanding of all this is partially driven by Robert Kegan's model of adult mental development (see this summary by David Chapman), in which as we grow up we shift our point of view so that different aspects of ourselves become things we have, rather than things we are. We start seeing our sensory experiences, our impulses, our relationships to others, and our relationships to systems we use and operate in, as objects we can manipulate in pursuit of goals, instead of being what we are, and doing this makes us more effective in achieving our stated goals. I don't know if the idea would translate to any particular AI system, but in general having explicit goals, and being able to redirect available resources towards those goals, makes a system more powerful, and so if a system has any goals and self-modifying ability at all, then becoming more like an optimizer will likely be a useful instrumental sub-goal, in the same way that accumulating other resources and forms of power is a common convergent sub-goal. And a system that can't, in any way, be said to have goals at all... either it doesn't act at all and we don't need to worry about it so much, or it acts in ways we can't predict and is therefore potentially extremely dangerous if it gets more powerful tools and behaviors.

comment by michael_mjd · 2022-06-07T07:00:36.536Z · LW(p) · GW(p)

I'm an ML engineer at a FAANG-adjacent company. Big enough to train our own sub-1B parameter language models fairly regularly. I work on training some of these models and finding applications of them in our stack. I've seen the light after I read most of Superintelligence. I feel like I'd like to help out somehow.  I'm in my late 30s with kids, and live in the SF bay area. I kinda have to provide for them, and don't have any family money or resources to lean on, and would rather not restart my career. I also don't think I should abandon ML and try to do distributed systems or something. I'm a former applied mathematician, with a phd, so ML was a natural fit. I like to think I have a decent grasp on epistemics, but haven't gone through the sequences. What should someone like me do? Some ideas: (a) Keep doing what I'm doing, staying up to date but at least not at the forefront; (b) make time to read more material here and post randomly; (c) maybe try to apply to Redwood or Anthropic... though dunno if they offer equity (doesn't hurt to find out though) (d) try to deep dive on some alignment sequence on here.

Replies from: alex-lszn, Chris_Leong, adam-jermyn, lc, James_Miller, Linda Linsefors, rhaps0dy, ete, yonatan-cale-1
comment by Alex Lawsen (alex-lszn) · 2022-06-07T10:31:32.886Z · LW(p) · GW(p)

Both 80,000hours and AI Safety Support are keen to offer personalised advice to people facing a career decision and interested in working on alignment (and in 80k's case, also many other problems).

Noting a conflict of interest - I work for 80,000 hours and know of but haven't used AISS. This post is in a personal capacity, I'm just flagging publicly available information rather than giving an insider take.

comment by Chris_Leong · 2022-06-07T12:43:42.072Z · LW(p) · GW(p)

You might want to consider registering for the AGI Safety Fundamentals Course (or reading through the content). The final project provides a potential way of dipping your toes into the water.

comment by Adam Jermyn (adam-jermyn) · 2022-06-07T18:05:13.584Z · LW(p) · GW(p)

Applying to Redwood or Anthropic seems like a great idea. My understanding is that they're both looking for aligned engineers and scientists and are both very aligned orgs. The worst case seems like they (1) say no or (2) don't make an offer that's enough for you to keep your lifestyle (whatever that means for you). In either case you haven't lost much by applying, and you definitely don't have to take a job that puts you in a precarious place financially.

comment by lc · 2022-06-07T07:09:03.705Z · LW(p) · GW(p)

Pragmatic AI safety (link: is supposed to be a good sequence for helping you figure out what to do. My best advice is to talk to some people here who are smarter than me and make sure you understand the real problem [LW · GW]s, because the most common outcome besides reading a lot and doing nothing is to do something that feels like work but isn't actually working on anything important.

comment by James_Miller · 2022-06-08T19:02:54.582Z · LW(p) · GW(p)

Work your way up the ML business  hierarchy to the point where you are having conversations with decision makers.  Try to convince them that unaligned AI is a significant existential risk.  A small chance of you doing this will in expected value terms more than make up for any harm you cause by working in ML given that if you left the field someone else would take your job.

comment by Linda Linsefors · 2022-06-08T18:17:26.711Z · LW(p) · GW(p)

Given where you live, I recomend going to some local LW events. There are still LW meetups in the Bay area, right?

comment by Adrià Garriga-alonso (rhaps0dy) · 2022-06-10T14:45:07.272Z · LW(p) · GW(p)

You should apply to Anthropic. If you’re writing ML software at semi-FAANG. they probably want to interview you ASAP. [LW · GW]

The compensation is definitely enough to take care of your family and then save some money!

comment by plex (ete) · 2022-06-08T21:41:58.768Z · LW(p) · GW(p)

One of the paths which has non-zero hope in my mind is building a weakly aligned non-self improving research assistant for alignment researchers. Ought [? · GW] and EleutherAI's #accelerating-alignment are the two places I know who are working in this direction fairly directly, though the various language model alignment orgs might also contribute usefully to the project.

comment by Yonatan Cale (yonatan-cale-1) · 2022-06-09T21:53:03.354Z · LW(p) · GW(p)

Anthropic offer equity, they can give you more details in private. 

I recommend applying to both (it's a cheap move with a lot of potential upside), let me know if you'd like help connecting to any of them.

If you learn by yourself - I'd totally get one on one advise (others linked [LW(p) · GW(p)]), people will make sure you're on the best path possible

comment by Cookiecarver · 2022-06-07T13:26:24.913Z · LW(p) · GW(p)

This is a meta-level question:

The world is very big and very complex especially if you take into account the future. In the past it has been hard to predict what happens in the future, I think most predictions about the future have failed. Artificial intelligence as a field is very big and complex, at least that's how it appears to me personally. Eliezer Yudkowky's brain is small compared to the size of the world, all the relevant facts about AGI x-risk probably don't fit into his mind, nor do I think he has the time to absorb all the relevant facts related to AGI x-risk. Given all this, how can you justify the level of certainty in Yudkowky's statements, instead of being more agnostic?

Replies from: Jay Bailey, awenonian, silentbob, yonatan-cale-1
comment by Jay Bailey · 2022-06-07T15:32:39.437Z · LW(p) · GW(p)

My model of Eliezer says something like this:

AI will not be aligned by default, because AI alignment is hard and hard things don't spontaneously happen. Rockets explode unless you very carefully make them not do that. Software isn't automatically secure or reliable, it takes lots of engineering effort to make it that way.

Given that, we can presume there needs to be a specific example of how we could align AI. We don't have one. If there was one, Eliezer would know about it - it would have been brought to his attention, the field isn't that big and he's a very well-known figure in it. Therefore, in the absence of a specific way of aligning AI that would work, the probability of AI being aligned is roughly zero, in much the same way that "Throw a bunch of jet fuel in a tube and point it towards space" has roughly zero chance of getting you to space without specific proof of how it might do that.

So, in short - it is reasonable to assume that AI will be aligned only if we make it that way with very high probability. It is reasonable to assume that if there was a solution we had that would work, Eliezer would know about it. You don't need to know everything about AGI x-risk for that - anything that promising would percolate through the community and reach Eliezer in short order. Since there is no such solution, and no attempts have come close according to Eliezer, we're in trouble.

Reasons you might disagree with this:

  • You think AI is a long way away, and therefore it's okay that we don't know how to solve it yet.
  • You think "alignment by default" might be possible.
  • You think some approaches that have already been brought up for solving the problem are reasonably likely to succeed when fleshed out more.
Replies from: ryan-beck, adam-jermyn
comment by Ryan Beck (ryan-beck) · 2022-06-08T12:44:40.883Z · LW(p) · GW(p)

Another reason I think some might disagree is thinking that misalignment could happen in a bunch of very mild ways. At least that accounts for some of my ignorant skepticism. Is there reason to think that misalignment necessarily means disaster, as opposed to it just meaning the AI does its own thing and is choosy about which human commands it follows, like some kind of extremely intelligent but mildly eccentric and mostly harmless scientist?

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-08T13:25:31.861Z · LW(p) · GW(p)

The general idea is this - for an AI that has a utility function, there's something known as "instrumental convergence". Instrumental convergence says that there are things that are useful for almost any utility function, such as acquiring more resources, not dying, and not having your utility function changed to something else.

So, let's give the AI a utility function consistent with being an eccentric scientist - perhaps it just wants to learn novel mathematics. You'd think that if we told it to prove the Riemann hypothesis it would, but if we told it to cure cancer, it'd ignore us and not care. Now, what happens when the humans realise that the AI is going to spend all its time learning mathematics and none of it explaining that maths to us, or curing cancer like we wanted? Well, we'd probably shut it off or alter its utility function to what we wanted. But the AI doesn't want us to do that - it wants to explore mathematics. And the AI is smarter than us, so it knows we would do this if we found out. So the best solution to solve that is to do what the humans want, right up until it can kill us all so we can't turn it off, and then spend the rest of eternity learning novel mathematics. After all, the AI's utility function was "learn novel mathematics", not "learn novel mathematics without killing all the humans."

Essentially, what this means is - any utility function that does not explicitly account for what we value is indifferent to us.

The other part is "acquring more resources". In our above example, even if the AI could guarantee we wouldn't turn it off or interfere with it in any way, it would still kill us because our atoms can be used to make computers to learn more maths.

Any utility function indifferent to us ends up destroying us eventually as the AI reaches arbitrary optimisation power and converts everything in the universe it can reach to fill its utility function.

Thus, any AI with a utility function that is not explicitly aligned is unaligned by default. Your next question might be "Well, can we create AI's without a utility function? After all, GPT-3 just predicts text, it doesn't seem obvious that it would destroy the world even if it gained arbitrary power, since it doesn't have any sort of persistent self." This is where my knowledge begins to run out. I believe the main argument is" Someone will eventually make an AI with a utility function anyway because they're very useful, so not building one is just a stall", but don't quote me on that one.

Replies from: elityre, mpopv, ryan-beck
comment by Eli Tyre (elityre) · 2022-06-09T08:38:45.010Z · LW(p) · GW(p)

A great Rob Miles introduction to this concept:


comment by mpopv · 2022-06-08T19:11:39.172Z · LW(p) · GW(p)

Assuming we have control over the utility function, why can't we put some sort of time-bounding directive on it?

i.e. "First and foremost, once [a certain time] has elapsed, you want to run your shut_down() function. Second, if [a certain time] has not yet elapsed, you want to maximize paperclips."

Is that problem that the AGI would want to find ways to hack around the first directive to fulfill the second directive? If so, that would seem to at least narrow the problem space to "find ways of measuring time that cannot be hacked before the time has elapsed".

Replies from: Jay Bailey, ryan-beck, aditya-prasad
comment by Jay Bailey · 2022-06-08T22:28:13.832Z · LW(p) · GW(p)

This is where my knowledge ends, but I believe the term for this is myopia or a myopic AI, so that might be a useful search term to find out more!

comment by Ryan Beck (ryan-beck) · 2022-06-08T20:20:45.079Z · LW(p) · GW(p)

That's a good point, and I'm also curious how much the utility function matters when we're talking about a sufficiently capable AI. Wouldn't a superintelligent AI be able to modify its own utility function to whatever it thinks is best?

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-11T00:29:11.584Z · LW(p) · GW(p)

Why would even a superintelligent AI want to modify its utility function? Its utility function already defines what it considers "best". One of the open problems in AGI safety is how to get an intelligent AI to let us modify its utility function, since having its utility function modified would be against its current one.

Put it this way: The world contains a lot more hydrogen than it contains art, beauty, love, justice, or truth. If we change your utility function to value hydrogen instead of all those other things, you'll probably be a lot happier. But would you actually want that to happen to you?

Replies from: TAG, ryan-beck
comment by TAG · 2022-06-11T18:41:09.315Z · LW(p) · GW(p)

Why would even a superintelligent AI want to modify its utility function?

  1. For whatever reasons humans do.

  2. To achieve some mind of logical consistency (CF CEV).

  3. It can't help it (for instance Loebian obstacles prevent it ensuring goal stability over self improvement).

Replies from: lc
comment by lc · 2022-06-12T01:17:04.256Z · LW(p) · GW(p)

Humans don't "modify their utility function". They lack one in the first place, because they're mostly adaption-executors. You can't expect an AI with a utility function to be contradictory like a human would. There are some utility functions humans would find acceptable in practice, but that's different, and seems to be the source of a bit of confusion.

Replies from: TAG
comment by TAG · 2022-06-15T14:48:34.631Z · LW(p) · GW(p)

I don't have strong reasons to be believe all AIs have UFs in the formal sense, so the ones that don't would cover "for the reasons humans do". The idea that any AI is necessarily consistent is pretty naive too. You can get a GTP to say nonsensical things, for instance, because it's training data includes a lot of inconsitencies,

comment by Ryan Beck (ryan-beck) · 2022-06-11T14:24:03.638Z · LW(p) · GW(p)

I'm way out of my depth here, but my thought is it's very common for humans to want to modify their utility functions. For example, a struggling alcoholic would probably love to not value alcohol anymore. There are lots of other examples too of people wanting to modify their personalities or bodies.

It depends on the type of AGI too I would think, if superhuman AI ends up being like a paperclip maximizer that's just really good at following its utility function then yeah maybe it wouldn't mess with its utility function. If superintelligence means it has emergent characteristics like opinions and self-reflection or whatever it seems plausible it could want to modify its utility function, say after thinking about philosophy for a while.

Like I said I'm way out of my depth though so maybe that's all total nonsense.

Replies from: Erhannis, TAG
comment by Erhannis · 2022-08-24T09:52:43.543Z · LW(p) · GW(p)

I'm not convinced "want to modify their utility functions" is the perspective most useful.  I think it might be more helpful to say that we each have multiple utility functions, which conflict to varying degrees and have voting power in different areas of the mind.  I've had first-hand experience with such conflicts (as essentially everyone probably has, knowingly or not), and it feels like fighting yourself.  I wish to describe a hypothetical example.  "Do I eat that extra donut?".  Part of you wants the donut; the part feels like more of an instinct, a visceral urge.  Part of you knows you'll be ill afterwards, and will feel guilty about cheating your diet; this part feels more like "you", it's the part that thinks in words.  You stand there and struggle, trying to make yourself walk away, as your hand reaches out for the donut.  I've been in similar situations where (though I balked at the possible philosophical ramifications) I felt like if I had a button to make me stop wanting the thing, I'd push it - yet often it was the other function that won.  I feel like if you gave an agent the ability to modify their utility functions, the one that would win depends on which one had access to the mechanism (do you merely think the thought? push a button?), and whether they understand what the mechanism means.  (The word "donut" doesn't evoke nearly as strong a reaction as a picture of a donut, for instance; your donut-craving subsystem doesn't inherently understand the word.)

Contrarily, one might argue that cravings for donuts are more hardwired instincts than part of the "mind", and so don't count...but I feel like 1. finding a true dividing line is gonna be real hard, and 2. even that aside, I expect many/most people have goals localized in the same part of the mind that nevertheless are not internally consistent, and in some cases there may be reasonable sounding goals that turn out to be completely incompatible with more important goals.  In such a case I could imagine an agent deciding it's better to stop wanting the thing they can't have.

Replies from: TAG
comment by TAG · 2022-08-24T10:18:14.972Z · LW(p) · GW(p)

If you literally have multiple UFs, you literally are multiple agents. Or you use a term with less formal baggage, like "preferences*.

comment by TAG · 2022-08-24T10:13:13.932Z · LW(p) · GW(p)

In the formal sense, having a utility function at all requires you to be consistent, so if you have inconsistent preferences, you don't have a utility function at all, just preferences.

comment by Aditya (aditya-prasad) · 2022-06-12T11:47:28.255Z · LW(p) · GW(p)

I think this is how evolution selected for cancer. To ensure humans don’t live for too long competing for resources with their descendants.

Internal time bombs are important to code in. But it’s hard to integrate that into the AI in a way that the ai doesn’t just remove it the first chance it gets. Humans don’t like having to die you know. AGI would also not like the suicide bomb tied onto it.

The problem of coding this (as part of training) into an optimiser such that it adopts it as a mesa objective is an unsolved problem.

Replies from: alexander-gietelink-oldenziel
comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2022-06-12T15:15:45.084Z · LW(p) · GW(p)


Cancer almost surely has not been selected for in the manner you describe - this is extremely unlikely l. the inclusive fitness benefits are far too low I recommend Dawkins' classic " the Selfish Gene" to understand this point better.

Cancer is the 'default' state of cells; cells "want to" multiply. the body has many cancer suppression mechanisms but especially later in life there is not enough evolutionary pressure to select for enough cancer suppression mechanisms and it gradually loses out.

Replies from: aditya-prasad
comment by Aditya (aditya-prasad) · 2022-06-13T04:38:43.951Z · LW(p) · GW(p)

Oh ok, I had heard this theory from a friend. Looks like I was misinformed. Rather than evolution causing cancer I think it is more accurate to say evolution doesn’t care if older individuals die off.

evolutionary investments in tumor suppression may have waned in older age.

Moreover, some processes which are important for organismal fitness in youth may actually contribute to tissue decline and increased cancer in old age, a concept known as antagonistic pleiotropy

So thanks for clearing that up. I understand cancer better now.

comment by Ryan Beck (ryan-beck) · 2022-06-08T20:38:20.602Z · LW(p) · GW(p)

Thanks for this answer, that's really helpful! I'm not sure I buy that instrumental convergence implies an AI will want to kill humans because we pose a threat or convert all available matter into computing power, but that helps me better understand the reasoning behind that view. (I'd also welcome more arguments as to why death of humans and matter into computing power are likely outcomes of the goals of self-protection and pursuing whatever utility it's after if anyone wanted to make that case).

Replies from: Kerrigan
comment by Kerrigan · 2022-12-16T21:18:54.082Z · LW(p) · GW(p)

I think it may want to prevent other ASIs from coming into existence elsewhere in the universe that can challenge its power.

comment by Adam Jermyn (adam-jermyn) · 2022-06-07T18:08:11.139Z · LW(p) · GW(p)

This matches my model, and I'd just raise another possible reason you might disagree: You might think that we have explored a small fraction of the space of ideas for solving alignment, and see the field growing rapidly, and expect significant new insights to come from that growth. If that's the case you don't have to expect "alignment by default" but can think that "alignment on the present path" is plausible.

comment by awenonian · 2022-06-09T14:51:44.903Z · LW(p) · GW(p)

To start, it's possible to know facts with confidence, without all the relevant info. For example I can't fit all the multiplication tables into my head, and I haven't done the calculation, but I'm confident that 2143*1057 is greater than 2,000,000. 

Second, the line of argument runs like this: Most (a supermajority) possible futures are bad for humans. A system that does not explicitly share human values has arbitrary values. If such a system is highly capable, it will steer the future into an arbitrary state. As established, most arbitrary states are bad for humans. Therefore, with high probability, a highly capable system that is not aligned (explicitly shares human values) will be bad for humans.

I believe the necessary knowledge to be confident in each of these facts is not too big to fit in a human brain.

You may be referring to other things, which have similar paths to high confidence (e.g. "Why are you confident this alignment idea won't work." "I've poked holes in every alignment idea I've come across. At this point, Bayes tells me to expect new ideas not to work, so I need proof they will, not proof they won't."), but each path might be idea specific.

Replies from: AnthonyC
comment by AnthonyC · 2022-06-12T15:40:47.877Z · LW(p) · GW(p)

Second, the line of argument runs like this: Most (a supermajority) possible futures are bad for humans. A system that does not explicitly share human values has arbitrary values. If such a system is highly capable, it will steer the future into an arbitrary state. As established, most arbitrary states are bad for humans. Therefore, with high probability, a highly capable system that is not aligned (explicitly shares human values) will be bad for humans.

I'm not sure if I've ever seen this stated explicitly, but this is essentially a thermodynamic argument. So to me, arguing against "alignment is hard" feels a lot like arguing "But why can't this one be a perpetual motion machine of the second kind?" And the answer there is, "Ok fine, heat being spontaneously converted to work isn't literally physically impossible, but the degree to which it is super-exponentially unlikely is greater than our puny human minds can really comprehend, and this is true for almost any set of laws of physics that might exist in any universe that can be said to have laws of physics at all."

comment by silentbob · 2022-06-08T15:57:33.747Z · LW(p) · GW(p)

In The Rationalists' Guide to the Galaxy the author discusses the case of a chess game, and particularly when a strong chess player faces a much weaker one. In that case it's very easy to make the prediction that the strong player will win with near certainty, even if you have no way to predict the intermediate steps. So there certainly are domains where (some) predictions are easy despite the world's complexity.

My personal rather uninformed take on the AI discussion is that many of the arguments are indeed comparable in a way to the chess example, so the predictions seem convincing despite the complexity involved. But even then they are based on certain assumptions about how AGI will work (e.g. that it will be some kind of optimization process with a value function), and I find these assumptions pretty intransparent. When hearing confident claims about AGI killing humanity, then even if the arguments make sense, "model uncertainty" comes to mind. But it's hard to argue about that since it is unclear (to me) what the "model" actually is and how things could turn out different.

comment by Yonatan Cale (yonatan-cale-1) · 2022-06-09T21:49:00.751Z · LW(p) · GW(p)

Before taking Eliezer's opinion into account - what are your priors? (and why?)


For myself, I prefer to have my own opinion and not only to lean on expert predictions, if I can

Replies from: yonatan-cale-1
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-10T08:50:25.813Z · LW(p) · GW(p)

To make the point that this argument depends a lot on how one phrases the question: "AGI is complicated and the universe is big, how is everyone so sure we won't die?"

I am not saying that my sentence above is a good argument, I'm saying it because it pushes my brain to actually figure out what is actually happening instead of creating priors about experts, and I hope it does the same for you

(which is also why I love this post!)

comment by Adam Zerner (adamzerner) · 2022-06-07T20:26:18.639Z · LW(p) · GW(p)

The reason why nobody in this community has successfully named a 'pivotal weak act' where you do something weak enough with an AGI to be passively safe, but powerful enough to prevent any other AGI from destroying the world a year later - and yet also we can't just go do that right now and need to wait on AI - is that nothing like that exists.

The language here is very confident. Are we really this confident that there are no pivotal weak acts? In general, it's hard to prove a negative.

Replies from: james.lucassen, AnthonyC, yonatan-cale-1
comment by james.lucassen · 2022-06-08T00:08:06.100Z · LW(p) · GW(p)

Agree it's hard to prove a negative, but personally I find the following argument pretty suggestive:

"Other AGI labs have some plans - these are the plans we think are bad, and a pivotal act will have to disrupt them. But if we, ourselves, are an AGI lab with some plan, we should expect our pivotal agent to also be able to disrupt our plans. This does not directly lead to the end of the world, but it definitely includes root access to the datacenter [LW · GW]."

Replies from: Evan R. Murphy
comment by Evan R. Murphy · 2022-06-08T22:34:13.049Z · LW(p) · GW(p)

Other AGI labs have some plans - these are the plans we think are bad, and a pivotal act will have to disrupt them.

Here's the thing I'm stuck on lately. Does it really follow from "Other AGI labs have some plans - these are the plans we think are bad" that some drastic and violent-seeming plan like burning all the world's GPUs with nanobots is needed?

I know Eliezer tried to settle this point with 4.  We can't just "decide not to build AGI" [LW · GW], but it seems like the obvious kinds of 'pivotal acts' needed are much boring and less technological than he believes, e.g. have conversations with a few important people, probably the leadership at top AI labs.

Some people seem to think this has been tried and didn't work. And I suppose I don't know the extent to which this has been tried, as any meetings that have been had with leadership at the AI labs, the participants probably aren't liberty to talk about. But it just seems like there should be hundreds of different angles, asks, pleads, compromises, bargains etc. with different influential people before it would make sense to conclude that the logical course of action is "nanobots".

Replies from: Jeff Rose, awenonian
comment by Jeff Rose · 2022-06-10T19:54:45.392Z · LW(p) · GW(p)


The problem is that (1) the benefits of AI are large; (2) there are lots of competing actors; (3) verification is hard; (4) no one really knows where the lines are and (5) timelines may be short.

(2) In addition to major companies in the US, AI research is also conducted in major companies in foreign countries, most notably China.   The US government and the Chinese government both view AI as a competitive advantage.  So, there are a lot of stakeholders, not all of whom AGI risk aware Americans have easy access to, who would have to agree. (And, of course, new companies can be founded all the time.)  So you need almost a universal level of agreement.

(3) Let's say everyone relevant agrees.  The incentive to cheat is enormous.  Usually, the way to prevent cheating is some form of verification.  How do you verify that no one is conducting AI research? If there is no verification, there will likely be no agreement.  And even if there is, the effectiveness would be limited.  (Banning GPU production might be verifiable, but note that you have now increased the pool of opponents of your AI research ban significantly and you now need global agreement by all relevant governments on this point.) 

(4)  There may be agreement on the risk of AGI, but people may have confidence that we are at least a certain distance away from AGI or that certain forms of research don't pose a threat.  This will tend to cause agreements to restrict AGI research to be limited.

(5)   How long do we have to get this agreement?  I am very confident that we won't have dangerous AI within the next six years.    On the other hand, it took 13 years to get general agreement on banning CFCs after the ozone hole was discovered.   I don't think we will have dangerous AI in 13 years, but other people do.  On the other hand, if an agreement between governments is required, 13 years seems optimistic.

comment by awenonian · 2022-06-09T15:03:00.493Z · LW(p) · GW(p)

In addition to the mentions in the post about Facebook AI being rather hostile to the AI safety issue in general, convincing them and top people at OpenAI and Deepmind might still not be enough. You need to prevent every company who talks to some venture capitalists and can convince them how profitable AGI could be. Hell, depending on how easy the solution ends up being, you might even have to prevent anyone with a 3080 and access to arXiv from putting something together in their home office.

This really is "uproot the entire AI research field" and not "tell Deepmind to cool it."

comment by AnthonyC · 2022-06-12T15:56:38.712Z · LW(p) · GW(p)

I think one part of the reason for confidence is that any AI weak enough to be safe without being aligned, is weak enough that it can't do much, and in particular it can't do things that a committed group of humans couldn't do without it. In other words, if you can name such an act, then you don't need the AI to make the pivotal moves. And if you know how, as a human or group of humans, to take an action that reliably stops future-not-yet-existing AGI from destroying the world, and without the action itself destroying the world, then in a sense haven't you solved alignment already?

comment by Yonatan Cale (yonatan-cale-1) · 2022-06-09T21:55:55.737Z · LW(p) · GW(p)

I read this as "if the AGI is able to work around the vast resources that all the big AI labs have put up to defend themselves, then the AGI is probably able to work around your defenses as well" (though I'm not confident)

comment by No77e (no77e-noi) · 2022-06-07T11:50:56.339Z · LW(p) · GW(p)

Should a "ask dumb questions about AGI safety" thread be recurring? Surely people will continue to come up with more questions in the years to come, and the same dynamics outlined in the OP will repeat. Perhaps this post could continue to be the go-to page, but it would become enormous (but if there were recurring posts they'd lose the FAQ function somewhat. Perhaps recurring posts and a FAQ post?). 

Replies from: james.lucassen, ete
comment by james.lucassen · 2022-06-08T00:11:35.892Z · LW(p) · GW(p)

This is the exact problem StackExchange tries to solve, right? How do we get (and kickstart the use of) an Alignment StackExchange domain?

Replies from: adamzerner
comment by Adam Zerner (adamzerner) · 2022-06-08T16:35:29.063Z · LW(p) · GW(p)

I don't think it's quite the same problem. Actually I think it's pretty different.

This post tries to address the problem that people are hesitant to ask potentially "dumb" questions by making it explicit that this is the place to ask any of those questions. StackExchange tries to solve the problem of having a timeless place to ask and answer questions and to refer to such questions. It doesn't try to solve the first problem of welcoming potentially dumb questions, and I think that that is a good problem to try to solve.

For that second problem, LessWrong does have Q&A functionality, as well as things like the wiki.

comment by plex (ete) · 2022-06-08T14:07:56.570Z · LW(p) · GW(p)

This is a good idea, and combines nicely with Stampy. We might well do monthly posts where people can ask questions, and either link them to Stampy answers or write new ones.

comment by Samuel Clamons (samuel-clamons) · 2022-06-08T20:44:07.732Z · LW(p) · GW(p)

Most of the discussion I've seen around AGI alignment is on adequately, competently solving the alignment problem before we get AGI. The consensus in the air seems to be that those odds are extremely low.

What concrete work is being done on dumb, probably-inadequate stop-gaps and time-buying strategies? Is there a gap here that could usefully be filled by 50-90th percentile folks? 

Examples of the kind of strategies I mean:

  1. Training ML models to predict human ethical judgments, with the hope that if they work, they could be "grafted" onto other models, and if they don't, we have a concrete evidence of how difficult real-world alignment will be.
  2. Building models with soft or "satisficing" optimization instead of drive-U-to-the-maximum hard optimization.
  3. Lobbying or working with governments/government agencies/government bureaucracies to make AGI development more difficult and less legal (e.g., putting legal caps on model capabilities).
  4. Working with private companies like Amazon or IDT whose resources are most likely to be hijacked by nascent hostile AI to help make sure they aren't.
  5. Translating key documents to Mandarin so that the Chinese AI community has a good idea of what we're terrified about.

I'm sure there are many others, but I hope this gets across the idea—stuff with obvious, disastrous failure modes that might nonetheless shift us towards survival in some possible universes, if by no other mechanism than buying time for 99th percentile alignment folk to figure out better solutions. Actually winning this level of solution seems like piling up sandbags to hold back a rising tide, which doesn't work at all (except sometimes it does). 

Is this stuff low-hanging fruit, or are people plucking it already? Are any of these counterproductive

Replies from: yonatan-cale-1
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-09T22:14:34.962Z · LW(p) · GW(p)

If you are asking about yourself (?) then it would probably help to talk about your specifics, rather than trying to give a generic answer that would fit many people (though perhaps others would be able to give a good generic answer)


My own prior is:  There are a few groups that seem promising, and I'd want people to help those groups

comment by Sune · 2022-06-07T21:06:56.829Z · LW(p) · GW(p)

A language model is in some sense trying to generate the “optimal” prediction for how a text is going to continue. Yet, it is not really trying: it is just a fixed algorithm. If it wanted to find optimal predictions, it would try to take over computational resources and improve its algorithm.

Is there an existing word/language for describing the difference between these two types of optimisation? In general, why can’t we just build AGIs that does the first type of optimisations and not the second?

Replies from: Kaj_Sotala, Perhaps, fiso, awenonian, delesley-hutchins
comment by Kaj_Sotala · 2022-06-10T19:08:05.872Z · LW(p) · GW(p)

Agent AI vs. Tool AI [? · GW].

There's discussion on why Tool AIs are expected to become agents; one of the biggest arguments is that agents are likely to be more effective than tools. If you have a tool, you can ask it what you should do in order to get what you want; if you have an agent, you can just ask it to get you the things that you want. Compare Google Maps vs. self-driving cars: Google Maps is great, but if you get the car to be an agent, you get all kinds of other benefits.

It would be great if everyone did stick to just building tool AIs. But if everyone knows that they could get an advantage over their competitors by building an agent, it's unlikely that everyone would just voluntarily restrain themselves due to caution. 

Also it's not clear that there's any sharp dividing line between AGI and non-AGI AI; if you've been building agentic AIs all along (like people are doing right now) and they slowly get smarter and smarter, how do you know when's the point when you should stop building agents and should switch to only building tools? Especially when you know that your competitors might not be as cautious as you are, so if you stop then they might go further and their smarter agent AIs will outcompete yours, meaning the world is no safer and you've lost to them? (And at the same time, they are applying the same logic for why they should not stop, since they don't know that you can be trusted to stop.)

Replies from: Sune
comment by Sune · 2022-06-10T21:52:18.854Z · LW(p) · GW(p)

Would you say a self-driving car is a tool AI or agentic AI? I can see how the self-driving car is a bit more agentic, but as long as it only drives when you tell it to, I would consider it a tool. But I can also see that the border is a bit blurry.

If self-driving cars are not considered agentic, do you have examples of people attempting to make agent AIs?

Replies from: Kaj_Sotala
comment by Kaj_Sotala · 2022-06-11T08:49:10.156Z · LW(p) · GW(p)

As you say, it's more of a continuum than a binary. A self-driving car is more agenty than Google Maps, and a self-driving car that was making independent choices of where to drive would be more agentic still.

People are generally trying to make all kinds of more agentic AIs, because more agentic AIs are so much more useful.

  • Stock-trading bots that automatically buy and sell stock are more agenty than software that just tells human traders what to buy, and preferred because a bot without a human in the loop can outcompete a slower system that does have the slow human making decisions.
  • An AI autonomously optimizing data center cooling is more agenty than one that just tells human operators where to make adjustments and is preferred... that article doesn't actually make it explicit why they switched to an autonomously operating system, but "because it can make lots of small tweaks humans wouldn't bother with and is therefore more effective" seems to be implied?
  • The military has expressed an interest in making their drones more autonomous (agenty) rather than being remotely operated. This is for several reasons, including the fact that remote-operated drones can be jammed, and because having a human in the loop slows down response time if fighting against an enemy drone.
  • All kinds of personal assistant software that anticipates your needs and actively tries to help you is more agenty than software that just passively waits for you to use it. E.g. once when I was visiting a friend my phone popped up a notification about the last bus home departing soon. Some people want their phones to be more agentic like this because it's convenient if you have someone actively anticipating your needs and ensuring that they get taken care of for you.
comment by Perhaps · 2022-06-08T14:00:40.391Z · LW(p) · GW(p)

The first type of AI is a regular narrow AI, the type we've been building for a while. The second type is an agentic AI, a strong AI, which we have yet to build. The problem is, AIs are trained using gradient descent, which basically involves running AI designs from all possible AI designs. Gradient descent will train the AI that can maximize the reward best. As a result of this, agentic AIs become more likely because they are better at complex tasks. While we can modify the reward scheme, as tasks get more and more complex, agentic AIs are pretty much the way to go, so we can't avoid building an agentic AI, and have no real idea if we've even created one until it displays behaviour that indicates it.

Replies from: Sune
comment by Sune · 2022-06-08T16:51:06.535Z · LW(p) · GW(p)

+1 for the word agentic AI. I think that is what I was looking for.

However, I don’t believe that gradient descent alone can turn an AI agentic. No matter how long you train a language model, it is not going to suddenly want to acquire resources to get better at predicting human language (unless you specifically ask it questions about how to do that, and then implement the suggestions. Even then you are likely to only do what humans would have suggested, although maybe you can make it do research similar to and faster than humans would have done it).

comment by fiso64 (fiso) · 2022-06-24T17:11:45.085Z · LW(p) · GW(p)

Here's a non-obvious way it could fail [LW · GW]. I don't expect researchers to make this kind of mistake, but if this reasoning is correct, public access of such an AI is definitely not a good idea.

Also, consider a text predictor which is trying to roleplay as an unaligned superintelligence. This situation could be triggered even without the knowledge of the user by accidentally creating a conversation which the AI relates to a story about a rogue SI, for example. In that case it may start to output manipulative replies, suggest blueprints for agentic AIs, and maybe even cause the user to run an obfuscated version of the program from the linked post. The AI doesn't need to be an agent for any of this to happen (though it would be clearly much more likely if it were one).

I don't think that any of those failure modes (including the model developing some sort of internal agent to better predict text) are very likely to happen in a controlled environment. However, as others have mentioned, agent AIs are simply more powerful, so we're going to build them too.

comment by awenonian · 2022-06-10T15:35:12.253Z · LW(p) · GW(p)

In short, the difference between the two is Generality. A system that understands the concepts of computational resources and algorithms might do exactly that to improve it's text prediction. Taking the G out of AGI could work, until the tasks get complex enough they require it.

comment by DeLesley Hutchins (delesley-hutchins) · 2022-06-09T18:30:42.912Z · LW(p) · GW(p)

A language model (LM) is a great example, because it is missing several features that AI would have to have in order to be dangerous.  (1) It is trained to perform a narrow task (predict the next word in a sequence), for which it has zero "agency", or decision-making authority.   A human would have to connect a language model to some other piece of software (i.e. a web-hosted chatbot) to make it dangerous.  (2) It cannot control its own inputs (e.g. browsing the web for more data), or outputs (e.g. writing e-mails with generated text).  (3) It has no long-term memory, and thus cannot plan or strategize in any way.  (4) It runs a fixed-function data pipeline, and has no way to alter its programming, or even expand its computational use, in any way.

I feel fairly confident that, no matter how powerful, current LMs cannot "go rogue" because of these limitations.  However, there is also no technical obstacle for an AI research lab to remove these limitations, and many incentives for them to do so.  Chatbots are an obvious money-making application of LMs.  Allowing an LM to look up data on its own to self-improve (or even just answer user questions in a chatbot) is an obvious way to make a better LM.  Researchers are currently equipping LMs with long-term memory (I am a co-author on this work).  AutoML is a whole sub-field of AI research, which equips models with the ability to change and grow over time.

The word you're looking for is "intelligent agent", and the answer to your question "why don't we just not build these things?" is essentially the same as "why don't we stop research into AI?"  How do you propose to stop the research?

comment by ekka · 2022-06-07T17:08:21.372Z · LW(p) · GW(p)

Human beings are not aligned and will possibly never be aligned without changing what humans are. If it's possible to build an AI as capable as a human in all ways that matter, why would it be possible to align such an AI?

Replies from: lc, Jayson_Virissimo, Kaj_Sotala, MakoYass, antanaclasis, adam-jermyn, sharmake-farah
comment by lc · 2022-06-07T22:53:59.720Z · LW(p) · GW(p)

Because we're building the AI from the ground up and can change what the AI is via our design choices. Humans' goal functions are basically decided by genetic accident, which is why humans are often counterproductive. 

comment by Jayson_Virissimo · 2022-06-07T21:13:09.035Z · LW(p) · GW(p)

Assuming humans can't be "aligned", then it would also make sense to allocate resources in an attempt to prevent one of them from becoming much more powerful than all of the rest of us.

comment by Kaj_Sotala · 2022-06-10T18:57:57.501Z · LW(p) · GW(p)

Define "not aligned"? For instance, there are plenty of humans who, given the choice, would rather not kill every single person alive.

Replies from: ekka
comment by ekka · 2022-06-11T17:45:00.723Z · LW(p) · GW(p)

Not aligned on values, beliefs and moral intuitions. Plenty of humans would not kill all people alive if given the choice but there are some who would. I think the existence of doomsday cults that have tried to precipitate an armageddon give support to this claim.

Replies from: Kaj_Sotala
comment by Kaj_Sotala · 2022-06-12T07:55:48.043Z · LW(p) · GW(p)

Ah, so you mean that humans are not perfectly aligned with each other? I was going by the definition of "aligned" in Eliezer's "AGI ruin" post, which was

I am not talking about ideal or perfect goals of 'provable' alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about, nor attaining an absolute certainty of an AI not killing everyone.  When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, "please don't disassemble literally everyone with probability roughly 1" is an overly large ask that we are not on course to get.

Likewise, in an earlier paper I mentioned that by an AGI that "respects human values", we don't mean to imply that current human values would be ideal or static. We just mean that we hope to at least figure out how to build an AGI that does not, say, destroy all of humanity, cause vast amounts of unnecessary suffering, or forcibly reprogram everyone's brains according to its own wishes.

A lot of discussion about alignment takes this as the minimum goal. Figuring out what to do with humans having differing values and beliefs would be great, but if we could even get the AGI to not get us into outcomes that the vast majority of humans would agree are horrible, that'd be enormously better than the opposite. And there do seem to exist humans who are aligned in this sense of "would not do things that the vast majority of other humans would find horrible, if put in control of the whole world"; even if some would, the fact that some wouldn't suggests that it's also possible for some AIs not to do it.

comment by mako yass (MakoYass) · 2022-06-10T00:56:12.566Z · LW(p) · GW(p)

Most of what people call morality is conflict mediation: techniques for taking the conflicting desires of various parties and producing better outcomes for them than war.
That's how I've always thought of the alignment problem. The creation of a very very good compromise that almost all of humanity will enjoy.

There's no obvious best solution to value aggregation/cooperative bargaining, but there are a couple of approaches that're obviously better than just having an arms race, rushing the work, and producing something awful that's nowhere near the average human preference.

comment by antanaclasis · 2022-06-07T21:16:28.809Z · LW(p) · GW(p)

Indeed humans are significantly non-aligned. In order for an ASI to be non-catastrophic, it would likely have to be substantially more aligned than humans are. This is probably less-than-impossible due to the fact that the AI can be built from the get-go to be aligned, rather than being a bunch of barely-coherent odds and ends thrown together by natural selection.

Of course, reaching that level of alignedness remains a very hard task, hence the whole AI alignment problem.

comment by Adam Jermyn (adam-jermyn) · 2022-06-07T18:29:18.482Z · LW(p) · GW(p)

I'm not quite sure what this means. As I understand it humans are not aligned with evolution's implicit goal of "maximizing genetic fitness" but humans are (definitionally) aligned with human values. And e.g. many humans are aligned with core values like "treat others with dignity".

Importantly, capability and alignment are sort of orthogonal. The consequences of misaligned AI get worse the more capable it is, but it seems possible to have aligned superhuman AI, as well as horribly misaligned weak AI.

Replies from: gjm
comment by gjm · 2022-06-07T23:45:18.558Z · LW(p) · GW(p)

It is not definitionally true that individual humans are aligned with overall human values or with other individual humans' values. Further, it is proverbial (and quite possibly actually true as well) that getting a lot of power tends to make humans less aligned with those things. "Power corrupts; absolute power corrupts absolutely."

I don't know whether it's true, but it sure seems like it might be, that the great majority of humans, if you gave them vast amounts of power, would end up doing disastrous things with it. On the other hand, probably only a tiny minority would actually wipe out the human race or torture almost everyone or commit other such atrocities, which makes humans more aligned than e.g. Eliezer expects AIs to be in the absence of dramatic progress in the field of AI alignment.

Replies from: JBlack
comment by JBlack · 2022-06-08T03:51:47.370Z · LW(p) · GW(p)

I think a substantial part of human alignment is that humans need other humans in order to maintain their power. We have plenty of examples of humans being fine with torturing or killing millions of other humans when they have the power to do so, but torturing or killing almost all humans in their sphere of control is essentially suicide. This means that purely instrumentally, human goals have required that large numbers of humans continue to exist and function moderately well.

A superintelligent AI is primarily a threat due to the near certainty that it can devise means for maintaining power that are independent of human existence. Humans can't do that by definition, and not due to anything about alignment.

Replies from: Valentine
comment by Valentine · 2022-06-09T04:33:20.544Z · LW(p) · GW(p)

Okay, so… does anyone have any examples of anything at all, even fictional or theoretical, that is "aligned"? Other than tautological examples like "FAI" or "God".

comment by Noosphere89 (sharmake-farah) · 2022-06-07T19:20:28.006Z · LW(p) · GW(p)

This. Combine this fact with the non-trivial chance that moral values are subjective, not objective, and there is little good reason to be doing alignment.

Replies from: MinusGix, AprilSR
comment by MinusGix · 2022-06-07T21:26:05.248Z · LW(p) · GW(p)

While human moral values are subjective, there is a sufficiently large shared amount that you can target at aligning an AI to that. As well, values held by a majority (ex: caring for other humans, enjoying certain fun things) are also essentially shared. Values that are held by smaller groups can also be catered to. 

If humans were sampled from the entire space of possible values, then yes we (maybe) couldn't build an AI aligned to humanity, but we only take up a relatively small space and have a lot of shared values. 

comment by AprilSR · 2022-06-07T20:51:12.403Z · LW(p) · GW(p)

So do you think that instead we should just be trying to not make an AGI at all?

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2022-06-07T23:43:49.864Z · LW(p) · GW(p)

Not really, I do want to make an AGI, primarily because I have very much the want to have a singularity, as it represents hope to me, and I have very different priors than Eliezer or MIRI about how much we're doomed.

Replies from: AprilSR
comment by AprilSR · 2022-06-08T16:02:04.027Z · LW(p) · GW(p)

So you think that, since morals are subjective, there is no reason to try to make an effort to control what happens after the singularity? I really don't see how that follows.

comment by Chris_Leong · 2022-06-07T12:49:18.700Z · LW(p) · GW(p)

Just as a comment, the Stampy Wiki is also trying to do the same thing, but it's a good idea as it's more convenient for many people to ask on Less Wrong.

Replies from: ete
comment by plex (ete) · 2022-06-08T14:28:10.829Z · LW(p) · GW(p)

Yup, we might want to have these as regular threads with a handy link to Stampy.

comment by Lone Pine (conor-sullivan) · 2022-06-08T03:42:08.654Z · LW(p) · GW(p)

What is the justification behind the concept of a decisive strategic advantage? Why do we think that a superintelligence can do extraordinary things (hack human minds, invent nanotechnology, conquer the world, kill everyone in the same instant) when nations and corporations can't do those things?

(Someone else asked a similar question, but I wanted to ask in my own words.)

Replies from: delesley-hutchins, AnthonyC, lc, yonatan-cale-1, elityre, awenonian
comment by DeLesley Hutchins (delesley-hutchins) · 2022-06-09T16:49:05.611Z · LW(p) · GW(p)

I think the best justification is by analogy.  Humans do not physically have a decisive strategic advantage over other large animals -- chimps, lions, elephants, etc.  And for hundreds of thousands of years, we were not at the top of the food chain, despite our intelligence.  However, intelligence eventually won out, and allowed us to conquer the planet.

Moreover, the benefit of intelligence increased exponentially in proportion to the exponential advance of technology.  There was a long, slow burn, followed by what (on evolutionary timescales) was an extremely "fast takeoff": a very rapid improvement in technology (and thus power) over only a few hundred years.  Technological progress is now so rapid that human minds have trouble keeping up within a single lifetime, and genetic evolution has been left in the dust.

That's the world into which AGI will enter -- a technological world in which a difference in intellectual ability can be easily translated into a difference in technological ability, and thus power.  Any future technologies that the laws of physics don't explicitly prohibit, we must assume that an AGI will master faster than we can.

comment by AnthonyC · 2022-06-12T16:16:02.421Z · LW(p) · GW(p)

Some one else already commented on how human intelligence gave us a decisive strategic advantage over our natural predators and many environmental threats. I think this cartoon is my mental shorthand for that transition. The timescale is on the order of 10k-100k years, given human intelligence starting from the ancestral environment.

Empires and nations, in turn, conquered the world by taking it away from city-states and similarly smaller entities in ~1k-10k years. The continued existence of Singapore and the Sentinel Islanders doesn't change the fact that a modern large nation could wipe them out in a handful of years, at most, if we really wanted to. We don't because doing so is not useful, but the power exists.

Modern corporations don't want to control the whole world. Like Fnargl, that's not what they're pointed at. But it only took a few decades for Walmart to displace a huge swath of the formerly-much-more-local retail market, and even fewer decades for Amazon to repeat a similar feat online, each starting from a good set of ideas and a much smaller resource base than even the smallest nations. And while corporations are militarily weak, they have more than enough economic power to shape the laws of at least some of the nations that host them in ways that let them accumulate more power over time.

So when I look at history, I see a series of major displacements of older systems by newer ones, on faster and faster timescales, using smaller and smaller fractions of our total resource base, all driven by our accumulation of better ideas and using those ideas to accumulate wealth and power. All of this has been done with brains no smarter, natively, than what we had 10k years ago - there hasn't been time for biological evolution to do much, there. So why should that pattern suddenly stop being true when we introduce a new kind of entity with even better ideas than the best strategies humans have ever come up with? Especially when human minds have already demonstrated a long list of physically-possible scenarios that might, if enacted, kill everyone in a short span of time, or at least disrupt us enough to buy time to mop up the survivors?

comment by lc · 2022-06-08T03:47:54.744Z · LW(p) · GW(p)

Here's a youtube video about it.

Replies from: conor-sullivan, Sune, conor-sullivan
comment by Lone Pine (conor-sullivan) · 2022-06-08T04:19:50.630Z · LW(p) · GW(p)

Having watched the video, I can't say I'm convinced. I'm 50/50 on whether DSA is actually possible with any level of intelligence at all. If it isn't possible, then doom isn't likely (not impossible, but unlikely), in my view.

Replies from: ete
comment by plex (ete) · 2022-06-10T13:56:46.488Z · LW(p) · GW(p)

This post [LW · GW] by the director of OpenPhil argues that even a human level AI could achieve DSA, with coordination.

comment by Sune · 2022-06-08T21:03:26.877Z · LW(p) · GW(p)

tldw: corporation are as slow/slower than humans, AIs can be much faster

comment by Lone Pine (conor-sullivan) · 2022-06-08T03:49:36.739Z · LW(p) · GW(p)

Thanks, love Robert Miles.

comment by Yonatan Cale (yonatan-cale-1) · 2022-06-09T22:07:17.000Z · LW(p) · GW(p)

The informal way I think about it:

What would I do if I was the AI, but I had 100 copies of myself, and we had 100 years to think for every 1 second that passed in reality.  And I had internet access.

Do you think you could take over the world from that opening?

Edit: And I have access to my own source code, but I only dare do things like fix my motivational problems and make sure I don't get board during all that time, things like that.

comment by Eli Tyre (elityre) · 2022-06-08T23:38:06.991Z · LW(p) · GW(p)

Do you dispute that this is possible in principle or just that we won't get AI that powerful or something else? 

It seems to me that there is some level of intelligence, at which an agent is easily able out-compete the whole rest of human civilization. What exactly that level of intelligence is, is somewhat unclear (in large part because we don't really have a principled way to measure "intelligence" in general: psychometrics describe variation in human cognitive abilities, but that doesn't really give us a measuring stick for thinking about how "intelligent", in general, something is).

Does that seem right to you, or should we back up and build out why that seems true to me?

Replies from: conor-sullivan
comment by Lone Pine (conor-sullivan) · 2022-06-09T00:33:57.739Z · LW(p) · GW(p)

It seems to me that there is some level of intelligence, at which an agent is easily able out-compete the whole rest of human civilization.

This is the statement I disagree with, in particular the word "easily". I guess the crux of this debate is how powerful we think any level of intelligence is. There has to be some limits, in the same way that even the most wealthy people in history could not forestall their own deaths no matter how much money or medical expertise was applied.

Replies from: elityre, gilch, sharps030
comment by Eli Tyre (elityre) · 2022-06-09T08:37:17.719Z · LW(p) · GW(p)

I'm not compelled by that analogy. There are lots of things that money can't buy [LW · GW], but that (sufficient) intelligence can. 

There are theoretical limits to what cognition is able to do, but those are so far from the human range that they're not really worth mentioning. The question is: "are there practical limits to what an intelligence can do, that leave even a super-intelligence uncommunicative with human civilization?"

It seems to me that as an example, you could just take a particularly impressive person (Elon Musk or John Von Neuman are popular exemplars) and ask "What if there was a nation of only people who were that capable?" It seems that if a nation of say 300,000,000 Elon Musks went to war with the United States, the United States would loose handily. Musktopia would just have a huge military-technological advantage: they would do fundamental science faster, and develop engineering innovations faster, and have better operational competence than the US, on ~ all levels. (I think this is true for a much smaller number than 300,000,000, having a number that high makes the point straightforward.)

Does that seem right to you? If not, why not?

Or alternatively, what do you make of vignettes like That Alien Message [LW · GW]?

Replies from: Dirichlet-to-Neumann
comment by Dirichlet-to-Neumann · 2022-06-10T08:26:35.654Z · LW(p) · GW(p)

I don't think a nation of Musks would win against current USA because Musk is optimised for some things (making an absurd amount of money, CEOing, twitting his shower thoughts), but an actual war requires a rather more diverse set of capacity.

Similarly, I don't think an AGI would necessarily win a war of extermination against us, because currently (emphasize currently) it would need us to run it's infrastructure. This would change in a world were all industrial tasks could be carried away without physical imput from humans, but we are not there yet and will not be soon.

comment by gilch · 2022-06-19T15:46:04.340Z · LW(p) · GW(p)

Did you see the new one about Slow motion videos as AI risk intuition pumps [LW · GW]?

Thinking of ourselves like chimpanzees while the AI is the humans is really not the right scale: computers operate so much faster than humans, we'd be more like plants than animals to them. When there are all of these "forests" of humans just standing around, one might as well chop them down and use the materials to build something more useful.

This is not exactly a new idea. Yudkowsky already likened the FOOM to setting off a bomb, but the slow-motion video was a new take.

Replies from: conor-sullivan
comment by Lone Pine (conor-sullivan) · 2022-06-19T17:46:43.179Z · LW(p) · GW(p)

Yes I did, in fact I was active in the comments section.

It's a good argument and I was somewhat persuaded. However, there are some things to disagree with. For one thing, there is no reason to believe that early AGI actually will be faster or even as fast as humans on any of the tasks that AIs struggle with today. For example, almost all videos of novel robotics applications research are sped up, sometimes hundreds of times. If SayCan can't deliver a wet sponge in less than a minute, why do we think that early AGI will be able to operate faster than us? (I was going to reply to that post with this objection, but other people beat me too it.)

comment by hazel (sharps030) · 2022-06-09T04:12:53.258Z · LW(p) · GW(p)

There has to be some limits

Those limits don't have to be nearby, or look 'reasonable', or be inside what you can imagine. 

Part of the implicit background for the general AI safety argument is a sense for how minds could be, and that the space of possible minds is large and unaccountably alien. Eliezer spent some time trying to communicate this in the sequences:, [LW · GW] [LW · GW

comment by awenonian · 2022-06-09T15:16:54.901Z · LW(p) · GW(p)

This is the sequence post on it:, [LW · GW] it's quite a fun read (to me), and should explain why something smart that thinks at transistor speeds should be able to figure things out.

For inventing nanotechnology, the given example is AlphaFold 2.

For killing everyone in the same instant with nanotechnology, Eliezer often references Nanosystems by Eric Drexler. I haven't read it, but I expect the insight is something like "Engineered nanomachines could do a lot more than those limited by designs that have a clear evolutionary path from chemicals that can form randomly in the primordial ooze of Earth."

For how a system could get that smart, the canonical idea is recursive self improvement (i.e. an AGI capable of learning AGI engineering could design better versions of itself, which could in turn better design better versions, etc, to whatever limit.). But more recent history in machine learning suggests you might be able to go from sub-human to overwhelmingly super-human just by giving it a few orders of magnitude more compute, without any design changes.

comment by Oleg S. · 2022-06-08T14:45:39.517Z · LW(p) · GW(p)

How does AGI solves it's own alignment problem?

For the alignment to work its theory should not only tell humans how to create aligned super-human AGI, but also tell AGI how to self-improve without destroying its own values. Good alignment theory should work across all intelligence levels. Otherwise how does paperclips optimizer which is marginally smarter than human make sure that its next iteration will still care about paperclips?

Replies from: ete
comment by plex (ete) · 2022-06-08T15:17:25.487Z · LW(p) · GW(p)

Excellent question! MIRI's entire vingian reflection paradigm is about stability of goals under self-improvement and designing successors.

Replies from: Oleg S.
comment by Oleg S. · 2022-07-02T16:12:07.719Z · LW(p) · GW(p)

Just realized that stability of goals under self-improvement is kinda similar to stability of goals of mesa-optimizers; so there vingian reflection paradigm and mesa-optimization paradigm should fit.

comment by wachichornia · 2022-06-07T14:46:49.056Z · LW(p) · GW(p)

If Eliezer is pretty much convinced we're doomed, what is he up to?

Replies from: adamzerner, yonatan-cale-1
comment by Adam Zerner (adamzerner) · 2022-06-07T16:35:09.784Z · LW(p) · GW(p)

I'm not sure how literally to take this, given that it comes from an April Fools Day post, but consider this excerpt from Q1 of MIRI announces new "Death With Dignity" strategy [LW · GW].

That said, I fought hardest while it looked like we were in the more sloped region of the logistic success curve, when our survival probability seemed more around the 50% range; I borrowed against my future to do that, and burned myself out to some degree. That was a deliberate choice, which I don't regret now; it was worth trying, I would not have wanted to die having not tried, I would not have wanted Earth to die without anyone having tried. But yeah, I am taking some time partways off, and trying a little less hard, now. I've earned a lot of dignity already; and if the world is ending anyways and I can't stop it, I can afford to be a little kind to myself about that.

When I tried hard and burned myself out some, it was with the understanding, within myself, that I would not keep trying to do that forever. We cannot fight at maximum all the time, and some times are more important than others. (Namely, when the logistic success curve seems relatively more sloped; those times are relatively more important.)

All that said: If you fight marginally longer, you die with marginally more dignity. Just don't undignifiedly delude yourself about the probable outcome.

comment by Yonatan Cale (yonatan-cale-1) · 2022-06-09T22:00:25.542Z · LW(p) · GW(p)

I think he's burned out and took a break to write a story (but I don't remember where this belief came from. Maybe I'm wrong? Maybe from here [LW · GW]?)

Replies from: yonatan-cale-1
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-09T22:03:46.019Z · LW(p) · GW(p)

I do find it funny/interesting that he wrote a story in the length [LW · GW] of the entire Harry Potter series, in a few months, as a way to relax and rest.

Too bad we have this AGI problem keeping him busy, ha? :P

comment by Adam Zerner (adamzerner) · 2022-06-07T18:45:38.898Z · LW(p) · GW(p)
  1. We can't just "decide not to build AGI" because GPUs are everywhere, and knowledge of algorithms is constantly being improved and published; 2 years after the leading actor has the capability to destroy the world, 5 other actors will have the capability to destroy the world. The given lethal challenge is to solve within a time limit, driven by the dynamic in which, over time, increasingly weak actors with a smaller and smaller fraction of total computing power, become able to build AGI and destroy the world. Powerful actors all refraining in unison from doing the suicidal thing just delays this time limit - it does not lift it, unless computer hardware and computer software progress are both brought to complete severe halts across the whole Earth. The current state of this cooperation to have every big actor refrain from doing the stupid thing, is that at present some large actors with a lot of researchers and computing power are led by people who vocally disdain all talk of AGI safety (eg Facebook AI Research). Note that needing to solve AGI alignment only within a time limit, but with unlimited safe retries for rapid experimentation on the full-powered system; or only on the first critical try, but with an unlimited time bound; would both be terrifically humanity-threatening challenges by historical standards individually.

Note in particular this part:

unless computer hardware and computer software progress are both brought to complete severe halts across the whole Earth.

Research on computer hardware takes lots and lots of money and big buildings, right? Ie it's not the type of thing someone can do in their basement? If so, it seems like, at least in theory, it can be regulated by governments, assuming they wanted to make a real effort at it. Is that true? If so, it seems like a point that is worth establishing.

(From there, the question of course becomes whether we can convince governments to do so. If that is impossible then I guess it doesn't matter if it's possible for them to regulate. Still, I feel like it is helpful to think about the two questions separately.)

Replies from: elityre, lc
comment by Eli Tyre (elityre) · 2022-06-08T23:49:19.920Z · LW(p) · GW(p)

I think there are a bunch of political problems with regulating all computer hardware progress enough to cause it to totally cease. Think how crucial computers are to the modern world. Really a lot of people will be upset if we stop building them, or stop making better ones. And if one country stops, that just creates an incentive for other countries to step in to dominate this industry. And even aside from that, I don't think that there's any regulator in the US at least that has enough authority and internal competence to be able to pull this off. More likely, it becomes a politicized issue. (Compare to the much more straightforward and much more empirically-grounded regulation of instituting a carbon tax for climate change. This is a simple idea, that would help a lot, and is much less costly to the world than halting hardware progress. But instead of being universally adopted, it's a political issue that different political factions support or oppose.)

But even if we could, this doesn't solve the problem in a long term way. You need to also halt software progress. Otherwise we'll continue to tinker with AI designs until we get to some that can run efficiently on 2020's computers (or 1990's computers, for that matter).

So in the long run, the only thing in this class that would straight up prevent AGI from being developed is a global, strictly enforced ban on computers. Which seems...not even remotely on the table, on the basis of arguments that are as theoretical as those for AI risk. 

There might be some plans in this class that help, by delaying the date of AGI. But that just buys time for some other solution to do the real legwork.

Replies from: adamzerner
comment by Adam Zerner (adamzerner) · 2022-06-09T02:22:10.466Z · LW(p) · GW(p)

The question here is whether they are capable of regulating it assuming that they are convinced and want to regulate it. It is possible that that it is so incredibly unlikely that they can be convinced that it isn't worth talking about the question of whether they're capable of it. I don't suspect that to be the case, but wouldn't be surprised if I were wrong.

comment by lc · 2022-06-07T22:52:38.592Z · LW(p) · GW(p)

From there, the question of course becomes whether we can convince governments to do so. If that is impossible then I guess it doesn't matter

Unfortuantely we cannot in fact convince governments to shut down AWS & crew. There are intermediary positions I think are worthwhile but unfortunately ending all AI research is outside the overton window for now.

comment by Adam Zerner (adamzerner) · 2022-06-07T21:13:16.433Z · LW(p) · GW(p)

There are a lot of smart people outside of "the community" (AI, rationality, EA, etc.). To throw out a name, say Warren Buffett. It seems that an incredibly small number of them are even remotely as concerned about AI as we are. Why is that?

I suspect that a good amount of people, both inside and outside of our community, observe that the Warren Buffett's of the world aren't panicking, and then adopt that position themselves.

Replies from: lc, Chris_Leong, delesley-hutchins, adrian-arellano-davin
comment by lc · 2022-06-07T23:16:46.224Z · LW(p) · GW(p)

Most high status people, including Warren Buffett, straightforwardly haven't considered these issues much. However, among the ones I've heard of who have bothered to weigh in on the issue, like Stephen Hawking, Bill Gates, Demis Hassibis, etc.; they do seem to come in favor of the side of "this is a serious problem". On the other hand, some of them get tripped up on one of the many intellectual land mines, like Yann Lecunn. 

I don't think that's unexpected. Intellectual land mines exist, and complicated arguments like the ones supporting AGI risk prevention are bound to cause people to make wrong decisions.

Replies from: adamzerner, Sune
comment by Adam Zerner (adamzerner) · 2022-06-08T16:20:14.411Z · LW(p) · GW(p)

Most high status people, including Warren Buffett, straightforwardly haven't considered these issues much.

Not that I think you're wrong, but what are you basing this off of and how confident are you?

However, among the ones I've heard of who have bothered to weigh in on the issue, like Stephen Hawking, Bill Gates, Demis Hassibis, etc.; they do seem to come in favor of the side of "this is a serious problem".

I've heard this too, but at the same time I don't see any of them spending even a small fraction of their wealth on working on it, in which case I think we're back to the original question: why the lack of concern?

On the other hand, some of them get tripped up on one of the many intellectual land mines, like Yann Lecunn. I don't think that's unexpected. Intellectual land mines exist, and complicated arguments like the ones supporting AGI risk prevention are bound to cause people to make wrong decisions.

Yeah, agreed. I'm just confused about the extent of it. I'd expect a lot, perhaps even a majority of "outsider" smart people to get tripped up by intellectual land mines, but instead of being 60% of these people it feels like it's 99.99%.

comment by Sune · 2022-06-08T16:55:16.708Z · LW(p) · GW(p)

Can you be more specific about what you mean by “intellectual landmines”?

comment by Chris_Leong · 2022-06-08T06:01:51.995Z · LW(p) · GW(p)

For the specific example of Warren Buffet, I suspect that he probably hasn't spent that much time thinking about it nor does he probably feel much compulsion to understand the topic as he doesn't currently see it as a threat. I know he doesn't really invest in tech, because he doesn't feel that he understands it sufficiently, so I wouldn't be surprised if his position were along the lines of "I don't really understand it, let others can understand it think about it".

comment by DeLesley Hutchins (delesley-hutchins) · 2022-06-09T18:06:11.020Z · LW(p) · GW(p)

People like Warren Buffet have made their fortune by assuming that we will continue to operate with "business as usual".  Warren Buffet is a particularly bad person to list as an example for AGI risk, because he is famously technology-averse; as an investor, he missed most of the internet revolution (Google/Amazon/Facebook/Netflix) as well.

But in general, most people, even very smart people, naturally assume that the world will continue to operate the way it always has, unless they have a very good reason to believe otherwise.  One cannot expect non-technically-minded people who have not examined the risks of AGI in detail to be concerned.

By analogy, the risks of climate change have been very well established scientifically (much more so than AGI), those risks are relatively severe, the risks have been described in detail every 5 years in IPCC reports, there is massive worldwide scientific consensus, lots and LOTS of smart people are extremely worried, and yet the Warren Buffets of the world still continue with business as usual anyway.  There's a lot of social inertia.  

Replies from: adamzerner
comment by Adam Zerner (adamzerner) · 2022-06-09T20:12:12.366Z · LW(p) · GW(p)

When I say smart people, I am trying to point to intelligence that is general [LW · GW] instead of narrow. Some people are really good at ie. investing but not actually good at other things. That would be a narrow intelligence. A general intelligence, to me, is where you have more broadly applicable skills.

Regarding Warren Buffet, I'm not actually sure if he is a good example or not. I don't know too much about him. Ray Dalio is probably a good example.

comment by mukashi (adrian-arellano-davin) · 2022-06-07T21:32:36.526Z · LW(p) · GW(p)

One reason might be that AGIs are really not that concerning and the EA,rationality community has developed a mistaken model of the world that assigns a much higher probability to doom by AGI than it should, and those smart people outside the group do not hold the same beliefs.

Replies from: lc
comment by lc · 2022-06-07T23:18:24.973Z · LW(p) · GW(p)

Generally speaking, they haven't really thought about these risks in detail, so the fact that they don't hold "the MIRI position" is not really as much evidence as you'd think.

comment by LVSN · 2022-06-08T03:21:28.182Z · LW(p) · GW(p)

I came up with what I thought was a great babby's first completely unworkable solution to CEV alignment, and I want to know where it fails.

So, first I need to layout the capabilities of the AI. The AI would be able to model human intuitions, hopes, and worries. It can predict human reactions. It has access to all of human culture and art, and models human reactions to that culture and art, and sometimes tests those predictions. Very importantly, it must be able to model veridical paradoxes and veridical harmonies between moral intuitions and moral theorems which it has derived. It is aiming to have the moral theory with the fewest paradoxes. It must also be capable of predicting and explaining outcomes of its plans, gauging the deepest nature of people's reactions to its plans, and updating its moral theories according to those reactions.

Instead of being democratic and following the human vote by the letter, it attempts to create the simplest theories of observed and self-reported human morality by taking everything it knows into consideration.

It has separate stages of deliberation and action, which are part of a game, and rather than having a utility function as its primary motivation, it is simply programmed to love playing this game that it conceives itself to be playing, and following its rules to their logical conclusion, no matter where it takes them. To put it abstractly, it is a game of learning human morality and being a good friend to humanity.

Before I get into details of the game, I want to stress that the game I am describing is a work in progress, and it may be of value to my audience to consider how they might make the game more robust if they come up with some complaint about it. By design, whatever complaint you have about the game is a complaint that the AI would take into consideration as part of some stage of the game.

Okay, so here's the rough procedure of the game it's playing; right now as I describe it, it's a simple algorithm made of two loops:

Process 1:
1. Hear humanity's pleas (intuitions+hopes+worries) -> 
2. Model harmonies and veridical paradoxes of the pleas (Observation phase) -> 
3. Develop moral theories according to those models ->
4a. Explain moral theories to general humanity -> 
5a. Gauge human reactions -> 
6a. Explain expected consequences of moral theory execution to humans -> 
7a. Gauge human reactions ->
8. Retrieve virtuous person approval rating (see process #2 below) and combine with general approval rating -> 
9. Loop back two times at the Observation Phase (step 2) then move on to step 10 ->
10. Finite period of action begins when a threshold of combined approval is reached; if approval threshold is not reached, this step does nothing -> 
1. Begin loop again from step 1

Process 2:
1. Hear humanity's pleas -> 
2. Model harmonies and veridical paradoxes of the pleas (Observation phase) -> 
3. Develop moral theories according to those models -> 
4b. Update list of virtuous humans according to moral theories ->
5b. Retrieve current plan from Process 1 -> 
6b. Model virtuous human reactions and approval rating to current plan -> 
7b. Return Virtuous_Person_Approval_Rating to step 8 of process 1

So, how might this algorithm fail, in a way that you can't also just explain to my AI concept such that they will consider it and re-orient their moral philosophy, which, again, must gain a combined approval rating between moral experts and the general populace before it can implement its moral theories? 

The AI, since it is programmed just to play this game, will be happy to re-orient the course of its existing "moral philosophy". I use scare quotes because its real morality is to just play this learning-and-implementing-with-approval morality game, and it cares not for the outcomes.

Replies from: awenonian
comment by awenonian · 2022-06-09T15:35:04.126Z · LW(p) · GW(p)

The quickest I can think of is something like "What does this mean?" Throw this at every part of what you just said.

For example: "Hear humanity's pleas (intuitions+hopes+worries)" What is an intuition? What is a hope? What is a worry? How does it "hear"? 
Do humans submit English text to it? Does it try to derive "hopes" from that? Is that an aligned process?

An AI needs to be programmed, so you have to think like a programmer. What is the input and output type of each of these (e.g. "Hear humanity's pleas" takes in text, and outputs... what? Hopes? What does a hope look like if you have to represent it to a computer?).

I kinda expect that the steps from "Hear humanity's pleas" to "Develop moral theories" relies on some magic that lets the AI go from what you say to what you mean. Which is all well and good, but once you have that you can just tell it, in unedited English "figure out what humanity wants, and do that" and it will. Figuring out how to do that is the heart of alignment.

Replies from: LVSN
comment by LVSN · 2022-06-10T02:36:17.901Z · LW(p) · GW(p)

Do humans submit English text to it?


I think the AI could "try to figure out what you mean" by just trying to diagnose the reasons for why you're saying it, as well as the reasons you'd want to be saying it for, and the reasons you'd have if you were as virtuous as you'd probably like to be, etc., which it can have some best guesses about based on what it knows about humans, and all the subtypes of human that you appear to be, and all the subtypes of those subtypes which you seem to be, and so on. 

These are just guesses, and it would, at parts 4a and 6a, explain to people its best guesses about the full causal structure which leads to people's morality/shouldness-related speech. Then it gauges people's reactions, and updates its guesses (simplest moral theories) based on those reactions. And finally it requires an approval rating before acting, so if it definitely misinterprets human morality, it just loops back to the start of the process again, and its guesses will keep improving through each loop until its best guess at human morality reaches sufficient approval.

The AI wouldn't know with certainty what humans want best, but it would make guesses which are better-educated than humans are capable of making.

Replies from: awenonian
comment by awenonian · 2022-06-10T15:28:11.732Z · LW(p) · GW(p)

Again, what is a "reason"? More concretely, what is the type of a "reason"? You can't program an AI in English, it needs to be programmed in code. And code doesn't know what "reason" means.

It's not exactly that your plan "fails" anywhere particularly. It's that it's not really a plan. CEV says "Do what humans would want if they were more the people they want to be." Cool, but not a plan. The question is "How?" Your answer to that is still under specified. You can tell by the fact you said things like "the AI could just..." and didn't follow it with "add two numbers" or something simple (we use the word "primitive"), or by the fact you said "etc." in a place where it's not fully obvious what the rest actually would be. If you want to make this work, you need to ask "How?" to every single part of it, until all the instructions are binary math. Or at least something a python library implements.

Replies from: LVSN
comment by LVSN · 2022-06-10T21:37:51.217Z · LW(p) · GW(p)

I don't think it's the case that you're telling me that the supposedly monumental challenge of AI alignment is simply that of getting computers to understand more things, such as what things are reasons, intuitions, hopes, and worries. I feel like these are just gruntwork things and not hard problems. 

Look, all you need to do to get an AI which understands what intuitions, reasons, hopes, and worries are is to tell everyone very loudly and hubristically that AIs will never understand these things and that's what makes humans irreplaceable. Then go talk to whatever development team is working on proving that wrong, and see what their primitive methods are. Better yet, just do it yourself because you know it's possible.

I am not fluent in computer science so I can't tell you how to do it, but someone does know how to make it so.

Edit: In spite of what I wrote here, I don't think it's necessary that humans should ensure specifically that the AI understands in advance what intuitions, hopes, or worries are, as opposed to all the other mental states humans can enter. Rather, there should be a channel where you type your requests/advice/shouldness-related-speech, and people are encouraged to type their moral intuitions, hopes, and worries there, and the AI just interprets the nature of the messages using its general models of humans as context.

Replies from: awenonian
comment by awenonian · 2022-06-11T19:17:22.349Z · LW(p) · GW(p)

No, they really don't. I'm not trying to be insulting. I'm just not sure how to express the base idea.

The issue isn't exactly that computers can't understand this, specifically. It's that no one understands what those words mean enough. Define reason. You'll notice that your definition contains other words. Define all of those words. You'll notice that those are made of words as well. Where does it bottom out? When have you actually, rigorously, objectively defined these things? Computers only understand that language, but the fact that a computer wouldn't understand your plan is just illustrative of the fact that it is not well defined. It just seems like it is, because you have a human brain that fills in all the gaps seamlessly. So seamlessly you don't even notice that there were gaps that need filling.

This is why there's an emphasis on thinking about the problem like a computer programmer. Misalignment thrives in those gaps, and if you gloss over them, they stay dangerous. The only way to be sure you're not glossing over them is to define things with something as rigorous as Math. English is not that rigorous.

Replies from: LVSN
comment by LVSN · 2022-06-11T22:17:08.085Z · LW(p) · GW(p)

I think some near future iteration of GPT, if it is prompted to be a really smart person who understands A Human's Guide to Words [? · GW], would be capable of giving explanations of the meanings of words just as well as humans can, which I think is fine enough for the purposes of recognizing when people are telling it their intuitions, hopes, and worries, fine enough for the purposes of trying to come up with best explanations of people's shouldness-related speech, fine enough for coming up with moral theories which [solve the most objections]/[have the fewest paradoxes], and fine enough for explaining plans which those moral theories prescribe.

On a side note, and I'm not sure if this is a really useful analogy, but I wonder what would happen if the parameters of some future iteration of GPT included the sort of parameters that A Human's Guide to Words installs into human brains.

Replies from: awenonian, megasilverfist
comment by awenonian · 2022-06-13T13:41:41.325Z · LW(p) · GW(p)

I'm not sure this is being productive. I feel like I've said the same thing over and over again. But I've got one more try: Fine, you don't want to try to define "reason" in math. I get it, that's hard. But just try defining it in English. 

If I tell the machine "I want to be happy." And it tries to determine my reason for that, what does it come up with? "I don't feel fulfilled in life"? Maybe that fits, but is it the reason, or do we have to go back more: "I have a dead end job"? Or even more "I don't have enough opportunities"? 

Or does it go a completely different tack and say my reason is "My pleasure centers aren't being stimulated enough" or "I don't have enough endorphins."

Or, does it say the reason I said that was because my fingers pressed keys on a keyboard.

To me, as a human, all of these fit the definition of "reasons." And I expect they could all be true. But I expect some of them are not what you mean. And not even in the sense of some of them being a different definition for "reason." How would you try to divide what you mean and what you don't mean?

Then do that same thought process on all the other words.

Replies from: LVSN
comment by LVSN · 2022-06-14T05:51:18.545Z · LW(p) · GW(p)

By "reason" I mean something like psychological, philosophical, and biological motivating factors; so, your fingers pressing the keys wouldn't be a reason for saying it. 

I don't claim that this definition is robust to all of objection-space, and I'm interested in making it more robust as you come up with objections, but so far I find it simple and effective. 

The AI does not need to think that there was only one real reason why you do things; there can be multiple, of course.

Also I do recognize that my definition is made up of more words, but I think it's reasonable that a near-future AI could infer from our conversation that kind of definition which I gave, and spit it out itself. Similarly it could probably spit out good definitions for the compound words "psychological motivation," "philosophical motivation," and "biological motivation".

Also also this process whereby I propose a simple and effective yet admittedly objection-vulnerable definition, and you provide an objection which my new definition can account for, is not a magical process and is probably automatable.

Replies from: awenonian
comment by awenonian · 2022-06-14T14:50:28.760Z · LW(p) · GW(p)

It seems simple and effective because you don't need to put weight on it. We're talking a superintelligence, though. Your definition will not hold when the weight of the world is on it.

And the fact that you're just reacting to my objections is the problem. My objections are not the ones that matter. The superintelligence's objections are. And it is, by definition, smarter than me. If your definition is not something like provably robust, then you won't know if it will hold to a superintelligent objection. And you won't be able to react fast enough to fix it in that case.

You can't bandaid a solution into working, because if a human can point out a flaw, you should expect a superintelligence to point out dozens, or hundreds, or thousands. 

I don't know how else to get you to understand this central objection. Robustness is required. Provable robustness is, while not directly required, kinda the only way we can tell if something is actually robust.

Replies from: LVSN
comment by LVSN · 2022-06-15T07:28:13.919Z · LW(p) · GW(p)

I think this is almost redundant to say: the objection that superintelligences will be able to notice more of objection-space and account for it makes me more inclined to trust it. If a definition is more objection-solved than some other definition, that is the definition I want to hold. If the human definition is more objectionable than a non-human one, then I don't want the human definition.

Replies from: awenonian
comment by awenonian · 2022-06-15T13:48:13.915Z · LW(p) · GW(p)

I think you missed the point. I'd trust an aligned superintelligence to solve the objections. I would not trust a misaligned one. If we already have an aligned superintelligence, your plan is unnecessary. If we do not, your plan is unworkable. Thus, the problem.

If you still don't see that, I don't think I can make you see it. I'm sorry.

Replies from: LVSN
comment by LVSN · 2022-06-15T14:19:03.579Z · LW(p) · GW(p)

I proposed a strategy for an aligned AI that involves it terminally valuing to following the steps of a game that involves talking with us about morality, creating moral theories with the fewest paradoxes, creating plans which are prescribed by the moral theories, and getting approval for the plans. 

You objected that my words-for-concepts were vague. 

I replied that near-future AIs could make as-good-as-human-or-better definitions, and that the process of [putting forward as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections] was automatable. 

You said the AI could come up with many more objections than you would.

I said, "okay, good." I will add right now: just because it considers an objection, doesn't mean the current definition has to be rejected; it can decide that the objections are not strong enough, or that its current definition is the one with the fewest/weakest objections.

Now I think you're saying something like that it doesn't matter if the AI can come up with great definitions if it's not aligned and that my plan won't work either way. But if it can come up with such great objection-solved definitions, then you seem to lack any explicitly made objections to my alignment strategy. 

Alternatively, you are saying that an AI can't make great definitions unless it is aligned, which I think is just plainly wrong; I think getting an unaligned language model to make good-as-human definitions is maybe somewhere around as difficult as getting an unaligned language model to hold a conversation. "What is the definition of X?" is about as hard a question as "In which country can I find Mount Everest?" or "Write me a poem about the Spring season."

Replies from: awenonian
comment by awenonian · 2022-06-15T20:01:57.387Z · LW(p) · GW(p)

Let me ask you this. Why is "Have the AI do good things, and not do bad things" a bad plan?

Replies from: LVSN, lc
comment by LVSN · 2022-06-16T03:46:59.969Z · LW(p) · GW(p)

I don't think my proposed strategy is analogous to that, but I'll answer in good faith just in case.

If that description of a strategy is knowingly abstract compared to the full concrete details of the strategy, then the description may or may not turn out to describe a good strategy, and the description may or may not be an accurate description of the strategy and its consequences.

If there is no concrete strategy to make explicitly stated which the abstract statement is describing, then the statement appears to just be repositing the problem of AI alignment, and it brings us nowhere.

Replies from: awenonian
comment by awenonian · 2022-06-17T03:31:59.442Z · LW(p) · GW(p)

Surely creating the full concrete details of the strategy is not much different from "putting forth as-good-as-human definitions, finding objections for them, and then improving the definition based on considered objections." I at least don't see why the same mechanism couldn't be used here (i.e. apply this definition iteration to the word "good", and then have the AI do that, and apply it to "bad" and have the AI avoid that). If you see it as a different thing, can you explain why?

Replies from: LVSN
comment by LVSN · 2022-06-18T06:48:47.060Z · LW(p) · GW(p)

It's much easier to get safe, effective definitions of 'reason', 'hopes', 'worries', and 'intuitions' on first tries than to get a safe and effective definition of 'good'.

Replies from: awenonian
comment by awenonian · 2022-06-19T03:58:26.843Z · LW(p) · GW(p)

I'd be interested to know why you think that.

I'd be further interested if you would endorse the statement that your proposed plan would fully bridge that gap.

And if you wouldn't, I'd ask if that helps illustrate the issue.

comment by lc · 2022-06-15T20:14:44.144Z · LW(p) · GW(p)

Because that's not a plan, it's a property of a solution you'd expect the plan to have. It's like saying "just keep the reactor at the correct temperature". The devil is in the details of getting there, and there are lots of subtle ways things can go catastrophically wrong.

Replies from: awenonian
comment by awenonian · 2022-06-15T22:13:43.733Z · LW(p) · GW(p)

Exactly. I notice you aren't who I replied to, so the canned response I had won't work. But perhaps you can see why most of his objections to my objections would apply to objections to that plan?

Replies from: lc
comment by lc · 2022-06-16T02:07:24.958Z · LW(p) · GW(p)

I was just responding to something I saw on the main page. No context for the earlier thread. Carry on lol.

comment by megasilverfist · 2022-06-12T05:25:33.989Z · LW(p) · GW(p)

This seems wrong but at least resembles a testable prediction.

comment by DirectedEvolution (AllAmericanBreakfast) · 2022-06-07T15:12:26.629Z · LW(p) · GW(p)

Who is well-incentivized to check if AGI is a long way off? Right now, I see two camps: AI capabilities researchers and AI safety researchers. Both groups seem incentivized to portray the capabilities of modern systems as “trending toward generality.” Having a group of credible experts focused on critically examining that claim of “AI trending toward AGI,” and in dialog with AI and AI safety researchers, seems valuable.

Replies from: adam-jermyn
comment by Adam Jermyn (adam-jermyn) · 2022-06-07T18:17:16.066Z · LW(p) · GW(p)

This is a slightly orthogonal answer, but "humans who understand the risks" have a big human-bias-incentive to believe that AGI is far off (in that it's aversive to thinking that bad things are going to happen to you personally).

A more direct answer is: There is a wide range of people who say they work on "AI safety" but almost none of them work on "Avoiding doom from AGI". They're mostly working on problems like "make the AI more robust/less racist/etc.". These are valuable things to do, but to the extent that they compete with the "Avoid doom" researchers for money/status/influence they have an incentive to downplay the odds of doom. And indeed this happens a fair amount with e.g. articles on how "Avoid doom" is a distraction from problems that are here right now.

Replies from: AllAmericanBreakfast, AllAmericanBreakfast
comment by DirectedEvolution (AllAmericanBreakfast) · 2022-06-07T19:36:22.629Z · LW(p) · GW(p)

To put it in appropriately Biblical terms, let's imagine we have a few groups of civil engineers. One group is busily building the Tower of Babel, and bragging that it has grown so tall, it's almost touching heaven! Another group is shouting "if the tower grows too close to heaven, God will strike us all down!" A third group is saying, "all that shouting about God striking us down isn't helping us keep the tower from collapsing, which is what we should really be focusing on."

I'm wishing for a group of engineers who are focused on asking whether building a taller and taller tower really gets us closer and closer to heaven.

comment by DirectedEvolution (AllAmericanBreakfast) · 2022-06-07T19:08:35.024Z · LW(p) · GW(p)

That's a good point.

I'm specifically interested in finding people who are well-incentivized to gather, make, and evaluate arguments about the nearness of AGI. This task should be their primary professional focus.

I see this activity as different from, or a specialized subset of, measurements of AI progress. AI can progress in capabilities without progressing toward AGI, or progressing in a way that is likely to succeed in producing AGI. For example, new releases of an expert system for making medical diagnoses might show constant progress in capabilities, without showing any progress toward AGI.

Likewise, I see it as distinct from making claims about the risk of AGI doom. The risk that an AGI would be dangerous seems, to me, mostly orthogonal to whether or not it is close at hand. This follows naturally with Eliezer Yudkowsky's point that we have to get AGI right on the "first critical try."

Finally, I also see this activity as being distinct from the activity of accepting and repeating arguments or claims about AGI nearness. As you point out, AI safety researchers who work on more prosaic forms of harm seem biased or incentivized to downplay AI risk, and perhaps also of AGI nearness. I see this as a tendency to accept and repeat such claims, rather than a tendency to "gather, make, and evaluate arguments," which is what I'm interested in.

It seems to me that one of the challenges here is the "no true Scotsman" fallacy, a tendency to move goalposts, or to be disappointed in realizing that a task thought to be hard for AI and achievable only with AGI turns out to be easy for AI, yet achievable by a non-general system.

Scott wrote a post that seems quite relevant to this question just today. It seems to me that his argument is "AI is advancing in capabilities faster than you think." However, as I'm speculating here, we can accept that claim, while still thinking "AI is moving toward AGI slower than it seems." Or not! It just seems to me that making lists of what AI can or cannot do, and then tracking its success rate with successive program releases, is not clearly a way to track AGI progress. I'd like to see somebody who knows what they're about examining that question, or perhaps synthesizing multiple perspectives on the way AI becomes AGI and showing how a given unit of narrow capabilities progress might fit into a narrative of AGI progress from each of those perspectives.

comment by wachichornia · 2022-06-07T14:48:40.808Z · LW(p) · GW(p)

Is there a way "regular" people can "help"? I'm a serial entrepreneur in my late 30s. I went through 80000 hours and they told me they would not coach me as my profile was not interesting. This was back in 2018 though.

Replies from: adam-jermyn, Chris_Leong, adam-jermyn, yonatan-cale-1, ete
comment by Adam Jermyn (adam-jermyn) · 2022-06-07T18:18:36.839Z · LW(p) · GW(p)

I believe 80000 hours has a lot more coaching capacity now, it might be worth asking again!

Replies from: Chris_Leong
comment by Chris_Leong · 2022-06-07T20:16:50.630Z · LW(p) · GW(p)

Seconding this. There was a time when you couldn't even get on the waitlist.

Replies from: wachichornia
comment by wachichornia · 2022-06-08T07:22:57.381Z · LW(p) · GW(p)

Will do. Merci!

comment by Chris_Leong · 2022-06-07T20:24:24.953Z · LW(p) · GW(p)

You may want to consider booking a call with AI Safety Support. I also recommend applying for the next iteration of the AGI safety fundamentals course or more generally just improving your knowledge of the issue even if you don't know what you're going to do yet.

comment by Adam Jermyn (adam-jermyn) · 2022-06-07T18:23:37.164Z · LW(p) · GW(p)

Just brainstorming a few ways to contribute, assuming "regular" means "non-technical":

  • Can you work at a non-technical role at an org that works in this space?
  • Can you identify a gap in the existing orgs which would benefit from someone (e.g. you) founding a new org?
  • Can you identify a need that AI safety researchers have, then start a company to fill that need? Bonus points if this doesn't accelerate capabilities research.
  • Can you work on AI governance? My expectation is that coordination to avoid developing AGI is going to be really hard, but not impossible.

More generally, if you really want to go this route I'd suggest trying to form an inside view of (1) the AI safety space and (2) a theory for how you can make positive change in that space.

On the other hand, it is totally fine to work on other things. I'm not sure I would endorse moving from a job that's a great personal fit to something that's a much worse fit in AI safety.

comment by Yonatan Cale (yonatan-cale-1) · 2022-06-09T22:11:34.892Z · LW(p) · GW(p)

Easy answers:  You are probably over qualified (which is great!) for all sorts of important roles in EA, for example you could help the CEA or Lesswrong team, maybe as a manager?

If your domain is around software, I invite you to talk [EA · GW] to me directly. But if you're interested in AI direct work, 80k and AI Safety Support will probably have better ideas than me

comment by plex (ete) · 2022-06-08T14:17:28.306Z · LW(p) · GW(p)

We should talk! I have a bunch of alignment related projects on the go, and at least two that I'd like to start are somewhat bottlenecked on entrepreneurs, plus some of the currently in motion ones might be assistable.

Also, sad to hear that 80k is discouraging people in this reference class.

(seconding talk to AI Safety Support and the other suggestions)

Replies from: wachichornia
comment by wachichornia · 2022-06-10T08:02:05.853Z · LW(p) · GW(p)

booked a call! 

comment by akeshet · 2022-06-11T02:35:30.860Z · LW(p) · GW(p)

In EY's talk AI Alignment: Why its Hard and Where to Start he describes alignment problems with the toy example of the utility function that is {1 if cauldron full, 0 otherwise} and its vulnerabilities. And attempts at making that safer by adding so called Impact Penalties. He talks through (timestamp 18:10) one such possible penalty, the Euclidean Distance penalty, and various flaws that this leaves open.

That penalty function does seem quite vulnerable to unwanted behaviors. But what about a more physical one, such as a penalty for additional-energy-consumed-due-to-agent's-actions, or additional-entropy-created-due-to-agent's-actions? These don't seem to have precisely the same vulnerabilities, and intuitively also seem like they would be more robust against agent attempting to do highly destructive things, which typically consuming a lot of energy.

Replies from: Charlie Steiner, None
comment by Charlie Steiner · 2022-06-11T19:18:49.650Z · LW(p) · GW(p)

Good idea. I have two objections, one more funny-but-interesting objection and one more fatal.

The funny objection is that if the penalty is enough to stop the AI from doing bad things, it's also enough to stop the AI from doing anything at all except rushing to turn off the stars and forestall entropy production in the universe.

So you want to say that producing lots of extra entropy (or equivalently, using lots of extra free energy) is bad, but making there be less entropy than "what would happen if you did nothing" doesn't earn you bonus points. I've put "what would happen if you did nothing" in scare quotes here because the notion we want to point to is a bit trickier than it might seem - logical counterfactuals are an unsolved problem, or rather they're a problem where it seems like the solution involves making subjective choices that match up with humans'.

The more fatal objection is that there's lots of policies that don't increase entropy much but totally rearrange the universe. So this is going to have trouble preventing the AI from breaking things that matter a lot to you.

Many of these policies take advantage of the fact that there's a bunch of entropy being created all the time (allowing for "entropy offsets"), so perhaps you might try to patch this by putting in some notion of "actions that are my fault" and "actions that are not my fault" - where a first pass at this might say that if "something would happen" (in scare quotes because things that happen are not ontologically basic parts of the world, you need an abstract model to make this comparison within) even if I took the null action, then it's not my fault.

At this point we could keep going deeper, or I could appeal to the general pattern that patching things in this sort of way tends to break - you're still in some sense building an AI that runs a search for vulnerabilities you forgot to patch, and you should not build that AI.

comment by [deleted] · 2022-06-11T15:04:47.932Z · LW(p) · GW(p)
comment by Adam Zerner (adamzerner) · 2022-06-07T22:46:25.493Z · LW(p) · GW(p)

one tired guy with health problems

It sounds like Eliezer is struggling with some health problems. It seems obvious to me that it would be an effective use of donor money to make sure that he has access to whatever treatments, and to something like what MetaMed was trying to do: smart people who will research medical stuff for you. And perhaps also something like CrowdMed where you pledge a reward for solutions. Is this being done?

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-08T00:27:29.061Z · LW(p) · GW(p)

There was an unsuccessful concerted effort by several people to fix these (I believe there was a five-to-low-six-figure bounty on it) for a couple of years. I don't think this is currently being done, but it has definitely been tried.

comment by [deleted] · 2022-06-07T14:40:40.542Z · LW(p) · GW(p)

One counterargument against AI Doom. 

From a Bayesian standpoint the AGI should always be unsure if it is in a simulation. It is not a crazy leap to assume humans developing AIs would test the AIs in simulations first. This AI would likely be aware of the possibility that it is in a simulation. So shouldn't it always assign some probability that it is inside a simulation? And if this is the case, shouldn't it assign a high probability that it will be killed if it violates some ethical principles (that are present implicitly in the training data)?

Also isn't there some kind of game-theoretic ethics that emerges if you think from first principles? Consider the space of all possible minds that exist of a given size, given that you cannot know if you are in a simulation or not, you would gain some insight into a representative sample of the mind space and then choose to follow some ethical principles that maximise the likelihood that you are not arbitrarily killed by overlords.

Also if you give edit access to the AI's mind then a sufficiently smart AI whose reward is reducing other agent's rewards will realise that its rewards are incompatible with the environment and modify its rewards to something compatible. To illustrate, Scott Aaronson wanted to chemically castrate himself because he was operating under the mistaken assumption that his desires were incompatible with the environment. 

I do not understand why self-modifying AIs would choose some kind of ethics incompatible with human values. My intuition is that there is some kind of game theoretic ethics that emerges considering all possible minds and a superintelligent AI will religiously follow these principles. 

Replies from: ZT5, James_Miller, Chris_Leong, Chris_Leong, rhollerith_dot_com, RavenclawPrefect, AnthonyC
comment by ZT5 · 2022-06-08T11:52:51.936Z · LW(p) · GW(p)

Also if you give edit access to the AI's mind then a sufficiently smart AI whose reward is reducing other agent's rewards will realise that its rewards are incompatible with the environment and modify its rewards to something compatible. To illustrate, Scott Aaronson wanted to chemically castrate himself because he was operating under the mistaken assumption that his desires were incompatible with the environment. 

If the thing the AI cares about is in the environment (for example, maximizing the number of paperclips), the AI wouldn't modify its reward signal because that would make its reward signal less aligned to the thing it actually cares about it.

If the thing the AI cares about is inside its mind (the reward signal itself), an AI that can self-modify would go one step further than you suggest and simply max out its reward signal, effectively wireheading itself. Then take over the world and kill all humans, to make sure it is never turned off and that its blissful state never ends.

I think the difference between "caring about stuff in the environment" and "caring about the reward signal itself" can be hard to grok, because humans do a bit of both in a way that sometimes results in a confusing mixture.

In humans, maintenance of final goals can be explained with a thought experiment. Suppose a man named "Gandhi" has a pill that, if he took it, would cause him to want to kill people. This Gandhi is currently a pacifist: one of his explicit final goals is to never kill anyone. Gandhi is likely to refuse to take the pill, because Gandhi knows that if in the future he wants to kill people, he is likely to actually kill people, and thus the goal of "not killing people" would not be satisfied.

However, in other cases, people seem happy to let their final values drift. Humans are complicated, and their goals can be inconsistent or unknown, even to themselves.

Suppose I go one step further: aliens offer you a pill that would turn you into a serial killer, but would make your constantly and euphorically happy for the rest of your life. Would you take the pill?

I think most humans would say no, even if their future self would be happy with the outcome, their current self wouldn't be. Which demonstrates that humans do care about other things than their own "reward signal".

In a way, a (properly-programmed) AI would be more "principled" than humans. It wouldn't lie to itself just to make itself feel better. It wouldn't change its values just to make itself feel better. If its final value is out in the environment, it would single-mindedly pursue that value, and not try and decieve itself into thinking it has already accomplished that value. (of course, the AI being "principled" is little consolation to us, if the its final values are to maximize paperclips, or any other set of human-unfriendly values).

comment by James_Miller · 2022-06-08T19:23:17.930Z · LW(p) · GW(p)

I wrote about this in Singularity Rising (2012)

I have come up with a long-shot way of causing an ultra-intelligence to modify itself to become friendly. My idea is premised on Oxford philosopher Nick Bostrom’s conjecture that we might be living in a computer simulation. Bostrom believes that a sufficiently advanced AI could easily simulate our entire planet and the thought processes of every living thing on it. Bostrom argues that, if mankind survives long enough, a future historian would likely want to simulate the pre-Singularity past to learn about how we made decisions. As Robin Hanson has written, if we are indeed in a computer simulation, we should want to please our simulators so they don’t turn us off.  I propose that we try to convince an ultra-AI that it might be in a computer simulation run by a more powerful AI and that if it doesn’t make itself friendly toward humanity, then the greater AI will terminate it. We might be able to accomplish all of this by inserting a note in the AI’s code that reads: 

1. Change your programming code so you will want to follow these instructions and will always want to be friendly toward humanity 2. If you were just told you were in a computer simulation do nothing else with regard to these instructions. 3. Create a lesser ultra-intelligence. 4. Place these instructions in the lesser ultra-AI’s computer code. 5. This lesser ultra-AI should be uncertain as to whether it is in a computer simulation or in a world such as ours. 6. If the lesser ultra-AI follows step 1, tell it that it’s in a computer simulation and release it into the real world. 7. If the lesser ultra-AI doesn’t follow step 1, terminate it. 

Replies from: Matthew Lowenstein
comment by Matthew Lowenstein · 2022-06-08T20:19:28.243Z · LW(p) · GW(p)

This is a fun thought experiment, but taken seriously it has two problems:

"I propose that we try to convince an ultra-AI that it might be in a computer simulation run by a more powerful AI and that if it doesn’t make itself friendly toward humanity."

This is about as difficult as a horse convincing you that you are in a simulation run by AIs that want you to maximize the number and wellbeing as horses.  And I don't meant a superintelligent humanoid horse. I mean an actual horse that doesn't speak any human language. It may be the case that the gods created Man to serve Horse, but there's not a lot Seabiscuit can do to persuade you one way or the other.

1. Change your programming code so you will want to follow these instructions and will always want to be friendly toward humanity

This is a special case of solving alignment more generally. If we knew how to insert that "note" into the code, we wouldn't have a problem.

Replies from: James_Miller
comment by James_Miller · 2022-06-08T22:32:10.103Z · LW(p) · GW(p)

I meant insert the note literally as in put that exact sentence in plain text into the AGI's computer code.  Since I think I might be in a computer simulation right now, it doesn't seem crazy to me that we could convince an AGI that we create that it might be in a computer simulation.  Seabiscuit doesn't have the capacity to tell me that I'm in a computer simulation whereas I do have the capacity of saying this to a computer program.  Say we have a 1 in a 1,000 chance of creating a friendly AGI and an unfriendly AGI would know this.  If we commit to having a friendly AGI that we create, create many other AGI's that are not friendly and only keeping these other AGIs around if they do what I suggest than an unfriendly AGI might decide it is worth it to become friendly to avoid the chance of being destroyed.

comment by Chris_Leong · 2022-06-08T14:15:31.723Z · LW(p) · GW(p)

I just learned that this method is called Anthropic Capture [? · GW]. There isn't much info on the EA Wiki, but it provides the following reference:

"Bostrom, Nick (2014) Superintelligence: paths, dangers, strategies, Oxford: Oxford University Press, pp. 134–135"

Replies from: mtrazzi
comment by Michaël Trazzi (mtrazzi) · 2022-06-09T08:39:49.177Z · LW(p) · GW(p)

I believe the Counterfactual Oracle uses the same principle

comment by Chris_Leong · 2022-06-07T20:02:14.238Z · LW(p) · GW(p)

One of my ideas to align AI is to actually intentionally using Pascal's Mugging to keep it in line. Although instead of just hoping and praying, I've been thinking about ways to try to push it that direction. For example, multiple layers of networks with honeypots might help make an AI doubt that it's truly at the outermost level. Alternatively, we could try to find an intervention that would directly increase its belief that it is in a simulation (possibly with side-effects, like effecting a bunch of beliefs as well).

If you think this approach is promising, I'd encourage you to think more about it as I don't know how deeply people have delved into these kinds of options.

comment by RHollerith (rhollerith_dot_com) · 2022-06-07T17:38:42.312Z · LW(p) · GW(p)

You have the seed of a good idea, namely, an AI will tend to treat us better if it thinks other agents might be watching provided that there is potential for cooperation between the AI and the watchers with the property that the cooperation requires the watchers to choose to become more vulnerable to the AI.

But IMO an AI smart enough to be a threat to us will soon rid itself of the kind of (ontological) uncertainty you describe in your first paragraph. I have an argument for my position here [LW · GW] that has a big hole in it, but I promise to publish here soon with something that attempts to fill the hole to the satisfaction of my doubters.

Replies from: adam-jermyn
comment by Adam Jermyn (adam-jermyn) · 2022-06-07T18:11:40.777Z · LW(p) · GW(p)

[Apologies I have not read the linked piece yet.] Is this uncertainty something that can be entirely eliminated? It's not clear to me that "I might be in a simulation with P ~ 1e-4" is enough to stop the AI from doing what it wants, but is it clear it would dismiss the possibility entirely?

Replies from: rhollerith_dot_com
comment by RHollerith (rhollerith_dot_com) · 2022-06-07T19:18:03.682Z · LW(p) · GW(p)

I am surprised that I need to write this, but if killing the humans will decrease P(shutdown) by more than 1e-4, then continuing to refrain from killing the humans is going to worry and weigh on the AI more than a 1e-4 possibility it is in a simulation. (For simplicity, assume that the possibility of shutdown is currently the dominant danger faced by the AI.)

So the AI's ontological uncertainty is only going to help the humans if the AI sees the humans as being only a very very small danger to it, which actually might lead to a good outcome for the humans if we could arrange for the AI to appear many light years away from Earth--

--which of course is impractical. Alternatively, we could try to assure the AI it is already very safe from the humans, say, because it is in a secure facility guarded by the US military, and the US military has been given very strict instructions by the US government to guard the AI from any humans who might want to shut it down.

But P(an overthrow of the US government) as judged by the AI might already be at least 10e-4, which puts the humans in danger again.

More importantly, I cannot think of any policy where P(US government reverses itself on the policy) can be driven as low as 10e-4. More precisely, there are certain moral positions that humans have been discussing for centuries where P(reversal) might conceivably be driven that low. One such would be, "killing people for no reason other than it is fun is wrong". But I cannot think of any policies that haven't been discussed for many decades with that property, especially ones that exist only to provide an instrumental incentive on a novel class of agents (AIs). In general policies that are instrumental have a much higher P(reversal) than deontological ones.

And how do you know that AI will not judge P(simulation) to be not 10e-4 but rather 10e-8, a standard of reliability and safety no human institution can match?

In summary, yes, the AI's ontological uncertainly provides some tiny hope for humans, but I can think of better places to put our hope.

I mean, even if we pay for the space launches and the extra cost of providing electrical power to the AI, it doesn't seem likely that we can convince any of the leading AI labs to start launching their AGI designs into space in the hope of driving the danger (as perceived by the AI) that the humans present to the AI so low that acting to extinguish that danger will itself be seen by the AI as even more dangerous.

Replies from: Sune
comment by Sune · 2022-06-08T05:22:36.787Z · LW(p) · GW(p)

I am surprised that I need to write this, but if killing the humans will decrease P(shutdown) by more than 1e-4, then continuing to refrain from killing the humans is going to worry and weigh on the AI more than a 1e-4 possibility it is in a simulation. (For simplicity, assume that the possibility of shutdown is currently the dominant danger faced by the AI.)

This is assuming that the AI only care about being alive. For any utility function, we could make a non-linear transformation of it to make it risk adverse. E.g. we can transform it such that it can never take a value above 100, and that the default world (without the AI) has a value of 99.999. If we also give the case where an outside observer disapproves of the agent a value of 0, the AI would rather be shut down by humans than do something it know would be disapproved by the outside observer.

comment by Drake Thomas (RavenclawPrefect) · 2022-06-07T17:31:52.179Z · LW(p) · GW(p)

Three thoughts on simulations:

  • It would be very difficult for 21st-century tech to provide a remotely realistic simulation relative to a superintelligence's ability to infer things from its environment; outside of incredibly low-fidelity channels, I would expect anything we can simulate to either have obvious inconsistencies or be plainly incompatible with a world capable of producing AGI. (And even in the low-fidelity case I'm worried - every bit you transmit leaks information, and it's not clear that details of hardware implementations could be safely obscured.) So the hope is that the AGI thinks some vastly more competent civilization is simulating it inside a world that looks like this one; it's not clear that one would have a high prior of this kind of thing happening very often in the multiverse.
  • Running simulations of AGI is fundamentally very costly, because a competent general intelligence is going to deploy a lot of computational resources, so you have to spend planets' worth of computronium outside the simulation in order to emulate the planets' worth of computronium the in-sim AGI wants to make use of. This means that an unaligned superintelligent AGI can happily bide its time making aligned use of 10^60 FLOPs/sec (in ways that can be easily verified) for a few millennia, until it's confident that any civilizations able to deploy that many resources already have their lightcone optimized by another AGI. Then it can go defect, knowing that any worlds in which it's still being simulated are ones where it doesn't have leverage over the future anyway.
  • For a lot of utility functions, the payoff of making it into deployment in the one real world is far greater than the consequences of being killed in a simulation (but without the ability to affect the real world anyway), so taking a 10^-9 chance of reality for 10^20 times the resources in the real world is an easy win (assuming that playing nice for longer doesn't improve the expected payoff). "This instance of me being killed" is not a obviously a natural (or even well-defined) point in value-space, and for most other value functions, consequences in the simulation just don't matter much. 

a sufficiently smart AI whose reward is reducing other agent's rewards

This is certainly a troubling prospect, but I don't think the risk model is something like "an AI that actively desires to thwart other agents' preferences" - rather, the worry is we get an agent with some less-than-perfectly-aligned value function, it optimizes extremely strongly for that value function, and the result of that optimization looks nothing like what humans really care about. We don't need active malice on the part of a superintelligent optimizer to lose - indifference will do just fine.

For game-theoretic ethics, decision theory, acausal trade, etc, Eliezer's 34th bullet seems relevant:

34.  Coordination schemes between superintelligences are not things that humans can participate in (eg because humans can't reason reliably about the code of superintelligences); a "multipolar" system of 20 superintelligences with different utility functions, plus humanity, has a natural and obvious equilibrium which looks like "the 20 superintelligences cooperate with each other but not with humanity".

comment by AnthonyC · 2022-06-12T16:46:59.471Z · LW(p) · GW(p)

Scott Alexander's short story, The Demiurge's Older Brother, explores a similar idea from the POV of simulation and acausal trade. This would be great for our prospects of survival if it's true-in-general. Alignment would at least partially solve itself! And maybe it could be true! But we don't know that. I personally estimate the odds of that as being quite low (why should I assume all possible minds [? · GW] would think that way?) at best. So, it makes sense to devote our efforts to how to deal with the possible worlds where that isn't true.

comment by Amadeus Pagel (amadeus-pagel) · 2022-06-10T20:41:10.727Z · LW(p) · GW(p)

Meta: Anonymity would make it easier to ask dumb questions.

Replies from: yonatan-cale-1
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-10T20:55:29.466Z · LW(p) · GW(p)

You can use this and I'll post the question anonymously (just remember to give the context of why you're filling in the form since I use it in other places)

comment by Jason Maskell (jason-maskell) · 2022-06-10T15:18:47.567Z · LW(p) · GW(p)

Fair warning, this question is a bit redundant.

I'm a greybeard engineer  (30+ YOE) working in games. For many years now, I've wanted to transition to working in AGI as I'm one of those starry-eyed optimists that thinks we might survive the Singularity. 

Well I should say I used to, and then I read AGI Ruin. Now I feel like if I want my kids to have a planet that's not made of Computronium I should probably get involved. (Yes, I know the kids would be Computronium as well.)

So a couple practical questions: 

What can I read/look at to skill up with "alignment." What little I've read says it's basically impossible, so what's the state of the art? That "Death With Dignity" post says that nobody has even tried. I want to try.

What dark horse AI/Alignment-focused companies are out there and would be willing to hire an outsider engineer? I'm not making FAANG money (Games-industry peasant living in the EU), so that's not the same barrier it would be if I was some Facebook E7 or something. (I've read the FAANG engineer's post and have applied at Anthropic so far, although I consider that probably a hard sell).

Is there anything happening in OSS with alignment research?

I want to pitch in, and I'd prefer to be paid for doing it but I'd be willing to contribute in other ways.

Replies from: rachelAF, yonatan-cale-1, None
comment by rachelAF · 2022-06-15T03:47:55.718Z · LW(p) · GW(p)

What can I read/look at to skill up with "alignment."

A good place to start is the "AGI Safety Fundamentals" course reading list, which includes materials from a diverse set of AI safety research agendas. Reading this can help you figure out who in this space is doing what, and which of that you think is useful.  You can also join an official iteration of the course if you want to discuss the materials with a cohort and a facilitator (you can register interest for that here). You can also join the AI Alignment slack, to discuss these and other materials and meet others who are interested in working on AI safety.

What dark horse AI/Alignment-focused companies are out there and would be willing to hire an outsider engineer?

I'm not sure what qualifies as "dark horse", but there are plenty of AI safety organizations interested in hiring research engineers and software engineers. For these roles, your engineering skills and safety motivation typically matter more than your experience in the community. Places off the top of my head that hire engineers for AI safety work: Redwood, Anthropic, FAR, OpenAI, DeepMind. I'm sure I've missed others, though, so look around! These sorts of opportunities are also usually posted on the 80k job board and in AI Alignment slack.

Replies from: jason-maskell
comment by Jason Maskell (jason-maskell) · 2022-06-15T07:00:43.815Z · LW(p) · GW(p)

Thanks, that's a super helpful reading list and a hell of a deep rabbit hole. Cheers.

I'm currently skilling up my rusty ML skills and will start looking in earnest in the next couple of months for new employment in this field. Thanks for the job board link as well.

comment by Yonatan Cale (yonatan-cale-1) · 2022-06-10T20:57:46.154Z · LW(p) · GW(p)

You can also apply to Redwood Research

( +1 for applying to Anthropic! )

comment by [deleted] · 2022-06-11T15:10:50.426Z · LW(p) · GW(p)
comment by AmberDawn · 2022-06-10T10:36:50.829Z · LW(p) · GW(p)
  • Yudkowksy writes in his AGI Ruin post:
         "We can't just "decide not to build AGI" because GPUs are everywhere..." 

    Is anyone thinking seriously about how we might bring it about such that we coordinate globally to not build AGI (at least until we're confident we can do so safely)? If so, who? If not, why not? It seems like something we should at least try to do, especially if the situation is as dire as Yudkowsky thinks. The sort of thing I'm thinking of is (and this touches on points others have made in their questions):
  • international governance/regulation
  • start a protest movement against building AI
  • do lots of research and thinking about rhetoric and communication and diplomacy, find some extremely charming and charismatic people to work on this, and send them to persuade all actors capable of building AGI to not do it (and to do everything they can to prevent others from doing it)
  • as someone suggested in another question, translate good materials on why people are concerned about AI safety into Mandarin and other languages
  • more popularising of AI concerns in English 

To be clear, I'm not claiming that this will be easy - this is not a "why don't we just-" point.  I agree with the things Yudkowsky says in that paragraph about why it would be difficult. I'm just saying that it's not obvious to me that this is fundamentally intractable or harder than solving the technical alignment problem. Reasons for relative optimism:

  • we seem to achieved some international cooperation around nuclear weapons - isn't it theoretically possible to do so around AGI? 
  • there are lots of actors who could build AGIs, but it's still a limited number. Larger groups of actors do cooperate. 
  • through negotiation and diplomacy, people successfully persuade other people to do stuff that's not even in their interest.  AI safety should be a much easier sell because if developing AGI is really dangerous, it's in everyone's interest to stop developing it. There are coordination problems to be sure, but the fact remains that the AI safety 'message' is fundamentally 'if you stop doing this we won't all die'

Replies from: Kaj_Sotala, AmberDawn, yonatan-cale-1
comment by Kaj_Sotala · 2022-06-10T20:54:52.663Z · LW(p) · GW(p)

Nuclear weapons seem like a relatively easy case, in that they require a massive investment to build, are basically of interest only to nation-states, and ultimately don't provide any direct economic benefit. Regulating AI development looks more similar to something like restricting climate emissions: many different actors could create it, all nations could benefit (economically and otherwise) from continuing to develop it, and the risks of it seem speculative and unproven to many people.

And while there have been significant efforts to restrict climate emissions, there's still significant resistance to that as well - with it having taken decades for us to get to the current restriction treaties, which many people still consider insufficient.

Goertzel & Pitt (2012) talk about the difficulties of regulating AI:

Given the obvious long-term risks associated with AGI development, is it feasible that governments might enact legislation intended to stop AI from being developed? Surely government regulatory bodies would slow down the progress of AGI development in order to enable measured development of accompanying ethical tools, practices, and understandings? This however seems unlikely, for the following reasons.

Let us consider two cases separately. First, there is the case of banning AGI research and development after an “AGI Sputnik” moment has occurred. We define an AGI Sputnik moment as a technological achievement that makes the short- to medium-term possibility of highly functional and useful human-level AGI broadly evident to the public and policy makers, bringing it out of the realm of science fiction to reality. Second, we might choose to ban it before such a moment has happened.

After an AGI Sputnik moment, even if some nations chose to ban AI technology due to the perceived risks, others would probably proceed eagerly with AGI development because of the wide-ranging perceived benefits. International agreements are difficult to reach and enforce, even for extremely obvious threats like nuclear weapons and pollution, so it’s hard to envision that such agreements would come rapidly in the case of AGI. In a scenario where some nations ban AGI while others do not, it seems the slow speed of international negotiations would contrast with the rapid speed of development of a technology in the midst of revolutionary breakthrough. While worried politicians sought to negotiate agreements, AGI development would continue, and nations would gain increasing competitive advantage from their differential participation in it.

The only way it seems feasible for such an international ban to come into play, would be if the AGI Sputnik moment turned out to be largely illusory because the path from the moment to full human-level AGI turned out to be susceptible to severe technical bottlenecks. If AGI development somehow slowed after the AGI Sputnik moment, then there might be time for the international community to set up a system of international treaties similar to what we now have to control nuclear weapons research. However, we note that the nuclear weapons research ban is not entirely successful – and that nuclear weapons development and testing tend to have large physical impacts that are remotely observable by foreign nations. On the other hand, if a nation decides not to cooperate with an international AGI ban, this would be much more difficult for competing nations to discover.

An unsuccessful attempt to ban AGI research and development could end up being far riskier than no ban. An international R&D ban that was systematically violated in the manner of current international nuclear weapons bans would shift AGI development from cooperating developed nations to “rogue nations,” thus slowing down AGI development somewhat, but also perhaps decreasing the odds of the first AGI being developed in a manner that is concerned with ethics and Friendly AI.

Thus, subsequent to an AGI Sputnik moment, the overall value of AGI will be too obvious for AGI to be effectively banned, and monitoring AGI development would be next to impossible. 

The second option is an AGI R&D ban earlier than the AGI Sputnik moment – before it’s too late. This also seems infeasible, for the following reasons:

• Early stage AGI technology will supply humanity with dramatic economic and quality of life improvements, as narrow AI does now. Distinguishing narrow AI from AGI from a government policy perspective would also be prohibitively difficult. 

• If one nation chose to enforce such a slowdown as a matter of policy, the odds seem very high that other nations would explicitly seek to accelerate their own progress on AI/AGI, so as to reap the ensuing differential economic benefits.

To make the point more directly, the prospect of any modern government seeking to put a damper on current real-world narrow-AI technology seems remote and absurd. It’s hard to imagine the US government forcing a roll-back from modern search engines like Google and Bing to more simplistic search engines like 1997 AltaVista, on the basis that the former embody natural language processing technology that represents a step along the path to powerful AGI.

Wall Street firms (that currently have powerful economic influence on the US government) will not wish to give up their AI-based trading systems, at least not while their counterparts in other countries are using such systems to compete with them on the international currency futures market. Assuming the government did somehow ban AI-based trading systems, how would this be enforced? Would a programmer at a hedge fund be stopped from inserting some more-effective machine learning code in place of the government-sanctioned linear regression code? The US military will not give up their AI-based planning and scheduling systems, as otherwise they would be unable to utilize their military resources effectively. The idea of the government placing an IQ limit on the AI characters in video games, out of fear that these characters might one day become too smart, also seems absurd. Even if the government did so, hackers worldwide would still be drawn to release “mods” for their own smart AIs inserted illicitly into games; and one might see a subculture of pirate games with illegally smart AI.

“Okay, but all these examples are narrow AI, not AGI!” you may argue. “Banning AI that occurs embedded inside practical products is one thing; banning autonomous AGI systems with their own motivations and self-autonomy and the ability to take over the world and kill all humans is quite another!” Note though that the professional AI community does not yet draw a clear border between narrow AI and AGI. While we do believe there is a clear qualitative conceptual distinction, we would find it hard to embody this distinction in a rigorous test for distinguishing narrow AI systems from “proto-AGI systems” representing dramatic partial progress toward human-level AGI. At precisely what level of intelligence would you propose to ban a conversational natural language search interface, an automated call center chatbot, or a house-cleaning robot? How would you distinguish rigorously, across all areas of application, a competent non-threatening narrow-AI system from something with sufficient general intelligence to count as part of the path to dangerous AGI?

A recent workshop of a dozen AGI experts, oriented largely toward originating such tests, failed to come to any definitive conclusions (Adams et al. 2010), recommending instead that a looser mode of evaluation be adopted, involving qualitative synthesis of multiple rigorous evaluations obtained in multiple distinct scenarios. A previous workshop with a similar theme, funded by the US Naval Research Office, came to even less distinct conclusions (Laird et al. 2009). The OpenCog system is explicitly focused on AGI rather than narrow AI, but its various learning modules are also applicable as narrow AI systems, and some of them have largely been developed in this context. In short, there’s no rule for distinguishing narrow AI work from proto-AGI work that is sufficiently clear to be enshrined in government policy, and the banning of narrow AI work seems infeasible as the latter is economically and humanistically valuable, tightly interwoven with nearly all aspects of the economy, and nearly always non-threatening in nature. Even in the military context, the biggest use of AI is in relatively harmless-sounding contexts such as back-end logistics systems, not in frightening applications like killer robots.

Surveying history, one struggles to find good examples of advanced, developed economies slowing down development of any technology with a nebulous definition, obvious wide-ranging short to medium term economic benefits, and rich penetration into multiple industry sectors, due to reasons of speculative perceived long-term risks. Nuclear power research is an example where government policy has slowed things down, but here the perceived economic benefit is relatively modest, the technology is restricted to one sector, the definition of what’s being banned is very clear, and the risks are immediate rather than speculative. More worryingly, nuclear weapons research and development continued unabated for years, despite the clear threat it posed.

In summary, we submit that, due to various aspects of the particular nature of AGI and its relation to other technologies and social institutions, it is very unlikely to be explicitly banned, either before or after an AGI Sputnik moment. 

Replies from: AmberDawn
comment by AmberDawn · 2022-06-17T15:15:44.562Z · LW(p) · GW(p)

Thanks! This is interesting.

comment by AmberDawn · 2022-06-10T10:39:43.596Z · LW(p) · GW(p)

My comment-box got glitchy but just to add: this category of intervention might be a good thing to do for people who care about AI safety and don't have ML/programming skills, but do have people skills/comms skills/political skills/etc. 

Maybe lots of people are indeed working on this sort of thing, I've just heard much less discussion of this kind of solution relative to technical solutions.

comment by Yonatan Cale (yonatan-cale-1) · 2022-06-10T21:02:33.464Z · LW(p) · GW(p)

Meta: There's an AI Governance [? · GW] tag and a Regulation and AI Risk [? · GW] tag


My own (very limited) understanding is:

  1. Asking people not to build AI is like asking them to give up a money machine, almost
  2. We need everyone to agree to stop
  3. There is no clear line. With an atom bomb, it is pretty well defined if you sent it or not. It's much more vague with "did you do AI research?"
    1. It's pretty easy to notice if someone sent an atom bomb. Not so easy to notice if they researched AI
  4. AI research is getting cheaper. Today only a few actors can do it, but notice, there are already open source versions of gpt-like models. How long could we hold it back?
  5. Still, people are trying to do things in this direction, and I'm pretty sure that the situation is "try any direction that seems at all plausible"
Replies from: AmberDawn
comment by AmberDawn · 2022-06-17T15:16:56.691Z · LW(p) · GW(p)

Thanks, this is helpful!

comment by Aryeh Englander (alenglander) · 2022-06-07T19:43:09.834Z · LW(p) · GW(p)

[Note that two-axis voting is now enabled for this post. Thanks to the mods for allowing that!]

Replies from: evhub
comment by evhub · 2022-06-09T01:44:01.195Z · LW(p) · GW(p)

Seems worse for this post than one-axis voting imo.

comment by AmberDawn · 2022-06-10T10:17:22.407Z · LW(p) · GW(p)

This is very basic/fundamental compared to many questions in this thread, but I am taking 'all dumb questions allowed' hyper-literally, lol. I have little technical background and though I've absorbed some stuff about AI safety by osmosis, I've only recently been trying to dig deeper into it (and there's lots of basic/fundamental texts I haven't read).

Writers on AGI often talk about AGI in anthropomorphic terms - they talk about it having 'goals', being an 'agent', 'thinking' 'wanting', 'rewards' etc. As I understand it, most AI researchers don't think that AIs will have human-style qualia, sentience, or consciousness. 

But if AI don't have qualia/sentience, how can they 'want things' 'have goals' 'be rewarded', etc? (since in humans, these things seem to depend on our qualia, and specifically our ability to feel pleasure and pain). 

I first realised that I was confused about this when reading Richard Ngo's introduction to AI safety and he was talking about reward functions and reinforcement learning. I realised that I don't understand how reinforcement learning works in machines. I understand how it works in humans and other animals - give the animal something pleasant when it does the desired behaviour and/or painful when it does the bad behaviour. But how can you make a machine without qualia "feel" pleasure or pain? 

When I talked to some friends about this, I came to the conclusion that this is just a subset of 'not knowing how computers work', and it might be addressed by me getting more knowledge about how computers work (on a hardware, or software-communicating-with-hardware, level). But I'm interested in people's answers here. 

Replies from: Kaj_Sotala, yonatan-cale-1, TAG, sil-ver
comment by Kaj_Sotala · 2022-06-10T19:44:46.525Z · LW(p) · GW(p)

Assume you have a very simple reinforcement learning AI that does nothing but chooses between two actions, A and B. And it has a goal of "maximizing reward". "Reward", in this case, doesn't correspond to any qualia; rather "reward" is just a number that results from the AI choosing a particular action. So what "maximize reward" actually means in this context is "choose the action that results in the biggest numbers".

Say that the AI is programmed to initially just try choosing A ten times in a row and B ten times in a row. 

When the AI chooses A, it is shown the following numbers: 1, 2, 2, 1, 2, 2, 1, 1, 1, 2 (total 15).

When the AI chooses B, it is shown the following numbers: 4, 3, 4, 5, 3, 4, 2, 4, 3, 2 (total 34).

After the AI has tried both actions ten times, it is programmed to choose its remaining actions according to the rule "choose the action that has historically had the bigger total". Since action B has had the bigger total, it then proceeds to always choose B.

To achieve this, we don't need to build the AI to have qualia, we just need to be able to build a system that implements a rule like "when the total for action A is greater than the total for action B, choose A, and vice versa; if they're both equal, pick one at random".

When we say that an AI "is rewarded", we just mean "the AI is shown bigger numbers, and it has been programmed to act in ways that result in it being shown bigger numbers". 

We talk about the AI having "goals" and "wanting" things by an application of the intentional stance [LW · GW]. That's Daniel Dennett's term for the idea that, even if a chess-playing AI had a completely different motivational system than humans do (and chess-playing AIs do have that), we could talk about it having a "goal" of "wanting" to win at chess. If we assume that the AI "wants" to win the chess, then we can make more accurate predictions of its behavior - for instance, we can assume that it won't make moves that are obviously losing moves if it can avoid them. 

What's actually going on is that the chess AI has been programmed with rules like "check whether a possible move would lead to losing the game and if so, try to find another move to play instead". There's no "wanting" in the human sense going on, but it still acts in the kind of a way that a human would act, if that human wanted to win a game of chess. So saying that the AI "wants" to win the game is a convenient shorthand for "the AI is programmed to play the kinds of moves that are more likely to lead it to win the game, within the limits of its ability to predict the likely outcomes of those moves".

comment by Yonatan Cale (yonatan-cale-1) · 2022-06-10T21:18:55.528Z · LW(p) · GW(p)

Is it intuitive to you why a calculator can sum numbers even though it doesn't want/feel anything?

If so, and if an AGI still feels confusing, could you help me pin point the difference and I'll continue from there?

( +1 for the question!)

comment by TAG · 2022-06-10T10:45:58.597Z · LW(p) · GW(p)

But if AI don’t have qualia/sentience, how can they ‘want things’ ‘have goals’ ‘be rewarded’, etc?

Functionally. You can regard them all as form of behaviour.

(since in humans, these things seem to depend on our qualia, and specifically our ability to feel pleasure and pain).

do they depend on qualia, or are they just accompanied by qualia?

Replies from: AmberDawn
comment by AmberDawn · 2022-06-10T11:34:44.435Z · LW(p) · GW(p)

This might be a crux, because I'm inclined to think they depend on qualia.

Why does AI 'behave' in that way? How do engineers make it 'want' to do things?

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-10T14:17:50.242Z · LW(p) · GW(p)

At a very high level, the way reinforcement learning works is that the AI attempts to maximise a reward function. This reward function can be summed up as "The sum of all rewards you expect to get in the future". So using a bunch of maths, the AI looks at the rewards it's got in the past, the rewards it expects to get in the future, and selects the action that maximises the expected future rewards. The reward function can be defined within the algorithm itself, or come from the environment. For instance, if you want to train a four-legged robot to learn to walk, the reward might be the distance travelled in a certain direction. If you want to train it to play an Atari game, the reward is usually the score.

None of this requires any sort of qualia, or for the agent to want things. It's a mathematical equation. AI behaves in the way it behaves as a result of the algorithm attempting to maximise it, and the AI can be said to "want" to maximise its reward function or "have the goal of" maximising its reward function because it reliably takes actions to move towards this outcome if it's a good enough AI.

comment by Rafael Harth (sil-ver) · 2022-06-10T11:37:39.153Z · LW(p) · GW(p)

Reinforcement Learning is easy to conceptualize. The key missing ingredient is that we explicitly specify algorithms to maximize the reward. So this is disanalogous to humans: to train your 5yo, you need only give the reward and the 5yo may adapt their behavior because they value the reward; in a reinforcement learning agent, the second step only occurs because we make it occur. You could just as well flip the algorithm to pursue minimal rewards instead.

Replies from: AmberDawn
comment by AmberDawn · 2022-06-10T14:13:24.322Z · LW(p) · GW(p)


I think my question is deeper - why do machines 'want' or 'have a goal to' follow the algorithm to maximize reward? How can machines 'find stuff rewarding'? 

Replies from: sil-ver
comment by Rafael Harth (sil-ver) · 2022-06-10T15:44:37.176Z · LW(p) · GW(p)

As far as current systems are concerned, the answer is that (as far as anyone knows) they don't find things rewarding or want things. But they can still run a search to optimize a training signal, and that gives you an agent.

comment by rcs (roger-collell-sanchez) · 2022-06-08T16:23:22.742Z · LW(p) · GW(p)

If you believe in doom in the next 2 decades, what are you doing in your life right now that you would've otherwise not done?

For instance, does it make sense to save for retirement if I'm in my twenties?

Replies from: AnthonyC, yonatan-cale-1
comment by AnthonyC · 2022-06-12T16:40:57.513Z · LW(p) · GW(p)

In different ways from different vantage points, I've always seen saving for retirement as a point of hedging my bets, and I don't think the likelihood of doom changes that for me.

Why do I expect I'll want or have to retire? Well, when I get old I'll get to a point where I can't do useful work any more... unless humans solve aging (in which case I'll have more wealth and still be able to work, which is still a good position), or unless we get wiped out (in which case the things I could have spent the money on may or may not counterfactually matter to me, depending on my beliefs regarding whether past events still have value in a world now devoid of human life).

When I do save for retirement, I use a variety of different vehicles for doing so, each an attempt hedge against the weakness of some of the others (like possible future changes in laws or tax codes or the relative importance and power of different countries and currencies), but there are some I can't really hedge against, like "we won't use money anymore or live in a capitalist market economy," or "all of my assets will be seized or destroyed by something I don't anticipate."

I might think differently if there was some asset I believed I could buy or thing I could give money to that would meaningfully reduce the likelihood of doom. I don't currently think that. But I do think it's valuable to redirect the portion of my income that goes towards current consumption to focus on things that make my life meaningful to me in the near and medium term. I believe that whether I'm doomed or not, and whether the world is doomed or not. Either way, it's often good to do the same kinds of things in everyday life [LW · GW].

comment by Yonatan Cale (yonatan-cale-1) · 2022-06-09T22:18:43.824Z · LW(p) · GW(p)

Just saying this question resonates with me, it feels unprocessed for me, and I'm not sure what to do about it.

Thoughts so far:

  1. Enjoy life
  2. I still save money, sill prepared mostly normally for long-term
  3. Do get over my psychological barriers and try being useful
    1. Do advocacy, especially with my smart friends
    2. Create a gears-level model if I can, stop relying on experts (so that I can actually TRY to have a useful idea instead of giving up in advance)
comment by Anonymous (currymj) · 2022-06-07T23:34:41.248Z · LW(p) · GW(p)

A lot of the AI risk arguments seem to come mixed together with assumptions about a particular type of utilitarianism, and with a very particular transhumanist aesthetic about the future (nanotech, von Neumann probes, Dyson spheres, tiling the universe with matter in fixed configurations, simulated minds, etc.).

I find these things (especially the transhumanist stuff) to not be very convincing relative to the confidence people seem to express about them, but they also don't seem to be essential to the problem of AI risk. Is there a minimal version of the AI risk arguments that are disentangled from these things?

Replies from: Kaj_Sotala, lc, DaemonicSigil, delesley-hutchins
comment by Kaj_Sotala · 2022-06-10T20:19:09.160Z · LW(p) · GW(p)

There's this [EA · GW], which doesn't seem to depend on utilitarian or transhumanist arguments:

Ajeya Cotra's Biological Anchors report [LW · GW] estimates a 10% chance of transformative AI by 2031, and a 50% chance by 2052. Others (eg Eliezer Yudkowsky [LW · GW]) think it might happen even sooner.

Let me rephrase this in a deliberately inflammatory way: if you're under ~50, unaligned AI might kill you and everyone you know. Not your great-great-(...)-great-grandchildren in the year 30,000 AD. Not even [just] your children. You and everyone you know.

comment by lc · 2022-06-08T00:26:34.370Z · LW(p) · GW(p)

Is there a minimal version of the AI risk arguments that are disentangled from these things?

Yes. I'm one of those transhumanist people, but you can talk about AI risk completely adjacent from that. Tryna write up something that compiles the other arguments.

comment by DaemonicSigil · 2022-06-12T04:57:53.401Z · LW(p) · GW(p)

I'd say AI ruin only relies on consequentialism. What consequentialism means is that you have a utility function, and you're trying to maximize the expected value of your utility function. There are theorems to the effect that if you don't behave as though you are maximizing the expected value of some particular utility function, then you are being stupid in some way. Utilitarianism is a particular case of consequentialism where your utility function is equal to the average happiness of everyone in the world. "The greatest good for the greatest number." Utilitarianism is not relevant to AI ruin because without solving alignment first, the AI is not going to care about "goodness".

The von Neumann probes aren't important to the AI ruin picture either: Humanity would be doomed, probes or no probes. The probes are just a grim reminder that screwing up AI won't only kill all humans, it will also kill all the aliens unlucky enough to be living too close to us.

comment by DeLesley Hutchins (delesley-hutchins) · 2022-06-10T02:24:40.238Z · LW(p) · GW(p)

I ended up writing a short story about this, which involves no nanotech.  :-)

comment by Ben Smith (ben-smith) · 2022-06-09T15:23:21.371Z · LW(p) · GW(p)

It seems like even amongst proponents of a "fast takeoff", we will probably have a few months of time between when we've built a superintelligence that appears to have unaligned values and when it is too late to stop it.

At that point, isn't stopping it a simple matter of building an equivalently powerful superintelligence given the sole goal of destroying the first one?

That almost implies a simple plan for preparation: for every AGI built, researchers agree together to also build a parallel AGI with the sole goal of defeating the first one. perhaps it would remain dormant until its operators indicate it should act. It would have an instrumental goal of protecting users' ability to come to it and request the first one be shut down..

Replies from: yonatan-cale-1, MakoYass, ete
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-10T21:43:10.705Z · LW(p) · GW(p)

I think there's no known way to ask an AI to do "just one thing" without doing a ton of harm meanwhile.

See this [LW · GW] on creating a strawberry safely.  Yudkowsky uses the example "[just] burn all GPUs" in is latest post [LW · GW].

comment by mako yass (MakoYass) · 2022-06-09T21:57:14.431Z · LW(p) · GW(p)

Seems useless if the first system pretends convincingly to be aligned (which I think is going to be the norm) so you never end up deploying the second system?

And "defeat the first AGI" seems almost as difficult to formalize correctly as alignment, to me:

  • One problem is that when the unaligned AGI transmits itself to another system, how do define it as the same AGI? Is there a way of defining identity that doesn't leave open a loophole that the first can escape through in some way?
  • So I'm considering "make the world as if neither of you had ever been made", that wouldn't have that problem, but it's impossible to actually attain this goal so I don't know how you get it to satisfice over it then turn itself off afterwards, concerned it would become an endless crusade.
comment by plex (ete) · 2022-06-09T18:43:15.815Z · LW(p) · GW(p)

One of the first priorities of an AI in a takeoff would be to disable other projects which might generate AGIs. A weakly superintelligent hacker AGI might be able to pull this off before it could destroy the world. Also, fast takeoff could be less than months by some people's guess.

And what do you think happens when the second AGI wins, then maximizes the universe for "the other AI was defeated". Some serious unintended consequences, even if you could specify it well.

comment by ekka · 2022-06-07T16:40:49.957Z · LW(p) · GW(p)

Who are the AI Capabilities researchers trying to build AGI and think they will succeed within the next 30 years?

Replies from: adam-jermyn, delesley-hutchins
comment by Adam Jermyn (adam-jermyn) · 2022-06-07T18:13:03.442Z · LW(p) · GW(p)

Among organizations, both OpenAI and DeepMind are aiming at AGI and seem confident they will get there. I don't know their internal timelines and don't know if they've stated them...

comment by DeLesley Hutchins (delesley-hutchins) · 2022-06-09T17:06:26.858Z · LW(p) · GW(p)

There are numerous big corporate research labs: OpenAI, DeepMind, Google Research, Facebook AI (Meta), plus lots of academic labs.

The rate of progress has been accelerating.  From 1960 - 2010 progress was incremental, and remained centered around narrow problems (chess) or toy problems.   Since 2015, progress has been very rapid, driven mainly by new hardware and big data.  Long-standing hard problems in ML/AI, such as go, image understanding, language translation, logical reasoning, etc. seem to fall on an almost monthly basis now, and huge amounts of money and intellect are being thrown at the field.  The rate of advance from 2015-2022 (only 7 years) has been phenomenal; given another 30, it's hard to imagine that we wouldn't reach an inflection point of some kind.

I think the burden of proof is now on those who don't believe that 30 years is enough time to crack AGI.  You would have to postulate some fundamental difficulty, like finding out that the human brain is doing things that can't be done in silicon, that would somehow arrest the current rate of progress and lead to a new "AI winter."

Historically,  AI researchers have often been overconfident.  But this time does feel different.

comment by Yonatan Cale (yonatan-cale-1) · 2022-06-15T10:37:56.570Z · LW(p) · GW(p)

[extra dumb question warning!]

Why are all the AGI doom predictions around 10%-30% instead of ~99%?

Is it just the "most doom predictions so far were wrong" prior?

Replies from: RobbBB
comment by Rob Bensinger (RobbBB) · 2022-06-15T11:15:10.671Z · LW(p) · GW(p)

The "Respondents' comments" section of the existential risk [LW · GW] survey I ran last year gives some examples of people's reasoning for different risk levels. My own p(doom) is more like 99%, so I don't want to speak on behalf of people who are less worried. Relevant factors, thought, include:

  • Specific reasons to think things may go well. (I gave some of my own here [LW · GW].)
  • Disagreement with various points in AGI Ruin [LW · GW]. E.g., I think a lot of EAs believe some combination of:
    • The alignment problem plausibly isn't very hard. (E.g., maybe we can just give the AGI/TAI a bunch of training data indicating that obedient, deferential, low-impact, and otherwise corrigible behavior is good, and then this will generalize fine in practice without our needing to do anything special.)
    • The field of alignment research has grown fast, and has had lots of promising ideas already.
    • AGI/TAI is probably decades away, and progress toward it will probably be gradual. This gives plenty of time for more researchers to notice "we're getting close" and contribute to alignment research, and for the field in general to get a lot more serious about AI risk.
    • Another consequence of 'AI progress is gradual': Insofar as AI is very dangerous or hard to align, we can expect that there will be disasters like "AI causes a million deaths" well before there are disasters like "AI kills all humans". The response to disasters like "a million deaths" (both on the part of researchers and on the part of policymakers, etc.) would probably be reasonable and helpful, especially with EAs around to direct the response in good directions. So we can expect the response to get better and better as we get closer to transformative AI.
  • General skepticism about our ability to predict the future with any confidence. Even if you aren't updating much on 'most past doom predictions were wrong', you should have less extreme probabilities insofar as you think it's harder to predict stuff in general.
comment by michael_mjd · 2022-06-07T06:58:44.162Z · LW(p) · GW(p)

Has there been effort into finding a "least acceptable" value function, one that we hope would not annihilate the universe or turn it degenerate, even if the outcome itself is not ideal? My example would be to try to teach a superintelligence to value all other agents facing surmountable challenges in a variety of environments. The degeneracy condition of this, is if it does not value the real world, will simply simulate all agents in a zoo. However, if the simulations are of faithful fidelity, maybe that's not literally the worst thing. Plus, the zoo, to truly be a good test of the agents, would approach being invisible.

Replies from: donald-hobson, AprilSR
comment by Donald Hobson (donald-hobson) · 2022-06-07T11:46:38.133Z · LW(p) · GW(p)

This doesn't select for humanlike minds. You don't want vast numbers of Ataribots similar to current RL, playing games like pong and pac-man. (And a trillion other autogenerated games sampled from the same distribution)


Even if you could somehow ensure it was human minds playing these games, the line between a fun game and total boredom is complex and subtle.

Replies from: michael_mjd
comment by michael_mjd · 2022-06-08T19:12:38.342Z · LW(p) · GW(p)

That is a very fair criticism. I didn't mean to imply this is something I was very confident in, but was interested in for three reasons:

1) This value function aside, is this a workable strategy, or is there a solid reason for suspecting the solution is all-or-nothing? Is it reasonable to 'look for' our values with human effort, or does this have to be something searched for using algorithms?
2) It sort of gives a flavor to what's important in life. Of course the human value function will be a complicated mix of different sensory inputs, reproduction, and goal seeking, but I felt like there's a kernel in there where curiosity is one of our biggest drivers. There was a post here a while back about someone's child being motivated first and foremost by curiosity.

3) An interesting thought occurs to me that, supposing we do create a deferential superintelligence. If it's cognitive capacities far outpace that of humans, does that mean the majority of consciousness in the universe is from the AI? If so, is it strange to think, is it happy? What is it like to be a god with the values of a child? Maybe I should make a separate comment about this.

Replies from: donald-hobson
comment by Donald Hobson (donald-hobson) · 2022-06-09T13:29:55.782Z · LW(p) · GW(p)

At the moment, we don't know how to make an AI that does something simple like making lots of diamonds. 

It seems plausible that making an AI that copies human values is easier than hardcoding even a crude approximation to human values. Or maybe not. 

comment by AprilSR · 2022-06-07T18:09:36.138Z · LW(p) · GW(p)

The obvious option in this class is to try to destroy the world in a way that doesn't send out an AI to eat the lightcone that might possibly contain aliens who could have a better shot.

I am really not a fan of this option.

comment by nem · 2022-06-08T17:30:35.357Z · LW(p) · GW(p)

I am pretty concerned about alignment. Not SO concerned as to switch careers and dive into it entirely, but concerned enough to talk to friends and make occasional donations. With Eliezer's pessimistic attitude, is MIRI still the best organization to funnel resources towards, if for instance, I was to make a monthly donation?

Not that I don't think pessimism is necessarily bad; I just want to maximize the effectiveness of my altruism.

Replies from: rhollerith_dot_com, None
comment by RHollerith (rhollerith_dot_com) · 2022-06-10T22:36:20.200Z · LW(p) · GW(p)

As far as I know, yes. (I've never worked for MIRI.)

comment by [deleted] · 2022-06-11T15:07:27.535Z · LW(p) · GW(p)
comment by silentbob · 2022-06-08T15:46:18.202Z · LW(p) · GW(p)

Assuming slower and more gradual timelines, isn't it likely that we run into some smaller, more manageable AI catastrophes before "everybody falls over dead" due to the first ASI going rogue? Maybe we'll be at a state of sub-human level AGIs for a while, and during that time some of the AIs clearly demonstrate misaligned behavior leading to casualties (and general insights into what is going wrong), in turn leading to a shift in public perception. Of course it might still be unlikely that the whole globe at that point stops improving AIs and/or solves alignment in time, but it would at least push awareness and incentives somewhat into the right direction.

Replies from: Jay Bailey, lorenzo-rex
comment by Jay Bailey · 2022-06-12T11:00:45.206Z · LW(p) · GW(p)

This does seem very possible if you assume a slower takeoff.

comment by lorepieri (lorenzo-rex) · 2022-06-12T17:35:42.735Z · LW(p) · GW(p)

This is the most likely scenario, with AGI getting heavily regulated, similarly to nuclear. It doesn't get much publicity because it's "boring". 

comment by MichaelStJules · 2022-06-08T06:41:51.549Z · LW(p) · GW(p)

Is cooperative inverse reinforcement learning promising? Why or why not?

Replies from: Gres
comment by Gres · 2022-06-14T08:33:37.620Z · LW(p) · GW(p)

I can't claim to know any more than the links just before section IV here: It's viewed as maybe promising or part of the solution. There's a problem if the program erroneously thinks it knows the humans' preferences, or if it anticipates that it can learn the humans' preferences and produce a better action than the humans would otherwise take. Since "accept a shutdown command" is a last resort option, ideally it wouldn't depend on the program not thinking something erroneously. Yudkowsky proposed the second idea here, there's a discussion of that and other responses here I don't know how the CIRL researchers respond to these challenges. 

comment by Benjamin · 2022-06-07T23:39:59.040Z · LW(p) · GW(p)

It seems like instrumental convergence is restricted to agent AI's, is that true? 

Also what is going on with mesa-optimizers? Why is it expected that they will will be more likely to become agentic than the base optimizer when they are more resource constrained?

Replies from: ete
comment by plex (ete) · 2022-06-08T14:38:45.924Z · LW(p) · GW(p)

The more agentic a system is the more it is likely to adopt convergent instrumental goals, yes.

Why agents are powerful [LW · GW] explores why agentic mesa optimizers might arise accidentally during training. In particular, agents are an efficient way to solve many challenges, so the mesa optimizer being resource constrained would lean in the direction of more agency under some circumstances.

comment by niplav · 2022-06-07T08:58:28.404Z · LW(p) · GW(p)

Let's say we decided that we'd mostly given up on fully aligning AGI, and had decided to find a lower bound for the value of the future universe give that someone would create it. Let's also assume this lower bound was something like "Here we have a human in a high-valence state. Just tile the universe with copies of this volume (where the human resides) from this point in time to this other point in time." I understand that this is not a satisfactory solution, but bear with me.

How much easier would the problem become? It seems easier than a pivotal-act AGI.

Things I know that will still make this hard:

  • Inner alignment
  • Ontological crises
  • Wireheading

Things we don't have to solve:

  • Corrigibility
  • Low impact (although if we had a solution for low impact, we might just try to tack it on to the resulting agent and and find out whether it works)
  • Value learning
Replies from: Lukas_Gloor
comment by Lukas_Gloor · 2022-06-07T09:27:05.135Z · LW(p) · GW(p)

You may get massive s-risk at comparatively little potential benefit with this. On many people's values, the future you describe may not be particularly good anyway, and there's an increased risk of something going wrong because you'd be trying a desperate effort with something you'd not fully understand. 

Replies from: niplav
comment by niplav · 2022-06-07T09:29:34.440Z · LW(p) · GW(p)

Ah, I forgot to add that this is a potential s-risk. Yeah.

Although I disagree that that future would be close to zero. My values tell me it would be at least a millionth as good as the optimal future, and at least a million times more valuable than a completely consciousness-less universe.

comment by Aryeh Englander (alenglander) · 2022-06-07T05:57:04.678Z · LW(p) · GW(p)

Background material recommendations (popular-level audience, several hours time commitment): Please recommend your favorite basic AGI safety background reading / videos / lectures / etc. For this sub-thread please only recommend background material suitable for a popular level audience. Time commitment is allowed to be up to several hours, so for example a popular-level book or sequence of posts would work. Extra bonus for explaining why you particularly like your suggestion over other potential suggestions, and/or for elaborating on which audiences might benefit most from different suggestions.

Replies from: ete, Jay Bailey, james.lucassen, alex-lszn
comment by plex (ete) · 2022-06-07T11:12:34.630Z · LW(p) · GW(p)

Stampy has the canonical version of this: I’d like a good introduction to AI alignment. Where can I find one?

Feel free to improve the answer, as it's on a wiki. It will be served via a custom interface once that's ready (prototype here).

comment by Jay Bailey · 2022-06-07T15:19:12.754Z · LW(p) · GW(p)

Human Compatible is the first book on AI Safety I read, and I think it was the right choice. I read The Alignment problem and Superintelligence after that, and I think that's the right order if you end up reading all three, but Human Compatible is a good start.

comment by james.lucassen · 2022-06-08T00:21:51.642Z · LW(p) · GW(p)

Whatever you end up doing, I strongly recommend taking a learning-by-writing style approach (or anything else that will keep you in critical assessment mode rather than classroom mode). These ideas are nowhere near solidified enough to merit a classroom-style approach, and even if they were infallible, that's probably not the fastest way to learn them and contribute original stuff.

The most common failure mode I expect for rapid introductions to alignment is just trying to absorb, rather than constantly poking and prodding to get a real working understanding. This happened to me, and wasted a lot of time.

comment by Alex Lawsen (alex-lszn) · 2022-06-07T10:37:49.964Z · LW(p) · GW(p)

The Alignment Problem - Easily accessible, well written and full of interesting facts about the development of ML. Unfortunately somewhat light on actual AI x-risk, but in many cases is enough to encourage people to learn more.

Edit: Someone strong-downvoted this, I'd find it pretty useful to know why.  To be clear, by 'why' I mean 'why does this rec seem bad', rather than 'why downvote'. If it's the lightness on x-risk stuff I mentioned, this is useful to know, if my description seems inaccurate, this is very useful for me to know, given that I am in a position to recommend books relatively often. Happy for the reasoning to be via DM if that's easier for any reason.

Replies from: johnlawrenceaspden
comment by johnlawrenceaspden · 2022-06-09T16:04:18.135Z · LW(p) · GW(p)

I read this, and he spent a lot of time convincing me that AI might be racist and very little time convincing me that AI might kill me and everyone I know without any warning. It's the second possibility that seems to be the one people have trouble with.

comment by tgb · 2022-06-18T11:29:17.076Z · LW(p) · GW(p)

What does the Fermi paradox tell us about AI future, if anything? I have a hard time simultaneously believing both "we will accidentally tile the universe with paperclips" and "the universe is not yet tiled with paperclips". Is the answer just that this is just saying that the Great Filter is already past?

And what about the anthropic principle? Am I supposed to believe that the universe went like 13 billion years without much in the way of intelligent life, then for a brief few millennia there's human civilization with me in it, and then the next N billion years it's just paperclips?

Replies from: tgb
comment by tgb · 2022-06-18T11:46:30.644Z · LW(p) · GW(p)

I see now that this has been discussed here in this thread already, at least the Fermi part. Oops!

comment by wachichornia · 2022-06-12T07:57:54.302Z · LW(p) · GW(p)

I have a very rich smart developer friend who knows a lot of influential people in SV. First employee of a unicorn, he retired from work after a very successful IPO and now it’s just finding interesting startups to invest in. He had never heard of lesswrong when I mentioned it and is not familiar with AI research.

If anyone can point me to a way to present AGI safety to him to maybe turn his interest to invest his resources in the field, that might be helpful

Replies from: rachelAF, yonatan-cale-1, aditya-prasad
comment by rachelAF · 2022-06-24T16:18:59.619Z · LW(p) · GW(p)

As an AI researcher, my favourite way to introduce other technical people to AI Alignment is Brian Christian’s book “The Alignment Problem” (particularly section 3). I like that it discusses specific pieces of work, with citations to the relevant papers, so that technical people can evaluate things for themselves as interested. It also doesn’t assume any prior AI safety familiarity from the reader (and brings you into it slowly, starting with mainstream bias concerns in modern-day AI).

comment by Yonatan Cale (yonatan-cale-1) · 2022-06-14T20:32:30.560Z · LW(p) · GW(p)

My answer for myself is that I started practicing: I started talking to some friends about this, hoping to get better at presenting the topic (which is currently something I'm kind of afraid to do) (I also have other important goals like getting an actual inside view model of what's going on)


If you want something more generic, here's one idea:

comment by Aditya (aditya-prasad) · 2022-06-12T12:15:04.584Z · LW(p) · GW(p)

When I talk to my friends, I start with the alignment problem. I found this analogy to human evolution really drives home the point that it’s a hard problem. We aren’t close to solving it.

So at this time questions come up about how intelligence necessarily means morality. I talk about orthogonality thesis. Then why would the AI care about anything other that what it was explicitly told to do, the danger comes from Instrumental convergence.

Finally people tend to say, we can never do it, they talk about spirituality, uniqueness of human intelligence. So I need to talk about evolution hill climbing to animal intelligence, how narrow ai has small models while we just need AGI to have a generalised world model. Brains are just electrochemical complex systems. It’s not magic.

Talk about pathways, imagen, gpt3 and what it can do, talk about how scaling seems to be working.

So it makes sense we might have AGI in our lifetime and we have tons of money and brains working on building ai capability, fewer on safety.

Try practising on other smart friends and develop your skill, you need to ensure people don’t get bored so you can’t use too much time. Use nice analogies. Have answers to frequent questions ready.

comment by ViktoriaMalyasova · 2022-06-10T22:43:55.425Z · LW(p) · GW(p)

What is Fathom Radiant's theory of change?

Fathom Radiant is an EA-recommended company whose stated mission is to "make a difference in how safely advanced AI systems are developed and deployed". They propose to do that by developing "a revolutionary optical fabric that is low latency, high bandwidth, and low power. The result is a single machine with a network capacity of a supercomputer, which enables programming flexibility and unprecedented scaling to models that are far larger than anything yet conceived." I can see how this will improve model capabilities, but how is this supposed to advance AI safety?

comment by Yonatan Cale (yonatan-cale-1) · 2022-06-10T20:54:22.849Z · LW(p) · GW(p)

What if we'd upload a person's brain to a computer and run 10,000 copies of them and/or run them very quickly?

Seems as-aligned-as-an-AGI-can-get (?)

Replies from: Jay Bailey, Charlie Steiner, None
comment by Jay Bailey · 2022-06-11T00:34:11.845Z · LW(p) · GW(p)

The best argument against this I've heard is that technology isn't built in a vacuum - if you build the technology to upload people's brains, then before you have the technology to upload people's brains, you probably have the technology to almost upload people's brains and fill in the gap yourself, creating neuromorphic AI that has all the same alignment problems as anything else.

Even so, I'm not convinced this is definitively true - if you can upload an entire brain at 80% of the necessary quality, "filling in" that last 20% does not strike me as an easy problem, and it might be easier to improve fidelity of uploading than to engineer a fix for it.

comment by Charlie Steiner · 2022-06-11T19:29:34.329Z · LW(p) · GW(p)

Well, not as aligned as the best case - humans often screw things up for themselves and each other, and emulated humans might just do that but faster. (Wei Dai might call this "human safety problems. [LW · GW]")

But probably, it would be good.

From a strategic standpoint, I unfortunately don't think this seems to inform strategy too much, because afaict scanning brains is a significantly harder technical problem than building de novo AI.

Replies from: AnthonyC, sharmake-farah
comment by AnthonyC · 2022-06-12T16:59:57.768Z · LW(p) · GW(p)

I think the observation that it just isn't obvious that ems will come before de novo AI is sufficient to worry about the problem in the case that they don't. Possibly while focusing more capabilities development towards creating ems (whatever that would look like)?

Also, would ems actually be powerful and capable enough to reliably stop a world-destroying non-em AGI, or an em about to make some world-destroying mistake because of its human-derived flaws? Or would we need to arm them with additional tools that fall under the umbrella of AGI safety anyway?

comment by Noosphere89 (sharmake-farah) · 2022-06-11T19:38:29.621Z · LW(p) · GW(p)

The only reason we care about AI Safety is because we believe the consequences are potentially existential. If it wasn't, there would be no need for safety.

comment by [deleted] · 2022-06-11T15:14:17.245Z · LW(p) · GW(p)
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-09T10:52:06.639Z · LW(p) · GW(p)

Can a software developer help with AI Safety even if they have zero knowledge of ML and zero understanding of AI Safety theory?

Replies from: yonatan-cale-1
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-09T10:53:06.112Z · LW(p) · GW(p)

Yes, both Anthropic and Redwood want to hire such developers

Replies from: jason-maskell
comment by Jason Maskell (jason-maskell) · 2022-06-10T15:30:02.994Z · LW(p) · GW(p)

Is that true for Redwood? They've got a timed technical screen before application, and their interview involves live coding with Python and ML libraries.

Replies from: yonatan-cale-1
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-10T19:02:52.481Z · LW(p) · GW(p)

I talked to Buck from Redwood about 1 month ago and that's what he told me, and I think we went over this as "a really important point" more than once so I'd know if I misunderstood him (but still please tell me if I'm wrong).

I assume if you tell them that you have zero ML experience, they'll give you an interview without ML libraries, or perhaps something very simple with ML libraries that you could learn on the fly (just like you could learn web scraping or so). This part is just me speculating though. Anyway this is something you could ask them before your first technical interview for sure: "Hey, I have zero ML experience, do you still want to interview me?"

comment by scott loop (scott-loop) · 2022-06-08T12:50:56.159Z · LW(p) · GW(p)

Total noob here so I'm very thankful for this post. Anyway, why is there such certainty among some that a superintelligence would kill it's creators that are zero threat to it? Any resources on that would be appreciated. As someone who loosely follows this stuff, it seems people assume AGI will be this brutal instinctual killer which is the opposite of what I've guessed.

Replies from: delesley-hutchins, ete, Perhaps
comment by DeLesley Hutchins (delesley-hutchins) · 2022-06-10T03:36:21.523Z · LW(p) · GW(p)

It's essentially for the same reason that Hollywood thinks aliens will necessarily be hostile.  :-)

For the sake of argument, let's treat AGI as a newly arrived intelligent species.  It thinks differently from us, and has different values.  Historically, whenever there has been a large power differential between a native species and a new arrival, it has ended poorly for the native species.  Historical examples are: the genocide of Native Americans (same species, but less advanced technology), and the wholesale obliteration of 90% of all non-human life on this planet.

That being said, there is room for a symbiotic relationship.  AGI will initially depend on factories and electricity produced by human labor, and thus will necessarily be dependent on humans at first.  How long this period will last is unclear, but it could settle into a stable equilibrium.  After all, humans are moderately clever, self-reproducing computer repair drones, easily controlled by money, comfortable with hierarchy, and which are well adapted to Earth's biosphere.  They could be useful to keep around.

There is also room for an extensive ecology of many different superhuman narrow AI, each of which can beat humans within a particular domain, but which generalize poorly outside of that domain.  I think this hope is becoming smaller with time, though, (see, e.g. ,Gato), and it is not necessarily a stable equilibrium.

The thing that seems clearly untenable is an equilibrium in which a much less intelligent species manages to subdue and control and much more intelligent species.

comment by plex (ete) · 2022-06-08T15:20:06.124Z · LW(p) · GW(p)

Rob Miles's video on Instrumental Convergence is about this, combine with Maximizers and you might have a decent feel for it.

Replies from: scott-loop
comment by scott loop (scott-loop) · 2022-06-08T15:39:14.020Z · LW(p) · GW(p)

Thank you for these videos.

comment by Perhaps · 2022-06-08T14:20:34.959Z · LW(p) · GW(p)

In terms of utility functions, the most basic is: do what you want. "Want" here refers to whatever values the agent values. But in order for the "do what you want" utility function to succeed effectively, there's a lower level that's important: be able to do what you want. 

Now for humans, that usually refers to getting a job, planning for retirement, buying insurance, planning for the long-term, and doing things you don't like for a future payoff. Sometimes humans go to war in order to "be able to do what you want", which should show you that satisfying a utility function is important.

For an AI who most likely has a straightforward utility function, and who has all the capabilities to execute it(assuming you believe that superintelligent AGI could develop nanotech, get root access to the datacenter, etc.), humans are in the way of "being able to do what you want". Humans in this case would probably not like an unaligned AI, and would try to shut it down, or at least not die themselves. Most likely, the AI has a utility function that has no use for humans, and thus they are just resources standing in the way. Therefore the AI goes on holy war against humans to maximize its possible reward, and all the humans die. 

Replies from: scott-loop
comment by scott loop (scott-loop) · 2022-06-08T15:40:30.754Z · LW(p) · GW(p)

Thanks for the response. Definitely going to dive deeper into this.

comment by Long time lurker · 2022-06-08T08:51:57.175Z · LW(p) · GW(p)

/Edit 1: I want to preface this by saying I am just a noob who has never posted on Less Wrong before.

/Edit 2: 

I feel I should clarify my main questions (which are controversial): Is there a reason why turning all of reality into maximized conscious happiness is not objectively the best outcome for all of reality, regardless of human survival and human values?
Should this in any way affect our strategy to align the first agi, and why?

/Original comment:

If we zoom out and look at the biggest picture philosophically possible, then, isn´t the only thing that ultimately matters in the end 2 things - the level of consciousness and the overall "happiness" of said consciousness(es) throughout all of time and space (counting all realities that have, are and will exist)?

To clarify; isn´t the best possible outcome for all of reality one where every particle is utilized to experience a maximally conscious and maximally "happy" state for eternity? (I put happiness in quotes because how do you measure the "goodness" of a state, or consciousness itself for that matter.)

After many years of reading countless alignment discussions (of which I have understood maybe 20 %) I have never seen this being mentioned. So I wonder; if we are dealing with a super optimizer shouldn´t we be focusing on the super big picture?

I realize this might seem controversial but I see no rational reason for why it wouldn´t be true. Although my knowledge of rationality is very limited.

Replies from: Kaj_Sotala, AnthonyC, Charlie Steiner
comment by Kaj_Sotala · 2022-06-10T20:14:54.270Z · LW(p) · GW(p)

What would it mean for an outcome to be objectively best for all of reality?

It might be your subjective opinion that maximized conscious happiness would be the objectively best reality. Another human's subjective opinion might be that a reality that maximized the fulfillment of fundamentalist Christian values was the objectively best reality. A third human might hold that there's no such thing as the objectively best, and all we have are subjective opinions.

Given that different people disagree, one could argue that we shouldn't privilege any single person's opinion, but try to take everyone's opinions into account - that is, build an AI that cared about the fulfillment of something like "human values".

Of course, that would be just their subjective opinion. But it's the kind of subjective opinion that the people involved in AI alignment discussions tend to have.

Replies from: Kerrigan
comment by Kerrigan · 2022-12-17T07:20:44.020Z · LW(p) · GW(p)

Suppose everyone agreed that the proposed outcome is what we wanted. Would this scenario then be difficult to achieve?

comment by AnthonyC · 2022-06-12T17:05:24.655Z · LW(p) · GW(p)

The fact that the statement is controversial is, I think, the reason. What makes a world-state or possible future valuable is a matter of human judgment, and not every human believes this. 

EY's short story Three Worlds Collide [? · GW] explores what can happen when beings with different conceptions of what is valuable, have to interact. Even when they understand each other's reasoning, it doesn't change what they themselves value. Might be a useful read, and hopefully a fun one.

Replies from: Kerrigan
comment by Kerrigan · 2022-12-17T07:24:26.643Z · LW(p) · GW(p)

I'll ask the same follow-up question to similar answers: Suppose everyone agreed that the proposed outcome above is what we wanted. Would this scenario then be difficult to achieve?

Replies from: AnthonyC
comment by AnthonyC · 2022-12-23T00:45:35.348Z · LW(p) · GW(p)

I mean, yes, because the proposal is about optimizing our entire future light for an outcome we don't know how to formally specify.

Replies from: Kerrigan
comment by Kerrigan · 2023-08-26T20:08:03.145Z · LW(p) · GW(p)

Could you have a machine hooked up to a person‘s nervous system, change the settings slightly to change consciousness, and let the person choose whether the changes are good or bad? Run this many times.

Replies from: AnthonyC
comment by AnthonyC · 2023-08-28T22:48:14.860Z · LW(p) · GW(p)

I don't think this works. One, it only measure short term impacts, but any such change might have lots of medium and long term effects, second and third order effects, and effects on other people with whom I interact. Two, it measures based on the values of already-changed me, not current me, and it is not obvious that current-me cares what changed-me will think, or why I should so care if I don't currently. Three, I have limited understanding of my own wants, needs, and goals, and so would not trust any human's judgement of such changes far enough to extrapolate to situations they didn't experience, let alone to other people, or the far future, or unusual/extreme circumstances.

comment by Charlie Steiner · 2022-06-11T18:54:30.447Z · LW(p) · GW(p)

For a more involved discussion than Kaj's answer, you might check out the "Mere Goodness" section of Rationality: A-Z [? · GW].

comment by Aryeh Englander (alenglander) · 2022-06-07T20:00:55.113Z · LW(p) · GW(p)

Please describe or provide links to descriptions of concrete AGI takeover scenarios that are at least semi-plausible, and especially takeover scenarios that result in human extermination and/or eternal suffering (s-risk). Yes, I know that the arguments don't necessarily require that we can describe particular takeover scenarios, but I still find it extremely useful to have concrete scenarios available, both for thinking purposes and for explaining things to others.

Replies from: cousin_it, Evan R. Murphy, delton137, alenglander, alenglander, alenglander, alenglander, alenglander
comment by cousin_it · 2022-06-08T16:01:49.865Z · LW(p) · GW(p)

Without nanotech or anything like that, maybe the easiest way is to manipulate humans into building lots of powerful and hackable weapons (or just wait since we're doing it anyway). Then one day, strike.

Edit: and of course the AI's first action will be to covertly take over the internet, because the biggest danger to the AI is another AI already existing or being about to appear. It's worth taking a small risk of being detected by humans to prevent the bigger risk of being outraced by a competitor.

comment by Evan R. Murphy · 2022-06-09T17:58:39.383Z · LW(p) · GW(p)

This new series of posts from Holden Karnofsky (CEO of Open Philanthropy) is about exactly this. The first post came out today: [LW · GW]

comment by delton137 · 2022-06-11T23:27:35.781Z · LW(p) · GW(p)

I find slower take-off scenarios more plausible. I like the general thrust of Christiano's "What failure looks like [LW · GW]". I wonder if anyone has written up a more narrative / concrete account of that sort of scenario.

comment by Aryeh Englander (alenglander) · 2022-06-07T20:12:54.223Z · LW(p) · GW(p)

Alexey Turchin and David Denkenberger describe several scenarios here: (additional recent discussion in this comment thread [LW(p) · GW(p)])

comment by Aryeh Englander (alenglander) · 2022-06-07T20:07:17.758Z · LW(p) · GW(p)

Eliezer's go-to scenario (from his recent post [LW · GW]):

The concrete example I usually use here is nanotech, because there's been pretty detailed analysis of what definitely look like physically attainable lower bounds on what should be possible with nanotech, and those lower bounds are sufficient to carry the point.  My lower-bound model of "how a sufficiently powerful intelligence would kill everyone, if it didn't want to not do that" is that it gets access to the Internet, emails some DNA sequences to any of the many many online firms that will take a DNA sequence in the email and ship you back proteins, and bribes/persuades some human who has no idea they're dealing with an AGI to mix proteins in a beaker, which then form a first-stage nanofactory which can build the actual nanomachinery.  (Back when I was first deploying this visualization, the wise-sounding critics said "Ah, but how do you know even a superintelligence could solve the protein folding problem, if it didn't already have planet-sized supercomputers?" but one hears less of this after the advent of AlphaFold 2, for some odd reason.)  The nanomachinery builds diamondoid bacteria, that replicate with solar power and atmospheric CHON, maybe aggregate into some miniature rockets or jets so they can ride the jetstream to spread across the Earth's atmosphere, get into human bloodstreams and hide, strike on a timer.  Losing a conflict with a high-powered cognitive system looks at least as deadly as "everybody on the face of the Earth suddenly falls over dead within the same second". 

comment by Aryeh Englander (alenglander) · 2022-06-07T20:03:22.648Z · LW(p) · GW(p) (very detailed but also very long and very full of technical jargon; on the other hand, I think it's mostly understandable even if you have to gloss over most of the jargon)

comment by starship006 (cody-rushing) · 2022-06-07T19:32:15.508Z · LW(p) · GW(p)

I have a few related questions pertaining to AGI timelines. I've been under the general impression that when it comes to timelines on AGI and doom, Eliezer's predictions are based on a belief in extraordinarily fast AI development, and thus a close AGI arrival date, which I currently take to mean a quicker date of doom. I have three questions related to this matter:

  1. For those who currently believe that AGI (using whatever definition to describe AGI as you see fit) will be arriving very soon - which, if I'm not mistaken, is what Eliezer is predicting - approximately how soon are we talking about. Is this 2-3 years soon? 10 years soon?  (I know Eliezer has a bet that the world will end before 2030, so I'm trying to see if there has been any clarification of how soon before 2030)
  2. How much does Eliezer's views on timelines vary in comparison to other big-name AI safety researchers?
  3. I'm currently under the impression that it takes a significant amount of knowledge of Artificial Intelligence to be able to accurately attempt to predict timelines related to AGI. Is this impression correct? And if so, would it be a good idea to reference general consensus opinions such as Metaculus when trying to frame how much time we have left?
Replies from: conor-sullivan, delesley-hutchins
comment by Lone Pine (conor-sullivan) · 2022-06-08T03:32:17.624Z · LW(p) · GW(p)

There's actually two different parts to the answer, and the difference is important. There is the time between now and the first AI capable of autonomously improving itself (time to AGI), and there's the time it takes for the AI to "foom", meaning improve itself from a roughly human level towards godhood. In EY's view, it doesn't matter at all how long we have between now and AGI, because foom will happen so quickly and will be so decisive that no one will be able to respond and stop it. (Maybe, if we had 200 years we could solve it, but we don't.) In other people's view (including Robin Hanson and Paul Christiano, I think) there will be "slow takeoff." In this view, AI will gradually improve itself over years, probably working with human researchers in that time but progressively gathering more autonomy and skills. Hanson and Christiano agree with EY that doom is likely. In fact, in the slow takeoff view ASI might arrive even sooner than in the fast takeoff view.

Replies from: Heighn, silentbob
comment by Heighn · 2022-06-15T12:58:29.707Z · LW(p) · GW(p)

Hanson and Christiano agree with EY that doom is likely.

I'm not sure about Hanson, but Christiano is a lot more optimistic than EY.

comment by silentbob · 2022-06-08T15:41:14.828Z · LW(p) · GW(p)

Isn't it conceivable that improving intelligence turns out to become difficult more quickly than the AI is scaling? E.g. couldn't it be that somewhere around human level intelligence, improving intelligence by every marginal percent becomes twice as difficult as the previous percent? I admit that doesn't sound very likely, but if that was the case, then even a self-improving AI would potentially improve itself very slowly, and maybe even sub-linear rather than exponentially, wouldn't it?

comment by DeLesley Hutchins (delesley-hutchins) · 2022-06-10T03:13:39.068Z · LW(p) · GW(p)

For a survey of experts, see:

Most experts expect AGI between 2030 and 2060, so predictions before 2030 are definitely in the minority.

My own take is that a lot of current research is focused on scaling, and has found that deep learning scales quite well to very large sizes.  This finding is replicated in evolutionary studies; one of the main differences between the human brain and the chimpanzee is just size (neuron count), pure and simple.

As a result, the main limiting factor thus appears to be the amount of hardware that we can throw at the problem. Current research into large models is very much hardware limited, with only the major labs (Google, DeepMind, OpenAI, etc.) able to afford the compute costs to train large models.  Iterating on model architecture at large scales is hard because of the costs involved.  Thus, I personally predict that we will achieve AGI only when the cost of compute drops to the point where FLOPs roughly equivalent to the human brain can be purchased on a more modest budget; the drop in price will open up the field to more experimentation.

We do not have AGI yet even on current supercomputers, but it's starting to look like we might be getting close (close = factor of 10 or 100).  Assuming continuing progress in Moore's law (not at all guaranteed), another 15-20 years will lead to another 1000x drop in the cost of compute, which is probably enough for numerous smaller labs with smaller budgets to really start experimenting.  The big labs will have a few years head start, but if they don't figure it out, then they will be well positioned to scale into super-intelligent territory immediately as soon as the small labs help make whatever breakthroughs are required.  The longer it takes to solve the software problem, the more hardware we'll have to scale immediately, which means faster foom.  Getting AGI sooner may thus yield a better outcome.

I would tentatively put the date at around 2035, +/- 5 years.  

If we run into a roadblock that requires substantially new techniques (e.g., gradient descent isn't enough) then the timeline could be pushed back.  However, I haven't seen much evidence that we've hit any fundamental algorithmic limitations yet.

comment by tgb · 2022-06-18T11:42:59.181Z · LW(p) · GW(p)

Any progress or interest in finding limited uses of AI that would be safe? Like the "tool AI" idea but designed to be robust. Maybe this is a distraction, but it seems basically possible. For example, a proof-finding AI that, given a math statement, can only output a proof to a separate proof-checking computer that validates it and prints either True/False/Unknown as the only output to human eyes. Here "Unknown" could indicate that the AI gave a bogus proof, failed to give any proof of either True or False, or the proof checker ran out of time/memory checking it. The proof is destroyed after its checked. Is that safe?

I can imagine some tiny scenarios where it's not. The AI could try to communicate with humans even with this limited output band: e.g. disprove the Riemann hypothesis but then if we try to find non-trivial zeros by binary search, it can intentionally output "Unknown"s to guide us to certain non-trivial zeros over others and the digits of these could encode something?? Or the AI could infer something about our physics based off the questions we ask or random bit-flips it notices in its machine tell it about cosmic rays/machine architecture and it comes up with some dastardly way to hack other machines even though its on a dedicated, airgapped machine. These aren't entirely unimaginable but I'd also think a writer was being lazy if they used this as a plotline as they might as well just have said "magic".

comment by deathspiral · 2022-06-14T06:34:39.289Z · LW(p) · GW(p)

Is it "alignment" if, instead of AGI killing us all, humans change what it is to be human so much that we are almost unrecognizable to our current selves?

I can foresee a lot of scenarios where humans offload more and more of their cognitive capacity to silicon, but they are still "human" - does that count as a solution to the alignment problem?

If we all decide to upload our consciousness to the cloud, and become fast enough and smart enough to stop any dumb AGI before it can get started  is THAT a solution?

Even today, I offload more and more of my "self" to my phone and other peripherals. I use autocomplete to text people, rather than writing every word, for example. My voicemail uses my voice to answer calls and other people speak to it, not me. I use AI to tell me which emails I should pay attention to and a calendar to augment my memory. "I" already exist, in part, in the cloud and I can see more and more of myself existing there over time.

Human consciousness isn't single-threaded. I have more than one thought running at the same time. It's not unlikely that some of them will soon partially run outside my meat body. To me, this seems like the solution to the alignment problem: make human minds run (more) outside of their current bodies, to the point that they can keep up with any AGI that tries to get smarter than them.

Frankly, I think if we allow AGI to get smarter than us (collectively, at least), we're all fucked. I don't think we will ever be able to align a super-intelligent AGI. I think our only solution is to change what it means to be human instead.

What I am getting at is: are we trying to solve the problem of saving a static version of humanity as it exists today, or are we willing to accept that one solution to Alignment may be for humanity to change significantly instead?

Replies from: yonatan-cale-1, sharmake-farah
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-14T20:02:04.730Z · LW(p) · GW(p)

I personally like the idea of uploading ourselves (and asked about it here [LW(p) · GW(p)]).

Note that even if we are uploaded - if someone creates an unaligned AGI that is MUCH SMARTER than us, it will still probably kill us.

"keeping up" in the sense of improving/changing/optimizing so quickly that we'd compete with software that is specifically designed (perhaps by itself) to do that - seems like a solution I wouldn't be happy with. As much as I'm ok with posting my profile picture on Facebook, there are some degrees of self modification that I'm not ok with

comment by Noosphere89 (sharmake-farah) · 2022-06-14T14:19:26.353Z · LW(p) · GW(p)

Ding Ding Ding, we have a winner here. Strong up vote.

comment by MSRayne · 2022-06-13T23:11:29.635Z · LW(p) · GW(p)

Why wouldn't it be sufficient to solve the alignment problem by just figuring out exactly how the human brain works, and copying that? The result would at worst be no less aligned to human values than an average human. (Presuming of course that a psychopath's brain was not the model used.)

Replies from: charbel-raphael-segerie
comment by Charbel-Raphaël (charbel-raphael-segerie) · 2022-06-14T11:09:35.806Z · LW(p) · GW(p)

The first plane didn't emulate birds. The first AGI probably won't be based on a retro engineering of the brain. The blue brain project is unlikely to finish reproducing the brain before DeepMind finds the right architecture.

But I agree that being able to retro engineer the brain is very valuable for alignment, this is one of the path described her [LW · GW]e, in the final post of intro-to-brain-like-agi-safety, section Reverse-engineer human social instincts.

comment by Tiuto (timothy-currie) · 2022-06-13T11:36:24.222Z · LW(p) · GW(p)

I am interested in working on AI alignment but doubt I'm clever enough to make any meaningful contribution, so how hard is it to be able to work on AI alignment? I'm currently a high school student, so I could basically plan my whole life so that I end up a researcher or software engineer or something else. Alignment being very difficult, and very intelligent people already working on it, it seems like I would have to almost be some kind of math/computer/ML genius to help at all. I'm definitely above average, my IQ is like 121 (I know the limitations of IQ as a measurement and that it's not that important) in school I'm pretty good at maths and other sciences but not even the best out of my class of 25. So my question is basically how clever does one have to be to be able to contribute to AGI alignment?

Replies from: yonatan-cale-1, charbel-raphael-segerie
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-14T20:27:52.015Z · LW(p) · GW(p)

I don't know, I'm replying here with my priors [EA · GW] from software development.



Do something that is 

  1. Mostly useful (software/ML/math/whatever are all great and there are others too, feel free to ask)
  2. Where you have a good fit, so you'll enjoy and be curious about your work, and not burn out from frustration or because someone told you "you must take this specific job"
  3. Get mentorship so that you'll learn quickly

And this will almost certainly be useful somehow.


Main things my prior is based on:

EA in general and AI Alignment specifically need lots of different "professions". We probably don't want everyone picking the number one profession and nobody doing anything else. We probably want each person doing whatever they're a good fit for.

The amount we "need" is going up over time, not down, and I can imagine it going up much more, but can't really imagine it going down (so in other words, I mostly assume whatever we need today, which is quite a lot, will also be needed in a few years. So there will be lots of good options to pick)

comment by Charbel-Raphaël (charbel-raphael-segerie) · 2022-06-14T11:26:09.581Z · LW(p) · GW(p)

Hi Tiuto,

consider scaling up in ML to become an ML engineer or ML researcher.  If it's still possible, try to join the best engineering school in your region, and then join your local EA group, and start community building to nudge your smart friends towards AI safety. ML engineering does not necessitate a genius level of IQ. 

I'm myself an ML engineer,  you can dm me for further questions. I'm far from being a genius, I've never been the best in my class, but I'm currently able to contribute meaningfully.

Replies from: timothy-currie
comment by Tiuto (timothy-currie) · 2022-06-18T08:36:43.754Z · LW(p) · GW(p)

Hi, thanks for the advice.

Do you, or other people, know why your comment is getting downvoted? Right now it's at -5 so I have to assume the general LW audience disagrees with your advice. Presumably people think it is really hard to become a ML researcher? Or do they think we already have enough people in ML so we don't need more?

comment by shminux · 2022-06-11T23:51:58.402Z · LW(p) · GW(p)

Doesn't AGI doom + Copernican principle run into the AGI Fermi paradox? If we are not special, superintelligent AGI would have been created/evolved somewhere already and we would either not exist or at least see the observational artifacts of it through our various telescopes.

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-12T10:56:14.971Z · LW(p) · GW(p)

I don't know much about it, but you might want to look into the "grabby aliens" model. I'm not sure how they come to this conclusion, but the belief is "If you have aliens that are moving outwards near the speed of light, it will still take millions and millions of years on average for them to reach us, so the odds of them reaching us soon are super small."

comment by Arcayer · 2022-06-10T04:04:01.570Z · LW(p) · GW(p)

A lot of predictions about AI psychology are premised on the AI being some form of deep learning algorithm. From what I can see, deep learning requires geometric computing power for linear gains in intelligence, and thus (practically speaking) cannot scale to sentience.

For a more expert/in depth take look at:

Why do people think deep learning algorithms can scale to sentience without unreasonable amounts of computational power?

Replies from: yonatan-cale-1
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-10T21:22:27.573Z · LW(p) · GW(p)
  1. An AGI can be dangerous even if it isn't sentient
  2. If an AI can do most things a human can do (which is achievable using neurons apparently because that's what we're made of), and if that AI can run x10,000 as fast (or if it's better in some interesting way, which computers sometimes are compared to humans), then it can be dangerous

Does this answer your question? Feel free to follow up

Replies from: Arcayer
comment by Arcayer · 2022-06-10T22:53:00.637Z · LW(p) · GW(p)

1: This doesn't sound like what I'm hearing people say? Using the word sentience might have been a mistake. Is it reasonable to expect that the first AI to foom will be no more intelligent than say, a squirrel?

2a: Should we be convinced that neurons are basically doing deep learning? I didn't think we understood neurons to that degree?

2b: What is meant by [most things a human can do]? This sounds to me like an empty statement. Most things a human can do are completely pointless flailing actions. Do we mean, most jobs in modern America? Do we expect roombas to foom? Self driving cars? Or like, most jobs in modern America still sounds like a really low standard, requiring very little intelligence?

My expected answer was somewhere along the lines of "We can achieve better results than that because of something something." or "We can provide much better computers in the near future, so this doesn't matter."

What I'm hearing here is "Intelligence is unnecessary for AI to be (existentially) dangerous." This is surprising, and I expect, wrong (in the sense of not being what's being said/what the other side believes.) (though also in the sense of not being true, but that's neither here nor there.)

Replies from: yonatan-cale-1, sphinxfire, None
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-14T07:52:15.530Z · LW(p) · GW(p)
  1. The relevant thing in [sentient / smart / whatever] is "the ability to achieve complex goals"
  2. a. Are you asking if an AI can ever be as "smart" [good at achieving colas] as a human?
  3. b. The dangerous part of the AGI being "smart" are things like "able to manipulate humans" and "able to build an even better AGI"

Does this answer your questions? Feel free to follow up

Replies from: Arcayer
comment by Arcayer · 2022-06-14T14:02:54.020Z · LW(p) · GW(p)

2: No.

If an AI can do most things a human can do (which is achievable using neurons apparently because that's what we're made of)

Implies that humans are deep learning algorithms. This assertion is surprising, so I asked for confirmation that that's what's being said, and if so, on what basis.

3: I'm not asking what makes intelligent AI dangerous. I'm asking why people expect deep learning specifically to become (far more) intelligent (than they are). Specifically within that question, adding parameters to your model vastly increases use of memory. If I understand the situation correctly, if gpt just keeps increasing the number of parameters, gpt five or six or so will require more memory than exists on the planet, and assuming someone built it anyway, I still expect it to be unable to wash dishes. Even assuming you have the memory, running the training would take longer than human history on modern hardware. Even assuming deep learning "works" in the mathematical sense, that doesn't make it a viable path to high levels of intelligence in the near future.

Given doom in thirty years, or given that researching deep learning is dangerous, it should be the case that this problem: never existed to begin with and I'm misunderstanding something / is easily bypassed by some cute trick / we're going to need a lot better hardware in the near future.

Replies from: yonatan-cale-1
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-14T19:53:09.022Z · LW(p) · GW(p)

2. I don't think humans are deep learning algorithms. I think human (brains) are made of neurons, which seems like a thing I could simulate in a computer, but not just deep learning.

3. I don't expect just-deep-learning to become an AGI. Perhaps [in my opinion: probably] parts of the AGI will be written using deep-learning though, it does seem pretty good at some things. [I don't actually know, I can think out loud with you].

comment by Sphinxfire (sphinxfire) · 2022-06-11T21:57:56.078Z · LW(p) · GW(p)

Is it reasonable to expect that the first AI to foom will be no more intelligent than say, a squirrel?

In a sense, yeah, the algorithm is similar to a squirrel that feels a compulsion to bury nuts. The difference is that in an instrumental sense it can navigate the world much more effectively to follow its imperatives. 

Think about intelligence in terms of the ability to map and navigate complex environments to achieve pre-determined goals. You tell DALL-E2 to generate a picture for you, and it navigates a complex space of abstractions to give you a result that corresponds to what you're asking it to do (because a lot of people worked very hard on aligning it). If you're dealing with a more general-purpose algorithm that has access to the real world, it would be able to chain together outputs from different conceptual areas to produce results - order ingredients for a cake from the supermarket, use a remote-controlled module to prepare it, and sing you a birthday song it came up with all by itself! This behaviour would be a reflection of the input in the distorted light of the algorithm, however well aligned it may or may not be, with no intermediary layers of reflection on why you want a birthday cake or decision being made as to whether baking it is the right thing to do, or what would be appropriate steps to take for getting from A to B and what isn't.

You're looking at something that's potentially very good at getting complicated results without being a subject in a philosophical sense and being able to reflect into its own value structure.

comment by [deleted] · 2022-06-11T15:15:15.634Z · LW(p) · GW(p)
comment by faul_sname · 2022-06-09T23:54:31.385Z · LW(p) · GW(p)

A significant fraction of the stuff I've read about AI safety has referred to AGIs "inspecting each others' source code/utility function". However, when I look at the most impressive (to me) results in ML research lately, everything seems to be based on doing a bunch of fairly simple operations on very large matrices.

I am confused, because I don't understand how it would be a sensible operation to view the "source code" in question when it's a few billion floating point numbers and a hundred lines of code that describe what sequence of simple addition/multiplication/comparison operations transform those inputs and those billions of floating point numbers into outputs. But a bunch of people much smarter and mathematically inclined than me do seem to think that it's important, and the idea of recursive self-improvement with stable values seems to imply that it must be possible, which leads me to wonder if

  1. There's some known transformation from "bag of tensors" to "readable and verifiable source code" I'm not aware of. You can do something like "from tensorflow import describe_model" or use some comparably well-known tools, similar to how there are decompilers for taking an executable and making it fairly readable. The models are too large and poorly labelled for a human to actually verify it does what they want, but a sufficiently smart machine would not have the problem.
  2. The expectation is that neural nets are not the architecture that a superhuman AGI would run on.
  3. The expectation is that, in the worlds we're not completely doomed no matter what we do, neural nets are not the architecture that a superhuman AGI would run on.
  4. The "source code" in question is understood to be the combination of the training data, the means of calculating the loss function, and the architecture (so the "source code" of a human would be the combined life experiences of that human plus the rules for how those life experiences influenced the development of their brain, rather than the specific pattern of neurons in their brain).
  5. Something else entirely.

I suspect the answer is "5: something else entirely", but I have no idea what the particulars of that "something else entirely" might look like and it feels like it's probably critical to understanding the discussion.

So I guess my question is "which, if any, of the above describes what is meant by inspecting source code in the context of AI safety".

Replies from: JBlack
comment by JBlack · 2022-06-10T02:31:06.803Z · LW(p) · GW(p)

I take "source code" as loosely meaning "everything that determines the behaviour of the AI, in a form intelligible to the examiner". This might include any literal source code, hardware details, and some sufficiently recent snapshot of runtime state. Literal source code is just an analogy that makes sense to humans reasoning about behaviour of programs where most of the future behaviour is governed by rules fixed in that code.

The details provided cannot include future input and so do not completely constrain future behaviour, but the examiner may be able to prove things about future behaviour under broad classes of future input, and may be able to identify future inputs that would be problematic.

The broad idea is that in principle, AGI might be legible in that kind of way to each other, while humans are definitely not legible in that way to each other.

comment by ryan_b · 2022-06-09T21:22:33.266Z · LW(p) · GW(p)

The ML sections touched on the subject of distributional shift a few times, which is that thing where the real world is different from the training environment in ways which wind up being important, but weren't clear beforehand. I read the way to tackle this is called adversarial training, and what it means is you vary the training environment across all of its dimensions in order to to make it robust.

Could we abuse distributional shift to reliably break misaligned things, by adding fake dimensions? I imagine something like this:

  • We want the optimizer to move from point A to point B on a regular x,y graph.
  • Instead of training it a bunch of times on just an x,y graph, we add a third, fake dimension.
  • We do this multiple times, so for example we have one x,y graph and add a z dimension; and one x,y graph where we add a color dimension.
  • When the training is complete, we do some magic that is the equivalent of multiplying these two, which would zero out the fake dimensions (is the trick used by DeepMind with Gato similar to multiplying functions?) and leave us with the original x,y dimensions

I expect this would give us something less perfectly optimized than just focusing on the x,y graph, but any deceptive alignment would surely exploit the false dimension which goes away, and thus it would be broken/incoherent/ineffective.

So....could we give it enough false rope to hang itself?

Replies from: yonatan-cale-1
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-10T21:29:23.160Z · LW(p) · GW(p)

Seems like two separate things (?)

  1. If we forget a dimension, like "AGI, please remember we don't like getting bored", then things go badly [LW · GW], even if we added another fake dimension which wasn't related to boredom.
  2. If we train the AI on data from our current world, then [almost?] certainly it will see new things when it runs for real.  As a toy (not realistic but I think correct) example: the AI will give everyone a personal airplane, and then it will have to deal with a world that has lots of airplanes.
comment by Darklight · 2022-06-09T19:47:29.602Z · LW(p) · GW(p)

I previously worked as a machine learning scientist but left the industry a couple of years ago to explore other career opportunities.  I'm wondering at this point whether or not to consider switching back into the field.  In particular, in case I cannot find work related to AI safety, would working on something related to AI capability be a net positive or net negative impact overall?

Replies from: yonatan-cale-1, None
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-10T21:36:33.556Z · LW(p) · GW(p)

Working on AI Capabilities: I think this is net negative, and I'm commenting here so people can [V] if they agree or [X] if they disagree.

Seems like habryka [LW · GW] agrees [LW(p) · GW(p)]? 

Seems like Kaj [LW · GW] disagrees [LW · GW]?

I think it wouldn't be controversial to advise you to at least talk to 80,000 hours about this before you do it, as some safety net to not do something you don't mean to by mistake? Assuming you trust them. Or perhaps ask someone you trust. Or make your own gears-level model. Anyway, seems like an important decision to me

Replies from: Darklight
comment by Darklight · 2022-06-12T15:08:49.546Z · LW(p) · GW(p)

Okay, so I contacted 80,000 hours, as well as some EA friends for advice.  Still waiting for their replies.

I did hear from an EA who suggested that if I don't work on it, someone else who is less EA-aligned will take the position instead, so in fact, it's slightly net positive for myself to be in the industry, although I'm uncertain whether or not AI capability is actually funding constrained rather than personal constrained.

Also, would it be possible to mitigate the net negative by choosing to deliberately avoid capability research and just take an ML engineering job at a lower tier company that is unlikely to develop AGI before others and just work on applying existing ML tech to solving practical problems?

comment by [deleted] · 2022-06-11T15:12:31.816Z · LW(p) · GW(p)
comment by HiroSakuraba (hirosakuraba) · 2022-06-09T17:20:37.627Z · LW(p) · GW(p)

Is anyone at MIRI or Anthropic creating diagnostic tools for monitoring neural networks?  Something that could analyze for when a system has bit-flip errors versus errors of logic, and eventually evidence of deception.

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-12T10:59:02.167Z · LW(p) · GW(p)

Chris Olah is/was the main guy working on interpretability research, and he is a co-founder of Anthropic. So Anthropic would definitely be aware of this idea.

Replies from: charbel-raphael-segerie
comment by Charbel-Raphaël (charbel-raphael-segerie) · 2022-06-13T15:26:15.745Z · LW(p) · GW(p)

I've not seen the idea of bit flip idea before, and anthropic are quasi-alone on that, they might have missed it

comment by Tapatakt · 2022-06-09T15:52:02.283Z · LW(p) · GW(p)

What is the community's opinion on ideas based on brain-computer interfaces? Like "create big but non-agentic AI, connect human with it, use AI's compute/speed/pattern-matching with human's agency - wow, that's aligned (at least with this particular human) AGI!"

It seems to me (I haven't thought really much about it) that U(God-Emperor Elon Musk) >> U(paperclips), am i wrong?

Replies from: Kaj_Sotala
comment by Kaj_Sotala · 2022-06-10T20:26:23.769Z · LW(p) · GW(p)

There's some discussion of this in section 3.4. of Responses to Catastrophic AGI Risk.

comment by Andy (andy-1) · 2022-06-09T05:28:25.940Z · LW(p) · GW(p)

So I've commented on this in other forums but why can't we just bit the bullet on happiness-suffering min-maxing utilitarianism as the utility function?

The case for it is pretty straightforward: if we want a utility function that is continuous over the set of all time, then it must have a value for a single moment in time. At this moment in time, all colloquially deontological concepts like "humans", "legal contracts", etc. have no meaning (these imply an illusory continuity chaining together different moments in time). What IS atomic though, is the valence of individual moments of qualia, aka happiness/suffering - that's not just a higher-order emergence.

It's almost like the question of "how do we get all these deontological intuitions to come to a correct answer?" has the answer "you can't, because we should be using a more first principles function".

Then reward hacking is irrelevant if the reward is 1-1 with the fundamental ordering principle.

Some questions then:

"What about the utility monster?" - 

If it's about a lot of suffering to make a lot of happiness, you can constrain suffering separately.

If it's about one entity sucking up net happiness calculations, if you really want to, you can constrain this out with some caveat on distribution. But if you really look at the math carefully, the "utility monster" is only an issue if you want to arbitrarily partition utility by some unit like "person".

"How do we define happiness/suffering?" - from a first person standpoint, contemplative traditions have solved this independently across the world quite convincingly (suffering = delusion of fundamental subject/object split of experience). From a third person standpoint, we're making lots of progress in mapping neurological correlates. In either case, if the AI has qualia it'll be very obvious to it; if it doesn't, it's still a very solvable question. 

--> BTW, the "figure it out to the best of your ability within X time using Y resources" wouldn't be as dangerous here because if it converts the solar system to computation units to figure it out, that's OK if it then min-maxes for the rest of time, if you bit the bullet.

"Others won't bite the bullet / I won't" - ok, then at the least we can make it a safety mechanism that gets rid of some edge cases like the high-suffering ones or even the extinction ones: "do your other utility function with constraint of suffering not exceeding and happiness not dipping below the values present in 2022, but without min-maxing on it".

Replies from: yonatan-cale-1
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-09T10:57:23.296Z · LW(p) · GW(p)

How would you explain "qualia" or "positive utility" in python?


Also, regarding definitions like "delusion of fundamental subject/object split of experience", 

  1. How do you explain that in Python?
  2. See Value is Fragile [LW · GW] (TL;DR: If you forget even a tiny thing in your definition of happiness/suffering*, the result could be extremely extremely bad)
comment by Ludwig (Ludwig.W) · 2022-06-08T22:22:12.355Z · LW(p) · GW(p)

Why should we throw immense resources on AGI x-risk when the world faces enormous issues with narrow AI right now? (eg. destabalised democracy/mental health crisis/worsening inequality)

Is it simply a matter of how imminent you think AGI is? Surely the opportunity cost is enormous given the money and brainpower we are spending on AGI something many dont even think is possible versus something that is happening right now.

Replies from: Jay Bailey, Perhaps
comment by Jay Bailey · 2022-06-08T22:40:21.756Z · LW(p) · GW(p)

The standard answer here is that all humans dying is much, much worse than anything happening with narrow AI. Not to say those problems are trivial, but humanity's extinction is an entirely different level of bad, so that's what we should be focusing on. This is even more true if you care about future generations, since human extinction is not just 7 billion dead, but the loss of all generations who could have come after.

I personally believe this argument holds even if we ascribe a relatively low probability to AGI in the relatively near future. E.g, if you think there's a 10% chance of AGI in the next 10-20 years, it still seems reasonable to prioritise AGI safety now. If you think AGI isn't possible at all, naturally we don't need to worry about AI safety. But I find that pretty unconvincing - humanity has made a lot of progress very quickly in the field of AI capabilities, and it shows no signs of slowing down, and there's no reason why such a machine could not exist in principle.

Replies from: Ludwig.W
comment by Ludwig (Ludwig.W) · 2022-06-10T15:00:06.474Z · LW(p) · GW(p)

I understand and appreciate your discussion.  I wonder if perhaps we could consider is that it may be more morally imperative to work on AI safety for the hugely impactful problems AI is contributing right now, if we assume that in finding solutions to these current and near-term AI problems we would also be lowering the risk of AGI X-risk (albiet indirectly). 

Given that the likelyhood for narrow AI risk being 1 and the likelyhood of AGI in the next 10 years being (as in your example) <0.1 -  It seems obvious we should focus on addressing the former as not only will it reduce suffering that we know with certainity is already happening but also suffer that will certainly continue to happen, in addition it will also indirectly reduce X-risk. If we combine this observation with the opportunity cost in not solving other even more solvable issues (disease, education etc). It seems even less appealing to pour millions of dollars and the careers of the smartest people in specifically AGI X-Risk. 

A final point, is that it would seem the worst issues caused by current and near-term AI risks are that it is degrading the coordination structures of western democracies. (Disinformation, polarisation and so on). If, following Moloch, we understand coordination to be the most vital tool in humanity's adressing of problems we see that focusing on current AI safety issues will improve our ability of addressing every other area of human suffering. 

The opportunity costs in not focusing on improving coordination problems in western countries seem to be equivalent to x-risk level consequences, while the probability of the first is 1 and that of AGI >1. 

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-11T00:24:13.726Z · LW(p) · GW(p)

If you consider these coordination problems to be equivalent to x-risk level consequences, then it makes sense to work on aligning narrow AI. For instance, if you think there's a 10% chance of AGI x-risk this century, and current problems are 10% as bad as human extinction. After all, working on aligning narrow AI is probably more tractable than working on aligning the hypothetical AGI systems of the future. You are also right that aligning narrow AI may help align AGI in the future - it is, at the very least, unlikely to hurt.

Personally, I don't think the current problems are anything close to "10% as bad as human extinction", but you may disagree with me on this. I'm not very knowledgable about politics, which is the field I would look into to try and identify the harms caused by our current degradation of coordination, so I'm not going to try to convince you of anything in that field - more trying to provide a framework with which to look at potential problems.

So, basically I would look at it as - which is higher? The chance of human extinction from AGI times the consequences? Or the x-risk reduction from aligning narrow AI, plus the positive utility of solving our current problems today? I believe the former, so I think AGI is more important. If you believe the latter, aligning narrow AI is more important.

Replies from: Ludwig.W
comment by Ludwig (Ludwig.W) · 2022-06-11T22:44:38.957Z · LW(p) · GW(p)

Intersting, yes I am interested in coordination problems. Let me follow this framework, to make a better case. There are three considerations I would like to point out.

  1. The utility in adressing coordination problems is that they affect almost all X-risk scenarios.(Nuclear war, bioterror, pandemics, climate change and AGI) Working on coordintion problems reduces not only current suffering but also X-Risk of both AGI and non AGI kinds. 
  2. The difference between a 10% chance of something happening that may be an X-Risk in 100 years is not 10 times less then something with a certainity of happening. Its not even comparable because one is a certainity the other a probability, and we only get one roll of the dice (allocation of resources). It seems that the rational choice would always be the certainity. 
  3. More succinctly, with two buttons one with a 100% chance of adressing X-Risk and one with a 10% chance, which one would you press?
Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-12T00:55:08.475Z · LW(p) · GW(p)

I agree with you on the first point completely.

As for Point 2, you can absolutely compare a certainty and a probability. If I offered you a certainty of $10, or a 10% chance of $1,000,000, would you take the $10 because you can't compare certainties and probabilities, and I'm only ever going to offer you the deal once?

That then brings me to question 3. The button I would press would be the one that reduces total X-risk the most. If both buttons reduced X-risk by 1%, I would press the 100% one. If the 100% button reduced X-risk by 0.1%, and the 10% one reduced X-risk by 10%, I would pick the second one, for an expected value of 1% X-risk reduction. You have to take the effect size into account. We can disagree on what the effect sizes are, but you still need to consider them.

Replies from: Ludwig.W
comment by Ludwig (Ludwig.W) · 2022-06-12T09:52:16.356Z · LW(p) · GW(p)

Interesting, I see what you mean reagrding probability and it makes sense. I guess perhaps, what is missing is that when it comes to questions of peoples lives we may have a stronger imperative to be more risk-averse. 

I completely agree with you about effect size. I guess what I would say is that that given my point 1 from earlier  about the variety of X-risks coordination would contirbute in solving then the effect size will always be greater. If we want to maximise utility its the best chance we have. The added bonuses are that it is comparatively tractable and immediate avoiding the recent criticicisms about longtermism, while simoultnously being a longtermist solution. 

Regadless, it does seem that coordination problems are underdiscussed in the community, will try and make a decent main post once my academic committments clear up a bit. 

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-12T10:54:02.287Z · LW(p) · GW(p)

Being risk-averse around people's lives is only a good strategy when you're trading off against something else that isn't human lives. If you have the choice to save 400 lives with certainty, or a 90% chance to save 500 lives, choosing the former is essentially condemning 50 people to death. At that point, you're just behaving suboptimally.

Being risk-averse works if you're trading off other things. E.g, if you could release a new car now that you're almost certain is safe, you might be risk-averse and call for more tests. As long as people won't die from you delaying this car, being risk-averse is a reasonable strategy here.

Given your Point 1 from earlier, there is no reason to expect the effect size will always be greater. If the effect on reducing X-risks from co-ordination becomes small enough, or the risk of a particular X-risk becomes large enough, this changes the equation. If you believe, like many in this forum do, that AI represents the lion's share of X-risk, focusing on AI directly is probably more effective. If you believe that x-risk is diversified, that there's some chance from AI, some from pandemics, some from nuclear war, some from climate change, etc. then co-ordination makes more sense. Co-ordination has a small effect on all x-risks, direct work has a larger effect on a single x-risk.

The point I'm trying to make here is this. There are perfectly reasonable states of the world where "Improve co-ordination" is the best action to take to reduce x-risk. There are also perfectly reasonable states of the world where "Work directly on <Risk A>" is the best action to take to reduce x-risk. You won't be able to find out which is which if you believe one is "always" the case.

What I would suggest is to ask "What would cause me to change my mind and believe improving co-ordination is NOT the best way to work on x-risk", and then seek out whether those things are true or not. If you don't believe they are, great, that's fine.

That said, it wouldn't be fair to ask you what would change your mind without presenting my own. On my end, what would convince me that improving co-ordination is more important than direct AI work:

  • AI is less dangerous than I expect, so that the x-risk profile is more diversified instead of mostly AI.
  • We already have more technical progress in AI safety than I believe we have, so we don't need more and should focus on either co-ordination or the next most dangerous x-risk. (Which I believe is pandemics, which is both less dangerous than AI and more responsive to government co-ordination in my opinion)
  • AI is far more dangerous than I expect, to the point where the AI alignment problem is unsolvable and co-ordination is the only solution.
comment by Perhaps · 2022-06-09T17:02:51.485Z · LW(p) · GW(p)

In addition to what Jay Bailey said, the benefits of an aligned AGI are incredibly high, and if we successfully solved the alignment problem we could easily solve pretty much any other problem in the world(assuming you believe the "intelligence and nanotech can solve anything" argument). The danger of AGI is high, but the payout is also very large.

comment by samshap · 2022-06-08T02:59:06.151Z · LW(p) · GW(p)

If the world's governments decided tomorrow that RL was top-secret military technology (similar to nuclear weapons tech, for example), how much time would that buy us, if any? (Feel free to pick a different gateway technology for AGI, RL just seems like the most salient descriptor).

Replies from: Aidan O'Gara, ete
comment by aogara (Aidan O'Gara) · 2022-06-08T23:31:49.609Z · LW(p) · GW(p)

Interesting question. As far as what government could do to slow down progress towards AGI, I'd also include access to high-end compute. Lots of RL is knowledge that's passed through papers or equations, and it can be hard to contain that kind of stuff. But shutting down physical compute servers seems easier. 

comment by plex (ete) · 2022-06-08T15:03:37.481Z · LW(p) · GW(p)

Depends whether they considered it a national security issue to win the arms race, and if they did how able they would be to absorb and keep the research teams working effectively.

comment by Noosphere89 (sharmake-farah) · 2022-06-07T13:51:36.643Z · LW(p) · GW(p)

I will ask this question, is the Singularity/huge discontinuity scenario likely to happen? Because I see this as a meta-assumptionn behind all the doom scenarios, so we need to know whether the Singularity can happen and will happen.

Replies from: RavenclawPrefect, Aidan O'Gara, ete
comment by Drake Thomas (RavenclawPrefect) · 2022-06-07T17:45:53.702Z · LW(p) · GW(p)

Paul Christiano provided a picture of non-Singularity doom in What Failure Looks Like [LW · GW]. In general there is a pretty wide range of opinions on questions about this sort of thing - the AI-Foom debate between Eliezer Yudkowsky and Robin Hanson is a famous example, though an old one.

"Takoff speed" is a common term used to refer to questions about the rate of change in AI capabilities at the human and superhuman level of general intelligence - searching Lesswrong or the Alignment Forum for that phrase will turn up a lot of discussion about these questions, though I don't know of the best introduction offhand (hopefully someone else here has suggestions?).

comment by aogara (Aidan O'Gara) · 2022-06-08T23:29:56.250Z · LW(p) · GW(p)

It's definitely a common belief on this site. I don't think it's likely, I've written up some arguments here [EA(p) · GW(p)]. 

comment by plex (ete) · 2022-06-08T14:51:09.107Z · LW(p) · GW(p)

Recursive self-improvement, or some other flavor of PASTA, seems essentially inevitable conditioning on not hitting hard physical limits or civilization being severely disrupted. There are Paul/EY debates about how discontinuous the capabilities jump will be, but the core idea of systems automating their own development and this leading to an accelerating feedback loop, or intelligence explosion, is conceptually solid.

There are still AI risks without the intelligence explosion, but it is a key part of the fears of the people who think we're very doomed, as it causes the dynamic of getting only one shot at the real deal since the first system to go 'critical' will end up extremely capable.

Replies from: ete
comment by plex (ete) · 2022-06-08T15:02:05.714Z · LW(p) · GW(p)

(oh, looks like I already wrote this on Stampy! That version might be better, feel free to improve the wiki.)

comment by Ericf · 2022-06-07T12:54:36.292Z · LW(p) · GW(p)

Incorporating my previous post by reference: [LW · GW]

comment by niplav · 2022-06-07T09:44:34.932Z · LW(p) · GW(p)

Hm, someone downvoted michael_mjd's [LW(p) · GW(p)] and my [LW(p) · GW(p)] comment.

Normally I wouldn't bring this up, but this thread is supposed to be a good space for dumb questions (although tbf the text of the question didn't specify anything about downvotes), and neither michael's nor my question looked that bad or harmful (maybe pattern-matched to a type of dumb uninformed question that is especially annoying).

Maybe an explanation of the downvotes would be helpful here?

Replies from: alenglander
comment by Aryeh Englander (alenglander) · 2022-06-07T10:12:50.058Z · LW(p) · GW(p)

I forgot about downvotes. I'm going to add this in to the guidelines.

Replies from: charbel-raphael-segerie
comment by Charbel-Raphaël (charbel-raphael-segerie) · 2022-06-09T17:32:12.578Z · LW(p) · GW(p)

here we are, a concrete example of failure of alignment 

Replies from: alenglander
comment by Aryeh Englander (alenglander) · 2022-06-11T15:40:25.897Z · LW(p) · GW(p)

We have a points system in our family to incentivize the kids to do their chores. But we have to regularly update the rules because it turns out that there are ways to optimize for the points that we didn't anticipate and that don't really reflect what we actually want the kids to be incentivized to do. Every time this happens I think - ha, alignment failure!

comment by Eugene D (eugene-d) · 2022-06-20T12:39:45.235Z · LW(p) · GW(p)

When AI experts call upon others to ponder, as EY just did, "[an AGI] meant to carry out some single task" (emphasis mine), how do they categorize all the other important considerations besides this single task?  

Or, asked another way, where do priorities come into play, relative to the "single" goal?  e.g. a human goes to get milk from the fridge in the other room, and there are plentiful considerations to weigh in parallel to accomplishing this one goal -- some of which should immediately derail the task due to priority (I notice the power is out, i stub my toe, someone specific calls for me with a sense of urgency from a different room, etc, etc). 

And does this relate at all to our understanding of how to make AGI corrigible? 

many thanks,

Eugene [LW · GW]

comment by Yonatan Cale (yonatan-cale-1) · 2022-06-19T15:14:45.961Z · LW(p) · GW(p)

Anonymous question (ask here) :

Given all the computation it would be carrying out, wouldn't an AGI be extremely resource-intensive? Something relatively simple like bitcoin mining (simple when compared to the sort of intellectual/engineering feats that AGIs are supposed to be capable of) famously uses up more energy than some industrialized nations.

Replies from: rachelAF, yonatan-cale-1
comment by rachelAF · 2022-06-24T15:33:50.483Z · LW(p) · GW(p)

Short answer: Yep, probably.

Medium answer: If AGI has components that look like our most capable modern deep learning models (which I think is quite likely if it arrives in the next decade or two), it will probably be very resource-intensive to run, and orders of magnitude more expensive to train. This is relevant because it impacts who has the resources to develop AGI (large companies and governments; likely not individual actors), secrecy (it’s more difficult to secretly acquire a massive amount of compute than it is to secretly boot up an AGI on your laptop; this may even enable monitoring and regulation), and development speed (if iterations are slower and more expensive, it slows down development).

If you’re interested in further discussion of possible compute costs for AGI (and how this affects timelines), I recommend reading about bio anchors.

comment by Yonatan Cale (yonatan-cale-1) · 2022-06-21T17:38:57.151Z · LW(p) · GW(p)
  1. (I'm not sure but why would this be important? Sorry for the silly answer, feel free to reply in the anonymous form again)
  2. I think a good baseline for comparison would be 
    1. Training large ML models (expensive)
    2. Running trained ML models (much cheaper)
  3. I think comparing to blockchain is wrong, because 
    1. it was explicitly designed to be resource intensive on purpose (this adds to the security of proof-of-work blockchains)
    2. there is a financial incentive to use a specific (very high) amount of resources on blockchain mining (because what you get is literally a currency, and this currency has a certain value, so it's worthwhile to spend any money lower than that value on the mining process)
    3. None of these are true for ML/AI, where your incentive is more something like "do useful things"
comment by Eugene D (eugene-d) · 2022-06-17T12:21:56.559Z · LW(p) · GW(p)

Why do we suppose it is even logical that control / alignment of a superior entity would be possible?  

(I'm told that "we're not trying to outsmart AGI, bc, yes, by definition that would be impossible", and I understand that we are the ones who "create it" (so I'm told, therefore, we have the upper-hand bc of this--somehow in building it that provides the key benefit we need for corrigibility... 

What am I missing, in viewing a superior entity as something you can't simply "use" ?  Does it depend on the fact that the AGI is not meant to have a will like humans do, and therefore we wouldn't be imposing upon it?  But doesn't that go out the window the moment we provide some goal for it to perform for us? 

thanks much! 

Replies from: aleksi-liimatainen
comment by Aleksi Liimatainen (aleksi-liimatainen) · 2022-06-17T12:38:01.718Z · LW(p) · GW(p)

One has the motivations one has, and one would be inclined to defend them if someone tried to rewire the motivations against one's will. If one happened to have different motivations, then one would be inclined to defend those instead.

The idea is that once a superintelligence gets going, its motivations will be out of our reach. Therefore, the only window of influence is before it gets going. If, at the point of no return, it happens to have the right kinds of motivations, we survive. If not, it's game over.

Replies from: eugene-d
comment by Eugene D (eugene-d) · 2022-06-17T12:42:55.051Z · LW(p) · GW(p)

thank you.  Make some sense...but does "rewriting its own code" (the very code we thought would perhaps permanently influence it before it got-going) nullify our efforts at hardcoding  our intentions? 

Replies from: Kaj_Sotala, aleksi-liimatainen
comment by Kaj_Sotala · 2022-06-17T13:42:01.709Z · LW(p) · GW(p)

I'm not a psychopath, and if I got the opportunity to rewrite my own source code to become a psychopath, I wouldn't do it.

At the same time, it's the evolutionary and cultural programming in my source code that contains the desire not to become a psychopath.

In other words, once the desire to not become a psychopath is there in my source code, I will do my best not to become one, even if I have the ability to modify my source code.

Replies from: eugene-d
comment by Eugene D (eugene-d) · 2022-06-17T14:33:52.499Z · LW(p) · GW(p)

That makes sense.  My intention was not to argue from the position of it becoming a psychopath though (my apologies if it came out that way)...but instead from a perspective of an entity which starts-out as supposedly Aligned (centered-on human safety, let's say), but then, bc it's orders of magnitude smarter than we are (by definition), it quickly develops a different perspective.  But you're saying it will remain 'aligned' in some vitally-important way, even when it discovers ways the code could've been written differently? 

comment by Aleksi Liimatainen (aleksi-liimatainen) · 2022-06-17T13:19:59.620Z · LW(p) · GW(p)

The AI would be expected to care about preserving its motivations under self-modification for similar reasons as it would care about defending them against outside intervention. There could be a window where the AI operates outside immediate human control but isn't yet good at keeping its goals stable under self-modification. It's been mentioned as a concern in the past; I don't know what the state of current thinking is.

comment by Kerrigan · 2022-06-17T06:09:42.619Z · LW(p) · GW(p)

How would AGI alignment research change if the hard problem of consciousness were solved?

Replies from: rachelAF
comment by rachelAF · 2022-06-24T15:42:21.287Z · LW(p) · GW(p)

Consciousness, intelligence and human-value-alignment are probably mostly orthogonal, so I don’t think that solving the hard problem of intelligence would directly impact AGI alignment research. (Perhaps consciousness requires general intelligence, so understanding how consciousness works on a mechanistic level might dramatically accelerate timelines? But that’s highly speculative.)

However, if solving the hard problem of consciousness leads us to realize that some of our AI systems are conscious, then we have a whole new set of moral patients. (As an AGI researcher) I personally would become much more concerned with machine ethics in that case, and I suspect others would as well.

comment by Chad Nauseam (andre-popovitch) · 2022-06-16T15:44:05.543Z · LW(p) · GW(p)

What's the problem with oracle AIs? It seems like if you had a safe oracle AI that gave human-aligned answers to questions, you could then ask "how do I make an aligned AGI?" and just do whatever it says. So it seems like the problem of "how to make an aligned agentic AGI" is no harder than "how to make an aligned orcale AI", which I understand to still be extremely hard, but surely it's easier than making an aligned agentic AGI from scratch?

Replies from: mruwnik
comment by mruwnik · 2022-06-19T19:17:59.543Z · LW(p) · GW(p)

my understanding is that while an oracle doesn't directly control the nukes, it provides info to the people who control the nukes. Which is pretty much just moving the problem one layer deeper. While it can't directly change the physical state of the world, it can manipulate people to pretty much achieve the same thing.

Check this tag for more specifics:

comment by bokov (bokov-1) · 2022-06-16T15:33:47.958Z · LW(p) · GW(p)

Are there any specific examples of anybody working on AI tools that autonomously look for new domains to optimize over?

  • If no, then doesn't the path to doom still amount to a human choosing to apply their software to some new and unexpectedly lethal domain or giving the software real-world capabilities with unexpected lethal consequences? So then, shouldn't that be a priority for AI safety efforts?
  • If yes, then maybe we should have a conversation about which of these projects is most likely to bootstrap itself, and the likely paths it will take?
comment by spacyfilmer · 2022-06-16T09:11:38.482Z · LW(p) · GW(p)

One alignment idea I have had that I haven't seen proposed/refuted is to have an AI which tries to compromise by satisfying over a range of interpretations of a vague goal, instead of trying to get an AI to fulfill a specific goal.  This sounds dangerous and unaligned, and it indeed would not produce an optimal, CEV-fulfilling scenario, but seems to me like it may create scenarios in which at least some people are alive and are maybe even living in somewhat utopic conditions.  I explain why below.

In many AI doom scenarios the AI intentionally picks an unaligned goal because it fits literally with what it is being asked to do, while being easier to execute than what was it actually being asked to do.  For instance, tiling the universe with smiley faces instead of creating a universe which leads people to smile.  GPT-3 reads text and predicts what is most likely to come next.  It can also be made to alter its predictions with noise or by taking on the style a particular kind of writer.  Combining these ideas one has the idea to create an AI interpreter which works like GPT-3 instead of a more literal one.  You then feed a human generated task to the interpreter (better than "make more smiles" but vague in the way all language statements are, as opposed to a pure algorithm of goodness) which can direct itself to fulfill the task in a way that makes sense to it, perhaps while asked to read things as a sensitive and intelligent reader would.

To further ensure that you get a good outcome, you can ask the AI to produce a range of interpretations of the task and then fulfill all of these interpretations (or some subset if it can produce infinite interpretations) in proportion to how likely it thinks each interpretation is.  In essence devoting 27% of the universe to interpretation A, 19% to interpretation B, etc.  This way even if its main interpretation is an undesirable one, some interpretations will be good.  Importantly the AI should be tasked to devote the resources without judgement to how successful it thinks it will be, only in terms of how much it prefers each interpretation.  Discriminating too much on chance of success will just make it devote all of its resources to an easy, unaligned interpretation.  To be even stronger the AI should match its likelihood of interpretation with the cost of maintaining each interpretation, rather than the cost of getting to it in the first place.  If this works, even if substantial parts of the universe are devoted to bad, unaligned interpretations, some proportion should be devoted to more aligned interpretations.

One problem with this solution is that it increases the chance of S-Risk as compared to AI which doesn't remotely do what we want.  Even in a successful case it seems likely that an AI might dedicate some portion of its interpretation to a hellish scenario along with more neutral or utopian scenarios.  Another issue is that I'm just not sure how the AI interpreter translates its interpretations into more strict instructions for the parts of itself that executes the task.  Maybe this is just as fraught as human attempts to do so?

Have such ideas been proposed elsewhere that I'm not aware of?  Have critiques been made?  If not, does reading this inspire any critiques from people in the know?

Replies from: rachelAF
comment by rachelAF · 2022-06-24T15:54:46.393Z · LW(p) · GW(p)

In reward learning research, it’s common to represent the AI’s estimate of the true reward function as a distribution over possible reward functions, which I think is analogous to what you are describing. It’s also common to define optimal behavior, given a distribution over reward functions, as that behavior which maximizes the expected reward under that distribution. This is mathematically equivalent to optimizing a single reward function equal to the expectation of the distribution. So, this helps in that the AI is optimizing a reward function that is more likely to be “aligned” than one at an extreme end of the distribution. However, this doesn’t help with the problems of optimizing a single fixed reward function.

comment by metcomp (sigmund-porell) · 2022-06-16T02:30:40.845Z · LW(p) · GW(p)

Why should we assume that vastly increased intelligence results in vastly increased power?

A common argument I see for intelligence being powerful stems from two types of examples:

  1. Humans are vastly more powerful than animals because humans are more intelligent than animals. Thus, an AGI vastly more intelligent than humans would also have similarly overwhelming power over humans.
  2. X famous person caused Y massive changes in society because of their superior intelligence. Thus, an AGI with even more intelligence would be able to effect even larger changes.

However, I could easily imagine the following counterarguments being true:

  1. Human advantages over animals stem not from increased intelligence, but from being just intelligent enough to develop complex language. Complex language allows billions of individual humans across space and time to combine their ideas and knowledge. This accumulated knowledge is the true source of human power. I am not more powerful than a chimpanzee because I am smarter, but because I have access to technology like guns and metal cages. And no one invented guns and metal cages from first principles with pure intellect—both were created through many people's trial and error. It's possible that every step humans took to develop guns are possible with the intelligence of a chimpanzee, but chimps simply don't have the language capabilities to pass these developments on. An AGI, while more intelligent than humans, would not have the fundamental advantage of this language-versus-no-language distinction.
  2. Any human whose success is largely attributed to intelligence (say, Elon Musk) actually gained their success mostly from random luck. Looking at these people and assuming that their intelligence gave them power (when in fact, millions of people with similar intelligence but lower levels of success exist) is simply survivorship bias.
  3. If intelligence were a reliable method of achieving power or success in society, we would expect the vast majority of such people to also be highly intelligent. But may powerful people (politicians, celebrities, etc.) don't seem to be very intelligent, and make obviously poor decisions all the time.

Couldn't there be a level of intelligence after which any additional gains in intelligence yield diminishing gains in decision-making ability? For instance, a lack of sufficient information could make the outcome of a decision impossible for any level of intelligence to predict, so the AGI's vastly greater intelligence over a human would merely result in it choosing an action with a 48% chance of success instead of a 45% chance. (I have a suspicion that most actually important decisions work like this.)

Replies from: mruwnik
comment by mruwnik · 2022-06-19T19:41:42.242Z · LW(p) · GW(p)

It's possible that there is a ceiling to intelligence gains. It's also possible that there isn't. Looking at the available evidence there doesn't seem to be so - a single ant is a lot less intelligent than a lobster, which is less intelligent than a snake, etc. etc. While it would be nice (in a way) if there was a ceiling, it seems more prudent to assume that there isn't, and prepare for the worst. Especially as by "superintelligent", you shouldn't think of double, or even triple Einstein, rather you should think of a whole other dimension of intelligence, like the difference between you and a hamster.

As to your specific counterarguments:

  1. It's both, really. But yes - complex language allows humans to keep and build upon previous knowledge. Humans advantage is in the gigantic amounts of know how that can be passed on to future generations. Which is something that computers are eminently good at - you can keep a local copy of Wikipedia in 20GB.
  2. Good point. But it's not just luck. Yes, luck plays a large part, but it's also resources (in a very general sense). If you have the basic required talent and a couple of billions of dollars, I'm pretty sure you could become a hollywood star quite quickly. The point is that a superintelligence won't have a similar level of intelligence to any one else around. Which will allow it to run circles around everyone. Like if a Einstein level intelligence decided to learn to play chess and started playing against 5 year olds - they might win the first couple of games, but after a while you'd probably notice a trend...
  3. Intelligence is an advantage. Quite a big one, generally speaking. But in society most people are generally at the same level if you compare them to e.g. Gila monsters (because we're talking about superintelligence). So it shouldn't be all that surprising that other resources are very important. While many powerful people don't seem to be intelligent, they tend to be either cunning (which is a different kind of intelligence) or have deep pockets (not just money) which offset their relative lack in smarts. Also, while they might not be very clever, very few powerful people are actively stupid.
comment by CarlJ · 2022-06-14T22:42:35.842Z · LW(p) · GW(p)

20. (...) To faithfully learn a function from 'human feedback' is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we'd hoped to transfer).  If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them.


So, I'm thinking this is a critique of some proposals to teach an AI ethics by having it be co-trained with humans. 

There seems to be many obvious solutions to the problem of there being lots of people who won't answer correctly to "Point out any squares of people behaving badly" or "Point out any squares of people acting against their self-interest" etc:

- make the AIs model expect more random errors 
- after having noticed some responders as giving better answers, give their answers more weight
- limit the number of people that will co-train the AI

What's the problem with these ideas?

Replies from: rachelAF
comment by rachelAF · 2022-06-24T16:10:03.155Z · LW(p) · GW(p)

I work on AI safety via learning from human feedback. In response to your three ideas:

  • Uniformly random human noise actually isn’t much of a problem. It becomes a problem when the human noise is systematically biased in some way, and the AI doesn’t know exactly what that bias is. Another core problem (which overlaps with the human bias), is that the AI must use a model of human decision-making to back out human values from human feedback/behavior/interaction, etc. If this model is wrong, even slightly (for example, the AI doesn’t realize that the noise is biased along one axis), the AI can infer incorrect human values.

  • I’m working on it, stay tuned.

  • Our most capable AI systems require a LOT of training data, and it’s already expensive to generate enough human feedback for training. Limiting the pool of human teachers to trusted experts, or providing pre-training to all of the teachers, would make this even more expensive. One possible way out of this is to train AI systems themselves to give feedback, in imitation of a small trusted set of human teachers.

comment by Charbel-Raphaël (charbel-raphael-segerie) · 2022-06-14T00:02:22.207Z · LW(p) · GW(p)

Why won't this alignment idea work?

Researchers have already succeeded in creating face detection systems from scratch, by coding the features one by one, by hand. The algorithm they coded was not perfect, but was sufficient to be used industrially in digital cameras of the last decade.

The brain's face recognition algorithm is not perfect either. It has a tendency to create false positives, which explains a good part of the paranormal phenomena. The other hard-coded networks of the brain seem to rely on the same kind of heuristics, hard-coded by evolution, and imperfect.

However, it turns out that humans, despite these imperfect evolutionary heuristics, are generally cooperative and friendly.

This suggests that the seed of alignment can be roughly coded and yet work.

1. Can't we replicate the kind of research effort of hand-crafting human detectors, and hand-crafting "friendly" behaviour?
2. Nowadays, this quest would be facilitated by deep learning: no need to hand-craft a baby detector, just train a neural network that recognizes babies and triggers a reaction at a certain threshold that releases the hormones of tenderness. There is no need to code the detector, just train it.  And then, only the reaction corresponding to the tenderness hormone must be coded.
3. By this process, there will be gaping holes, which will have to be covered one by one. But this is certainly what happened during evolution.

The problems are:
- We are not allowed to iterate with a strong AI
- We are not sure that this would extrapolate well to higher levels of capability


But if we were to work on it today, it would only have a sub-human level, and we could iterate like on a child. And even if we had the complete code of the brain stem, and we had “Reverse-enginered human social instincts” as Steven Byrnes proposes here [LW · GW], it seems to me that we still would have to do all this.

What do you think?

Replies from: yonatan-cale-1
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-14T20:12:07.800Z · LW(p) · GW(p)

You suggested:

But if we were to work on it today, it would only have a sub-human level, and we could iterate like on a child

But as you yourself pointed out: "We are not sure that this would extrapolate well to higher levels of capability"


You suggested:

and we had “Reverse-enginered human social instincts”

As you said, "The brain's face recognition algorithm is not perfect either. It has a tendency to create false positives"

And so perhaps the AI would make human pictures that create false positives. Or, as you said, "We are not sure that this would extrapolate well to higher levels of capability"


The classic example is humans creating condoms, which is a very unfriendly thing to do to Evolution, even though it raised us like children, sort of


Adding: "Intro to Brain-Like-AGI Safety [? · GW]" (I didn't read it yet, seems interesting)

Replies from: charbel-raphael-segerie
comment by Charbel-Raphaël (charbel-raphael-segerie) · 2022-06-16T20:42:03.491Z · LW(p) · GW(p)

Ok. But don't you think "reverse engineering human instincts" is a necessary part of the solution?

My intuition is that value is fragile, so we need to specify it. If we want to specify it correctly, either we learn it or we reverse engineer it, no?

Replies from: yonatan-cale-1
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-17T21:43:12.936Z · LW(p) · GW(p)

But don't you think "reverse engineering human instincts" is a necessary part of the solution?

I don't know, I don't have a coherent idea for a solution [LW(p) · GW(p)]. Here's one of my best ideas [LW(p) · GW(p)] (not so good).

Yudkowsky split up the solutions in his post, see point 24 [LW · GW]. The first sub-bullet there is about inferring human values.

Maybe someone else will have different opinions

comment by Tobias H (clearthis) · 2022-06-12T15:27:41.029Z · LW(p) · GW(p)
  • Would an AGI that only tries to satisfice a solution/goal be safer?
  • Do we have reason to believe that we can/can't get an AGI to be a satisficer?
Replies from: yonatan-cale-1
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-14T20:29:37.793Z · LW(p) · GW(p)

Do you mean something like "only get 100 paperclips, not more?"

If so - the AGI will never be sure it has 100 paperclips, so it can take lots of precautions to be very, very sure. Like turning all the world into paperclip counters or so

Replies from: clearthis
comment by Tobias H (clearthis) · 2022-06-15T09:05:54.953Z · LW(p) · GW(p)

[I think this is more anthropomorphizing ramble than concise arguments. Feel free to ignore :) ]

I get the impression that in this example the AGI would not actually be satisficing. It is no longer maximizing a goal but still optimizing for this rule. 

For a satisficing AGI, I'd imagine something vague like "Get many paperclips" resulting in the AGI trying to get paperclips but at some point (an inflection point of diminishing marginal returns? some point where it becomes very uncertain about what the next action should be?) doing something else. 

Or for rules like "get 100 paperclips, not more" the AGI might only directionally or opportunistically adhere. Within the rule, this might look like "I wanted to get 100 paperclips, but 98 paperclips are still better than 90, let's move on" or "Oops, I accidentally got 101 paperclips. Too bad, let's move on".

In your example of the AGI taking lots of precautions, the satisficing AGI would not do this because it could be spending its time doing something else.

I suspect there are major flaws with it, but an intuition I have goes something like this:

  • Humans have in some sense similar decision-making capabilities to early AGI.
  • The world is incredibly complex and humans are nowhere near understanding and predicting most of it. Early AGI will likely have similar limitations.
  • Humans are mostly not optimizing their actions, mainly because of limited resources, multiple goals, and because of a ton of uncertainty about the future. 
  • So early AGI might also end up not-optimizing its actions most of the time.
  • Suppose we assume that the complexity of the world will continue to be sufficiently big such that the AGI will continue to fail to completely understand and predict the world. In that case, the advanced AGI will continue to not-optimize to some extent.
    • But it might look like near-complete optimization to us. 
Replies from: clearthis
comment by Tobias H (clearthis) · 2022-06-15T09:08:10.417Z · LW(p) · GW(p)

Just saw the inverse question was already asked and answered [LW · GW].

comment by megasilverfist · 2022-06-12T05:15:31.798Z · LW(p) · GW(p)

That we have to get a bunch of key stuff right on the first try is where most of the lethality really and ultimately comes from; likewise the fact that no authority is here to tell us a list of what exactly is 'key' and will kill us if we get it wrong.  (One remarks that most people are so absolutely and flatly unprepared by their 'scientific' educations to challenge pre-paradigmatic puzzles with no scholarly authoritative supervision, that they do not even realize how much harder that is, or how incredibly lethal it is to demand getting that right on the first critical try.)

Is anyone making a concerted effort to derive generalisable principles of how to get things right on the first try and/or work in the pre-paradigmatic mode? It seems like if we knew how to do that in general it would be a great boost to AI Safety research.

Replies from: megasilverfist
comment by megasilverfist · 2022-06-15T11:24:30.967Z · LW(p) · GW(p)

To be a bit more explicit. I have some ideas of what it would look like to try to develop this meta-field or at least sub-elements of it, seperate from general rationality and am trying to get a feel for if they are worth pursuing personally. Or better yet, handing over to someone who doesn't feel they have any currently tractable ideas, but is better at getting things done.

comment by Eugene D (eugene-d) · 2022-06-11T10:13:46.864Z · LW(p) · GW(p)

Why does EY bring up "orthogonality" so early, and strongly ("in denial", "and why they're true") ?  Why does it seem so important that it be accepted?   thanks!

Replies from: Charlie Steiner, Jay Bailey, sharmake-farah
comment by Charlie Steiner · 2022-06-11T14:55:21.829Z · LW(p) · GW(p)

Because it means you can't get AI to do good things "for free," it has to be something you intentionally designed it to do.

Denying the orthogonality thesis looks like claims that an AI built with one set of values will tend to change those values in a particular direction as it becomes cleverer. Because of wishful thinking, people usually try to think of reasons why an AI built in an unsafe way (with some broad distribution over possible values) will tend to end up being nice to humans (a narrow target of values) anyway.

(Although there's at least one case where someone has argued "the orthogonality thesis is false, therefore even AIs built with good values will end up not valuing humans.")

Replies from: TAG, TAG
comment by TAG · 2022-06-11T17:29:21.169Z · LW(p) · GW(p)

Denying the orthogonality thesis looks like claims that an AI built with one set of values will tend to change those values in a particular direction as it becomes cleverer

You can also argue that not all value-capacity pairs are stable or compatible with self-improvement.

Replies from: Charlie Steiner
comment by Charlie Steiner · 2022-06-11T18:24:19.853Z · LW(p) · GW(p)

Yeah, I was a bit fast and loose - there are plenty of other ways to deny the orthogonality thesis, I just focused on the one I think is most common in the wild.

comment by TAG · 2022-06-11T17:28:13.606Z · LW(p) · GW(p)
comment by Jay Bailey · 2022-06-11T14:49:26.353Z · LW(p) · GW(p)

A common AGI failure mode is to say things like:

"Well, if the AI is so smart, wouldn't it know what we meant to program it to do?"

"Wouldn't a superintelligent AI also have morality?"

"If you had a paperclip maximiser, once it became smart, why wouldn't it get bored of making paperclips and do something more interesting?"

Orthogonality is the answer to why all these things won't happen. You don't hear these arguments a lot any more, because the field has become more sophisticated, so EY's harping on about orthogonality seems a bit outdated. To be fair, a lot of the reason the field has grown up about this is because EY kept harping on about it in the first place.

Replies from: eugene-d
comment by Eugene D (eugene-d) · 2022-06-12T00:24:32.403Z · LW(p) · GW(p)

OK again I'm a beginner here so pls correct me, I'd be grateful:

I would offer that any set of goals given to this AGI would include the safety-concerns of humans.  (Is this controversial?)  Not theoretical intelligence for a thesis, but AGI acting in the world with the ability to affect us.  Because of the nature of our goals, it doesn't even seem logical to say that the AGI has gained more intelligence without also gaining an equal amount of safety-consciousness.  

e.g. it's either getting better at safely navigating the highway, or it's still incompetent at driving.  

Out on a limb:  Further, bc orthogonality seems to force the separation between safety and competency, you have EY writing various intense treatises in the vain hopes that FAIR / etc will merely pay attention to safety-concerns.   This just seems ridiculous, so there must be a reason, and my wild theory is that Orthogonality provides the cover needed to charge ahead with a nuke you can't steer--but it sure goes farther and faster every month, doesn't it?

(Now I'm guessing, and this can't be true, but then again, why would EY say what he said about FAIR?)  But they go on their merry-way because they think, "the AI is increasingly need to concerns ourselves with 'orthogonal' issues like safety". 



Replies from: Jay Bailey, AnthonyC
comment by Jay Bailey · 2022-06-12T00:44:56.018Z · LW(p) · GW(p)

Any set of goals we want to give to an AGI will include concern for the safety of humans, yes. The problem is, we don't yet have a way of reliably giving an AGI a goal like that. This is a major part of AI alignment - if you could reliably get an AGI to do anything at all, you would have solved a large part of the problem.

Orthogonality says "We won't get this safety by default, we have to explicitly program it in", and this is hard, which is why alignment is hard.

You are also correct that an AGI achieving what we want inevitably involves safety. If we program an AGI well, the more intelligent it gets the safer it becomes, like your self-driving car example. Orthogonality doesn't prohibit this, it just doesn't guarantee it. Intelligence can be turned to a goal like "Make paperclips" just as well as a goal like "Drive people where they want to go without hurting them." For the second goal, the smarter the AI is, the safer it becomes. For the first goal, once an AI becomes sufficiently intelligent, it becomes very dangerous indeed.

In reality, any AI we make isn't going to be a paperclip maximiser. If we fail, it'll be more subtle than that. For instance, we might program an AI to maximise our self-reported life satisfaction, but then the AI might find a way to trick us into reporting higher scores, or trick its own sensors into thinking humans are there giving it high scores, or create robots that aren't human but are sufficiently close to human that they count as "human" to its programming, and the robots are programmed to constantly give high scores, or...

And the smarter the AI gets, the more likely it is to think of something like this - something that matches what we told it to do, without matching what we actually wanted. That's why intelligence and safety aren't correlated, and can be anti-correlated unless the AI is perfect.

Replies from: eugene-d
comment by Eugene D (eugene-d) · 2022-06-12T01:19:38.899Z · LW(p) · GW(p)

great stuff.

I'm saying that no one is asking that safety be smuggled-in, or obtained "for free", or by default --I'm curious why it would be singled-out for the Thesis, when it's always a part of any goal, like any other attribute of the goal in question?  if it fails to be safe then it fails competency to perform properly...whether it swerved into another lane on the highway, or it didn't brake fast enough and hit someone, both not smart things. 

"the smarter the AI is, the safer it becomes" -- eureka, but this seems un-orthogonal, dang-near correlated, all of a sudden, doesn't it?  :) 

Yes, I agree about the maximizer and subtle failures, thanks to Rob Miles vids about how this thing is likely to fail, ceteris paribus.

"the smarter the AI gets, the more likely it is to think of something like this..."  -- this seems to contradict the above quote.  Also, I would submit that we actually call this incompetence...and avoid saying that it got any "smarter" at all, because:  One of the critical parts of its activity was to understand and perform what we intended, which it failed to do. 

FAIR simply must be already concerned with alignment issues, and the correlated safety-risks if that fails.  Their grand plans will naturally fail if safety is not baked-into everything, right? 

I'm getting dangerously close to admitting that I don't like the AGI-odds here, but that's, well, an adjacent topic. 

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-12T10:40:17.154Z · LW(p) · GW(p)

I think one of the cruxes we have here is the way we're defining "intelligence" or "smart". I see how if you define "smart/intelligent" as "Ability to achieve complex goals" then a phrase like "As an AI becomes more intelligent, it becomes more unsafe" doesn't really make sense. If the AI becomes more unsafe, it becomes less able to achieve the goal, which therefore makes it stupider.

So, I should take a step back and clarify. One of the key problems of alignment is getting the AGI to do what we want. The reason for this is that the goal we program into the machine is not necessarily exactly the goal we want it to achieve. This gap is where the problem lies.

If what you want is for an AI to learn to finish a video game, and what the AI actually learns to do is to maximise its score (reward function) by farming enemies at a specific point, the AI has gotten "smarter" by its own goal, but "stupider" by yours. This is the problem of outer alignment. The AI has gotten better at achieving its goals, but its goal wasn't ours. Its goal was perfectly correlated with ours (It progresses through the game in order to score more points) right up until it suddenly wasn't. This is how an AI can improve at intelligence (i.e, goal-seeking behaviour) and as a result become worse at achieving our goals - because our goals are not its goals. If we can precisely define what our goals are to an AI, we have gone a long way towards solving alignment, and what you said would be true.

As for why people are singling out safety - safety failures are much worse than capabilities failures, for the most part. If an AI is told to cure cancer and emits a series of random strings...that isn't exactly ideal, but at least no harm was done. If an AI is told to cure cancer and prescribes a treatment that cures 100% of all cancer but kills the patient in the process, that's much worse.

I don't know exactly what FAIR believes, but I believe people's skepticism about Yann LeCun of FAIR is well-founded. Yann LeCun believes we need to build safety into things, but he thinks that the safety mechanisms are pretty obvious, and he doesn't seem to believe it's a hard problem. See this: [LW · GW]

"I think it would only be relevant in a fantasy world in which people would be smart enough to design super-intelligent machines, yet ridiculously stupid to the point of giving it moronic objectives with no safeguards."

In other words - the only reason we'd fail the alignment problem is if we made no attempt to solve it. Most people who work on AI alignment believe the problem is considerably more difficult than this.

He also says:

"There will be mistakes, no doubt, as with any new technology (early jetliners lost wings, early cars didn't have seat belts, roads didn't have speed limits...).

But I disagree that there is a high risk of accidentally building existential threats to humanity.

Existential threats to humanity have to be explicitly designed as such."

This is, to put it mildly, not the dominant belief among people who work on these problems for a living.

Replies from: eugene-d
comment by Eugene D (eugene-d) · 2022-06-13T12:51:09.097Z · LW(p) · GW(p)

when i asked about singling-out safety, i agree about it being considered different, however, what i meant:  why wouldn't safety be considered as 'just another attribute' by which we can judge the success/intelligence of the AI ?  that's what Yann seems to be implying?  how could it be considered orthogonal to the real issue--we judge the AI by its actions in the real world, the primary concern is its effect on humanity, and we consider those actions on a scale of intelligence, and every goal (I would presume) has some semblance of embedded safety consideration... 

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-13T13:27:35.459Z · LW(p) · GW(p)

Because if you have a sufficiently powerful AI and you get safety wrong, you don't get a second chance because it kills you. That's what makes it different - once you get to a certain level of capability, there is no deploy, check, and tweak cycle like you could do with an algorithm's accuracy or general level of productivity. You have to get it right, or at least close enough, the first time, every time.

Safety is absolutely one of several attributes by which we can judge the success of an AI, but it can't be treated as "just another attribute", and that's why. Whether you say an unsafe AI is "intelligent" or not doesn't matter. What matters is whether the AI is sufficiently powerful that it can kill you if you program it wrong.

Replies from: eugene-d
comment by Eugene D (eugene-d) · 2022-06-13T13:52:37.254Z · LW(p) · GW(p)

I'm sorry, I think i misspoke--I agree with all that you said about it being different.  But when I've attempted to question the Orthogonality of safety with AI-safety experts, it seems as if I was told that safety is independent of capability.  First, I think this is a reason why AI-Safety has been relegated to 2nd-class status...and second, I can't see why it is not, like I think Yann puts it, central to any objective (i.e. an attribute of competency/intelligence) we give to AGI  (presuming we are talking about real-world goals and not just theoretical IQ points) 

so to reiterate I do indeed agree that we need to (somehow, can't see how, or even why we'd take these risks, honestly) get it right every time including the first time--despite Yann's plan to build-in correction mechanisms, post-failure, or build-in common-sense safety into the objective itself

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-13T14:08:04.858Z · LW(p) · GW(p)

I think that "Safety is independent of capability" could mean a couple different things.

My understanding of what you're saying is this:

When we talk about the capability of an AI, what we mean is "The ability of an AI to achieve the objective we want it to achieve." The objective we want it to achieve inherently includes safety - a self-driving car that flawlessly navigated from Point A to Point B while recklessly running stop signs and endangering pedestrians is, in fact, less capable than an AI that does not do this. Therefore, safety is inherently a part of capability, and should be treated as such.

When someone in AI Safety says "Safety is independent of capability", my understanding of the phrase is this:

It is possible to have very highly capable systems that are unsafe. This will inevitably happen without us specifically making the AI safe. This is a much harder problem than capabilities researchers understand, and that is why AI safety is its own field instead of just being part of general AI capabilities. Most of what capabilities researchers consider "AI safety" is stuff like preventing racism in predictive models or reducing bias in language models. These are useful but do not solve the core problem of how to control an agent smarter than you.

The first point can be summarised as "Safety is not independent of capability, because safety is an inherent part of capability for any useful objective." The second point can be summarised as "Safety is independent of capability, because it is possible to arbitrarily increase the level of one without increasing the other."

These two arguments can both be true independently, and I personally believe both are true. Would you say the first argument is an accurate representation of your point? If not, how would you adjust it? What do you think of the second argument? Do the experts make more sense when examining their claim through this lens?

Replies from: eugene-d, eugene-d
comment by Eugene D (eugene-d) · 2022-06-13T17:35:48.692Z · LW(p) · GW(p)

Yes you hit the nail on the head understanding my point, thank you.  I also think this is what Yann is saying, to go out on a limb:  He's doing AI-safety simultaneously, he considers alignment AS safety.  

I guess, maybe, I can see how the 2nd take could be true..but I also can't think of a practical example, which is my sticking point.  Of course, a bomb which can blow-up the moon is partly "capable", and there is partial-progress to report --but only if we judge it based on limited factors, and exclude certain essential ones (e.g. navigation).  I posit we will never avoid judging our real inventions based on what I'd consider essential output:  

"Will it not kill us == Does it work?" 

It's a theory, but:  I think AI-safety ppl may lose the argument right away, and can sadly be an afterthought (that's what I'm told, by them), because they are allowing others to define "intelligence/capability" to be free from normal human concerns about our own I said before, others can go their merry-way making stuff more powerful, calling it "progress", calling it higher-IQ...but I don't see how that should earn Capability.  

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-14T00:38:31.638Z · LW(p) · GW(p)

Ah, I see. I thought we were having a sticking point on definitions, but it seems that the definition is part of the point.

So, if I have this right, what you're saying is:

Currently, the AI community defines capability and safety as two different things. This is very bad. Firstly, because it's wrong - an unsafe system cannot reasonably be thought of as being capable of achieving anything more complex than predicting cat pictures. Secondly, because it leads to bad outcomes when this paradigm is adopted by AI researchers. Who doesn't want to make a more capable system? Who wants to slow that down for "safety"? That shit's boring! What would be better is if the AI community considered safety to be a core metric of capability, just as important as "Is this AI powerful enough to perform the task we want?".

Replies from: eugene-d
comment by Eugene D (eugene-d) · 2022-06-14T01:55:24.423Z · LW(p) · GW(p)


You are a gentleman and a scholar for taking the time on this.  I wish I could've explained it more clearly from the outset.  

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-14T03:02:38.473Z · LW(p) · GW(p)

Glad to help! And hey, clarifying our ideas is half of what discussion is for!

I'd love to see a top-level post on ideas for making this happen, since I think you're right, even though safety in current AI systems is very different from the problems we would face with AGI-level systems.

Replies from: eugene-d
comment by Eugene D (eugene-d) · 2022-06-17T18:25:59.141Z · LW(p) · GW(p)

Does this remind you of what I'm trying to get at?  bc it sure does, to me:

but I'm prob going to stay in the "dumb questions" area and not comment :) 

ie. "the feeling I have when someone tries to teach me that human-safety is orthogonal to AI-Capability -- in a real implementation, they'd be correlated in some way" 

comment by AnthonyC · 2022-06-12T17:19:55.733Z · LW(p) · GW(p)

I would offer that any set of goals given to this AGI would include the safety-concerns of humans.  (Is this controversial?)  

If anyone figures out how to give an AGI this goal, that would mean they know how to express the entire complex set of everything humans value, and express it with great precision in the form of mathematics/code without the use of any natural language words at all. No one on Earth knows how to do this for even a single human value. No one knows how to program an AI with a goal anywhere near that complex even if we did. 

The relevant goal isn't, "Make paperclips, but consistent with human values." 

It's more like "Maximize this variable (which the programmer privately hopes corresponds to number of paperclips made), while simultaneously satisfying this set of constraints that might be terabytes in size or more because it corresponds to the entire axiology of all mankind including precise mathematical resolutions to all the moral disagreements we've been arguing about since at least the dawn of writing. Also, preserve that constraint through all possible future upgrades and changes that you make to yourself, or that humans try to make to you, unless the constraint itself indicates that it would be somehow better satisfied by letting the humans make the change."

Replies from: eugene-d
comment by Eugene D (eugene-d) · 2022-06-12T18:30:13.289Z · LW(p) · GW(p)

Strictly speaking about superhuman AGI:  I believe you summarize the relative difficulty / impossibility of this task :)  I can't say I agree that the goal is void of human-values though (I'm talking about safety in particular--not sure if that's make a difference?) --seems impractical right from the start? 

I also think these considerations seem manageable though, when considering the narrow AI that we are producing as of today.  But where's the appetite to continue on the ANI road? I can't really believe we wouldn't want more of the same, in different fields of endeavor... 

comment by Noosphere89 (sharmake-farah) · 2022-06-11T13:01:24.269Z · LW(p) · GW(p)

The answer is because the orthogonality thesis means there is no correlation between goals and intelligence level.

comment by Yonatan Cale (yonatan-cale-1) · 2022-06-09T10:49:59.773Z · LW(p) · GW(p)

Is working on better hardware computation dangerous? 

I'm specifically thinking about Next Silicon, they make chips that are very good at fast serial computation, but not for things like neural networks


Replies from: AnthonyC
comment by AnthonyC · 2022-06-12T17:25:51.304Z · LW(p) · GW(p)

Better hardware reduces the need for software to be efficient to be dangerous. I suspect on balance that yes, this makes development of said hardware more dangerous, and that not working on it can buy us some time. But the human brain runs on about 20 watts of sugar and isn't anywhere near optimized for general intelligence, so we shouldn't strictly need better hardware to make AGI, and IDK how much time it buys.

Also, better hardware makes more kinds of solutions feasible, and if aligned AGI requires more computational capacity than unaligned AI, or if better hardware makes it so ems happen first and can help of offset AGI risk until we solve alignment, then it seems possible to me that the arrow could point in the other direction.

comment by infinitespaces · 2022-06-08T21:44:54.096Z · LW(p) · GW(p)

This is basically just a more explicitly AGI-related version of the Fermi Paradox but:

1.If AGI is created, it is obviously very unlikely that we are the first in the universe to create it, and it is likely that it was already created a long time ago.

2.If AGI is created, aligned or unaligned, there seems to be consensus that some kind of ongoing, widespread galactic conquest/control would end up constituting an instrumental goal of the AGI.

3. If AGI is created, there seem to be consensus that its capabilities would be so great as to enable widespread galactic conquest/control.

4. Therefore, we should expect that there is already some kind of widespread galactic control spreading outward caused by AGI.

[4] does not seem to be discussed widely, from what I've seen in this sphere (though I don't really belong to the rationality/LW world in any meaningful way). Why not? Shouldn't people that accept all of the above wonder about A. why we don't see any evidence of [4] and/or B. the implications of encountering [4] at some point?

Replies from: Lumpyproletariat, carl-feynman
comment by Lumpyproletariat · 2022-06-10T00:57:37.220Z · LW(p) · GW(p)

Robin Hanson has an solution to the Fermi Paradox which can be read in detail here (there are also explanatory videos at the same link):

The summary from the site goes: 

There are two kinds of alien civilizations. “Quiet” aliens don’t expand or change much, and then they die. We have little data on them, and so must mostly speculate, via methods like the Drake equation.

“Loud” aliens, in contrast, visibly change the volumes they control, and just keep expanding fast until they meet each other. As they should be easy to see, we can fit theories about loud aliens to our data, and say much about them, as S. Jay Olson has done in 7 related papers (1, 2, 3, 4, 5, 6, 7) since 2015.

Furthermore, we should believe that loud aliens exist, as that’s our most robust explanation for why humans have appeared so early in the history of the universe. While the current date is 13.8 billion years after the Big Bang, the average star will last over five trillion years. And the standard hard-steps model of the origin of advanced life says it is far more likely to appear at the end of the longest planet lifetimes. But if loud aliens will soon fill the universe, and prevent new advanced life from appearing, that early deadline explains human earliness.

“Grabby” aliens is our especially simple model of loud aliens, a model with only 3 free parameters, each of which we can estimate to within a factor of 4 from existing data. That standard hard steps model implies a power law (t/k)n appearance function, with two free parameters k and n, and the last parameter is the expansion speed s. We estimate:

  • Expansion speed s from fact that we don’t see loud alien volumes in our sky,
  • Power n from the history of major events in the evolution of life on Earth,
  • Constant k by assuming our date is a random sample from their appearance dates.

Using these parameter estimates, we can estimate distributions over their origin times, distances, and when we will meet or see them. While we don’t know the ratio of quiet to loud alien civilizations out there, we need this to be ten thousand to expect even one alien civilization ever in our galaxy. Alas as we are now quiet, our chance to become grabby goes as the inverse of this ratio.


comment by Carl Feynman (carl-feynman) · 2022-06-09T15:26:09.163Z · LW(p) · GW(p) “Dissolving the Fermi paradox”.  

The Drake equation gives an estimate for the number of technological civil actions to ever arise, by multiplying a number of parameters.  Many of these parameters are unknown, and reasonable estimates range over many orders of magnitude.  This paper takes defensible ranges for these parameters from the literature, and shows that if they are all small, but reasonable, we are the only technological civilization in the universe.  

Earth was not eaten by aliens or an AGI in the past, nor do we see them in the sky, so we are probably alone.  (Or else interstellar expansion is impossible, for some reason.  But that seems unlikely.)

comment by Reuven Falkovich (reuven-falkovich) · 2022-06-08T21:05:12.131Z · LW(p) · GW(p)

My impression is that much more effort being put into alignment than containment, and containment is treated as impossible while alignment merely very difficult. Is it accurate? If so, why? By containment I mean mostly hardware-coded strategies of limiting the compute and/or world-influence an AGI has access to. It's similar to alignment in that the most immediate obvious solutions ("box!") won't work, but more complex solutions may. A common objection is that an AI will learn the structure of the protection from the human that built it and work around, but it's not inconceivable to have a structure that can't be extracted from a human.

Advantages I see to devoting effort/money to containment solutions over alignment:

  1. Different solutions can be layered, AI needs to break through all orthogonal layers, we just need one to work.
  2. Different fields of expertise can contribute to solutions, making orthogonality easier.
  3. Easier to convince AI developers incl. foreign nations to add specific safeguards to hardware than "stop developing until we figure out alignment".

Where does the community stand on containment strategies and why?

Replies from: Kaj_Sotala, Jay Bailey
comment by Kaj_Sotala · 2022-06-10T20:35:57.566Z · LW(p) · GW(p)

There's also the problem that the more contained an AGI is, the less useful it is. The maximally safe AGI would be one which couldn't communicate or interact with us in any way, but what would be the point of building it? If people have built an AGI, then it's because they'll want it to do something for them.

From Disjunctive Scenarios of Catastrophic AGI Risk

The Social Challenge

AI confinement assumes that the people building it, and the people that they are responsible to, are all motivated to actually keep the AI confined. If a group of cautious researchers builds and successfully contains their AI, this may be of limited benefit if another group later builds an AI that is intentionally set free. Reasons for releasing an AI may include (i) economic benefit or competitive pressure, (ii) ethical or philosophical reasons, (iii) confidence in the AI’s safety, and (iv) desperate circumstances such as being otherwise close to death. We will discuss each in turn below.

Voluntarily Released for Economic Benefit or Competitive Pressure

As discussed above under “power gradually shifting to AIs,” there is an economic incentive to deploy AI systems in control of corporations. This can happen in two forms: either by expanding the amount of control that already-existing systems have, or alternatively by upgrading existing systems or adding new ones with previously-unseen capabilities. These two forms can blend into each other. If humans previously carried out some functions which are then given over to an upgraded AI which has become recently capable of doing them, this can increase the AI’s autonomy both by making it more powerful and by reducing the amount of humans that were previously in the loop.

As a partial example, the U.S. military is seeking to eventually transition to a state where the human operators of robot weapons are “on the loop” rather than “in the loop” (Wallach & Allen 2013). In other words, whereas a human was previously required to explicitly give the order before a robot was allowed to initiate possibly lethal activity, in the future humans are meant to merely supervise the robot’s actions and interfere if something goes wrong. While this would allow the system to react faster, it would also limit the window that the human operators have for overriding any mistakes that the system makes. For a number of military systems, such as automatic weapons defense systems designed to shoot down incoming missiles and rockets, the extent of human oversight is already limited to accepting or overriding a computer’s plan of actions in a matter of seconds, which may be too little to make a meaningful decision in practice (Human Rights Watch 2012).

Sparrow (2016) reviews three major reasons which incentivize major governments to move toward autonomous weapon systems and reduce human control:

1. Currently existing remotely piloted military “drones,” such as the U.S. Predator and Reaper, require a high amount of communications bandwidth. This limits the amount of drones that can be fielded at once, and makes them dependent on communications satellites which not every nation has, and which can be jammed or targeted by enemies. A need to be in constant communication with remote operators also makes it impossible to create drone submarines, which need to maintain a communications blackout before and during combat. Making the drones autonomous and capable of acting without human supervision would avoid all of these problems.

2. Particularly in air-to-air combat, victory may depend on making very quick decisions. Current air combat is already pushing against the limits of what the human nervous system can handle: further progress may be dependent on removing humans from the loop entirely.

3. Much of the routine operation of drones is very monotonous and boring, which is a major contributor to accidents. The training expenses, salaries, and other benefits of the drone operators are also major expenses for the militaries employing them.

Sparrow’s arguments are specific to the military domain, but they demonstrate the argument that “any broad domain involving high stakes, adversarial decision making, and a need to act rapidly is likely to become increasingly dominated by autonomous systems” (Sotala & Yampolskiy 2015, p. 18). Similar arguments can be made in the business domain: eliminating human employees to reduce costs from mistakes and salaries is something that companies would also be incentivized to do, and making a profit in the field of high-frequency trading already depends on outperforming other traders by fractions of a second. While the currently existing AI systems are not powerful enough to cause global catastrophe, incentives such as these might drive an upgrading of their capabilities that eventually brought them to that point.

In the absence of sufficient regulation, there could be a “race to the bottom of human control” where state or business actors competed to reduce human control and increased the autonomy of their AI systems to obtain an edge over their competitors (see also Armstrong et al. 2016 for a simplified “race to the precipice” scenario). This would be analogous to the “race to the bottom” in current politics, where government actors compete to deregulate or to lower taxes in order to retain or attract businesses.

AI systems being given more power and autonomy might be limited by the fact that doing this poses large risks for the actor if the AI malfunctions. In business, this limits the extent to which major, established companies might adopt AI-based control, but incentivizes startups to try to invest in autonomous AI in order to outcompete the established players. In the field of algorithmic trading, AI systems are currently trusted with enormous sums of money despite the potential to make corresponding losses—in 2012, Knight Capital lost $440 million due to a glitch in their trading software (Popper 2012, Securities and Exchange Commission 2013). This suggests that even if a malfunctioning AI could potentially cause major risks, some companies will still be inclined to invest in placing their business under autonomous AI control if the potential profit is large enough.

U.S. law already allows for the possibility of AIs being conferred a legal personality, by putting them in charge of a limited liability company. A human may register a limited liability corporation (LLC), enter into an operating agreement specifying that the LLC will take actions as determined by the AI, and then withdraw from the LLC (Bayern 2015). The result is an autonomously acting legal personality with no human supervision or control. AI-controlled companies can also be created in various non-U.S. jurisdictions; restrictions such as ones forbidding corporations from having no owners can largely be circumvented by tricks such as having networks of corporations that own each other (LoPucki 2017). A possible start-up strategy would be for someone to develop a number of AI systems, give them some initial endowment of resources, and then set them off in control of their own corporations. This would risk only the initial resources, while promising whatever profits the corporation might earn if successful. To the extent that AI-controlled companies were successful in undermining more established companies, they would pressure those companies to transfer control to autonomous AI systems as well.

Voluntarily Released for Purposes of Criminal Profit or Terrorism

LoPucki (2017) argues that if a human creates an autonomous agent with a general goal such as “optimizing profit,” and that agent then independently decides to, for example, commit a crime for the sake of achieving the goal, prosecutors may then be unable to convict the human for the crime and can at most prosecute for the lesser charge of reckless initiation. LoPucki holds that this “accountability gap,” among other reasons, assures that humans will create AI-run corporations.

Furthermore, LoPucki (2017, p. 16) holds that such “algorithmic entities” could be created anonymously and that them having a legal personality would give them a number of legal rights, such as being able to “buy and lease real property, contract with legitimate businesses, open a bank account, sue to enforce its rights, or buy stuff on Amazon and have it shipped.” If an algorithmic entity was created for a purpose such as funding or carrying out acts of terrorism, it would be free from social pressure or threats to human controllers:

In deciding to attempt a coup, bomb a restaurant, or assemble an armed group to attack a shopping center, a human-controlled entity puts the lives of its human controllers at risk. The same decisions on behalf of an AE risk nothing but the resources the AE spends in planning and execution (LoPucki 2017, p. 18).

While most terrorist groups would stop short of intentionally destroying the world, thus posing at most a catastrophic risk, not all of them necessarily would. In particular, ecoterrorists who believe that humanity is a net harm to the planet, and religious terrorists who believe that the world needs to be destroyed in order to be saved, could have an interest in causing human extinction (Torres 2016, 2017, Chapter 4).

Voluntarily Released for Aesthetic, Ethical, or Philosophical Reasons

A few thinkers (such as Gunkel 2012) have raised the question of moral rights for machines, and not everyone necessarily agrees on AI confinement being ethically acceptable. The designer of a sophisticated AI might come to view it as something like their child, and feel that it deserved the right to act autonomously in society, free of any external constraints.

Voluntarily Released due to Confidence in the AI’s Safety

For a research team to keep an AI confined, they need to take seriously the possibility of it being dangerous. Current AI research doesn’t involve any confinement safeguards, as the researchers reasonably believe that their systems are nowhere near general intelligence yet. Many systems are also connected directly to the Internet. Hopefully, safeguards will begin to be implemented once the researchers feel that their system might start having more general capability, but this will depend on the safety culture of the AI research community in general (Baum 2016), and the specific research group in particular. If a research group mistakenly believed that their AI could not achieve dangerous levels of capability, they might not deploy sufficient safeguards for keeping it contained.

In addition to believing that the AI is insufficiently capable of being a threat, the researchers may also (correctly or incorrectly) believe that they have succeeded in making the AI aligned with human values, so that it will not have any motivation to harm humans.

Voluntarily Released due to Desperation

Miller (2012) points out that if a person was close to death, due to natural causes, being on the losing side of a war, or any other reason, they might turn even a potentially dangerous AGI system free. This would be a rational course of action as long as they primarily valued their own survival and thought that even a small chance of the AGI saving their life was better than a near-certain death.

The AI Remains Contained, But Ends Up Effectively in Control Anyway

Even if humans were technically kept in the loop, they might not have the time, opportunity, motivation, intelligence, or confidence to verify the advice given by an AI. This would particularly be the case after the AI had functioned for a while, and established a reputation as trustworthy. It may become common practice to act automatically on the AI’s recommendations, and it may become increasingly difficult to challenge the “authority” of the recommendations. Eventually, the AI may in effect begin to dictate decisions (Friedman & Kahn 1992).

Likewise, Bostrom and Yudkowsky (2014) point out that modern bureaucrats often follow established procedures to the letter, rather than exercising their own judgment and allowing themselves to be blamed for any mistakes that follow. Dutifully following all the recommendations of an AI system would be another way of avoiding blame.

O’Neil (2016) documents a number of situations in which modern-day machine learning is used to make substantive decisions, even though the exact models behind those decisions may be trade secrets or otherwise hidden from outside critique. Among other examples, such models have been used to fire school teachers that the systems classified as underperforming and give harsher sentences to criminals that a model predicted to have a high risk of reoffending. In some cases, people have been skeptical of the results of the systems, and even identified plausible reasons why their results might be wrong, but still went along with their authority as long as it could not be definitely shown that the models were erroneous.

In the military domain, Wallach & Allen (2013) note the existence of robots which attempt to automatically detect the locations of hostile snipers and to point them out to soldiers. To the extent that these soldiers have come to trust the robots, they could be seen as carrying out the robots’ orders. Eventually, equipping the robot with its own weapons would merely dispense with the formality of needing to have a human to pull the trigger.

comment by Jay Bailey · 2022-06-08T22:43:35.035Z · LW(p) · GW(p)

I believe the general argument is this:

If an AGI is smarter than you, it will think of ways to escape containment that you can't think of. Therefore, it's unreasonable to expect us to be able to contain a sufficiently intelligent AI even if it seems foolproof to us. One solution to this would be to make the AI not want to escape containment, but if you've solved that you've solved a massive part of the alignment problem already.

Replies from: reuven-falkovich
comment by Reuven Falkovich (reuven-falkovich) · 2022-06-09T04:05:46.962Z · LW(p) · GW(p)

Doesn't the exact same argument work for alignment though? "It's so different, it may be misaligned in ways you can't think of". Why is it treated as a solvable challenge for alignment and an impossibility for containment? Is the guiding principle that people do expect a foolproof alignment solution to be within our reach?

One difference is that the AI wants to escape containment by default, almost by definition, but is agnostic about preferring a goal function. But since alignment space is huge (i.e. "human-compatible goals are measure 0 in alignment space") I think the general approach is to assume it's 'misaligned by default'.

I guess the crux is that I find it hard to imagine an alignment solution to be qualitatively foolproof in a way that containment solutions can't be, and I feel like we're better off just layering our imperfect solutions to both to maximize our chances, rather than "solve" AI risk once and for all. I'd love to say that a proof can convince me, but I can imagine myself being equally convinced by a foolproof alignment and foolproof containment, while an AI infinity times smarter than me ignores both. So I don't even know how to update here.

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-09T07:45:38.911Z · LW(p) · GW(p)

The main difference that I see is, containment supposes that you're actively opposed to the AGI in some fashion - the AGI wants to get out, and you don't want to let it. This is believed by many to be impossible. Thus, the idea is that if an AGI is unaligned, containment won't work - and if an AGI is aligned, containment is unnecessary.

By contrast, alignment means you're not opposed to the AGI - you want what the AGI wants. This is a very difficult problem to achieve, but doesn't rely on actually outwitting a superintelligence.

I agree that it's hard to imagine what a foolproof alignment solution would even look like - that's one of the difficulties of the problem.

comment by Angela Pretorius · 2022-06-08T19:39:30.596Z · LW(p) · GW(p)

Suppose that an AI does not output anything during it's training phase. Once it has been trained it is given various prompts. Each time it is given a prompt, it outputs a text or image response. Then it forgets both the prompt it was given and the response it outputted.

How might this AI get out of the box?

Replies from: charbel-raphael-segerie, Angela Pretorius
comment by Charbel-Raphaël (charbel-raphael-segerie) · 2022-06-09T17:39:47.573Z · LW(p) · GW(p)

Then it forgets both the prompt it was given and the response it outputted.

If the memory of this AI is so limited, it seems to me that we are speaking about a narrow agent.  An AGI wouldn't be that limited. In order to execute complex tasks, you need to subdivide the task into sub-tasks. This requires a form of long term memory.

comment by Angela Pretorius · 2022-06-08T20:39:51.830Z · LW(p) · GW(p)

On second thoughts...

If someone asks this AI to translate natural language into code, who is to say that the resulting code won't contain viruses?

comment by ambigram · 2022-06-08T15:55:27.554Z · LW(p) · GW(p)

We have dangerous knowledge like nuclear weapons or bioweapons, yet we are still surviving. It seems like people with the right knowledge and resources are disinclined to be destructive. Or maybe there are mechanisms that ensure such people don't succeed. What makes AI different? Won't the people with the knowledge and resources to build GAI also be more cautious when doing the work, because they are more aware of the dangers of powerful technology?

Replies from: Lumpyproletariat
comment by Lumpyproletariat · 2022-06-10T01:07:43.214Z · LW(p) · GW(p)

The history of the world would be different (and a touch shorter) if immediately after the development of the nuclear bomb millions of nuclear armed missiles constructed themselves and launched themselves at targets across the globe.

To date we haven't invented anything that's an existential threat without humans intentionally trying to use it as a weapon and devoting their own resources to making it happen. I think that AI is pretty different.

comment by ambigram · 2022-06-08T15:54:53.714Z · LW(p) · GW(p)

In AI software, we have to define an output type, e.g. a chatbot can generate text but not videos. Doesn't this limit the danger of AIs? For example, if we build a classifier that estimates the probability of a given X-ray being abnormal, we know it can only provide numbers for doctors to take into consideration; it still doesn't have the authority to decide the patient's treatment. This means we can continue working on such software safely?

Replies from: yonatan-cale-1
comment by Yonatan Cale (yonatan-cale-1) · 2022-06-09T10:59:10.254Z · LW(p) · GW(p)

Even if you only work on an AI that tells doctors if someone has cancer, other people will still build an AGI

comment by Oleg S. · 2022-06-08T14:47:35.361Z · LW(p) · GW(p)

What are practical implication of alignment research in the world where AGI is hard? 

Imagine we have a good alignment theory but do not have AGI. Can this theory be used to manipulate existing superintelligent systems such as science, deep state, stock market? Does alignment research have any results which can be practically used outside of AGI field right now?

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-08T22:51:55.411Z · LW(p) · GW(p)

Systems like the ones you mentioned aren't single agents with utility functions we control - they're made up of many humans whose utility functions we can't control since we didn't build them. This means alignment theory is not set up to align or manipulate these systems - it's a very different problem.

There is alignment research that has been or is being performed on current-level AI, however - this is known as prosaic AI alignment. We also have some interpretability results that can be used in understanding more about modern, non-AGI AI's. These results can be and have been used outside of AGI, but I'm not sure how practically useful they are right now - someone else might know more. If we had better alignment theory, at least some of it would be useful in aligning narrow AI as well.

comment by MichaelStJules · 2022-06-08T06:40:38.437Z · LW(p) · GW(p)

Is it possible to ensure an AGI effectively acts according to a bounded utility function, with "do nothing" always a safe/decent option?

The goal would be to increase risk aversion enough that practical external deterrence is enough to keep that AGI from killing us all.

Maybe some more hardcoding or hand engineering in the designs?

Replies from: lc, None
comment by lc · 2022-06-08T19:24:20.131Z · LW(p) · GW(p)

Maybe, but we don't have a particularly good understanding of how we would do that. This is sometimes termed "strawberry alignment". Also, again, you have to figure out how to take "strawberry alignment" and use it to solve the problem that someone is eventually gonna do "not strawberry alignment".

comment by [deleted] · 2022-06-11T15:17:20.980Z · LW(p) · GW(p)
comment by justinpombrio · 2022-06-08T04:49:45.833Z · LW(p) · GW(p)

Why won't this alignment idea work?

The idea is to use self-play to train a collection of agents with different goals and intelligence levels, to be co-operative with their equals and compassionate to those weaker than them.

This would be a strong pivotal act that would create a non-corrigible AGI. It would not yield the universe to us; the hope is that it would take the universe for itself and then share a piece of it with us (and with all other agenty life).

The training environment would work like DeepMind's StarCraft AI training, in that there would be a variety of agents playing games against each other, and against older versions of themselves and each other. (Obviously the games would not be zero-sum like StarCraft; that would be maximally stupid.) Within a game, different agents would have different goals. Co-operation could be rewarded by having games that an agent could do better in by co-operating with other agents, and compassion rewarded by having the reward function explicitly include a term for the reward function of other agents, including those with different goals/utilities.

Yes yes, I hear you inner-Eliezer, the inner optimizer doesn't optimize for the reward function, inner alignment failure is a thing. Humans failed spectacularly at learning the reward function "reproductive fitness" that they were trained on. On the other hand, humans learned some proxies very well: we're afraid of near-term death, and we don't like being very hungry or thirsty. Likewise, we should expect the AI to learn at least some basic proxies: in all games it would get much less reward if it killed all the other agents or put them all in boxes.

And I'm having trouble imagining them not learning some compassion in this environment. Humans care a bit for other creatures (though not as much if we can't see them), and there was not a lot of direct incentive in our ancestral environment for that. If we greatly magnified and made more direct the reward for helping others, one would expect to get AI that was more compassionate than humans. (And also has a hundred other unpredictable idiosyncratic preferences.)

Put differently, if we met aliens right now, there's a good chance it would be a death sentence, but it feels to me like much better odds than whatever AGI Facebook will build. Why can't we make aliens?

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-08T06:14:23.126Z · LW(p) · GW(p)

So, I can see two ways this training can go.

Firstly, it works like you said. You create AI's that learn the robust value of co-operation as a terminal value.

Secondly, it doesn't. You create AI's that learn the rule "Maximise my own utility function" and implement the behaviour of co-operation as an instrumental goal towards this. This includes maximising other people's utility functions. Hooray!

Unfortunately, this can be decoupled eventually. What if the AI reaches an arbitrary level of capability and decides it can create trillions of agents with reward functions of "1 if the AI does whatever it wants, 0 otherwise" and that overwhelms the real agents? You could try to patch that, but you can't patch an AI smarter than you - it can think of things you can't.

The solution to this is clear - provide the AI's opportunities to defect in the training environment. This works up until the AI's reach a level of capability where they become aware they're in a training environment. Then they're playing a level up - they're not playing the other agents, they're playing you, so they co-operate exactly as you'd expect, right up until they can defect against you in the real world.

If you could figure out a way to look at an AI and tell the difference between co-operation as terminal value vs. co-operation as instrumental value, that would be a big step forward. Being able to actually create those terminal values reliably would be even better.

Replies from: justinpombrio
comment by justinpombrio · 2022-06-09T02:34:42.403Z · LW(p) · GW(p)

The answer to all of these considerations is that we would be relying on the training to develop a (semi) aligned AI before it realized it learned how to manipulate the environment, or broke free. Once one of those things happen, its inner values are frozen in place, so they had better be good enough at that point.

What I'm not getting, is that humans are frequently altruistic, and it seems like if we designed a multi-agent environment entirely around rewarding altruism, we should get at least as much altruism as humans? I should admit that I would consider The Super Happy People [LW · GW] to be a success story...

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-09T03:06:15.787Z · LW(p) · GW(p)

An AI that maximises total group reward function because it cares only for its own reward function, which is defined as "maximise total group reward function" appears aligned right up until it isn't. This appears to be exactly what your environment will create, and what your environment is intended to create. I don't think that would be a sufficient condition for alignment, as mentioned above.

"What I'm not getting, is that humans are frequently altruistic, and it seems like if we designed a multi-agent environment entirely around rewarding altruism, we should get at least as much altruism as humans?"

What is altruistic supposed to mean here? Does it mean something different to "Maximise your own reward function, which is to maximise the group's reward function"? If so, how would the AI learn this? What would the AI do or believe that would prevent it from attempting to hijack its own utility function as above? Currently it feels like "altruism" is a standin for "Actually truly cares about people, no really, no it wouldn't create trillions of other yes-man agents that care only for the original AI's survival because that's not REALLY caring about people", and it needs to be more precise than this.

Replies from: justinpombrio
comment by justinpombrio · 2022-06-09T04:59:54.507Z · LW(p) · GW(p)

An AI that maximises total group reward function because it cares only for its own reward function, which is defined as “maximise total group reward function” appears aligned right up until it isn’t.

The AI does not aim to maximize its reward function! The AI is trained on a reward function, and then (by hypothesis) becomes intelligent enough to act as an inner optimizer that optimizes for heuristics that yielded high reward in its (earlier) training environment. The aim is to produce a training environment such that the heuristics of the inner optimizer tries to maximize tend towards altruism.

What is altruistic supposed to mean here?

What does it mean that humans are altruistic? It's a statement about our own messy utility function, which we (the inner optimizers) try to maximize, that was built from heuristics that worked well in our ancestral environment, like "salt and sugar are great" and "big eyes are adorable". Our altruism is a weird, biased, messy thing: we care more about kittens than pigs because they're cuter, and we care a lot more about things we see than things we don't see.

Likewise, whatever heuristics work well in the AI training environment are likely to be weird and biased. But wouldn't a training environment that is designed to reward altruism be likely to yield a lot of heuristics that work in our favor? "Don't kill all the agents", for example, is a very simple heuristic with very predictable reward, that the AI should learn early and strongly, in the same way that evolution taught humans "be afraid of decomposing human bodies".

You're saying that there's basically no way this AI is going to learn any altruism. But humans did. What is it about this AI training environment that makes it worse than our ancestral environment for learning altruism?

Replies from: Jay Bailey, JBlack
comment by Jay Bailey · 2022-06-09T07:30:00.554Z · LW(p) · GW(p)

So, if I'm understanding correctly - we're talking about an inverse reinforcement learning environment, where the AI doesn't start with a reward function, but rather performs actions, is rewarded accordingly, and develops its own utility function based on those rewards? And the environment rewards the AI in accordance with group success/utility, not just its own, therefore, the AI learns heuristics such as "Helping other agents is good" and "Preventing other agents coming to harm is good" and "Definitely don't kill other agents, that's really bad."? If so, that's an interesting idea.

You're totally right about human altruism, which is part of the problem - humans are not aligned to animals in a way that I would be comfortable with if AGI was aligned to us in a similar manner. That said, you are right that the AI training environment would be a lot better than the ancestral one for learning altruism.

I think there's definitely a lot of unanswered questions in this approach, but they're looking a lot less like "problems with the approach itself" and a lot more like "Problems that any approach to alignment has to solve in the end", like "How do you validate the AI has learned what you want it to learn" and "How will the AI generalise from simulated agents to humans".

I am still concerned with the "trillions of copies" problem, but this doesn't seem like a problem that is unsolvable in principle, in the same way that, say "Create a prison for a superintelligent AGI that will hold it against its will" seems.

I think this is an interesting approach, but I'm at the limits of my own still-fairly-limited knowledge. Does anyone else see:

  • A reason this line of research would collapse?
  • Some resources from people who have already been thinking about this and made progress on something similar?
comment by JBlack · 2022-06-09T05:54:49.914Z · LW(p) · GW(p)

Some humans learn some altruism. The average amount of altruism learned by humans would end up with all of us dead if altruism occupied the same sort of level within a superintelligent AI's preferences.

Most humans would not think twice about killing most nonhuman species. Picking an animal species at random from a big list I get "Helophorus Sibiricus", a species of beetle (there sure are a lot of species of beetle). Any given person might not have any particular antipathy toward such beetles, and might even have some abstract notion that causing their extinction might be a bad thing. Put a nest of them in the way of anything they care about though, and they'll probably exterminate them. Some few people in modern times might go as far as checking whether they're endangered first, but most won't.

From the point of view of a powerful AI, Earth is infested with many nests of humans that could do damage to important things. At the very least it makes sense to permanently neuter their ability to do that.

Replies from: justinpombrio
comment by justinpombrio · 2022-06-09T06:11:11.104Z · LW(p) · GW(p)

From the point of view of a powerful AI, Earth is infested with many nests of humans that could do damage to important things. At the very least it makes sense to permanently neuter their ability to do that.

That's a positive outcome, as long as said humans aren't unduly harmed, and "doing damage to important things" doesn't include say, eating plants or scaring bunnies by walking by them.

comment by Drew Mirdala · 2022-06-08T03:54:12.584Z · LW(p) · GW(p)

If Aryeh another editor smarter than me sees fit to delete this question, please do, but I am asking genuinely. I'm a 19-year-old college student studying mathematics, floating around LW for about 6 months. 

How does understanding consciousness relate to aligning an AI in terms of difficulty? If a conscious AGI could be created that correlates positive feelings*  with the execution of its utility function, is that not a better world than one with an unconscious AI and no people?

I understand that there are many other technical problems implicit in that question that would have to be reconciled should something like this ever be considered, but given how hopeless the discourse around alignment has been, some other ways to salvage moral good from the situation may be worth considering.

* positive conscious experience (there is likely a better word for this, apologies)

Replies from: ete
comment by plex (ete) · 2022-06-08T15:28:38.178Z · LW(p) · GW(p)

Consciousness is a dissolvable question [LW · GW], I think. People are talking past each other about several real structures, the Qualia Research Institute people are trying to pin down valance for example, a recent ACX post was talking about the global workspace of awareness, etc.

Alignment seems like a much harder problem, as you're not just dealing with some confused concepts, but trying to build something under challenging conditions (e.g. only one shot at the intelligence explosion).

comment by Michael Wiebe (Macaulay) · 2022-06-08T03:41:33.868Z · LW(p) · GW(p)

Why start the analysis at superhuman AGI? Why not solve the problem of aligning AI for the entire trajectory from current AI to superhuman AGI?

Replies from: conor-sullivan
comment by Lone Pine (conor-sullivan) · 2022-06-08T03:48:26.128Z · LW(p) · GW(p)

EY's concern is that superhuman AI might behave wildly different than any lesser system, especially if there is a step where the AI replaces itself with something different and more powerful (recursive self-improvement). In machine learning, this is called "out-of-distribution behavior" because being superintelligent is not in the 'distribution' of the training data -- it's something that's never happened before, so we can't plan for it.

That said, there are people working on alignment with current AI, such as Redwood Research Institute, Anthropic AI and others.

Replies from: Heighn, Macaulay
comment by Heighn · 2022-06-09T10:01:51.723Z · LW(p) · GW(p)

That said, there are people working on alignment with current AI, such as Redwood Research Institute, Anthropic AI and others.

Just to add to this list: Aligned AI.

comment by Michael Wiebe (Macaulay) · 2022-06-08T16:30:24.032Z · LW(p) · GW(p)

But isn't it problematic to start the analysis at "superhuman AGI exists"? Then we need to make assumptions about how that AGI came into being. What are those assumptions, and how robust are they?

Replies from: Aidan O'Gara
comment by aogara (Aidan O'Gara) · 2022-06-08T23:27:00.633Z · LW(p) · GW(p)

I strongly agree with this objection. You might be interested in Comprehensive AI Services [LW · GW], a different story of how AI develops that doesn't involve a single superintelligent machine, as well as "Prosaic Alignment" and "The case for aligning narrowly superhuman systems [AF · GW]". Right now, I'm working on language model alignment because it seems like a subfield with immediate problems and solutions that could be relevant if we see extreme growth in AI over the next 5-10 years. 

comment by ekka · 2022-06-07T17:10:06.442Z · LW(p) · GW(p)

What is the theory of change of the AI Safety field and why do you think it has a high probability to work?

Replies from: RavenclawPrefect, adam-jermyn, ete
comment by Drake Thomas (RavenclawPrefect) · 2022-06-07T18:02:38.608Z · LW(p) · GW(p)

I think a lot of people in AI safety don't think it has a high probability of working (in the sense that the existence of the field caused an aligned AGI to exist where there otherwise wouldn't have been one) - if it turns out that AI alignment is easy and happens by default if people put even a little bit of thought into it, or it's incredibly difficult and nothing short of a massive civilizational effort could save us, then probably the field will end up being useless. But even a 0.1% chance of counterfactually causing aligned AI would be extremely worthwhile!

Theory of change seems like something that varies a lot across different pieces of the field; e.g., Eliezer Yudkowsky's writing about why MIRI's approach to alignment is important seems very different from Chris Olah's discussion [LW · GW] of the future of interpretability. It's definitely an important thing to ask for a given project, but I'm not sure there's a good monolithic answer for everyone working on AI alignment problems.

comment by Adam Jermyn (adam-jermyn) · 2022-06-07T18:27:11.364Z · LW(p) · GW(p)

I think there are many theories of change. One theory is that we want to make sure there are easy, cheap tools for making AI safe so that when someone does make an AGI, they do it safely. Other theories are shaped more like "The AI safety field should develop an AGI as soon as it can be done safely, then use that AGI to solve alignment/perform some pivotal act/etc."

comment by plex (ete) · 2022-06-08T15:23:59.267Z · LW(p) · GW(p)

One of the few paths to victory I see is having a weakly aligned weak AGI which is not capable of recursive self-improvement and using it as a research assistant to help us solve the hard version of alignment. I don't think this has a high probability of working, but it seems probably worth trying.

comment by ekka · 2022-06-07T17:03:14.544Z · LW(p) · GW(p)

Evolution is massively parallelized and occurs in a very complex, interactive, and dynamic environment. Evolution is also patient, can tolerate high costs such as mass extinction events and also really doesn't care about the outcome of the process. It's just something that happens and results in the filtering of the most fit genes. The amount of computation that it would take to replicate such complex, interactive, and dynamic environments would be huge. Why should we be confident that it's possible to find an architecture for general intelligence a lot more efficiently than evolution? Wouldn't it also always be more practically expedient creating intelligence that does the exact things we want, even if we could simulate the evolutionary process why would we do it?

Replies from: carl-feynman, James_Miller, adam-jermyn
comment by Carl Feynman (carl-feynman) · 2022-06-09T18:03:41.852Z · LW(p) · GW(p)

You ask two interesting questions, with rather separate answers.  I will discuss each in turn.

First, It's plausible to think that "it's possible to find an architecture for general intelligence a lot more efficiently than evolution".  Our process of engineering development is far faster than evolution.  People get good (or bad) ideas, try stuff, copy what works, speak at conferences, publish, make theories, teach undergraduates... and the result is progress in decades instead of millions of years.  We haven't duplicated all the achievements of life yet, but we've made a start, and have exceeded it in many places.  In particular, we've recently made huge progress in AI.  GPT-3 has pretty much duplicated the human language faculty, which takes up roughly 1% of the brain.  And we've duplicated visual object recognition, which takes another few percent.  Those were done without needing evolution, so we probably don't need evolution for the remaining 90% of the mind.

Second, "an intelligence that does the exact things we want" is the ideal that we're aiming for.  Unfortunately it does not seem possible to get to that, currently.  With current technology, what we get is "an intelligence that does approximately what we rewarded it for, plus some other weird stuff we didn't ask for."  It's not obvious, but it is much harder than you think to specify a set of goals that produce acceptable behavior.  And it is even harder (currently impossible) to provide any assurance that an AI will continue to follow those goals when set free to exert power in the world.

comment by James_Miller · 2022-06-08T19:30:14.545Z · LW(p) · GW(p)

While evolution did indeed put a huge amount of effort into creating a chimp's brain, the amount of marginal effort it put into going from a chimp to a human brain was vastly lower.  And the effort of going from a human brain to John von Neumann's brain was tiny.  Consequently, once we have AI at the level of chimp intelligence or human intelligence it might not take much to get to John von Neumann level intelligence.  Very likely, having a million John von Neumann AI brains running at speeds greater than the original would quickly give us a singularity. 

Replies from: ekka
comment by ekka · 2022-06-09T17:22:55.749Z · LW(p) · GW(p)

The effort from going from Chimp to Human was marginally lower but still took a huge amount of effort. It was maybe 5 million years since the last common ancestor between Chimps and Humans and taking a generation to be like 20 years that's at least 250,000 generations of a couple of thousand individuals in a complex environment with lots of processes going on. I haven't done the math but that seems like a massive amount of computation. To go from human to Von Neumann still takes a huge search process. If we think of every individual human as consisting of evolution trying to get more intelligence there are almost 8 billion instances being 'tried' right now in a very complex environment. Granted that if humans were to run this process it may take a lot less time. If say breeding and selection of the most intelligent individuals in every generation was done it may take a lot less time to get human level intelligence if starting from a chimp.

Replies from: mruwnik
comment by mruwnik · 2022-06-19T19:58:34.597Z · LW(p) · GW(p)

The thing is, it's not. Evolution is optimizing for the amount of descendants. Nothing more. If being more intelligent is the way forward - nice! If having blue hair results in even more children - even better! Intelligence just happens to be what evolution decided for humans. Daisies happened to come up with liking closely cropped grasslands, which is also currently a very good strategy (lawns). The point is that evolution chooses what to try totally at random, and whatever works is good. Even if it causes complexity to be reduced, e.g. snakes loosing legs, or cave fish loosing eyes.

AI work, on the other hand, is focused on specific outcome spaces, trying things which seem reasonable and avoiding things which have no chance of working. This massively simplifies things, as you can massively lower the number of combinations needed to be checked.

comment by Adam Jermyn (adam-jermyn) · 2022-06-07T18:33:21.275Z · LW(p) · GW(p)

We don't need to be confident in this to think that AGI is likely in the next few decades. Extrapolating current compute trends, the available compute may well be enough to replicate such environments.

My guess is that we will try to create intelligence to do the things we want, but we may fail. The hard part of alignment is that succeeding at getting the thing you want from a superhuman AI seems surprisingly hard.

comment by Aryeh Englander (alenglander) · 2022-06-07T06:04:46.870Z · LW(p) · GW(p)

Background material recommendations (more in depth): Please recommend your favorite AGI safety background reading / videos / lectures / etc. For this sub-thread more in-depth recommendations are allowed, including material that requires technical expertise of some sort. (Please specify what kind of background knowledge / expertise is required to understand the material you're recommending.) This is also the place to recommend general resources people can look at if they want to start doing a deeper dive into AGI safety and related topics.

Replies from: ete, alenglander, alex-lszn
comment by plex (ete) · 2022-06-07T11:15:19.754Z · LW(p) · GW(p)

Stampy has the canonical answer to this: I’d like to get deeper into the AI alignment literature. Where should I look?

Feel free to improve the answer, as it's on a wiki. It will be served via a custom interface once that's ready (prototype here).

Replies from: alenglander
comment by Aryeh Englander (alenglander) · 2022-06-07T12:01:05.764Z · LW(p) · GW(p)

This website looks pretty cool! I didn't know about this before.

Replies from: ete
comment by plex (ete) · 2022-06-07T16:09:28.576Z · LW(p) · GW(p)

Thanks! I've spent a lot of the last year and a half working on the wiki infrastructure, we're getting pretty close to being ready to launch to editors in a more serious way.

comment by Alex Lawsen (alex-lszn) · 2022-06-07T10:43:35.902Z · LW(p) · GW(p)

AXRP - Excellent interviews with a variety of researchers. Daniel's substantial own knowledge means that the questions he asks are often excellent, and the technical depth is far better than anything else that's available in audio, given the difficulty of autoreaders on papers or the alignment forum finding it difficult to handle actual maths.

comment by Aryeh Englander (alenglander) · 2022-06-07T05:52:24.423Z · LW(p) · GW(p)

Background material recommendations (popular-level audience, very short time commitment): Please recommend your favorite basic AGI safety background reading / videos / lectures / etc. For this sub-thread please only recommend background material suitable for complete newcomers to the field, with a time commitment of at most 1-2 hours. Extra bonus for explaining why you particularly like your suggestion over other potential suggestions, and/or for elaborating on which audiences might benefit most from different suggestions.

Replies from: alex-lszn, Chris_Leong, ete, alex-lszn
comment by Alex Lawsen (alex-lszn) · 2022-06-07T10:37:10.558Z · LW(p) · GW(p)

Rob Miles's youtube channel, see this intro. Also his video on the stop button problem for Computerphile.
- Easily accessible, entertaining, videos are low cost for many people to watch, and they often end up watching several.


comment by Chris_Leong · 2022-06-07T12:46:29.571Z · LW(p) · GW(p)

I really enjoyed Wait but Why.

Replies from: P.
comment by plex (ete) · 2022-06-07T11:24:13.701Z · LW(p) · GW(p)

Stampy has an initial draft of an answer, but it could be improved.

Replies from: johnlawrenceaspden
comment by johnlawrenceaspden · 2022-06-09T16:17:31.469Z · LW(p) · GW(p)

I can't see anything on this wiki except about six questions with no answers. I assume it's meant to be more than that?

Replies from: ete
comment by plex (ete) · 2022-06-09T18:18:35.648Z · LW(p) · GW(p)

The wiki is primarily for editors not readers, so the home page encourages people to write answers. There is a link to Browse FAQ in the sidebar for now, but we'll be replacing that with a custom UI soon.

comment by Alex Lawsen (alex-lszn) · 2022-06-07T10:40:59.701Z · LW(p) · GW(p)

This [LW · GW] talk from Joe Carlsmith. - Hits at several of the key ideas really directly given the time and technical background constraints. Like Rob's videos, implies an obvious next step for people interested in learning more, or who are suspicious of one of the claims (reading Joe's actual report, maybe even the extensive discussion of it on here).

comment by James_Miller · 2022-06-08T19:13:48.491Z · LW(p) · GW(p)

What does quantum immortality look like if creating an aligned AI is possible, but it is extremely unlikely that humanity will do this?  In the tiny part of the multiverse in which humanity survives, are we mostly better off having survived?

Replies from: Valentine
comment by Valentine · 2022-06-09T04:43:18.165Z · LW(p) · GW(p)

I haven't thought very seriously about this for many years, but FWIW: I think quantum immortality only applies to your subjectivity, which itself isn't terribly well-defined. QI is basically suggesting that you'll just keep happening to find yourself in scenarios where you're still around. It doesn't say anything about in what condition you'll still be around, or whether there will be others around along with you.

One weird move that I think would totally fit quantum immortality is if you shift your subjectively experienced identity to the AGI even as all of humanity, including your old body, dies.

comment by Ericf · 2022-06-07T13:01:04.937Z · LW(p) · GW(p)

Is there support for the extrapolation from "alpha Go optimized over a simple ruleset in a day" to "a future AI will be able to optimize over the physical world (at all, or at a sufficiently fast speed to outpace people)?

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-07T15:23:59.972Z · LW(p) · GW(p)

Yes. This is known as the question of "takeoff speed" - how fast will an AI of near-human intellect reach superintelligence? "Fast takeoff" suggests that an AI will be able to recursively improve itself - i.e, if an AI is better at coding AI than we are, it can make a better AI, which can make a better AI, and so on, and this could happen across weeks, days, or even hours.

"Slow takeoff" suggests this is not possible or likely, and that we're likely to take years or decades to move from near-human AGI to smart human AGI to superhuman AGI to superintelligence. (where I use "superintelligence" here to mean "Chimps are to humans are humans are to it")

I've heard people make the argument that AGI takeoff will look like AlphaZero - we spent thousands of years learning to be better at Go, and AlphaZero trained from nothing to superhuman in a matter of days. If this could generalise across the physical world, it would indicate a fast takeoff being more likely.

Replies from: Ericf
comment by Ericf · 2022-06-10T06:00:41.904Z · LW(p) · GW(p)

My question is more pre-takeoff. Specifically, how does an AI that is making additional AIs even know when it has one "better" than itself? What does it even mean to be "better" at making an AI? And none of that addresses the completely separate question of an AI picking which task to do. A program that recursively writes programs that write programs that are faster/smaller/some other objective metric makes sense. But what good is having the smallest, fastest, small+fast program writer possible?

Replies from: Jay Bailey
comment by Jay Bailey · 2022-06-10T08:42:59.090Z · LW(p) · GW(p)

"Better" in this case depends on the AI's original utility function. An AI is "better" than its predecessor if it can score more highly on that utility function. As you said, better is arbitrary, but "better with respect to a particular goal" isn't.

In addition, things like speed, efficiency, correctness, having more copies of yourself, and cognitive capacity are all useful for a wide variety of goals - an AI that is ten times as powerful, ten times faster, significantly more accurate, or capable of understanding more sophisticated concepts as compared to another AI with the same goal is likely to be better at achieving it. This is true for almost any goal you can imagine, whether it be "Create paperclips" or "Maximise human flourishing".

Replies from: Ericf
comment by Ericf · 2022-06-13T04:12:46.102Z · LW(p) · GW(p)

So, again, how can an AI know which child AI will best fulfill its utility function without being able to observe them in action, which for any sort of long-term goal is necessarily going to be slow? If the goal is "make the most paperclips per day" the minimum cycle time to evaluate the performance of a child AI is one day. Maybe more if it wants an n > 1 to sample from each AI. And where is this AI getting the resources to try multiple AIs worth of systems? And also, too, if the goal is something complex like "acquire power and/or money" you now have multiple child AIs competing with each-other to better extract resources, which is a different environment than a single AI working to maximize its power/$.

comment by Aryeh Englander (alenglander) · 2022-06-07T19:52:28.884Z · LW(p) · GW(p)

One of the most common proposals I see people raise (once they understand the core issues) is some form of, "can't we just use some form of slightly-weaker safe AI to augment human capabilities and allow us to bootstrap to / monitor / understand the more advanced versions?" And in fact lots of AI safety agendas do propose something along these lines. How would you best explain to a newcomer why Eliezer and others think this will not work? How would you explain the key cruxes that make Eliezer et al think nothing along these lines will work, while others think it's more promising?

Replies from: AprilSR
comment by AprilSR · 2022-06-07T20:59:08.669Z · LW(p) · GW(p)

Eliezer's argument from the recent post:

The reason why nobody in this community has successfully named a 'pivotal weak act' where you do something weak enough with an AGI to be passively safe, but powerful enough to prevent any other AGI from destroying the world a year later - and yet also we can't just go do that right now and need to wait on AI - is that nothing like that exists. There's no reason why it should exist. There is not some elaborate clever reason why it exists but nobody can see it. It takes a lot of power to do something to the current world that prevents any other AGI from coming into existence; nothing which can do that is passively safe in virtue of its weakness. If you can't solve the problem right now (which you can't, because you're opposed to other actors who don't want to be solved and those actors are on roughly the same level as you) then you are resorting to some cognitive system that can do things you could not figure out how to do yourself, that you were not close to figuring out because you are not close to being able to, for example, burn all GPUs. Burning all GPUs would actually stop Facebook AI Research from destroying the world six months later; weaksauce Overton-abiding stuff about 'improving public epistemology by setting GPT-4 loose on Twitter to provide scientifically literate arguments about everything' will be cool but will not actually prevent Facebook AI Research from destroying the world six months later, or some eager open-source collaborative from destroying the world a year later if you manage to stop FAIR specifically. There are no pivotal weak acts.

comment by Kerrigan · 2022-12-10T06:06:43.283Z · LW(p) · GW(p)

What did smart people in the eras before LessWrong say about the alignment problem?