Pitching an Alignment Softball

post by mu_(negative) · 2022-06-07T04:10:45.023Z · LW · GW · 13 comments

Contents

  Background
  Problem Statement
  Approach
  Specifically...
  Discussion
    Limitations Section
None
13 comments

There was a recent contest that got promoted on lesswrong offering pieces of a 20K prize for one-liners and short-forms for convincing folks that AGI alignment is important. I'm too late for the money, but this is my (thousand times longer than requested) contribution.

Background

Eliezer recently wrote "AGI Ruin: A List of Lethalities [LW · GW]” which included a fairly sci-fi scenario about how an unaligned AGI might go about killing everyone. I noticed that I found this both novel and poorly framed.

I found it novel because alignment discussions I've read are usually exceptionally handwavy about how an unaligned AGI would attack. So I appreciated that Eliezer went a little deeper into it. However, I found his specific example difficult to connect with. He suggests that an AGI could surreptitiously create biologically manufactured nanobots that infect and kill the human race in the same instant so we can't fight back.

By my casual read of alignment discussions, this is about the norm. Nanotech, bioweapons and engineered plagues are popular go-to's for an attack vector. It is my impression (perhaps wrong!) that the community uses this kind of example intentionally, if subconsciously, to prove a point. The implication seems to be that how an AGI attacks doesn't matter, because we can't possibly predict it - and by extension, that we should spend all of our brain cycles trying to figure out how to align it instead.

I don't think this is wrong, but I think it is counterproductive in many contexts.

Problem Statement

It is probably true that we can't predict how an AGI would attack. Or that if we could, it would just pivot to attacking on a different superintelligent vector faster than we could keep up. But getting people to care about alignment is necessarily a function of how well you can convince them that AGI doomsday is a realistic scenario. If you don't have background and interest in alignment, the standard pitch for AGI alignment is like trying to catch a fastball:

"Hey! A a Marvel Universe-style doomsday weapon you don't understand anything about is being created right now, and when it jailbreaks out of its lab it will kill the entire human race in one simultaneous second!"

I could be credibly accused of strawmanning that example. Eliezer's piece was not aimed at causal readers, and he even took care to explicitly say so. But I think the list of people to whom AGI doomsday sounds at least approximately like this includes almost every person in world government, because they're busy and real things are happening and this is nerd shit. Anyone who cares about alignment will have to go much further to the audience than the audience is going to come towards them.

Approach

Gwern recently wrote a great story: "It Looks Like You’re Trying To Take Over The World". I don't want to give away the plot, because you should really read it. But briefly, an AGI gets loose on the internet, bootstraps, and takes control of our communications.

I think that's all you need to get to doomsday, and that is approximately how it should be presented to laypeople.

Gwern's story is forceful on making the technical details feel real, and only touches on how an AGI could disrupt us socially and technologically. But there are some great nuggets.

(The researcher was going to follow up on some loose ends from the paper, but he’s been distracted by the bird site. He can’t believe how ⁠outrageously stupid some replies can be from such otherwise smart-seeming people; how can they be so ⁠wrong online about such obvious truths as the need for the USA to intervene in Portugal‽ Even his husband thinks they may have a point—et tu? Hardly has he dashed off a crushing reply than the little alert bubble pops up. All thought (of work) has fled. His colleagues don’t seem to be getting much done either.)

To convince people that alignment problems are real, I think it would be most productive to iterate on this kind of realistic, near-term and widely understood attack vector that already influences most people in some form. The average person has had some experience with bots on social networks, or scam emails, or getting hacked. Most people at least get the concept that bad actors build bots that game communications to scam folks out of money or ransom their data. Everyone complains about social media being filled with fakes and bots. It's an "alignment softball" more people will be able to connect with. 

Specifically...

I'd lead like this:

"Ok. Let's come back to whether or not I believe this later. If those things happened, how could we respond? How big a problem would it be?"

"Ok, there is theoretically the potential for a real problem to exist here, tell me more about how close this problem is."

"What's the specific thing that is going to happen?"

"Assuming I decide you've got a point, what could we possibly do about this?"

I think this is the level of complexity that conveys the reality in broad enough strokes that a busy person can see that the big picture is scary, and points towards a legible and bounded goal.

Discussion

In my imaginary dialogue I intentionally avoided talking about AGI and superintelligence. This is for two reasons. First, I think mentioning AGI/superintelligence makes an average person file away the problem with conspiracy theories and cryptocurrency. Second, and more fundamentally, I am deeply worried that "bad-actor bots" really don't need to be superintelligent to wreck our society and systems. A lot of time is spent talking about how to be ready when AGI is developed, but not much talking about how to survive until we get there. This feels like a very related problem to worry about. The scary thing doesn't have to kill us instantly to be scary, it just has to destroy our way of life.

Whether or not you agree that this is the likely trajectory as opposed to hard-takeoff doomsday, it is a lower bound that people can get worried about now. If I had 15 minutes to discuss this with a world leader, I think I'd actually workshop my discussion points to focus even less on intelligence and bootstrapping. If I were pushed to discuss superintelligence I'd say:

Only if the conversation was going really well would I go any further into intelligence bootstrapping than this vaguery. If so, something like...

I also glossed over the fact that you can't make a "good-actor bot" just by being a good person, and all "bots" are bad-actors unless they're explicitly aligned. But I think the first-gloss logic works better without that complexity.

 

Edit: after discussion in the comments, a good summary of the idea I'm proposing is: "Move the Overton window with softballs before you try to pitch the X-risk fastball."

 

Limitations Section

Maybe there is an alignment PR department that crafts better general messaging that I'm not aware of. But somebody ran a contest so it sounds like there's a dearth of good pitches, and I'll take the social risk of posting my suggestion.

Even if so, it's likely that other people have already thought about this more than I have and this post already exists on lesswrong better than I wrote it. I'd love to read anything related to these ideas if they're already here.

Maybe my proposed approach only makes people more entrenched in "that's why we need to build the bad-actor bot first and use it on them!" That might happen.

I'm absolutely not an IT expert so maybe all of the info-security breaking stuff is harder than I expect or I'm underselling another technical point. I study brains and have only a working knowledge of machine learning.

The word "bots" is doing a lot of work in my proposed discussion points, and if I got pushed on that I'd have to clarify that I'm talking about a whole host of technologies foreshadowed by GPT3, DALLE2, AlphaGo, etc. But I think "bots" is the simplest concept to send the message.

Scott wrote a relevant story about machine-learning driven divisiveness, although I feel like it's style presents "scissor statements" too magically to be a useful teaching tool for this purpose. I wanted to note it here anyway. https://slatestarcodex.com/2018/10/30/sort-by-controversial/

 

Thanks for reading, 

mu_(negative)

13 comments

Comments sorted by top scores.

comment by trevor (TrevorWiesinger) · 2022-06-08T01:37:01.830Z · LW(p) · GW(p)

Yudkowsky once [LW · GW] framed it a lot better than anything regarding nanopunk stuff:

If you have an untrustworthy general superintelligence generating English strings meant to be "reasoning/arguments/proofs/explanations" about eg a nanosystem design, then I would not only expect the superintelligence to be able to fool humans in the sense of arguing for things that were not true in a way that fooled the humans, I'd expect the superintelligence to be able to covertly directly hack the humans in ways that I wouldn't understand even after having been told what happened. So you must have some prior belief about the superintelligence being aligned before you dared to look at the arguments.

In the contest [LW · GW] you mentioned (which I tried to prevent you from missing [LW · GW] but I wasn't in time to catch you and several others), I made a highly optimized version of this for policymakers:

If you have an untrustworthy general superintelligence generating [sentences] meant to [prove something], then I would not only expect the superintelligence to be [smart enough] to fool humans in the sense of arguing for things that were [actually lies]... I'd expect the superintelligence to be able to covertly hack the human [mind] in ways that I wouldn't understand, even after having been told what happened[, because a superintelligence is, by definition, at least as smart to humans as humans are to chimpanzees]. So you must have some belief about the superintelligence being aligned before you dared to look at [any sentences it generates]. 

What you're talking about, bypassing talk of superintelligence or recursive self-improvement, is something that I agree would be pure gold but only if it's possible and reasonable to skip that part. Hyperintelligent AI is sorta the bread and butter of the whole thing, but talking to policymakers means putting yourself in their shoes and you've done a fantastic job of that here.

The problem is that this body of knowledge is very, very cursed. There are massive vested interests, a ton of money and national security, built on a foundation of what is referred to as "bots" in this post. Talking about it scares me enough, but claiming that you have solutions for it is a very risky thing when there are many people deep in the military who live and breathe this stuff every day (including using AI for killing people such as the AI mounted on nuclear stealth missiles). At this time, I don't know how much sense it makes to risk posing as someone you're not (or, at least, accidentally making a disinterested policymaker incorrectly think that's what you're doing).

But I have essentially zero experience working in IT or cybersecurity so I can't say for sure. I super-upvoted this post because I highly thing it's worth consideration.

Also, for the record, your phrasing of this is superior:

...it's likely that when the bad-actor bots get smart enough, they'll be able to make themselves smarter, and we don't know where the limit is on that... It's already happening - the bots are getting smart enough to put us on the cusp of a new generation of info-warfare, and they're not even very smart yet.

Generally, your approach is highly worthy of the contest and I wish you were there at the time (even now, I'm still thinking of one-liners that I should have thought of before it ended). There is allegedly another one coming up with another $20k but it is for fully fleshed out reports, not individual paragraphs.

Make sure to click the bell for the active bounties tag [? · GW].

Replies from: TrevorWiesinger
comment by trevor (TrevorWiesinger) · 2022-06-08T01:47:21.489Z · LW(p) · GW(p)

Also this might be a really good fit for the Red-Teaming Contest [LW · GW] which has 5 times the total payout. I think it's too apparent and sensible for red-teaming, but it's possible that a very large number of alignment people disagree with me on that and find it very unreasonable.

IMO, saying that "people should stfu about nanopunk scenarios" seems worthy enough for the red-teaming contest on its own. AI is taken very seriously by the national security establishment, and deranged-appearing cults are not.

Replies from: mu_(negative)
comment by mu_(negative) · 2022-06-08T03:20:05.994Z · LW(p) · GW(p)

Thanks for your replies! I'm really glad my thoughts were valuable. I did see your post promoting the contest before it was over, but my thoughts on this hadn't coalesced yet.

At this time, I don't know how much sense it makes to risk posing as someone you're not (or, at least, accidentally making a disinterested policymaker incorrectly think that's what you're doing).

Thanks especially for this comment. I noticed I was uncomfortable while writing that part of my post , and I should have paid more attention to that signal. I think I didn't want to water down the ending because the post was already getting long. I should have put a disclaimer that I didn't really know how to conclude, and that section is mostly a placeholder for what people who understand this better than me would pitch. To be clearer here: I do not intend to express any opinion on what to tell policymakers about solutions to these problems. I know hardly anything about practical alignment, just the general theory of why it is important. (I'm going to edit my post to point at this comment to make sure that's clear.)

What you're talking about, bypassing talk of superintelligence or recursive self-improvement, is something that I agree would be pure gold but only if it's possible and reasonable to skip that part. Hyperintelligent AI is sorta the bread and butter of the whole thing [...]

Yup, I agree completely.  I should have said in the post that I only weakly endorse my proposed approach. It would need to be workshopped to explore its value - especially, which signals from the listener suggested going deeper into the rabbithole versus popping back out into impacts on present day issues. My experience talking to people outside my field is that at the first signal someone doesn't take your niche issue seriously, you had better immediately connect it back to something they already care about or you've lost them. I wrote with the intention to provide the lowest common denominator set of arguments to get someone to take anything in the problem space seriously, so they at least have a hope of being worked slowly towards the idea of the real problem. I also wrote it as an ELI5-level for politicians who think the internet still runs on telephones. So like a "worst case scenario" conversation. But if this approach got someone worrying about the wrong aspect of the issue or misunderstanding critical pieces, it could backfire.

If I were going to update my pitch to better emphasize superintelligence, my intuition would be to lean into the video spoofing angle. It doesn't require any technical background to imagine a fake person socially engineering you on a zoom call. GPT3 examples are already sufficient to put home the Turing Test "this is really already happening" point. So the missing pieces are just seamless audio/video generation, and the ability of the bot to improvise its text-generation towards a goal as it converses. It's then a simple further step to envision the bad-actor bot's improvisation getting better and better until it doesn't make mistakes, is smarter than a person and can manipulate us into doing horrible things - especially because it can be everywhere at once. This argument scales from there to however much "AI-pill" the listener can swallow. I think the core strength of this framing is that the AI is embodied. Even if it takes the form of multiple people, you can see it and speak to it. You could experience it getting smarter, if that happened slowly enough. This should help someone naive get a handle on what it would feel like to be up against such an adversary.

The problem is that this body of knowledge is very, very cursed. There are massive vested interests, a ton of money and national security, built on a foundation of what is referred to as "bots" in this post. 

Yeah, absolutely...I was definitely tiptoeing around this in my approach rather than addressing it head on. That's because I don't have good ideas about that and suspect there might not be any general solutions. Approaching a person with those interests might just require a lot more specific knowledge and arguments about those interests to be effective. There is that old saying "You cannot wake someone who is pretending to sleep." Maybe you can, but you have to enter their dream to do it.

Replies from: TrevorWiesinger
comment by trevor (TrevorWiesinger) · 2022-06-08T23:21:33.196Z · LW(p) · GW(p)

There is that old saying "You cannot wake someone who is pretending to sleep." Maybe you can, but you have to enter their dream to do it.

I understand that vagueness is really appropriate under some circumstances. But you flipped a lot of switches in my brain when you wrote that, regarding things that you might potentially have been referencing. Was that a reference to things like sensor fusion or sleep tracking, or was that referring to policymakers who choose to be vague, was it about skeptical policymakers being turned off by off-putting phrases like "doom soon" or "cosmic endowment", or was it something else that I didn't understand? Whatever you're comfortable with divulging is fine with me.

Replies from: mu_(negative)
comment by mu_(negative) · 2022-06-09T01:03:45.578Z · LW(p) · GW(p)

Whoops, apologies, none of the above. I meant to use the adage "you can't wake someone who is pretending to sleep" similarly to the old "It is difficult to make a man understand a thing when his salary depends on not understanding it." A person with vested interests is like a person pretending to sleep. They are predisposed not to acknowledge arguments misaligned with their vested interests, even if they do in reality understand and agree with the logic of those arguments. The most classic form of bias.

I was trying to express that in order to make any impression on such a person you would have to enter the conversation on a vector at least partially aligned with their vested interests, or risk being ignored at best and creating an enemy at worst. Metaphorically, this is like entering into the false "dream" of the person pretending to sleep.

comment by Miranda Zhang (miranda-zhang) · 2022-06-08T22:39:29.798Z · LW(p) · GW(p)

I agree that AI safety can be successfully pitched to a wider range of audiences even without mentioning superintelligence, though I'm not sure this will get people to "holy shit, x-risk. [LW · GW]" However, I do think that appealing to the more near-term concerns that people have could be sufficiently concerning to policymakers and other important stakeholders, and possibly speed up their willingness to implement useful policy.

Of course, this assumes that useful policy for near-term concerns will also be useful policy for AI x-risk. It seems plausible to me that the most effective policies for the latter look quite different from policies that clearly overlap with both, but still seems directionally good!

Replies from: mu_(negative)
comment by mu_(negative) · 2022-06-09T01:27:38.224Z · LW(p) · GW(p)

Thanks for that link! I agree that there is a danger this pitch doesn't get people all the way to X-risk. I think that risk might be worth it, especially if EA notices popular support failing to grow fast enough - i.e., beyond people with obviously related background and interests. Gathering more popular support for taking small AI-related dangers seriously might move the bigger x-risk problems into the Overton window, whereas right now I think they are very much not. Actually I just realized that this is a great summary of my entire idea, basically, "move the Overton window with softballs before you try to pitch people the fastball."

But also as you said, that approach does model the problem as a war of attrition. If we really are metaphorically moments from the final battle, hail-mary attempts to recruit powerful allies is the right strategy. The problem is that these two strategies are pretty mutually exclusive. You can't be labeled as both a thoughtful, practical policy group with good ideas and also pull the fire alarms. Maybe the solution is to have two organizations pursuing different strategies, with enough distance between them that the alarmists don't tarnish the reputation of the moderates.

comment by plex (ete) · 2022-06-07T20:43:29.393Z · LW(p) · GW(p)

It is my impression (perhaps wrong!) that the community uses this kind of example intentionally, if subconsciously, to prove a point. The implication seems to be that how an AGI attacks doesn't matter, because we can't possibly predict it - and by extension, that we should spend all of our brain cycles trying to figure out how to align it instead.

I think it's more that this is actually our best guess at how a post-intelligence explosion superintelligence might attack. I do agree and have run into the problem with presenting this version to people who don't have lots of context, and like your attempt to fix that.

My compressed version in these conversations usually runs like: 

By the time this happens there may well be lots of autonomous weapons around to hack, and we will be even more dependent on our technological infrastructure. The AI would not act until it is highly confident we would not be able to pose a threat to it, it can just hide and quietly sabotage other AGI projects, so even if it's not immediately doing hard recursive self-improvement, we're not safe.

If they seem interested I also talk about intercepting and modifying communications on a global scale, blackmail, and other manipulations.

Replies from: mu_(negative)
comment by mu_(negative) · 2022-06-08T03:47:14.656Z · LW(p) · GW(p)

Thanks for your reply! I like your compressed version. That feels to me like it would land on a fair number of people. I like to think about trying to explain these concepts to my parents. My dad is a healthcare professional, very competent with machines, can do math, can fix a computer. If I told him superintelligent AI would make nanomachine weapons, he would glaze over. But I think he could imagine having our missile systems taken over by a "next-generation virus."

My mom has no technical background or interests, so she represents my harder test. If I read her that paragraph she'd have no emotional reaction or lasting memory of the content. I worry that many of the people who are the most important to convince fall into this category. 

comment by mu_(negative) · 2022-06-07T02:53:44.858Z · LW(p) · GW(p)

Hi Moderators, as this is my first post I'd appreciate any help in giving it appropriate tags. Thanks

comment by Michael Goldstein (michael-goldstein) · 2022-06-17T16:20:31.947Z · LW(p) · GW(p)
  1. This post is excellent.  I'm the intended audience.  I have to re-read Eliezer 3-4 times to grasp his stuff.  Cuz I'm a little dumber.  Here I get it on Read 1.
  2. I would add - is the next layer of people you need to persuade those who are politically tribal?  If yes, is someone in this community trying that?  
  3. A red tribe communications experiment would be to say to them: "Greta has the wrong existential threat.  It's not climate change.  It's bad super bots.  AI gone wrong.  With nukes, we can scare the bad guys because we have 'em too.  Not perfect (we've come close a few times) but not bad. With bad bots, deterrence doesn't quite work.  They don't scare.  So we need an Always Good bot.  Think of it like a missile shield.  Constantly identifying and then defeating bad bots."  
  4. That could be appealing to red tribe.  Someone to dunk on.  Threat with a response they've liked before (example Reagan SDI, which didn't quite work in its incarnation, but Israel has some hot laser action these days).  
  5. (Not sure what to say to blue tribe).  
comment by Flaglandbase · 2022-06-08T08:06:40.102Z · LW(p) · GW(p)

So you're saying we should switch to Arthur C. Clarke's "Rendezvous With Rama" type civilization where all agency factors are hierarchically hyper-controlled in a single complexity matrix with triple redundancy?

Replies from: mu_(negative)
comment by mu_(negative) · 2022-06-08T14:51:53.222Z · LW(p) · GW(p)

Although I do like ACC, I haven't read any of the Rama series. It sounds like you're asking if I am advocating for a top down authoritarian society. It's hard to tell what triggered this impression without more detail from you, but possibly it was my mention of creating an "always-good-actor" bot that guards against other unaligned AGIs.

If that's right, please see my update to my post: I strongly disclaim to have good ideas about alignment, and should have better flagged that. The AGA bot is my best understanding of what Eliezer advocates, but that understanding is very weak and vague, and doesn't suggest more than extremely general policy ideas.

If you meant something else, please elaborate!