Convincing All Capability Researchers

elriggs

Convincing All Capability Researchers

post by Logan Riggs (elriggs) · 2022-04-08T17:40:25.488Z · LW · GW · 70 comments

  Next Steps to seriously consider this proposal
    Specifying "Solving Alignment"
    Listing 1000 AI researcher intellectuals
    Actually Hiring People/Creating a Company to Do This
  Timelines Argument
  This Plan May Backfire and Increase Capabilities
None
70 comments

This post is heavily based off this excellent comment [LW(p) · GW(p)] (the author's name is not relevant)

Give the world's thousand most respected AI researchers $1M each to spend 3 months working on AI alignment, with an extra $100M if by the end they can propose a solution alignment researchers can't shoot down. I promise you that other than like 20 industry researchers who are paid silly amounts, every one of them would take the million. They probably won't make any progress, but from then on when others ask them whether they think alignment is a real unsolved problem, they will be way more likely to say yes. That only costs you a billion dollars! I literally think I could get someone reading this the money to do this (at least at an initially moderate scale) - all it needs is a competent person to step up.

Usually people object to paying people large sums of money to work on alignment because they don't expect them to produce any good work (mostly because it's very hard to specify alignment, see below). This is a feature, not a bug.

Being able to say "the 1000 smartest people working in AI couldn't make headway in Alignment in 3 months even when they were paid $1 million and a solution would be awarded $100 million" is very good for persuading existing researchers that this is a very hard problem.

Would this really go a long way to convincing those at major AI labs that alignment is hard? We could actually ask people working at these places if there were no progress made in those 3 months if it would change their minds.

Another problem is looooong timelines. They can agree it's hard, but their timelines may be too far to matter to work on it now. I could think of a couple counter-arguments that may work (but it would better to actually test them with real people)

1. To predict AGI (or just AI that has the capability of destroying the world), you could argue "It needs these 10 capabilities, and I predict it will get each capabilities at these different time periods". If someone actually did that and was historically correct, even within 5-10 years, I'd be impressed and trust their opinion. Have you successfully predicted AI capabilities of the past 10 years?
- possibly mention most recent advances (Transformers, some RL, Neural Nets in 2010's)
2. If there's a 10% chance that it happens during you or your kids lifetime, then that's extremely important. For example, if I thought there was a 10% chance that I died taking the train home tonight, I wouldn't take the train. Even with a 0.1%, I wouldn't.

Though, generally having a set of people working on apologetics for both US & Chinese AI Researchers is potentially extremely high-impact. If you're interested and might be able to donate , DM me for a call.

Next Steps to seriously consider this proposal

Specifying "Solving Alignment"

Part of the problem of alignment is we don't know the correct framework to specify it. The quoted text suggests the criteria for a solution as "a solution alignment researchers can't shoot down.", which side-steps this issue; however, specifying the problem in as fine-grained detail would be extremely useful for communicating to these researchers.

One failure mode would be them taking the money, not get work done, and then argue the problem wasn't specified enough to make any meaningful progress which limits how persuasive this stunt could be. Documents like ELK are more useful specifications that captures the problem to various degrees, and I wish we had more problems like that.

Listing 1000 AI researcher intellectuals

The initial idea is to hire 1000 best AI researchers to work on the problem, not because we expect them to solve it, but by all of them failing

here are a few different proxy's we can use, such as citations and top researchers at AI Labs. So far I've got

1. Authors with the most citations in ML from google scholar
2. DeepMind, Facebook, Microsoft, Tesla, OpenAI, major university groups(?), ?
~~3. Myself and all my friends~~
3.Every author on this Chinese paper

Convincing the CCP and backed researchers is a blank spot in my map, and if anyone knows anything, please comment or message me for a video call.

Actually Hiring People/Creating a Company to Do This

We would need competent people to work on the specification of the problem, outreach and selecting who to pay, keeping up with researchers (if you're paying them $1 million dollars, you also can have 1-on-1 calls which would be useful to make the most of), and reviewing actual work produced(which could actually be done by the community/ independent researchers/orgs).

Timelines Argument

It was argued that these plans are only relevant in 15+ timelines, but huge social changes/cultural norms have happened within 1 year time periods. I'm not giving examples, but they went from, say 50 to 100 to 70 within a few months, which may be significantly different than Alignment.

This Plan May Backfire and Increase Capabilities

A way this can backfire is increasing race conditions, such that everyone wants to create the AGI first. Or at least, more people than are already doing so right now.

I think this is a relevant possibility, and this should be taken into account with whoever reaches out to talk to these top researchers when offering to pay them the large sums of money.

70 comments

Comments sorted by top scores.

comment by Raemon · 2022-04-08T18:41:47.064Z · LW(p) · GW(p)

I do like the core concept here, but I think for it to work you need to have a pretty well specified problem that people can't weasel out of. (I expect the default result of this to be "1000 researchers all come up with reasons they think they've solved alignment, without really understanding what was supposed to be hard in the first place.")

You touch upon this in your post but I think it's kinda the main blocker.

I do think might be a surmountable obstacle though.

Replies from: steve2152, Linda Linsefors

↑ comment by Steven Byrnes (steve2152) · 2022-04-08T21:46:28.809Z · LW(p) · GW(p)

I agree. I imagine a disagreement where the person says "you can't prove my proposal won't work" and the alignment judge says "you can't prove your proposal will work", and then there's some nasty, very-hard-to-resolve debate about whether AGI is dangerous by default or not, involving demands for more detail, and then the person says "I can't provide more detail because we don't have AGI yet", etc.

I think of the disagreement between Paul and Eliezer about whether IDA was a promising path to safe AGI (back when Paul was more optimistic about it); both parties there were exceptionally smart and knowledgeable about alignment, and they still couldn't reconcile.

Giving people a more narrow problem (e.g. ELK) would help in some ways, but there could still be disagreement over whether solving that particular problem in advance (or at all) is in fact necessary to avert AGI catastrophe.

Replies from: Raemon

↑ comment by Raemon · 2022-04-08T23:36:54.835Z · LW(p) · GW(p)

I've seen proposals of the form "Eliezer and Paul both have to agree the researchers have solved alignment to get the grand prize", which seems better than not-that, but, still seems insufficiently legibly fair to make it work with 1000 researchers.

Replies from: Raemon

↑ comment by Raemon · 2022-04-09T00:50:50.504Z · LW(p) · GW(p)

I have also seem some partially formalized stabs at something like "interpretability challenges", somewhat inspired by Auditing Games [LW · GW], where there are multiple challenges (i.e. bronze, silver and gold awards), and the bronze challenge is meant to be something achievable by interpretability researchers within a couple years, and the gold challenge is meant to be something like "you can actually reliably detect deceptive adversaries, and other key properties a competent civilization would have before running dangerously powerful AGI."

This isn't the same as an "alignment prize", but might be easier to specify.

↑ comment by Linda Linsefors · 2022-04-10T13:31:50.826Z · LW(p) · GW(p)

I don't think well specified problem is the way to go.

Instead chose a group of judges, and if the majority of them think something is progress, then that counts a winn. This are the rules, and no argument afterwards.

But to make this fair, the judges should be available to discuss possible research directions, and what they would and would not concider progress, before and during the 3 months of reaserch. Or if that's not possible, due to the scale, maybe some group of people who are good at predicting what the real judges would say.

Besides, giving out the money, and not stay in touch during the 3 months, seems like a bad idea anyway.

Replies from: Raemon

↑ comment by Raemon · 2022-04-10T17:55:46.022Z · LW(p) · GW(p)

I agree that actual judges are better for actually solving the problem, the question is whether you can legibly scale the reputation of those judges to 1000 researchers who aren't already bought into the judge's paradigm.

Replies from: Linda Linsefors

↑ comment by Linda Linsefors · 2022-04-14T13:11:34.890Z · LW(p) · GW(p)

I don't know. I would start small, and worry about scaling later.

If this is worth doing at all it is also worth doing at smaller scale, so it would not be a waste if the project don't scale.

Also, I suspect that quality is more important than quantity here. Doing a really good job running this program with the 50 most influential AI reserchers, might be better payoff, than doing a half job with the top 1000 reaerchers.

Replies from: Raemon

↑ comment by Raemon · 2022-04-14T14:55:45.261Z · LW(p) · GW(p)

Oh, maybe, but it seemed like the specific proposal here was the scale. It seemed like hiring a few people to do AI research was... sort of just the same as all the hiring and independent research that's already been happening. The novel suggestion here is hiring less aligned people at a scale that would demonstrate to the rest of the world the problem was really hard. (not sure exactly what the OP meant, and not sure what specific new element you were was interested in)

Replies from: philh, Linda Linsefors

↑ comment by philh · 2022-04-18T23:03:06.111Z · LW(p) · GW(p)

it seemed like the specific proposal here was the scale.

I guess a question here is what the returns-to-scale curve looks like? I'd be surprised if the 501-1000th researchers were more valuable than the 1-500th, suggesting there is a smaller version that's still worth doing.

I don't know where this guess comes from, but my guess might be that the curve is increasing up to somewhere between 10 and 100 researchers, and decreasing after that. But also there's likely to be threshold effects at round/newsworthy numbers?

Replies from: Raemon

↑ comment by Raemon · 2022-04-18T23:08:44.498Z · LW(p) · GW(p)

That does sound right-ish.

↑ comment by Linda Linsefors · 2022-04-16T16:52:24.901Z · LW(p) · GW(p)

I'm not going to speculate about who ment what. But for me the new and interesing idea whas to pay people to do reserch in order to change their mind, as oposed to pay people to do reaserch in order to produce reserch.

As far as I know all the current hireing and paying independent reserchers is directed towards people who already do belive that AI Safety reserch is dificult and important. Paying people who are not yet convinced is a new move (as far as I know) even at small scale.

I guess that is is currently possible for an AI risk sceptics to get an AI Safety reserch grant. But none of the grant are designed for this purpouse, right? I think the format of a very high payed (more thatn they would earn otherwese) very short (not a major interuption to ongoing research) offer, with the possibility of a price at the end, is more optimised to get skeptics onboard.

In short, the design of a funding program will be very diffrent when you have a difrent goal in mind.

comment by Ben Pace (Benito) · 2022-04-08T21:51:01.892Z · LW(p) · GW(p)

The first issue in my mind is that it’s straightforwardly messing up all business plans those companies and labs have for their staff to leave for 3 months. Leadership will by default be angry and work hard to discredit you and perhaps threaten to fire anyone who accepts your deal.

Replies from: Raemon, lc, elityre, niplav

↑ comment by Raemon · 2022-04-09T00:52:49.660Z · LW(p) · GW(p)

My yes-and thought here is "okay, can we do this in a way that the various labs will feel is exciting, or win-win?". Rather than making the deal with individuals, somehow frame it as a partnership with the companies.

↑ comment by lc · 2022-04-10T06:03:24.443Z · LW(p) · GW(p)

Leadership will by default be angry and work hard to discredit you and perhaps threaten to fire anyone who accepts your deal.

"Why is management so insistent I don't go do this alignment thing if it won't actually change my view?"

This sounds like an wonderful way to invoke the Streisand effect and cause an internal political upheaval. I would love to see DeepMind leadership overreact to this kind of outreach and attempt to suppress a bunch of Google AI researcher's moral compunctions through heavy-handed threats. Just imagine someone doing this if Google researchers had concerns about racial bias in their algorithms. It really seems like the dream scenario.

Replies from: Benito, Kaj_Sotala

↑ comment by Ben Pace (Benito) · 2022-04-11T03:50:49.399Z · LW(p) · GW(p)

I want to explicitly mark this (in my mind) as "babble" of the sort that is "Hufflepuff Bones" from HPMOR. I don't want to have this relationship with the leadership at AI labs. I'm way more interested in finding positive sum actions to take (e.g. trades that AI labs are actively happy about) than adversarial moves.

Replies from: lc

↑ comment by lc · 2022-04-11T04:12:07.928Z · LW(p) · GW(p)

I want them to stop lighting enormous fires next to open fuel repositories, which is unfortunately what these "leaders" are currently paid millions of dollars to do. Given that, I'm not sure what option there is. They are already adversaries. Nonprofit leadership probably doesn't have much of a vested interest and we should definitely talk to them, but these guys? Seriously, what strategy remains?

Replies from: Benito

↑ comment by Ben Pace (Benito) · 2022-04-11T04:52:36.301Z · LW(p) · GW(p)

Man, wtf is this argument? Yes you should talk to leaders in industry. I have not yet tried at all hard enough on the "work together" strategy to set it on fire at this moment. I don't think such people they see themselves as adversaries, I don't think we have been acting like adversaries, and I think this has allowed us to have a fair amount of conversation and cooperation with them.

I feel a bit like you're saying "a new country is stockpiling nukes, so let's make sure to quickly stop talking to them". We're in a free market economy, everyone is incentivized to build these companies, not just the people leading existing ones. That's like most of the problem.

I think it's good and healthy to think about your BATNA and figure out your negotiating position, so thinking about this seems good to me, but just because you might be able to exercise unilateral control doesn't mean you should, it lowers trust and the ability for everyone to work together on anything.

I'm not confident here and maybe I should already have given up after OpenAI was founded but I'm not ready to call everyone adversaries, it's pretty damn hard to backtrack on that, and it makes conversation and coordination way way harder.

Replies from: lc

↑ comment by lc · 2022-04-11T05:03:08.871Z · LW(p) · GW(p)

I think my problem is that I sometimes use "moral culpability" as some sort of proxy for "potential for positive outcomes following dialogue". Should reiterate that it was always my opinion that we should be doing more outreach to industry leaders, even if my hopes are low, especially if it turns out we haven't really tried it.

Edit: After further thought I also think the frustration I have with this attitude is:

We're not going to convince everybody.
Wild success means diverting significant but not necessarily critical amounts of resources (human, monetary, etc.) going toward AI capabilities research toward other less dangerous things.
Less AI capabilities research dries up the very short term money. Someone from #1 who we can't convince, or just doesn't care, is going to be mad about this.

So it's my intuition that, if you're not willing to annoy e.g. DeepMind's executive leadership, you are basically unable to commit to any strategy with a chance of working. It sucks too because this is the type of project where one bad organization will still end up killing everybody else, eventually. But this is the problem that must be solved, and it involves being willing to piss some people off.

Replies from: Benito

↑ comment by Ben Pace (Benito) · 2022-04-14T06:14:56.032Z · LW(p) · GW(p)

I am not sure what the right stance is here, and your points seem reasonable. (I am willing to piss people off.)

↑ comment by Kaj_Sotala · 2022-04-10T14:09:58.814Z · LW(p) · GW(p)

"Why is management so insistent I don't go do this alignment thing if it won't actually change my view?"

There are already perfectly ordinary business reasons for management to not want to lose all their major staff for three months (especially if some of them are needed for fulfilling contractual obligations with existing customers), and the employees have signed job contracts that probably do not provide a clause to just take three months off without permission. So the social expectation is more on the side of "this would be a major concession from management that they're under no obligation to grant" than "management would be unreasonable not to grant this".

↑ comment by Eli Tyre (elityre) · 2022-04-11T03:47:12.220Z · LW(p) · GW(p)

Thinking out loud about how to address this issue:

It doesn't have to be the same 3 months for everyone. You could stagger this over the course of 2 years, with an 8th of the relevant researchers taking a sabbatical to give this challenge a shot, at any given time.

This also means that they can build on each other's work, in serial, if they're in fact making progress, and can see how others have failed to make progress, if they're not.

↑ comment by niplav · 2022-04-08T23:19:18.659Z · LW(p) · GW(p)

Maybe alternatively a big EA actor could just buy some big AGI labs?

That wouldn't mess up schedules that badly.

Replies from: lahwran

↑ comment by the gears to ascension (lahwran) · 2022-04-09T06:44:45.143Z · LW(p) · GW(p)

google's entire stock value is agi

comment by lc · 2022-04-08T18:00:13.926Z · LW(p) · GW(p)

This is a creative idea, and maybe it'll even work, but I do not want us to completely dry up the resources of good-natured people like Sam Bankman-Fried. We could, at the absolute minimum, start this mission with 500k per researcher, one to five well respected researchers internal to these organizations, and a million dollar bonus. A billion dollars is a shit ton of money to waste if something like this doesn't work and I honestly don't think it's wise to set a standard of "asking for it" unless we're literally repelling an asteroid. Six million dollars is a shit ton of money to waste if it doesn't work.

Things aren't that desperate (yet). We have at least a few years to scale up.

Replies from: Chris_Leong

↑ comment by Chris_Leong · 2022-04-08T18:36:47.112Z · LW(p) · GW(p)

I expect this proposal would be more likely to receive funding, though normally in funding applications you submit different proposals with various different levels.

comment by Rohin Shah (rohinmshah) · 2022-04-10T07:44:09.340Z · LW(p) · GW(p)

The capabilities people have thrown a challenge right back at you!

Convincing All Alignment Researchers

Give the world's hundred most respected AI alignment researchers $1M each to spend 3 months explaining why AI will be misaligned, with an extra $100M if by the end they can propose an argument capability researchers can't shoot down. They probably won't make any progress, but from then on when others ask them whether they think alignment is a real unsolved problem, they will be way more likely to say no. That only costs you a hundred million dollars!

I don't think I could complete this challenge, yet I also predict that I would not then say that alignment is not a real unsolved challenge. I mostly expect the same problem from the OP proposal, for pretty similar reasons.

Replies from: not-relevant, Vaniver, johnlawrenceaspden

↑ comment by Not Relevant (not-relevant) · 2022-04-10T08:52:56.414Z · LW(p) · GW(p)

For the record I think this would also be valuable! If as an alignment researcher your arguments don't survive the scrutiny of skeptics, you should probably update away from them. I think maybe what you're highlighting here is the operationalization of "shoot down", which I wholeheartedly agree is the actual problem.

Re: the quantities of funding, I know you're being facetious, but just to point it out, the economic value of of "capabilities researchers being accidentally too optimistic about alignment" and "alignment researchers being too pessimistic about alignment" are asymmetric.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2022-04-10T09:15:32.582Z · LW(p) · GW(p)

If as an alignment researcher your arguments don't survive the scrutiny of skeptics, you should probably update away from them.

If that's your actual belief you should probably update away now. People have tried to do this for years, and in fact in most cases the skeptics were not convinced.

Personally, I'm much more willing to say that they're wrong and so I don't update very much.

Re: the quantities of funding, I know you're being facetious, but just to point it out, the economic value of of "capabilities researchers being accidentally too optimistic about alignment" and "alignment researchers being too pessimistic about alignment" are asymmetric.

Yeah, that was indeed just humor, and I agree with the point.

Replies from: not-relevant

↑ comment by Not Relevant (not-relevant) · 2022-04-10T16:52:29.118Z · LW(p) · GW(p)

I was referring to your original statements:

...if by the end they can propose an argument capability researchers can't shoot down
[...]
I don't think I could complete this challenge, yet I also predict that I would not then say that alignment is not a real unsolved challenge.

I think you might be construing my statement

if your arguments don't survive the scrutiny of skeptics, you should probably update away from them.

as that you should take AI risk less seriously if you can't convince the skeptics, as opposed to if the skeptics can't convince you.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2022-04-11T08:06:54.686Z · LW(p) · GW(p)

Ah, I see, that makes more sense, sorry for the misunderstanding.

(Fwiw I and others have in fact talked with skeptics about alignment.)

↑ comment by Vaniver · 2022-04-10T15:32:56.445Z · LW(p) · GW(p)

Hmmm, I wonder if there's a version like this that would actually be decent at 'getting people to grapple with' the ideas? Like, if you got Alignment Skeptic Alice to judge the ELK prize submissions with Paul Christiano and Mark Xu (presumably by paying her a bunch of money), would Alice come away from the experience thinking "actually ELK is pretty challenging" or would she come away from it thinking "Christiano and Xu are weirdly worried about edge cases that would never happen in reality"?

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2022-04-11T08:04:16.784Z · LW(p) · GW(p)

would Alice come away from the experience thinking "actually ELK is pretty challenging" or would she come away from it thinking "Christiano and Xu are weirdly worried about edge cases that would never happen in reality"?

If Alice is a skeptic because she actually has thought about it for a while and come to an opinion, it will totally be the latter.

If Alice has just not thought about it much, and she believes that AGI is coming, then she might believe the former. But in that case I think you could just have Alice talk to e.g. me for, say, 10 hours, and I'd have a pretty decent chance of convincing her that we should have more investment in AGI alignment.

(Or to put it another way: just actually honestly debating with the other person seems a lot more likely to work than immersing them in difficult alignment research -- at least as long as you know how to debate with non-rationalists.)

Replies from: not-relevant

↑ comment by Not Relevant (not-relevant) · 2022-04-11T22:05:36.592Z · LW(p) · GW(p)

Your time isn’t scalable, though - there are well over 10,000 replacement-level AI researchers.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2022-04-12T12:51:41.192Z · LW(p) · GW(p)

Of course, but "judge the ELK prize submissions with Paul Christiano and Mark Xu" is also not a scalable approach, since it requires a lot of conversation with Paul and Mark?

Replies from: not-relevant

↑ comment by Not Relevant (not-relevant) · 2022-04-12T17:17:07.816Z · LW(p) · GW(p)

I think we agree that scaling is relative and more is better!

↑ comment by johnlawrenceaspden · 2022-04-11T12:01:19.076Z · LW(p) · GW(p)

God, absolutely, yes, do I get to talk to the sceptic in question regularly over the three months?

Given three months of dialogue with someone who thinks like me about computers and maths, and where we both promise to take each other's ideas seriously, if I haven't changed his mind far enough to convince him that there are serious reasons to be scared, he will have changed mine.

I have actually managed this with a couple of sceptic friends, although the three months of dialogue has been spread out over the last decade!

And I don't know what I'm talking about. Are you seriously saying that our best people can't do this?! Eliezer used to make a sport of getting people to let him out of his box. And has always been really really good at explaining complicated thoughts persuasively.

Maybe our arguments aren't worth listening to. Maybe we're just wrong.

Give me this challenge!! Nobody needs to pay me, I will try to do this for fun and curiosity with anyone on the other side who is open-minded enough to commit to regular chats. An hour every evening?

In person would be better, so Cambridge or maybe London? I can face the afternoon train for this.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2022-04-12T12:50:36.752Z · LW(p) · GW(p)

I think it's totally doable (and I have done it myself) to convince people who haven't yet staked a claim as an Alignment Skeptic. There are specific people such as Yann Lecun who are publicly skeptical of alignment research; it is them who I imagine we could not convince.

Replies from: johnlawrenceaspden, elriggs

↑ comment by johnlawrenceaspden · 2022-04-12T19:25:55.079Z · LW(p) · GW(p)

OK, I myself am sceptical of alignment research, but not at all sceptical of the necessity for it.

Do you think someone like Eliezer has had a proper go at convincing him that there's a problem? Or will he just not give us the time of day? Has he written anything coherent on the internet that I could read in order to see what his objections are?

Personally I would love to lose my doom-related beliefs, so I'd like to try to understand his position as well as I can for two reasons.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2022-04-13T07:30:42.292Z · LW(p) · GW(p)

Here's an example [LW · GW]

Replies from: johnlawrenceaspden

↑ comment by johnlawrenceaspden · 2022-04-13T13:30:58.199Z · LW(p) · GW(p)

Great, thanks, so I'm going to write down my response to his thoughts as I hear them:

Before reading the debate I read the Scientific American article it's about. On first read, that seems convincing, ok, relax! And then take a closer look.

What's he saying (paraphrasing stuff from Scientific American):

Superintelligence is possible, can be made to act in the world, and is likely coming soon.

Intelligence and goals are decoupled

Why would a sentient AI want to take over the world? It wouldn’t.

Intelligence per se does not generate the drive for domination

Not all animals care about dominance.

I'm a bit worried by mention of the first law of robotics. I thought the point of all those stories was all the ways such laws might lead to weird outcomes.

Blah blah joblessness, military robots, blah inequality, all true, I might even care if I thought there was going to be anyone around to worry about it. But it does mean that he's not a starry-eyed optimist who thinks nothing can go wrong.

That's great! I agree with all of that, it's often very hard to get people that far. I think he's on board with most of our argument.

And then right at the end (direct quote):

Even in the worst case, the robots will remain under our command, and we will have only ourselves to blame.

OK, so he thinks that because you made a robot, it will stay loyal to you and follow your commands. No justification given.

It's not really fair to close read an article in a popular magazine, but at this point I think he maybe he realises that you can make a superintelligent wish-granting machine, but hasn't thought about what happens if you make the wrong wish and want to change it later.

(I'm supposed to not be throwing mythological and literary references into things any more, but I can't help but think about the Sybil, rotting in her bag because Apollo had granted her eternal life but not eternal youth, or TS Eliot's: "That is not it at all, That is not what I meant, at all." )

So let's go and look at the debate itself, rather than the article.

↑ comment by Logan Riggs (elriggs) · 2022-04-12T23:25:47.073Z · LW(p) · GW(p)

I was talking to someone recently who talked to Yann and got him to agree with very alignment-y things, but then a couple days later, Yann was saying very capabilities things instead.

The "someone"'s theory was that Yann's incentives and environment is all towards capabilities research.

comment by lorepieri (lorenzo-rex) · 2022-04-08T23:24:25.611Z · LW(p) · GW(p)

Unpopular opinion (on this site I guess): AI alignment is not a well defined problem, there is no clear cut resolution to it. It will be an incremental process, similar to cybersecurity research.

About the money, I would do the opposite: select researchers that would do it for free, just pay them living expenses and give them arbitrary resources.

Replies from: johnlawrenceaspden, lahwran

↑ comment by johnlawrenceaspden · 2022-04-09T13:21:50.524Z · LW(p) · GW(p)

I don't think that's an unpopular opinion! That's just true.

Money-wise, we have a Cassandra problem. No one believes us. So the question is, how can we get everyone competent enough to end the world to realize that they're about to end the world? Is it possible to use money to do this? Apparently there is money, but not so much time.

↑ comment by the gears to ascension (lahwran) · 2022-04-09T06:49:13.517Z · LW(p) · GW(p)

just like parenting

comment by johnlawrenceaspden · 2022-04-09T10:36:22.351Z · LW(p) · GW(p)

Why not do something like the Clay Prize for Mathematics? Offer gargantuan sums for progress on particular subproblems like corrigibility, or inner alignment, or coming up with new reasons why it's even harder than it already looks, or new frameworks for attacking the problem.

Let MIRI be the judge, and make it so that even the most trivial progress gets a fat cheque. (A few thousand for pointing out typos in MIRI papers seems fair)

It should be obviously easy money even to clever undergraduate compscis.

That should inspire vast numbers of people to do the research instead of their day jobs, which would obviously be great news.

That will likely lead to companies banning their staff from thinking about the problem, which is the best possible publicity. Nothing gets one of us thinking like being told to not think about something.

It might even produce some progress (even better!!).

But what it will likely really do is convince everyone that alignment is the same sort of problem as P=NP, or the Riemann Hypothesis.

comment by Joe Collman (Joe_Collman) · 2022-04-08T20:02:05.556Z · LW(p) · GW(p)

This is an interesting idea (of course the numbers might need to be tweaked).

My first concern would be disagreement about what constitutes a solution: what happens if one/many researchers incorrectly think they've solved the problem, and they continue to believe this after alignment researchers "shoot them down", but the solution appears to be correct to most researchers?
Does this backfire and get many to dismiss the problem as already solved?

I think we'd want to set things up with adjudicators that are widely respected and have the 'right' credentials - so e.g. Stuart Russell is an obvious choice; maybe Wei Dai? I'm not the best judge of this.

Clearly we'd want Eliezer, Paul... advising, but for final decisions to be made by people who are much harder to dismiss.

I think we'd also want to set up systems for partial credit. I.e. rewards for anyone who solves a significant part of the problem, or who lays out a new solution outline which seems promising (even to the most sceptical experts).
I'd want to avoid the failure mode where people say at the end (or decide before starting) that while they couldn't solve the problem in 3 months, it's perfectly doable in e.g. a couple of years - or that they could solve large parts of the problem, and the rest wouldn't be too difficult.

With a bit of luck, this might get us the best of both worlds: some worthwhile progress, along with an acknowledgement that the rest of the problem remains very hard.

I think the largest obstacle is likely the long timelines people. Not those who simply believe timelines to be long, but those who also believe meaningful alignment work will only be possible once we have a clear picture what AGI will look like.

The "but think of your grand-children!" argument doesn't work here, since they'll simply respond that they're all for working on alignment at the appropriate time. To such people, working on alignment now can seem as sensible as working on 747 safety before the Wright flyer. (this was my impression of Yann LeCun's take when last I heard him address such questions, if I'm not misremembering)

comment by DanielFilan · 2022-04-11T20:37:15.929Z · LW(p) · GW(p)

Give the world's thousand most respected AI researchers $1M each to spend 3 months working on AI alignment, with an extra $100M if by the end they can propose a solution alignment researchers can't shoot down.

An unfortunate fact is that the existing AI alignment community is bad at coming to consensus with alignment solution proposers about whether various proposals have been "shot down". For instance, consider the proposed solution to the "off-switch problem" where an AI maintains uncertainty about human preferences, acts to behave well according to those (unknown) preferences, and updates its belief about human preferences based on human behaviour (such as inferring "I should shut down" from the human attempting to shut the AI down), as described in this paper. My sense is that Eliezer Yudkowsky and others think they have shot this proposal down (see the Arbital page on the problem of fully updated deference), but that Stuart Russell, a co-author of the original paper, does not (based on personal communication with him). This suggests that we need significant advances in order to convince people who are not AI safety researchers that their solutions don't work.

Replies from: not-relevant

↑ comment by Not Relevant (not-relevant) · 2022-04-11T22:03:49.067Z · LW(p) · GW(p)

Strongly encourage you to make this comment its own post; the work of laying out “here’s SOTA and here’s why it doesn’t work” in precise and accessible language is invaluable to incentivizing outsiders to work on alignment.

comment by lc · 2022-04-09T10:42:16.382Z · LW(p) · GW(p)

It was argued that these plans are only relevant in 15+ timelines

Is it OK if I get kinda angry when I think about these constraints, as someone born around 2000? We had 15 years 15 years ago. What was everybody doing when I was 7 years old?

Replies from: johnlawrenceaspden

↑ comment by johnlawrenceaspden · 2022-04-09T12:54:03.597Z · LW(p) · GW(p)

They were wanking on endlessly about quantum consciousness and how any agent sufficiently intelligent to escape its box would be intelligent enough to figure out that the utility function we'd given it wasn't actually what we wanted.

I've been going on about 'AI will kill us all' to anyone who'll listen for about ten years now. Everyone I know thinks I've got another stupid science-fiction bee in my bonnet to replace my previous idiotic nerd-concerns about overpopulation and dysgenics and grey goo and engineered plagues and factory farming and nuclear accidents and nuclear proliferation and free speech and ubiquitous surveillance and whatever the hell else it was that used to seem important.

Almost all my friends are graduates of the University of Cambridge, and roughly half of them studied real subjects and can program computers.

Goddamn it, I know actual AI safety researchers, people much cleverer and more capable than me, who just looked at AlphaZero and shrugged, apparently completely unable to see that building a really good general reinforcement learner might have real-world implications beyond chess.

There's some bunch of academic parasites in Cambridge who were originally supposed to be worrying about existential risks and really seemed quite serious at first, but recently they mostly seem interested in global warming and the disparate impact of COVID.

Also, whaddya mean, 15 years ago? IJ Good pointed out that we were all doomed about 70 years ago, and really it's blindingly obvious to anyone who thinks about it, which is where the inspiration for things like the Terminator movies come from. Or Frankenstein, or the Sorcerers' Apprentice, or the Golem or the Monkey's Paw.

You were fucked, and it was knowable that you were fucked, before you were born, and so was I. That's what it's like to be born into a world of lunatics.

But you have a genuine case to hate me in particular, if you like, because I knew. I really did, in spite of the fact that everyone was telling me I was wrong all the time. And I did nothing. Because it didn't look like there was anything to be done and I already knew that I was no good at research.

Learn from me, and use your anger. You may still have 15 years, and more and more people are coming round to the conclusion that the things that are obviously true might actually be true. Go down fighting.

Or not. Getting drunk on a beach and trying to have as much of a life as possible before the end is also a good use of your time. That was the only deal that was ever on offer, really.

Replies from: Chris_Leong

↑ comment by Chris_Leong · 2022-04-09T13:00:52.500Z · LW(p) · GW(p)

So what's your plan now? Are you going to go down fighting?

Replies from: johnlawrenceaspden

↑ comment by johnlawrenceaspden · 2022-04-09T13:46:43.558Z · LW(p) · GW(p)

A good question! I think the answer's no.

If there was a bunch of guys around here that I could talk to about this stuff, then I might be able to act as a sort of useful idiot for them to do rubber-duck debugging on. I do still seem to be pretty good at understanding other people's ideas.

But there isn't. I have looked. And I never even got my PhD, I don't seem to be very good at generating ideas alone.

I'm too old for sport or drinking or womanizing these days, even tennis seems to be too much for my poor knees. I'm even getting bored of computers, to my amazement. But the love of patterns is still there and now there's nothing better to do it seems more and more salient.

So I think I'm going to finish reading my current book on undergraduate group theory, and then maybe after that there's something called Visual Differential Geometry that looks cool. And maybe take my narrowboat out into the Summer fens a few last times.

Actually MIRI's paper on Garrabrant Inductors looks fascinating, and maybe the last great contribution to the philosophy of mathematics, but I think maybe I need to do an undergraduate logic course first in order to actually understand it.

A bit of consulting to pay the rent maybe if we last that long. But it's all zoom calls and open-plan offices these days. Sweaty nerd-pits where the blinds are always down so that the fierce rays of the hated sun don't shine on people's monitors. Can't face that sort of thing in the Summer.

Life's to be enjoyed. I never did have the willpower to make myself do something that was no fun.

Replies from: Chris_Leong, lc

↑ comment by Chris_Leong · 2022-04-09T14:21:33.865Z · LW(p) · GW(p)

Well, if you change your mind, it'd be great to have you. I'm sure there are some up and coming researchers who'd benefit from having someone else to work through Garrabrant Inductors with them. At the same time, I don't grudge you your decision. I'm glad at least that you seem to have found peace.

Replies from: johnlawrenceaspden, johnlawrenceaspden

↑ comment by johnlawrenceaspden · 2022-04-09T20:10:39.301Z · LW(p) · GW(p)

Chris, if you're speaking on behalf of MIRI, then, to, for once, speak frankly and without irony:

If you can pay me some sort of reasonable amount of money for my time (and I mean 'reasonable by academic standards', not 'reasonable by computer programmer standards' and even less 'reasonable by consultant accustomed to astronomical daily rates' standards) to sit around doing roughly what I was going to do anyway, and I don't have to leave my paradise or spend my days in front of a computer, or do anything I don't fancy doing, but I get to sit around in the sunshine reading and thinking and talking to people all day, then I'm yours and always was.

I have enough control of my curiosity that I can point it at a copy of 'Logical Induction' and a book on undergraduate logic, if you think that might be a good use of my time in the next few months.

And I'm unusually sociable and extrovert for one of us, and I have no fear of public speaking, and I would be quite happy to occasionally get the train to London or Oxford to talk to people if I could help out in that sort of way.

Once upon a time I was a promising young mathematician, and I am still rather a good programmer even by the standards of this town. And what I understand I can teach, so if you have any promising young persons who can make it to Cambridge then I can show them what few things I know.

But I wonder if you would really want me to do that. I don't want to suck up resources that could be better used elsewhere. I am an old man now, and even as a youngster I was no use at research, and I can practically guarantee that the result of me staring at Logical Induction for long enough will be that I will understand first-year logic, and likely I will understand something of Logical Induction. And that that will be it.

I'm not going to solve alignment. I'd be very surprised if I could make any progress on it at all.

But if there is a deal to be done here, I will do it. If you want a reference get in touch with Ramana Kumar, he knows me fairly well.

Replies from: Vaniver, Chris_Leong

↑ comment by Vaniver · 2022-04-10T14:59:26.856Z · LW(p) · GW(p)

Are you aware of open postings like this one [LW · GW]?

Replies from: johnlawrenceaspden

↑ comment by johnlawrenceaspden · 2022-04-11T12:43:00.317Z · LW(p) · GW(p)

That looks perfect, thanks. But it also looks like a job offer. If they want a servant they can pay my full commercial rate. And they don't need to, because they can find a much brighter and more idealistic PhD student who'll do it for them for peanuts.

So much so that I'm not going to even fill out the form. I don't have the qualifications they're looking for, and that kind of application process is always a sign that it's not worth bothering. Just wasting the time of everyone involved.

But if someone recommends me to them (who? why?), and they get in touch and offer to pay me some reasonable retainer (£2000/month?) just to plough through whatever infrabayesianism is under my own power and try to come up with a sensible explanation and some illuminating examples, like I was reading it for fun, then I would totally be up for that. I may fail, of course. And either side can end the deal at any time without notice. So what would they have to lose?

Topology and functional analysis were favourites of mine as an undergraduate (30 years ago!!), I've never done measure theory or convex analysis. Background in reinforcement learning theory? Well I read the first half of Sutton and got it, and did all the exercises, and I've been meaning to get round to the second half.

But I think that I can still claim to be able to understand complicated maths, especially if I get to talk to the people who came up with it, and sometimes I can teach it to other people or at least help them to understand it.

↑ comment by Chris_Leong · 2022-04-09T22:58:09.500Z · LW(p) · GW(p)

I don't work for MIRI or LessWrong (sadly)^[1]. I just meant the community of people who care about these issues. Maybe it was presumptuous for me to use the word "we", but I think people within the community generally do appreciate any contributions that people happen to make.

Random fact: I happen to know Ramana Kumar. He was on Australia's International Informatics Olympiad Team with me.

^{^}
I am doing some movement-building in Australia and NZ, but my efforts to encourage people here is kind of out of this scope.

Replies from: johnlawrenceaspden

↑ comment by johnlawrenceaspden · 2022-04-10T11:37:18.266Z · LW(p) · GW(p)

Maybe it was presumptuous for me to use the word "we"

My mistake, I use 'we' to talk about us all the time. And I did check to see whether you were MIRI or not, I couldn't tell.

You've at least got me to think about my price for lifting a finger to save the world. I am surprised that the answer is 'minimum wage will do as long as I don't have to do anything I don't enjoy'; that seems odd, but I think that is the answer. It feels right.

I think at heart the problem is that I don't believe we've got a ghost of a chance. And for me 'dying with dignity' means enjoying what time remains.

↑ comment by johnlawrenceaspden · 2022-04-09T20:20:25.156Z · LW(p) · GW(p)

I'm glad at least that you seem to have found peace.

Staring death in the face and smiling is a character trait of the English. I take no credit for my character class, since I did not choose it.

I learned when I was very young that I was going to die, and I saw death very early, and I grew up imagining the nuclear fire blossom over my city, and I have seen friends and mentors and lovers and family die. We were all, always, going to die.

Die by fire, die by AI, die by old age, die by disease, what matter?

↑ comment by lc · 2022-04-09T13:53:18.431Z · LW(p) · GW(p)

You could always help us out with this [LW · GW].

Replies from: johnlawrenceaspden

↑ comment by johnlawrenceaspden · 2022-04-10T13:02:42.247Z · LW(p) · GW(p)

I've been thinking for a while now that it might be helpful to get my unformed vague intuition that we're all dead into a more legible and persuasive form:

https://www.lesswrong.com/posts/xBrpph9knzWdtMWeQ/the-proof-of-doom [LW · GW]

Certainly I have had little success convincing people that there is even a problem this last decade.

Perhaps that could be some use to you?

comment by nem · 2022-04-11T16:48:37.424Z · LW(p) · GW(p)

Another way this could potentially backfire. $1,000,000 is a lot of money for 3 months. A lump sum like this will cause at least some of the researchers to A) Retire, B) Take a long hiatus/sabbatical, or C) Be less motivated by future financial incentives.

If 5 researchers decide to take a sabbatical, then whatever. If 150 of them do? Maybe that's a bigger deal. You're telling me you wouldn't consider it if 5-10 times your annual salary was dropped in your lap?

comment by johnlawrenceaspden · 2022-04-09T10:23:16.361Z · LW(p) · GW(p)

Oh, that's so unfair! I've been avoiding working in AI for years, on account I worried I might have a good enough idea to bring the apocalypse forward a day. One day out of 8 billion lives makes me Worse Than Hitler(TM) by at least an order of magnitude.

And I've even been avoiding blogging about basic reinforcement learning for fear of inspiring people!

Where's my million?!

Also hurry up, I hardly have any use for money, so it will take ages to spend a million dollars even if I get creative, and it doesn't look like we have very long.

Replies from: lc

↑ comment by lc · 2022-04-09T10:45:48.758Z · LW(p) · GW(p)

This actually brings up an important consideration. It would be bad to incentivize AGI research by paying people who work inside it huge sums, the same way paying slaveowners to free their slaves is a poor abolitionists' strategy. In reality people aren't Homo Economicus, so I think we could work around it, but we'd need to be careful.

Replies from: Linch, johnlawrenceaspden, JohnBuridan

↑ comment by Linch · 2022-04-09T19:04:09.133Z · LW(p) · GW(p)

This actually brings up an important consideration. It would be bad to incentivize AGI research by paying people who work inside it huge sums, the same way paying slaveowners to free their slaves is a poor abolitionists' strategy.

? This was literally what the UK did to free their slaves, and (iiuc) historians considered it more successful than (e.g.) the US strategy, which led to abolition a generation later and also involved a civil war.

Replies from: johnlawrenceaspden

↑ comment by johnlawrenceaspden · 2022-04-09T20:30:41.063Z · LW(p) · GW(p)

That was after the abolition of slavery by law, and long after the interdiction of the transatlantic slave trade by the Royal Navy. I wonder if it would have been such a good strategy if the compensated could have re-invested their money.

Replies from: Linch

↑ comment by Linch · 2022-04-09T20:37:47.593Z · LW(p) · GW(p)

This is a really good point! Though I thought the subtext of the original comment was about the incentives rather than the causal benefits/harms after receiving the money.

↑ comment by johnlawrenceaspden · 2022-04-09T12:04:03.764Z · LW(p) · GW(p)

That was actually the subtext of my comment, so thanks for making it explicit.

To be tediously clear, I think this is a good idea, but only because I take imminent AI extinction doom super-seriously and think just about anything that could plausibly help might as well be tried.

As a general policy, I do feel giving all the money you can lay your hands on to the bad guys in proportion to how bad they've historically been is not "establishing correct incentives going forward". (am I supposed to add "according to my model" here? I'm not terribly hip with the crazy rhythm of the current streets.)

Not such an issue if there is no going forward. An interesting test case for Logical Decision Theory perhaps?

↑ comment by SebastianG (JohnBuridan) · 2022-04-10T18:17:36.004Z · LW(p) · GW(p)

Well, Russia did pay to free the serfs and it would have worked if they didn't design it so that the former serfs had to pay back the debt themselves. Similarly, Ireland paid the big landholders off in a series of acts from 1870 - 1909 which created a new small landholding class. In fact, this type of thing historically seems to work when it concerns purchasing property rights.

The analogy though is imperfect for knowledge workers. They don't own the knowledge in the same way. Perhaps, the best way to do this is as previously mentioned, to get a sizable portion of current accomplished researchers to spend time on the problem by purchasing their time.

comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2022-04-08T20:00:15.823Z · LW(p) · GW(p)

This is thinking in the right direction.

Just announcing prize money could be worth it too.

Convincing All Capability Researchers

Contents

Next Steps to seriously consider this proposal

Specifying "Solving Alignment"

Listing 1000 AI researcher intellectuals

Actually Hiring People/Creating a Company to Do This

Timelines Argument

This Plan May Backfire and Increase Capabilities

70 comments

Convincing All Alignment Researchers