AI Alignment Open Thread August 2019

habryka4

AI Alignment Open Thread August 2019

post by habryka (habryka4) · 2019-08-04T22:09:38.431Z · LW · GW · 96 comments

96 comments

This is an experiment in having an Open Thread dedicated to AI Alignment discussion, hopefully enabling researchers and upcoming researchers to ask small questions they are confused about, share very early stage ideas and have lower-key discussions.

96 comments

Comments sorted by top scores.

comment by Wei Dai (Wei_Dai) · 2019-08-08T09:24:36.366Z · LW(p) · GW(p)

Has anyone seen this argument for discontinuous takeoff before? I propose that there will be a discontinuity in AI capabilities at the time that the following strategy becomes likely to succeed:

Use hacking or phishing to take over a computing center belonging to someone else.
Expand self (i.e., the AI executing the current strategy) into the new computing center.
Repeat steps 1 & 2 on other computing centers (in increasing order of their security) using the increased capabilities of the expanded AI.
Defend self and figure out how to take over or neutralize the rest of the world.

The reason for the discontinuity is that this strategy is an all-or-nothing kind of thing. There is a threshold in the chance of success in taking over other people's hardware, below which you're likely to get caught and punished/destroyed before you take over the world (and therefore almost nobody attempts it, and the few who do just quickly get caught), and above which the above strategy becomes feasible.

Replies from: Kaj_Sotala, jalex-stark-1, capybaralet, jess-smith

↑ comment by Kaj_Sotala · 2019-08-08T18:31:13.272Z · LW(p) · GW(p)

There's previously been the "an AI could achieve a discontinuous takeoff by exploiting a security vulnerability to copy itself into lots of other computers" argument in at least Sotala 2012 (sect 4.1.) and Sotala & Yampolskiy 2015 (footnote 15), though those don't explicitly mention the "use the additional capabilities to break into even more systems" part. (It seems reasonably implicitly there to me, but that might just be illusion of transparency speaking.)

↑ comment by Jalex Stark (jalex-stark-1) · 2019-08-08T22:09:17.841Z · LW(p) · GW(p)

I think Bostrom uses the term "hardware overhang" in Superintelligence to point to a cluster of discontinuous takeoff scenarios including this one

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2019-08-09T02:51:44.510Z · LW(p) · GW(p)

It seems to me that there's a counter-argument available to the "hardware overhang" argument for discontinuous takeoff that doesn't apply to the "hacking" argument, namely that for any AI that achieves a high level of capability by taking advantage of hardware overhang, there will be an AI that arrives a bit earlier and achieves a somewhat lower level of capability by taking advantage of the same hardware overhang (e.g., because it has somewhat worse algorithms, or somewhat less or lower quality training data). Unlike the "hacking" scenario, in the generic "hardware overhang" scenario, there's not an apparent threshold effect that could cause a discontinuity.

(Curiously, Paul Christiano's and AI Impacts's posts arguing against discontinuous takeoff both ignore "hardware overhang" and neither give this counter-argument. Neither of them mention the "hacking" argument either, AFAICT.)

Replies from: Kaj_Sotala

↑ comment by Kaj_Sotala · 2019-08-09T10:23:56.215Z · LW(p) · GW(p)

Wasn't hardware overhang the argument that if AGI is more bottlenecked by software than hardware, then conceptual insights on the software side could cause a discontinuity as people suddenly figured out how to use that hardware effectively? I'm not sure how your counterargument really works there, since the AI that arrives "a bit earlier" either precedes or follows that conceptual breakthrough. If it precedes the breakthrough, then it doesn't benefit from that conceptual insight so won't be powerful enough to take advantage of the overhang, and if it follows it, then it has a discontinuous advantage over previous systems and can take advantage of hardware overhang.

---

Separately, your comment also feels related to my argument [LW · GW] that focusing on just superintelligence is a useful simplifying assumption, since a superintelligence is almost by definition capable of taking over the world. But it simplifies things a little too much, because if we focus too much on just the superintelligence case, we might miss the emergence of a “dumb” AGI which nevertheless had the "crucial capabilities" necessary for a world takeover.

In those terms, "having sufficient offensive cybersecurity capability that a hacking attempt can snowball into a world takeover" would be one such crucial capability that allowed for a discontinuity.

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2019-08-15T04:07:30.055Z · LW(p) · GW(p)

Yes.

Not a direct response: It's been argued (e.g. I think Paul said this in his 2nd 80k podcast interview?) that this isn't very realistic, because the low-hanging fruit (of easy to attack systems) is already being picked by slightly less advanced AI systems. This wouldn't apply if you're *already* in a discontinuous regime (but then it becomes circular).

Also not a direct response: It seems likely that some AIs will be much more/less cautious than humans, because they (e.g. implicitly) have very different discount rates. So AIs might take very risky gambles, which means both that we might get more sinister stumbles (good thing), but also that they might readily risk the earth (bad thing).

↑ comment by Jess Smith (jess-smith) · 2019-08-09T17:53:49.031Z · LW(p) · GW(p)

I wonder how plausible it is that the AI would be able to take over a second computing center before being detected in the first. (Which would then presumably be shut down)

comment by Rohin Shah (rohinmshah) · 2019-08-05T02:01:35.672Z · LW(p) · GW(p)

(Short writeup for the sake of putting the idea out there)

AI x-risk people often compare coordination around AI to coordination around nukes. If we ignore military applications of AI and restrict ourselves to misalignment, this seems like a weird analogy to me:

With technical AI safety we're primarily thinking about accident risks, whereas nukes are deliberately weaponized.
Everyone can agree that we don't want nuclear accidents, so why can't everyone agree we don't want AI accidents? I think the standard response here is "everyone will trade off safety for capabilities", but did that happen with nukes?
I don't see any analog to mutually assured destruction, which seems like a pretty key feature with nukes.

Perhaps a more appropriate nuclear analogy for AI x-risk would be accidents like Chernobyl.

Replies from: JamesPayor, capybaralet, FactorialCode, robert-miles, Dagon, matthew-barnett

↑ comment by James Payor (JamesPayor) · 2019-08-06T22:11:52.309Z · LW(p) · GW(p)

There is a nuclear analog for accident risk. A quote from Richard Hamming:

Shortly before the first field test (you realize that no small scale experiment can be done—either you have a critical mass or you do not), a man asked me to check some arithmetic he had done, and I agreed, thinking to fob it off on some subordinate. When I asked what it was, he said, "It is the probability that the test bomb will ignite the whole atmosphere." I decided I would check it myself! The next day when he came for the answers I remarked to him, "The arithmetic was apparently correct but I do not know about the formulas for the capture cross sections for oxygen and nitrogen—after all, there could be no experiments at the needed energy levels." He replied, like a physicist talking to a mathematician, that he wanted me to check the arithmetic not the physics, and left. I said to myself, "What have you done, Hamming, you are involved in risking all of life that is known in the Universe, and you do not know much of an essential part?" I was pacing up and down the corridor when a friend asked me what was bothering me. I told him. His reply was, "Never mind, Hamming, no one will ever blame you."

https://en.wikipedia.org/wiki/Richard_Hamming#Manhattan_Project

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-08T03:00:59.950Z · LW(p) · GW(p)

I don't really know what this is meant to imply? Maybe you're answering my question of "did that happen with nukes?", but I don't think an affirmative answer means that the analogy starts to work.

I think the nukes-AI analogy is used to argue "people raced to develop nukes despite their downsides, so we should expect the same with AI"; the magnitude/severity of the accident risk is not that relevant to this argument.

Replies from: Wei_Dai, JamesPayor

↑ comment by Wei Dai (Wei_Dai) · 2019-08-08T04:04:47.906Z · LW(p) · GW(p)

I think the nukes-AI analogy is used to argue "people raced to develop nukes despite their downsides, so we should expect the same with AI"

If you're arguing against that, I'm still not sure what your counter-argument is. To me, the argument is: the upsides of nukes are the ability to take over the world (militarily) and to defend against such attempts. The downsides include risks of local and global catastrophe. People raced to develop nukes because they judged the upsides to be greater than the downsides, in part because they're not altruists and longtermists. It seems like people will develop potentially unsafe AI for analogous reasons: the upsides include the ability to take over the world (militarily or economically) and to defend against such attempts, and the downsides include risks of local and global catastrophe, and people will likely race to develop AI because they judge the upsides to be greater than the downsides, in part because they're not altruists and longtermists.

Where do you see this analogy breaking down?

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-08T04:32:19.309Z · LW(p) · GW(p)

I'm more sympathetic to this argument (which is a claim about what might happen in the future, as opposed to what is happening now, which is the analogy I usually encounter, though possibly not on LessWrong). I still think the analogy breaks down, though in different ways:

There is a strong norm of openness in AI research (though that might be changing). (Though perhaps this was the case with nuclear physics too.)
There is a strong anti-government / anti-military ethic in the AI research community. I'm not sure what the nuclear analog is, but I'm guessing it was neutral or pro-government/military.
Governments are staying a mile away from AGI; their interest in AI is in narrow AI's applications. Narrow AI applications are diverse, and many can be done by a huge number of people. In contrast, nukes are a single technology, governments were interested in them, and only a few people could plausibly build them. (This is relevant if you think a ton of narrow AI could be used to take over the world economically.)
OpenAI / DeepMind are not adversarial towards each other. In contrast, US / Germany were definitely adversarial.

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2019-08-08T06:57:49.147Z · LW(p) · GW(p)

Assuming you agree that people are already pushing too hard for progress in AGI capability (relative to what's ideal from a longtermist perspective), I think the current motivations for that are mostly things like money, prestige, scientific curiosity, wanting to make the world a better place (in a misguided/shorttermist way), etc., and not so much wanting to take over the world or to defend against such attempts. This seems likely to persist in the near future, but my concern is that if AGI research gets sufficiently close to fruition, governments will inevitably get involved and start pushing it even harder due to national security considerations. (Recall that Manhattan Project started 8 years before detonation of the first nuke.) Your argument seems more about what's happening now, and does not really address this concern.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-08T18:00:31.585Z · LW(p) · GW(p)

you agree that people are already pushing too hard for progress in AGI capability (relative to what's ideal from a longtermist perspective)

I'm uncertain, given the potential for AGI to be used to reduce other x-risks. (I don't have strong opinions on how large other x-risks are and how much potential there is for AGI to differentially help.) But I'm happy to accept this as a premise.

Your argument seems more about what's happening now, and does not really address this concern.

I think what's happening now is a good guide into what will happen in the future, at least on short timelines. If AGI is >100 years away, then sure, a lot will change and current facts are relatively unimportant. If it's < 20 years away, then current facts seem very relevant. I usually focus on the shorter timelines.

For min(20 years, time till AGI), for each individual trend I identified, I'd weakly predict that trend will continue (except perhaps openness, because that's already changing).

↑ comment by James Payor (JamesPayor) · 2019-08-08T04:05:03.217Z · LW(p) · GW(p)

It wasn't meant as a reply to a particular thing - mainly I'm flagging this as an AI-risk analogy I like.

On that theme, one thing "we don't know if the nukes will ignite the atmosphere" has in common with AI-risk is that the risk is from reaching new configurations (e.g. temperatures of the sort you get out of a nuclear bomb inside the Earth's atmosphere) that we don't have experience with. Which is an entirely different question than "what happens with the nukes after we don't ignite the atmosphere in a test explosion".

I like thinking about coordination from this viewpoint.

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2019-08-07T07:46:52.010Z · LW(p) · GW(p)

For me it's because:

Nukes seem like an obvious Xrisk
People mostly seem to agree that we haven't done a good job coordinating around them
They seem a lot easier to coordinate around

Also, not a reason, but:

AI seems likely to be weaponized, and warfare (whether conventional or not) seems like one of the areas where we should be most worried about "unbridled competition" creating a race-to-the-bottom on safety.

Replies from: capybaralet

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2019-08-07T07:48:27.477Z · LW(p) · GW(p)

TBC, I think climate change is probably an even better analogy.

And I also like to talk about international regulation, in general, like with tax havens.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-07T18:59:03.267Z · LW(p) · GW(p)

Agree that climate change is a better analogy.

Disagree that nukes seem easier to coordinate around -- there are factors that suggest this (e.g. easier to track who is and isn't making nukes), but there are factors against as well (the incentives to "beat the other team" don't seem nearly as strong).

Replies from: capybaralet

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2019-08-09T04:07:56.824Z · LW(p) · GW(p)

incentives to "beat the other team" don't seem nearly as strong

You mean it's stronger for nukes than for AI? I think I disagree, but it's a bit nuanced. It seems to me (as someone very ignorant about nukes) like with current nuclear tech you hit diminishing returns pretty fast, but I don't expect that to be the case for AI.

Also, I'm curious if weaponization of AI is a crux for us.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-09T21:57:22.675Z · LW(p) · GW(p)

I'm uncertain about weaponization of AI (and did say "if we ignore military applications" in the OP).

Replies from: capybaralet

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2019-08-14T16:28:43.404Z · LW(p) · GW(p)

Oops, missed that, sry.

↑ comment by FactorialCode · 2019-08-06T03:18:16.656Z · LW(p) · GW(p)

I agree that the coordination games between nukes and AI are different, but I still think that nukes make for a good analogy. But not after multiple parties have developed them. Rather I think key elements of the analogy is the game changing and decisive strategic advantage that nukes/AI grant once one party develops them. There aren't too many other technologies that have that property. (maybe the bronze-iron age transition?)

Where the analogy breaks down is with AI safety. If we get AI safety wrong there's a risk of large permanent negative consequences. A better analogy might be living near the end of WW2, but if you build a nuclear bomb incorrectly, it ignites the atmosphere and destroys the world.

In either case, under this model, you end up with the following outcomes:

(A): Either party incorrectly develops the technology
(B): The other party successfully develops the technology
(C): My party successfully develops the technology

and generally a preference ordering of A<B<C, although a sufficiently cynical actor might have B<A<C.

If there's a sufficiently shallow trade-off between speed of development and the risk of error, this can lead to a dollar auction like dynamic where each party is incentivized to trade a bit more risk in order to develop the technology first. In a symmetric situation without coordination, the ~~equilibrium~~ nash equilibrium is all parties advancing as quickly as possible to develop the technology and throwing caution to the wind.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-06T18:30:02.216Z · LW(p) · GW(p)

In a symmetric situation without coordination, the equilibrium is all parties advancing as quickly as possible to develop the technology and throwing caution to the wind.

Really? It seems like if I've raised my risk level to 99% and the other team has raised their risk level to 98% (they are slightly ahead), one great option for me is to commit not to developing the technology and let the other team develop the technology at risk level ~1%. This gets me an expected utility of 0.99B + 0.01A, which is probably better than the 0.01C + 0.99A that I would otherwise have gotten (assuming I developed the technology first).

I am assuming common knowledge here, but I am not assuming coordination. See also OpenAI Charter.

Replies from: FactorialCode

↑ comment by FactorialCode · 2019-08-06T20:52:22.953Z · LW(p) · GW(p)

Interesting. I had the Nash equilibrium in mind, but it's true that unlike a dollar auction, you can de-escalate, and when you take into account how your opponent will react to you changing your strategy, doing so becomes viable. But then you end up with something like a game of chicken, where ideally, you want to force your opponent to de-escalate first, as this tilts the outcomes toward option C rather than B.

↑ comment by Robert Miles (robert-miles) · 2019-08-27T16:27:40.529Z · LW(p) · GW(p)

Yeah, nuclear power is a better analogy than weapons, but I think the two are linked, and the link itself may be a useful analogy, because risk/coordination is affected by the dual-use nature of some of the technologies.

One thing that makes non-proliferation difficult is that nations legitimately want nuclear facilities because they want to use nuclear power, but 'rogue states' that want to acquire nuclear weapons will also claim that this is their only goal. How do we know who really just wants power plants?

And power generation comes with its own risks. Can we trust everyone to take the right precautions, and if not, can we paternalistically restrict some organisations or states that we deem not capable enough to be trusted with the technology?

AI coordination probably has these kinds of problems to an even greater degree.

↑ comment by Dagon · 2019-08-05T16:19:21.256Z · LW(p) · GW(p)

Opposition to and heavy regulation of nuclear reactors is mostly about accidents, not weapons (though at least some of the effort into tracking the material is about weapons). Everyone agrees we don't want accidents, not everyone agrees how much we should give up to prevent 100% of accidents. We have, in fact, had significant accidents.

Also, accidents with weapons are definitely a thing. Human regulation and cooperation is unsolved, so even knowing the difference between accident and intent is actually somewhat hard to define for many group activities.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-05T17:17:36.457Z · LW(p) · GW(p)

I agree with this; I'm not sure what point you're trying to make?

Perhaps you're suggesting that the fact that its accident risk rather than weapons risk doesn't mean that we're safe, in which case I agree. I'm only suggesting that people stop using the analogy to nukes because its misleading, I'm not saying that there's no risk as a result.

↑ comment by Matthew Barnett (matthew-barnett) · 2019-08-06T01:52:56.304Z · LW(p) · GW(p)

I don't see any analog to mutually assured destruction, which seems like a pretty key feature with nukes.

Perhaps the appropriate analogy here would be two teams which both say "The other team is going to get to AI first if we don't, and we prefer misalignment to losing, so we might as well push ahead." The disanalogy here is that it's not adversarial in the sense of being destructive (although it could be if they are enemies). But it's analogous in the sense that they could either both decide to do nothing, or both decide to take the action. If they decide to take the action, they will both ensure their own destruction in the case of misalignment.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-06T18:21:45.656Z · LW(p) · GW(p)

This still feels more analogous to Chernobyl? "The other team is going to get cheap nuclear energy first if we don't, and we prefer a nuclear accident to losing, so we might as well push ahead."

You might argue that obviously it doesn't matter very much who gets nuclear energy first, so this wouldn't apply. I'd respond that the benefit : cost ratio here seems similar to the benefit : cost ratio for AI where the benefit is "we build a singleton" and the cost is "misaligned AGI causes extinction". Surely it's significantly better for the other team to win and build a singleton than for you to build a misaligned AGI?

(Separately, I think I would argue that the "we build a singleton" case is unlikely, but that's not a crucial part of this argument.)

comment by Rohin Shah (rohinmshah) · 2019-08-05T02:01:06.303Z · LW(p) · GW(p)

It seems to me that many people believe something like "We need proof-level guarantees, or something close to it, before we build powerful AI". I could interpret this in two different ways:

Normative claim: "Given how bad extinction is, and the plausibility of AI x-risk, it would be irresponsible of us to build powerful AI before having proof-level guarantees that it will be beneficial".
Empirical claim: "If we run a powerful AI system without having something like a proof of the statement 'running this AI system will be beneficial', then catastrophe is nearly inevitable".

I am uncertain on the normative claim (there might be great benefits to building powerful AI sooner, including the reduction of other x-risks), and disagree with the empirical claim.

If I had to argue briefly for the empirical claim, it would go something like this: "Since powerful AI will be world-changing, it will either be really good, or really bad -- neutral impact is too implausible. But due to fragility of value, the really bad outcomes are far more likely. The only way to get enough evidence to rule out all of the bad outcomes is to have a proof that the AI system is beneficial". I'd probably agree with this if we had to create a utility function and give it to a perfect expected utility maximizer (and we couldn't just give it something trivial like the zero utility function), but that seems to be drastically cutting down our options.

So I'm curious: a) are there any people who believe the empirical claim? b) If so, what are your arguments for it? c) How tractable do you think it is to get proof-level guarantees about AI?

Replies from: abramdemski, Wei_Dai, vanessa-kosoy, johnswentworth, capybaralet, capybaralet

↑ comment by abramdemski · 2019-08-06T17:03:48.278Z · LW(p) · GW(p)

My thoughts: we can't really expect to prove something like "this ai will be beneficial". However, relying on empiricism to test our algorithms is very likely to fail, because it's very plausible that there's a discontinuity in behavior around the region of human-level generality of intelligence (specifically as we move to the upper end, where the system can understand things like the whole training regime and its goal systems). So I don't know how to make good guesses about the behavior of very capable systems except through mathematical analysis.

There are two overlapping traditions in machine learning. There's a heavy empirical tradition, in which experimental methodology is used to judge the effectiveness of algorithms along various metrics. Then, there's machine learning theory (computational learning theory), in which algorithms are analyzed mathematically and properties are proven. This second tradition seems far more applicable to questions of safety.

(But we should not act as if we only have one historical example of a successful scientific field to try and generalize from. We can also look at how other fields accomplish difficult things, especially in the face of significant risks.)

Replies from: capybaralet, rohinmshah

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2019-08-07T08:08:47.232Z · LW(p) · GW(p)

I don't think you need to posit a discontinuity to expect tests to occasionally fail.

I suspect the crux is more about how bad a single failure of a sufficiently advanced AI is likely to be.

I'll admit I don't feel like I really understand the perspective of people who seem to think we'll be able to learn how to do alignment via trial-and-error (i.e. tolerating multiple failures). Here are some guesses why people might hold that sort of view:

We'll develop AI in a well-designed box, so we can do a lot of debugging and stress testing.

counter-argument: but the concern is about what happens at deployment time

We'll deploy AI in a box, too then

counter: seems like that entails a massive performance hit (but it's not clear if that's actually the case)

We'll have other "AI police" to stop any "evil AIs" that "go rogue" (just like we have for people).

counter: where did the AI police come from, and why can't they go rogue as well?

The "AI police" can just be the rest of the AIs in the world ganging up on anyone who goes rogue.

counter: this seems to be assuming the "corrigibility as basin of attraction" argument (which has no real basis beyond intuition ATM, AFAIK) at the level of the population of agents.

A single failure isn't likely to be that bad, it would take a series of unlikely failures to take a safe (e.g. "satiable") AI and make it an insatiable "open ended optimizer AI".

counter: we can't assume that we can detect and correct failures, especially in real-world deployment scenarios where subagents might be created. So the failures may have time to compound. It also seems possible that a single failure is all that's needed; this seems like an open question

OK I could go on, but I'd rather actually hear from anyone who has this view! :)

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-07T19:02:42.376Z · LW(p) · GW(p)

I hold this view; none of those are reasons for my view. The reason is much more simple -- before x-risk level failures, we'll see less catastrophic (but still potentially very bad) failures for the same underlying reason. We'll notice this, understand it, and fix the issue.

(A crux I expect people to have is whether we'll actually fix the issue or "apply a bandaid" that is only a superficial fix.)

Replies from: abramdemski, capybaralet

↑ comment by abramdemski · 2019-08-07T21:21:53.371Z · LW(p) · GW(p)

Yeah, this is why I think some kind of discontinuity is important to my case. I expect different kinds of problems to arise with very very capable systems. So I don't see why it makes sense to expect smaller problems to arise first which indicate the potential larger problems and allow people to avert them before they occur.

If a case could be made that all potential problems with very very capable systems could be expected to first arise in survivable forms in moderately capable systems, then I would see how the more empirical style of development could give rise to safe systems.

Replies from: capybaralet

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2019-08-09T04:14:07.799Z · LW(p) · GW(p)

Can you elaborate on what kinds of problems you expect to arise pre vs. post discontinuity?

E.g. will we see "sinister stumbles" (IIRC this was Adam Gleave's name for half-baked treacherous turns)? I think we will, FWIW.

Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)

How about mesa-optimization? (I think we already see qualitatively similar phenomena, but my idea of this doesn't emphasize the "optimization" part.)

Jessica's posts about MIRI vs. Paul's views made it seem like MIRI might be quite concerned about the first AGI arising via mesa-optimization. This seems likely to me, and would also be a case where I'd expect, unless ML becomes "woke" to mesa-optimization (which seems likely to happen, and not too hard to make happen, to me), we'd see something that *looks* like a discontinuity, but is *actually* more like "the same reason".

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2019-10-01T00:04:49.551Z · LW(p) · GW(p)

Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)

This in particular doesn't match my model. Quoting some relevant bits from Embedded Agency [LW · GW]:

So I'm not talking about agents who know their own actions because I think there's going to be a big problem with intelligent machines inferring their own actions in the future. Rather, the possibility of knowing your own actions illustrates something confusing about determining the consequences of your actions—a confusion which shows up even in the very simple case where everything about the world is known and you just need to choose the larger pile of money.

[...]

But it’s not that I’m imagining real-world embedded systems being “too Bayesian” and this somehow causing problems, if we don’t figure out what’s wrong with current models of rational agency. It’s certainly not that I’m imagining future AI systems being written in second-order logic! In most cases, I’m not trying at all to draw direct lines between research problems and specific AI failure modes.

What I’m instead thinking about is this: We sure do seem to be working with the wrong basic concepts today when we try to think about what agency is, as seen by the fact that these concepts don’t transfer well to the more realistic embedded framework.

This is also the topic of The Rocket Alignment Problem.

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2019-08-09T16:00:03.586Z · LW(p) · GW(p)

Interesting. Your crux seems good; I think it's a crux for us. I expect things play out more like Eliezer predicts here: https://www.facebook.com/jefftk/posts/886930452142?comment_id=886983450932&comment_tracking=%7B%22tn%22%3A%22R%22%7D&hc_location=ufi

I also predict that there will be types of failure we will not notice, or will misinterpret. It seems fairly likely to me proto-AGI (i.e. AI that could autonomously learn to become AGI within <~10yrs of acting in the real world) is deployed and creates proto-AGI subagents, some of which we don't become aware of (e.g. because accidental/incidental/deliberate steganography) and/or are unable to keep track of. And then those continue to survive and reproduce, etc... I guess this only seems plausible if the proto-AGI has a hospitable environment (like the internet, human brains/memes) and/or means of reproduction in the real world.

A very similar problem would be a form of longer-term "seeding", where an AI (at any stage) with a sufficiently advanced model of the world and long horizons discovers strategies for increasing the chances ("at the margin") that its values dominate in the long-term future. With my limited knowledge of physics, I imagine there might be ways of doing this just by beaming signals into space in a way calculated to influence/spur the development of life/culture in other parts of the galaxy.

I notice a lot of what I said above makes less sense if you think of AIs as having a similar skill profile to humans, but I think we agree that AIs might be much more advanced than people in some respects while still falling short of AGI because of weaknesses in other areas.

That observation also cuts against the argument you make about warning signs, I think, as it suggests that we might significantly underestimate an AIs (e.g. vastly superhuman) skill in some areas, if it still fails at some things we think are easy. To pull an example (not meant to be realistic) out of a hat: we might have AIs that can't carry on a conversations, but can implement a very sophisticated covert world domination strategy.

Replies from: aleksi-liimatainen, rohinmshah

↑ comment by Aleksi Liimatainen (aleksi-liimatainen) · 2019-08-09T16:22:20.913Z · LW(p) · GW(p)

It seems fairly likely to me proto-AGI (i.e. AI that could autonomously learn to become AGI within <~10yrs of acting in the real world) is deployed and creates proto-AGI subagents, some of which we don't become aware of (e.g. because accidental/incidental/deliberate steganography) and/or are unable to keep track of. And then those continue to survive and reproduce, etc...

Now I'm wondering if it makes sense to model past or present cognitive-cultural information processes in a similar fashion. Memetic and cultural evolutions are a thing and any agentlike processes that spawn could piggypack on our existing general intelligence architecture.

Replies from: capybaralet

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2019-08-15T04:33:02.136Z · LW(p) · GW(p)

Yeah, I think it totally does! (and that's a very interesting / "trippy" line of thought :D)

However, it does seem to me somewhat unlikely, since it does require fairly advanced intelligence, and I don't think evolution is likely to have produced such advanced intelligence with us being totally unaware, whereas I think something about the way we train AI is more strongly selecting for "savant-like" intelligence, which is sort of what I'm imagining here. I can't think of why I have that intuition OTTMH.

↑ comment by Rohin Shah (rohinmshah) · 2019-08-09T22:08:47.801Z · LW(p) · GW(p)

That observation also cuts against the argument you make about warning signs, I think, as it suggests that we might significantly underestimate an AIs (e.g. vastly superhuman) skill in some areas, if it still fails at some things we think are easy.

Nobody denies that AI is really good at extracting patterns out of statistical data (e.g. image classification, speech-to-text, and so on), even though AI is absolutely terrible at many "easy" things. This, and the linked comment from Eliezer, seem to be drastically underselling the competence of AI researchers. (I could imagine it happening with strong enough competitive pressures though.)

I also predict that there will be types of failure we will not notice, or will misinterpret. [...]

All of this assumes some very good long-term planning capabilities. I expect long-term planning to be one of the last capabilities that AI systems get. If I thought they would get them early, I'd be more worried about scenarios like these.

Replies from: capybaralet

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2019-08-15T04:29:40.756Z · LW(p) · GW(p)

So I don't take EY's post as about AI researchers' competence, as much as their incentives and levels of rationality and paranoia. It does include significant competitive pressures, which seems realistic to me.

I don't think I'm underestimating AI researchers, either, but for a different reason... let me elaborate a bit: I think there are waaaaaay to many skills for us to hope to have a reasonable sense of what an AI is actually good at. By skills I'm imagining something more like options, or having accurate generalized value functions (GVFs), than tasks.

Regarding long-term planning, I'd factor this into 2 components:

1) having a good planning algorithm

2) having a good world model

I think the way long-term planning works is that you do short-term planning in a good hierarchical world model. I think AIs will have vastly superhuman planning algorithms (arguably, they already do), so the real bottleneck is the world-model.

I don't think its necessary to have a very "complete" world-model (i.e. enough knowledge to look smart to a person) in order to find "steganographic" long-term strategies like the ones I'm imagining.

I also don't think it's even necessary to have anything that looks very much like a world-model. The AI can just have a few good GVFs.... (i.e. be some sort of savant).

↑ comment by Rohin Shah (rohinmshah) · 2019-08-06T18:38:38.400Z · LW(p) · GW(p)

I don't think the only alternative to proof is empiricism. Lots of people reason about evolutionary biology/psychology with neither proof nor empiricism. The mesa optimizers paper involves neither proof nor empiricism.

it's very plausible that there's a discontinuity in behavior around the region of human-level generality of intelligence (specifically as we move to the upper end, where the system can understand things like the whole training regime and its goal systems)

You can also be empirical at that point though? I suppose you couldn't be empirical if you expect an either an extremely fast takeoff (i.e. order one day or less) or an inability on our part to tell when the AI reaches human-level, but this seems overly pessimistic to me.

Replies from: abramdemski

↑ comment by abramdemski · 2019-08-07T21:44:46.167Z · LW(p) · GW(p)

The mesa-optimizer paper, along with some other examples of important intellectual contributions to AI alignment, have two important properties:

They are part of a research program, not an end result. Rough intuitions can absolutely be a useful guide which (hopefully eventually) helps us figure out what mathematical results are possible and useful.
They primarily point at problems rather than solutions. Because (it seems to me) existential risk seems asymmetrically bad in comparison to potential technology upsides (large as upsides may be), I just have different standards of evidence for "significant risk" vs "significant good". IE, an argument that there is a risk can be fairly rough and nonetheless be sufficient for me to "not push the button" (in a hypothetical where I could choose to turn on a system today). On the other hand, an argument that pushing the button is net positive has to be actually quite strong. I want there to be a small set of assumptions, each of which individually seem very likely to be true, which taken together would be a guarantee against catastrophic failure.

[This is an "or" condition -- either one of those two conditions suffices for me to take vague arguments seriously.]

On the other hand, I agree with you that I set up a false dichotomy between proof and empiricism. Perhaps a better model would be a spectrum between "theory" and empiricism. Mathematical arguments are an extreme point of rigorous theory. Empiricism realistically comes with some amount of theory no matter what. And you could also ask for a "more of both" type approach, implying a 2d picture where they occupy separate dimensions.

Still, though, I personally don't see much of a way to gain understanding about failure modes of very very capable systems using empirical observation of today's systems. I especially don't see an argument that one could expect all failure modes of very very capable systems to present themselves first in less-capable systems.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-08T03:20:05.976Z · LW(p) · GW(p)

Because (it seems to me) existential risk seems asymmetrically bad in comparison to potential technology upsides (large as upsides may be), I just have different standards of evidence for "significant risk" vs "significant good".

This is a normative argument, not an empirical one. The normative position seems reasonable to me, though I'd want to think more about it (I haven't because it doesn't seem decision-relevant).

I especially don't see an argument that one could expect all failure modes of very very capable systems to present themselves first in less-capable systems.

The quick version is that to the extent that the system is adversarially optimizing against you, it had to at some point learn that that was a worthwhile thing to do, which we could notice. (This is assuming that capable systems are built via learning; if not then who knows what'll happen.)

Replies from: abramdemski

↑ comment by abramdemski · 2019-08-08T21:51:07.067Z · LW(p) · GW(p)

I am confused about how the normative question isn't decision-relevant here. Is it that I have a model where it is the relevant question, but you have one where it isn't? To be hopefully clear: I'm applying this normative claim to argue that proof is needed to establish the desired level of confidence. That doesn't mean direct proof of the claim "the AI will do good", but rather of supporting claims, perhaps involving the learning-theoretic properties of the system (putting bounds on errors of certain kinds) and such.

It's possible that this isn't my true disagreement, because actually the question seems more complicated than just a question of how large potential downsides are if things go poorly in comparison to potential upsides if things go well. But some kind of analysis of the risks seems relevant here -- if there weren't such large downside risks, I would have lower standards of evidence for claims that things will go well.

The quick version is that to the extent that the system is adversarially optimizing against you, it had to at some point learn that that was a worthwhile thing to do, which we could notice. (This is assuming that capable systems are built via learning; if not then who knows what'll happen.)

It sounds like we would have to have a longer discussion to resolve this. I don't expect this to hit the mark very well, but here's my reply to what I understand:

I don't see how you can be confident enough of that view for it to be how you really want to check.
A system can be optimizing a fairly good proxy, so that at low levels of capability it is highly aligned, but this falls apart as the system becomes highly capable and figures out "hacks" around the "usual interpretation" of the proxy.

I also note that it seems like we disagree both about how useful proofs will be and about how useful empirical investigations will be (keeping in mind that those aren't the only two things in the universe). I'm not sure which of those two disagreements is more important here.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-08T22:35:48.478Z · LW(p) · GW(p)

To be hopefully clear: I'm applying this normative claim to argue that proof is needed to establish the desired level of confidence.

Under my model, it's overwhelmingly likely that regardless of what we do AGI will be deployed with less than the desired level of confidence in its alignment. If I personally controlled whether or not AGI was deployed, then I'd be extremely interested in the normative claim. If I then agreed with the normative claim, I'd agree with:

proof is needed to establish the desired level of confidence. That doesn't mean direct proof of the claim "the AI will do good", but rather of supporting claims, perhaps involving the learning-theoretic properties of the system (putting bounds on errors of certain kinds) and such.

I don't see how you can be confident enough of that view for it to be how you really want to check.

If I want >99% confidence, I agree that I couldn't be confident enough in that argument.

A system can be optimizing a fairly good proxy, so that at low levels of capability it is highly aligned, but this falls apart as the system becomes highly capable and figures out "hacks" around the "usual interpretation" of the proxy.

Yeah, the hope here would be that the relevant decision-makers are aware of this dynamic (due to previous situations in which e.g. a recommender system optimized the fairly good proxy of clickthrough rate but this lead to "hacks" around the "usual interpretation"), and have some good reason to think that it won't happen with the highly capable system they are planning to deploy.

I also note that it seems like we disagree both about how useful proofs will be and about how useful empirical investigations will be

Agreed. It also might be that we disagree on the tractability of proofs in addition to / instead of the utility of proofs.

↑ comment by Wei Dai (Wei_Dai) · 2019-08-06T10:52:14.469Z · LW(p) · GW(p)

Not sure who you have in mind as people believing this, but after searching both LW and Arbital, the closest thing I've found to a statement of the empirical claim is from Eliezer's 2012 Reply to Holden on ‘Tool AI’ [LW · GW]:

I’ve repeatedly said that the idea behind proving determinism of self-modification isn’t that this guarantees safety, but that if you prove the self-modification stable the AI might work, whereas if you try to get by with no proofs at all, doom is guaranteed.

Paul Christiano argued against this at length in Stable self-improvement as an AI safety problem, concluding as follows:

But I am not yet convinced that stable self-improvement is an especially important problem for AI safety; I think it would be handled correctly by a human-level reasoner as a special case of decision-making under logical uncertainty. This suggests that (1) it will probably be resolved en route to human-level AI, (2) it can probably be “safely” delegated to a human-level AI.

Note that the above talked about "stable self-modification" instead of ‘running this AI system will be beneficial’, and the former is a much narrower and easier to formalize concept than the latter. I haven't really found a serious proposal to try to formalize and prove the latter kind of statement.

IMO, formalizing ‘running this AI system will be beneficial’ is itself an informal and error-prone process, where the only way to gain confidence in its correctness is for many competent researchers to try and fail to find flaws in the formalization. Instead of doing that, one could gain confidence in the AI's safety by directly trying to find flaws (considered informally) in the AI design, and trying to prove or demonstrate via empirical testing narrower safety-relevant statements like "stable self-modification", and given enough resources perhaps reach a similar level of confidence. (So the empirical statement doesn't seem to make sense as written.)

The former still has the advantage that the size of the thing that might be flawed is much smaller (i.e., just the formalization of ‘running this AI system will be beneficial’ instead of the whole AI design), but it has the disadvantage that finding a proof might be very costly both in terms of research effort and in terms of additional constraint on AI design (to allow for a proof) making the AI less competitive. Overall, it seems like it's too early to reach a strong conclusion one way or another as to which approach is more advisable.

Replies from: FactorialCode, rohinmshah

↑ comment by FactorialCode · 2019-08-06T12:40:16.533Z · LW(p) · GW(p)

At some point, there was definitely discussion about formal verification of AI systems. At the very least, this MIRIx event seems to have been about the topic.

From Safety Engineering for Artificial General Intelligence:

An AI built in the Artificial General Intelligence paradigm, in which the design is engineered de novo, has the advantage over humans with respect to transparency of disposition, since it is able to display its source code, which can then be reviewed for trustworthiness (Salamon, Rayhawk, and Kramár 2010; Sotala 2012). Indeed, with an improved intelligence, it might find a way to formally prove its benevolence. If weak early AIs are incentivized to adopt verifiably or even provably benevolent dispositions, these can be continually verified or proved and thus retained, even as the AIs gain in intelligence and eventually reach the point where they have the power to renege without retaliation (Hall 2007a).

Also, from section 2 of Agent Foundations for Aligning Machine Intelligence with Human Interests: A Technical Research Agenda:

When constructing intelligent systems which learn and interact with all the complexities of reality, it is not sufficient to verify that the algorithm behaves well in test settings. Additional work is necessary to verify that the system will continue working as intended in application. This is especially true of systems possessing general intelligence at or above the human level: superintelligent machines might find strategies and execute plans beyond both the experience and imagination of the programmers, making the clever oscillator of Bird and Layzell look trite. At the same time, unpredictable behavior from smarter-than-human systems could cause catastrophic damage, if they are not aligned with human interests (Yudkowsky 2008). Because the stakes are so high, testing combined with a gut-level intuition that the system will continue to work outside the test environment is insufficient, even if the testing is extensive. It is important to also have a formal understanding of precisely why the system is expected to behave well in application. What constitutes a formal understanding? It seems essential to us to have both (1) an understanding of precisely what problem the system is intended to solve; and (2) an understanding of precisely why this practical system is expected to solve that abstract problem. The latter must wait for the development of practical smarter than-human systems, but the former is a theoretical research problem that we can already examine.

I suspect that this approach has fallen out of favor as ML algorithms have gotten more capable while our ability to prove anything useful about those algorithms has heavily lagged behind. Although deep mind and a few others are is still trying.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2019-08-06T18:30:47.722Z · LW(p) · GW(p)

MIRIx events are funded by MIRI, but we don't decide the topics or anything. I haven't taken a poll of MIRI researchers to see how enthusiastic different people are about formal verification, but AFAIK Nate and Eliezer don't see it as super relevant. See https://www.lesswrong.com/posts/xCpuSfT5Lt6kkR3po/my-take-on-agent-foundations-formalizing-metaphilosophical#cGuMRFSi224RCNBZi [LW(p) · GW(p)] and the idea of a "safety-story" in https://www.lesswrong.com/posts/8gqrbnW758qjHFTrH/security-mindset-and-ordinary-paranoia [LW · GW] for better attempts to characterize what MIRI is looking for.

ETA: From the end of the latter dialogue,

In point of fact, the real reason the author is listing out this methodology is that he's currently trying to do something similar on the problem of aligning Artificial General Intelligence, and he would like to move past “I believe my AGI won't want to kill anyone” and into a headspace more like writing down statements such as “Although the space of potential weightings for this recurrent neural net does contain weight combinations that would figure out how to kill the programmers, I believe that gradient descent on loss function L will only access a result inside subspace Q with properties P, and I believe a space with properties P does not include any weight combinations that figure out how to kill the programmer.”

Though this itself is not really a reduced statement and still has too much goal-laden language in it.

Rather than putting the emphasis on being able to machine-verify all important properties of the system, this puts the emphasis on having strong technical insight into the system; I usually think of formal proofs more as a means to that end. (Again caveating that some people at MIRI might think of this differently.)

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2019-08-06T18:36:35.494Z · LW(p) · GW(p)

Also the discussion of deconfusion research in https://intelligence.org/2018/11/22/2018-update-our-new-research-directions/ and https://www.lesswrong.com/posts/Gg9a4y8reWKtLe3Tn/the-rocket-alignment-problem [LW · GW] , and the sketch of 'why this looks like a hard problem in general' in https://www.lesswrong.com/posts/zEvqFtT4AtTztfYC4/optimization-amplifies [LW · GW] and https://arbital.com/p/aligning_adds_time/ .

↑ comment by Rohin Shah (rohinmshah) · 2019-08-06T18:44:33.787Z · LW(p) · GW(p)

Not sure who you have in mind as people believing this

I don't have particular people in mind, it's more of a general "vibe" I get from talking to people. In the past, when I've stated the empirical claim, some people agreed with it, but upon further discussion it turned out they actually agreed with the normative claim. Hence my first question, which was to ask whether or not people believe the empirical claim.

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2019-08-09T17:02:34.977Z · LW(p) · GW(p)

a) I believe a weaker version of the empirical claim, namely that the catastrophe is not nearly inevitable but not unlikely. That is, I can imagine different worlds in which the probability of the catastrophe is different, and I have uncertainty over in which world we actually are, s.t. in average the probability is sizable.

b) I think that the argument you gave is sort of correct. We need to augment it by: the minimal requirement from the AI is, it needs to effectively block all competing dangerous AI projects, without also doing bad things (which is why you can't just give it the zero utility function). Your counterargument seems weak to me because, moving from utility maximizes to other types of AIs is just replacing something that is relatively easy to reason about with something that it is harder to reason about, thereby obscuring the problems (that are still there). I think that whatever your AI is, given that is satisfies the minimal requirement, some kind of utility-maximization-like behavior is likely to arise.

Coming at it from a different angle, complicated systems often fail in unexpected ways. The way people solve this problem in practice is by a combination of mathematical analysis and empirical research. I don't think we have many examples of complicated systems where all failures were avoided by informal reasoning without either empirical or mathematical backing. In the case of superintelligent AI, empirical research alone is insufficient because, without mathematical models, we don't know how to extrapolate empirical results from current AIs to superintelligent AIs, and when superintelligent algorithms are already here it will probably be too late.

c) I think what we can (and should) realistically aim for is, having a mathematical theory of AI, and having a mathematical model of our particular AI, such that in this model we can prove the AI is safe. This model will have some assumptions and parameters that will need to be verified/measured in other ways, through some combination of (i) experiments with AI/algorithms (ii) learning from neuroscience (iii) learning from biological evolution and (iv) leveraging our knowledge of physics. Then, there is also the question of, how precise is the correspondence between the model and the actual code (and hardware). Ideally, we want to do formal verification in which we can test that a certain theorem holds for the actual code we are running. Weaker levels of correspondence might still be sufficient, but that would be Plan B.

Also, the proof can rely on mathematical conjectures in which we have high confidence, such as $P \neq N P$ . Of course, the evidence for such conjectures is (some sort of) empirical, but it is important that the conjecture is at least a rigorous, well defined mathematical statement.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-09T22:26:05.339Z · LW(p) · GW(p)

I agree with a). c) seems to me to be very optimistic, but that's mostly an intuition, I don't have a strong argument against it (and I wouldn't discourage people who are enthusiastic about it from working on it).

The argument in b) makes sense; I think the part that I disagree with is:

moving from utility maximizes to other types of AIs is just replacing something that is relatively easy to reason about with something that it is harder to reason about, thereby obscuring the problems (that are still there).

The counterargument is "current AI systems don't look like long term planners", but of course it is possible to respond to that with "AGI will be very different from current AI systems", and then I have nothing to say beyond "I think AGI will be like current AI systems".

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2019-08-10T10:52:17.795Z · LW(p) · GW(p)

Well, any system that satisfies the Minimal Requirement is doing long term planning on some level. For example, if your AI is approval directed, it still needs to learn how to make good plans that will be approved. Once your system has a superhuman capability of producing plans somewhere inside, you should worry about that capability being applied in the wrong direction (in particular due to mesa-optimization / daemons). Also, even without long term planning, extreme optimization is dangerous (for example an approval directed AI might create some kind of memetic supervirus).

But, I agree that these arguments are not enough to be confident of the strong empirical claim.

↑ comment by johnswentworth · 2019-08-05T02:20:50.444Z · LW(p) · GW(p)

I believe the empirical claim. As I see it, the main issue is Goodhart: an AGI is probably going to be optimizing something, and open-ended optimization tends to go badly. The main purpose of proof-level guarantees is to make damn sure that the optimization target is safe. (You might imagine something other than a utility-maximizer, but at the end of the day it's either going to perform open-ended optimization of something, or be not very powerful.)

The best analogy here is something like an unaligned wish-granting genie/demon [? · GW]. You want to be really careful about wording that wish, and make sure it doesn't have any loopholes.

I think the difficulty of getting those proof-level guarantees is more conceptual than technical: the problem is that we don't have good ways to rigorously express many of the core ideas, e.g. the idea that physical systems made of atoms can "want" things. Once the core problems of embedded agency [? · GW] are resolved, I expect the relevant guarantees will not be difficult.

Replies from: rohinmshah, capybaralet

↑ comment by Rohin Shah (rohinmshah) · 2019-08-05T03:40:23.791Z · LW(p) · GW(p)

Does it make a difference if the optimization target is itself being learned?

What if we have intuitive arguments + tests that suggest that the optimization target is safe?

Replies from: johnswentworth, donald-hobson

↑ comment by johnswentworth · 2019-08-05T16:55:19.110Z · LW(p) · GW(p)

Still unsafe, in both cases.

The second case is simpler. Think about it in analogy to a wish-granting genie/demon: if we have some intuitive argument that our wish-contract is safe and a few human-designed tests, do we really expect it to have no loopholes exploitable by the genie/demon? I certainly wouldn't bet on it. The problem here is that the AI is smarter than we are, and can find loopholes we will not think of.

The first case is more subtle, because most of the complexity is hidden under a human-intuitive abstraction layer. If we had an unaligned genie/demon and said "I wish for you to passively study me for a year, learn what would make me most happy, and then give me that", then that might be a safe wish - assuming the genie/demon already has an appropriate understanding of what "happy" means, including things like long-term satisfaction etc. But an AI will presumably not start with such an understanding out the gate. Abstractly, the AI can learn its optimization target, but in order to do that it needs a learning target - the thing it's trying to learn. And that learning target is itself what needs to be aligned. If we want the AI to learn what makes humans "happy", in a safe way, then whatever it's using as a proxy for "happiness" needs to be a safe optimization target.

On a side note, Yudkowsky's "The Hidden Complexity of Wishes [? · GW]" is in many ways a better explanation of what I'm getting at. The one thing it doesn't explain is how "more powerful" in the sense of "ability to grant more difficult wishes" translates into a more powerful optimizer. But that's a pretty easy jump to make: wishes require satisficing, so we use the usual approach of a two-valued utility function.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-05T17:12:24.263Z · LW(p) · GW(p)

if we have some intuitive argument that our wish-contract is safe and a few human-designed tests, do we really expect it to have no loopholes exploitable by the genie/demon?

I wasn't imagining just input-output tests in laboratory conditions, which I agree are insufficient. I was thinking of studying counterfactuals, e.g. what the optimization target would suggest doing under the hypothetical scenario where it has lots of power. Alternatively, you could imagine tests of the form "pose this scenario, and see how the AI thinks about it", e.g. to see whether the AI runs a check for whether it can deceive humans. (Yes, this assumes strong interpretability techniques that we don't yet have. But if you want to claim that only proofs will work, you either need to claim that interpretability techniques can never be developed, or that even if they are developed they won't solve the problem.)

Also, I probably should have mentioned this in the previous comment, but it's not clear to me that it's accurate to model AGI as an open-ended optimizer, in the same way that that's not a great model of humans. I don't particularly want to debate that claim, because those debates never help, but it's a relevant fact to understanding my position.

Replies from: johnswentworth

↑ comment by johnswentworth · 2019-08-05T17:27:11.100Z · LW(p) · GW(p)

I mentioned that I expect proof-level guarantees will be easy once the conceptual problems are worked out. Strong interpretability is part of that: if we know how to "see whether the AI runs a check for whether it can deceive humans", then I expect systems which provably don't do that won't be much extra work. So we might disagree less on that front than it first seemed.

The question of whether to model the AI as an open-ended optimizer is is one I figured would come up. I don't think we need to think of it as truly open-ended in order to use any of the above arguments, especially the wish-granting analogy. The relevant point is that limited optimization implies limited wish-granting ability. In order to grant more "difficult" wishes, the AI needs to steer the universe into a smaller chunk of state-space - in other words, it needs to perform stronger optimization. So AI with limited optimization capability will be safer to exactly the extent that they are unable to grant unsafe wishes - i.e. the chunks of state-space which they can access just don't contain really bad outcomes.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-06T18:12:34.213Z · LW(p) · GW(p)

Perhaps the disagreement is in how hard it is to prove things vs. test them. I pretty strongly disagree with

if we know how to "see whether the AI runs a check for whether it can deceive humans", then I expect systems which provably don't do that won't be much extra work.

The version based on testing has to look at a single input scenario to the AI, whereas the proof has to quantify over all possible scenarios. These seem wildly different. Compare to e.g. telling whether Alice is being manipulated by Bob by looking at interactions between Alice and Bob, vs. trying to prove that Bob will never be manipulative. The former seems possible, the latter doesn't.

Replies from: johnswentworth

↑ comment by johnswentworth · 2019-08-06T18:45:30.722Z · LW(p) · GW(p)

Three possibly-relevant points here.

First, when I say "proof-level guarantees will be easy", I mean "team of experts can predictably and reliably do it in a year or two", not "hacker can do it over the weekend".

Second, suppose we want to prove that a sorting algorithm always returns sorted output. We don't do that by explicitly quantifying over all possible outputs. Rather, we do that using some insights into what it means for something to be sorted - e.g. expressing it in terms of a relatively small set of pairwise comparisons. Indeed, the insights needed for the proof are often exactly the same insights needed to design the algorithm. Once you've got the insights and the sorting algorithm in hand, the proof isn't actually that much extra work, although it will still take some experts chewing on it a bit to make sure it's correct.

That's the sort of thing I expect to happen for friendly AI: we are missing some fundamental insights into what it means to be "aligned". Once those are figured out, I don't expect proofs to be much harder than algorithms. Coming back to the "see whether the AI runs a check for whether it can deceive humans" example, the proof wouldn't involve writing the checker and then quantifying over all possible inputs. Rather, it would involve writing the AI in such a way that it always passes the check, by construction - just like we write sorting algorithms so that they will always pass an is_sorted() check by construction.

Third, continuing from the previous point: the question is not how hard it is to prove compared to test. The question is how hard it is to build a provably-correct algorithm, compared to an algorithm which happens to be correct even though we don't have a proof.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-06T19:52:53.372Z · LW(p) · GW(p)

First, when I say "proof-level guarantees will be easy", I mean "team of experts can predictably and reliably do it in a year or two", not "hacker can do it over the weekend".

This was also what I was imagining. (Well, actually, I was also considering more than two years.)

we are missing some fundamental insights into what it means to be "aligned".

It sounds like our disagreement is the one highlighted in Realism about rationality [LW · GW]. When I say we could check whether the AI is deceiving humans, I don't mean that we have a check that succeeds literally 100% of the time because we have formalized a definition of "deception" that gives us a perfect checker. I don't think notions like "deception", "aligned", "want", "optimize", etc. have a clean formal definition that admits a 100% successful checker. I do think that these notions do tend to have extremes that can be reliably identified, even if there are edge cases where it is unclear. This makes testing easy, while proofs remain very difficult.

Jumping back to the original question, it sounds like the reason that you think that if we don't have proofs we are doomed, is that conditional on us not having proofs, we must not have had any other methods of gaining confidence (such as testing), and so we must be flying blind. Is that right?

If so, how do you square this with other engineering disciplines, which typically place most of the confidence in safety on comprehensive, expensive testing (think wind tunnels for rockets or crash tests for cars)? Perhaps this is also explained by realism about rationality -- maybe physical phenomena aren't amenable to crisp formal definitions, but "alignment" is.

Replies from: johnswentworth

↑ comment by johnswentworth · 2019-08-06T22:45:03.592Z · LW(p) · GW(p)

It does sound like our disagreement is the same thing outlined in Realism about Rationality (although I disagree with almost all of the "realism about rationality" examples in that post - e.g. I don't think AGI will necessarily be an "agent", I don't think Turing machines or Kolmogorov complexity are useful foundations for epistemology, I'm not bothered by moral intuitions containing contradictions, etc).

I would also describe my "no proofs => doomed" view, not as the proofs being causally important, but as the proofs being evidence of understanding. If we don't have the proofs, it's highly unlikely that we understand the system well enough to usefully predict whether it is safe - but the proofs themselves play a relatively minor role.

I do not know of any engineering discipline which places most of the confidence in safety on comprehensive, expensive testing. Every single engineering discipline I have ever studied starts from understanding the system under design, the principles which govern its function, and designs a system which is expected to be safe based on that understanding. As long as those underlying principles are understood, the most likely errors are either simple mistakes (e.g. metric/standard units mixup) or missing some fundamental phenomenon (e.g. aerodynamics of a bridge). Those are the sort of problems which testing is good at catching. Testing is a double-check that we haven't missed something critical; it is not the primary basis for thinking the system is safe.

A simple example, in contrast to AI: every engineering discipline I know of uses "safety factors" - i.e. make a beam twice as strong as it needs to be, give a wire twice the current capacity it needs, etc. A safety factor of 2 is typical in a wide variety of engineering fields. In AI, we cannot use safety factors because we do not even know what number we could double to make the AI more safe. Today, given any particular aspect of an AI system, we do not know whether adjusting any particular parameter will make the AI more or less reliable/risky.

↑ comment by Donald Hobson (donald-hobson) · 2019-08-05T11:33:44.274Z · LW(p) · GW(p)

The problem with tests is that the AI behaving well when weak enough to be tested doesn't guarantee it will continue to do so.

If you are testing a system, that means that you are not confidant that it is safe. If it isn't safe, then your only hope is for humans to stop it. Testing an AI is very dangerous unless you are confidant that it can't harm you.

A paperclip maximizer would try to pass your tests until it was powerful enough to trick its way out and take over. Black box testing of arbitrary AI's gets you very little safety.

Also some peoples intuitions think that a smile maximizing AI is a good idea. If you have a straightforward argument that appeals to the intuitions of the average Joe Blogs, and can't be easily formalized, then I would take the difficulty formalizing it as evidence that the argument is not sound.

If you take a neural network and train it to recognize smiling faces, then attach that to AIXI, you get a machine that will appear to work in the lab, when the best it can do is make the scientists smile into its camera. There will be an intuitive argument about how it wants to make people smile, and people smile when they are happy. The AI will tile the universe with cameras pointed at smiley faces as soon as it escapes the lab.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-05T17:12:45.997Z · LW(p) · GW(p)

See response to johnswentworth above.

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2019-08-07T07:36:16.135Z · LW(p) · GW(p)

A slightly misspecified reward function can lead to anything from perfectly aligned behavior to catastrophic failure. So I think we need much stronger and more formal arguments to believe that catastrophe is almost inevitable than EY's genie post provides.

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2019-08-09T16:15:41.276Z · LW(p) · GW(p)

I think a potentially more interesting question is not about running a single AI system, but rather the overall impact of AI technology (in a world where we don't have proofs of things like beneficence). It would be easier to hold the analogue of the empirical claim there.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-09T22:11:20.198Z · LW(p) · GW(p)

I'd also argue against the empirical claim in that setting; do you agree with the empirical claim there?

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2019-08-07T07:28:32.536Z · LW(p) · GW(p)

I hold a nuanced view that I believe is more similar to the empirical claim than your views.

I think what we want is an extremely high level of justified confidence that any AI system or technology that is likely to become widely available is not carrying a significant and non-decreasing amount of Xrisk-per-second.
And it seems incredibly difficult and likely impossible to have such an extremely high level of justified confidence.

Formal verification and proof seem like the best we can do now, but I agree with you that we shouldn't rule out other approaches to achieving extreme levels of justified confidence. What it all points at to me is the need for more work on epistemology, so that we can begin to understand how extreme levels of confidence actually operate.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-07T19:05:08.775Z · LW(p) · GW(p)

This sounds like the normative claim, not the empirical one, given that you said "what we want is..."

Replies from: capybaralet

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2019-08-09T16:13:32.122Z · LW(p) · GW(p)

Yep, good catch ;)

I *do* put a non-trivial weight on models where the empirical claim is true, and not just out of epistemic humility. But overall, I'm epistemically humble enough these days to think it's not reasonable to say "nearly inevitable" if you integrate out epistemic uncertainty.

But maybe it's enough to have reasons for putting non-trivial weight on the empirical claim to be able to answer the other questions meaningfully?

Or are you just trying to see if anyone can defeat the epistemic humility "trump card"?

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-09T22:14:26.776Z · LW(p) · GW(p)

Or are you just trying to see if anyone can defeat the epistemic humility "trump card"?

Partly (I'm surprised by how confident people generally seem to be, but that could just be a misinterpretation of their position), but also on my inside view the empirical claim is not true and I wanted to see if there were convincing arguments for it.

But maybe it's enough to have reasons for putting non-trivial weight on the empirical claim to be able to answer the other questions meaningfully?

Yeah, I'd be interested in your answers anyway.

Replies from: capybaralet

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2019-08-15T04:13:08.445Z · LW(p) · GW(p)

I'm not sure I have much more than the standard MIRI-style arguments about convergent rationality and fragility of human values, at least nothing is jumping to mind ATM. I do think we probably disagree about how strong those arguments are. I'm actually more interested in hearing your take on those lines of argument than saying mine ATM :P

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-08-15T19:57:38.540Z · LW(p) · GW(p)

Re: convergent rationality, I don't buy it [? · GW] (specifically the "convergent" part).

Re: fragility of human values, I do buy the notion of a broad basin of corrigibility, which presumably is less fragile.

But really my answer is "there are lots of ways you can get confidence in a thing that are not proofs". I think the strongest argument against is "when you have an adversary optimizing against you, nothing short of proofs can give you confidence", which seems to be somewhat true in security. But then I think there are ways that you can get confidence in "the AI system will not adversarially optimize against me" using techniques that are not proofs.

(Note the alternative to proofs is not trial and error. I don't use trial and error to successfully board a flight, but I also don't have a proof that my strategy is going to cause me to successfully board a flight.)

Replies from: capybaralet

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2019-08-21T16:18:36.045Z · LW(p) · GW(p)

But really my answer is "there are lots of ways you can get confidence in a thing that are not proofs".

Totally agree; it's an under-appreciated point!

Here's my counter-argument: we have no idea what epistemological principles explain this empirical observation. Therefor we don't actually know that the confidence we achieve in these ways is justified. So we may just be wrong to be confident in our ability to successfully board flights (etc.)

The epistemic/aleatory distinction is relevant here. Taking an expectation over both kinds of uncertainty, we can achieve a high level of subjective confidence in such things / via such means. However, we may be badly mistaken, and thus still extremely likely objectively speaking to be wrong.

This also probably explains a lot of the disagreement, since different people probably just have very different prior beliefs about how likely this kind of informal reasoning is to give us true beliefs about advanced AI systems.

I'm personally quite uncertain about that question, ATM. I tend to think we can get pretty far with this kind of informal reasoning in the "early days" of (proto-)AGI development, but we become increasingly likely to fuck up as we start having to deal with vastly super-human intelligences. And would like to see more work in epistemology aimed at addressing this (and other Xrisk-relevant concerns, e.g. what principles of "social epistemology" would allow the human community to effectively manage collective knowledge that is far beyond what any individual can grasp? I'd argue we're in the process of failing catastrophically at that)

comment by Wei Dai (Wei_Dai) · 2019-09-23T05:35:39.410Z · LW(p) · GW(p)

A downside of the portfolio approach to AI safety research

Given typical human biases, researchers of each AI safety approach are likely to be overconfident about the difficulty and safety of the approach they're personally advocating and pursuing, which exacerbates the problem of unilateralist's curse in AI. This should highlighted and kept in mind by practitioners of the portfolio approach to AI safety research (e.g., grant makers). In particular it may be a good idea to make sure researchers who are being funded have a good understanding of the overconfidence effect and other relevant biases, as well as the unilateralist's curse.

Replies from: ofer

↑ comment by Ofer (ofer) · 2019-09-23T07:11:20.905Z · LW(p) · GW(p)

These biases seem very important to keep in mind!

If "AI safety" refers here only to AI alignment, I'd be happy to read about how overconfidence about the difficulty/safety of one's approach might exacerbate the unilateralist's curse.

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-08-09T13:09:59.068Z · LW(p) · GW(p)

I'm posting a few research directions in my research agenda about which I haven't written much elsewhere (except maybe in the MIRIx Discord server), and for which I so far haven't got the time to make a full length essay with mathematical details. Each direction is in a separate child comment.

Replies from: vanessa-kosoy, vanessa-kosoy, vanessa-kosoy, vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2019-08-09T15:10:24.484Z · LW(p) · GW(p)

In last year's essay [AF · GW] about my research agenda I wrote about an approach I call "learning by teaching" (LBT). In LBT, an AI is learning human values by trying to give advice to a human and seeing how the human changes eir behavior (without an explicit reward signal). Roughly speaking, if the human permanently changes eir behavior as a result of the advice, then one can assume the advice was useful. Partial protection against manipulative advice is provided by a delegation mechanism, which ensures the AI only produces advice that is in the space of "possible pieces of advice a human could give" in some sense. However, this protection seems insufficient since it allows for giving all arguments in favor of a position without giving any arguments against a position.

To add more defense against manipulation, I propose to build on the "AI debate" idea. However, in this scheme, we don't need more than one AI. In fact, this is a general fact: for any protocol $P$ involving multiple AIs, there is a protocol $Q$ involving just one AI that works (at least roughly, qualitatively) just as well. Proof sketch: If we can prove that under assumptions $X$ , the protocol $P$ is safe/effective, then we can design a single AI $Q$ which has assumptions $X$ baked into its prior. Such an AI would be able to understand that simulating protocol $P$ would lead to a safe/effective outcome, and would only choose a different strategy if it leads to an even better outcome under the same assumptions.

The way we use "AI debate" is not by implementing an actual AI debate. Instead, we use it to formalize our assumptions about human behavior. In ordinary IRL, the usual assumption is "a human is a (nearly) optimal agent for eir utility function". In the original version of LBT, the assumption was of the form "a human is (nearly) optimal when receiving optimal advice". In debate-LBT the assumption becomes "a human is (nearly) optimal* when exposed to a debate between two agents at least one of which is giving optimal advice". Here, the human observes this hypothetical debate through the same "cheap talk" channel through which it receives advice from the single AI.

Notice that debate can be considered to be a form of interactive proof system (with two or more provers). However, the requirements are different from classical proof systems. In classical theory, the requirement is "When the prover is honestly arguing for a correct proposition, the verifier is convinced. For any prover the verifier cannot be convinced of a false proposition." In "debate proof systems" the requirement is "If at least one prover is honest, the verifier comes to the correct conclusion". That is, we don't guarantee anything when both provers are dishonest. It is easy to see that these debate proof systems admit any problem in PSPACE: given a game, both provers can state their assertions as to which side wins the game, and if they disagree they have to play the game for the corresponding sides.

*Fiddling with the assumptions a little, instead of "optimal" we can probably just say that the AI is guaranteed to achieve this level of performance, what it is.

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2019-08-09T15:32:19.195Z · LW(p) · GW(p)

A variant of Christiano's IDA amenable to learning-theoretic analysis. We consider reinforcement learning with a set of observations and a set of actions, where the semantics of the actions is making predictions about future observations. (But, as opposed to vanilla sequence forecasting, making predictions affects the environment.) The reward function is unknown and unobservable, but it is known to satisfy two assumptions:

(i) If we make the null prediction always, the expected utility will be lower bounded by some constant.

(ii) If our predictions sample the $n$ -step future for a given policy $π$ , then our expected utility will be lower bounded by some function $F (u, n)$ of the the expected utility $u$ of $π$ and $n$ . $F$ is s.t. for sufficiently low $u$ , $F (u, n) \leq u$ but for sufficiently high $u$ , $F (u, n) > u$ (in particular the constant in (i) should be high enough to lead to an increasing sequence). Also, it can be argued that it's natural to assume $F (F (u, n), m) \approx F (u, n m)$ for $u >> 0$ .

The goal is proving regret bounds for this setting. Note that this setting automatically deals with malign hypotheses in the prior, bad self-fulfilling prophecies and "corrupting" predictions that cause damage just by seeing them.

However, I expect that without additional assumptions the regret bound will be fairly weak, since the penalty for making wrong predictions grows with the total variation distance between the prediction distribution and the true distribution, which is quite harsh. I think this reflects a true weakness of IDA (or some versions of it, at least): without an explicit model of the utility function, we need very high fidelity to guarantee robustness against e.g. malign hypotheses. On the other hand, it might be possible to ameliorate this problem if we introduce an additional assumption of the form: the utility function is e.g. Lipschitz w.r.t some metric $d$ . Then, total variation distance is replaced by Kantorovich-Rubinstein distance defined w.r.t. $d$ . The question is, where do we get the metric from. Maybe we can consider something like the process of humans rewriting texts into equivalent texts.

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2019-08-09T16:19:44.298Z · LW(p) · GW(p)

This idea was inspired by a discussion with Discord user @jbeshir

Model dynamically inconsistent agents (in particular humans) as having a different reward function at every state of the environment MDP (i.e. at every state we have a reward function that assigns values both to this state and to all other states: we have a reward matrix $r (s, t)$ ). This should be regarded as a game where a different player controls the action at every state. We can now look for value learning protocols that converge to Nash* (or other kind of) equilibrium in this game.

The simplest setting would be, every time you visit a state, you learn the reward of all previous states w.r.t. the reward function of the current state. Alternatively, every time you visit a state, you can ask about the reward of one previously visited state w.r.t. the reward function of the current state. This is the analogue of classical reinforcement learning with an explicit reward channel. We can now try to prove a regret bound, which takes the form of an $ϵ$ -Nash equilibrium condition, with $ϵ$ being the regret. More complicated settings would be analogues of Delegative RL (where the advisor also follows the reward function of the current state) and other value learning protocols.

This seems like a more elegant way to model "corruption" than as a binary or continuous one dimensional variable like I did before.

*Note that although for general games, even if they are purely coorperative, Nash equilibria can be suboptimal due to coordination problems, for this type of games it doesn't happen: in the purely cooperative case, the Nash equilibrium condition becomes the Bellman equation that implies global optimality.

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2019-08-09T15:52:37.446Z · LW(p) · GW(p)

It is an interesting problem to write explicit regret bounds for reinforcement learning with a prior that is the Solomonoff prior or something similar. Of course, any regret bound requires dealing with traps. The simplest approach is, leaving only environments without traps in the prior (there are technical details involved that I won't go into right now). However, after that we are still left with a different major problem. The regret bound we get is very weak. This happens because the prior contains sets of hypotheses of the form "program template $P$ augmented by a hard-coded bit string $b$ of length $n$ ". Such a set contains $2^{n}$ hypotheses, and its total probability mass is approximately $2^{- | P |}$ , which is significant for short $P$ (even when $n$ is large). However, the definition of regret requires out agent to compete against a hypothetical agent that knows the true environment, which in this case means knowing both $P$ and $b$ . Such a contest is very hard since learning $n$ bits can take much time for large $n$ .

Note that the definition of regret depends on how we decompose the prior into a convex combination of individual hypotheses. To solve this problem, I propose redefining regret in this setting by grouping the hypotheses in a particular way. Specifically, in algorithmic statistics there is the concept of sophistication. The sophistication of a bit string $x$ is defined as follows. First, we consider the Kolmogorov complexity $K (x)$ of $x$ . Then we consider pairs $(Q, y)$ where $Q$ is a program, $y$ is a bit string, $Q (y) = x$ and $| Q | + | y | \leq K (x) + O (1)$ . Finally, we minimize over $| Q |$ . The minimal $| Q |$ is called the sophistication of $x$ . For our problem, we are interested in the minimal $Q$ itself: I call it the "sophisticated core" of $x$ . We now group the hypotheses in our prior by sophisticated cores, and define (Bayesian) regret w.r.t. this grouping.

Coming back to our problematic set of hypotheses, most of it is now grouped into a single hypothesis, corresponding to the sophisticated core of $P$ . Therefore, the reference agent in the definition of regret now knows $P$ but doesn't know $b$ , making it feasible to compete with it.

comment by Donald Hobson (donald-hobson) · 2019-08-05T11:41:25.884Z · LW(p) · GW(p)

You are handed a hypercomputer, and allowed to run any code you like on it. You can then take 1Tb of data from your computations and attach it to a normal computer. The hypercomputer is removed. You are then handed a magic human utility function. How do you make an FAI with these resources?

The normal computer is capable of running a highly efficient super-intelligence. The hypercomputer can do a brute force search for efficient algorithms. The idea is to split FAI into building a capability module, and a value module.

Replies from: Gurkenglas

↑ comment by Gurkenglas · 2019-08-05T14:44:58.202Z · LW(p) · GW(p)

Assume that given a hypercomputer and the magic utility function m, we could build an FAI F(m). Every TB of data encodes some program A(u) that takes a utility function u as input. For all A and u, ask F(u) if A(u) is aligned with F(u). (We must construct F not to vote strategically here.) Save that A' which gets approved by the largest fraction of F(u). Sanity check that this maximum fraction is very close to 1. Run A'(m).

comment by Abe Dillon (abe-dillon) · 2019-08-06T02:20:00.433Z · LW(p) · GW(p)

The telos of life is to collect and preserve information. That is to say: this is the defining behavior of a living system, so it is an inherent goal. The beginning of life must have involved some replicating medium for storing information. At first, life actively preserved information by replicating, and passively collected information through the process of evolution by natural selection. Now life forms have several ways of collecting and storing information. Genetics, epigenetic, brains, immune systems, gut biomes, etc.

Obviously a system that collects and preserves information is anti-entropic, so living systems can never be fully closed systems. One can think of them as turbulent vortices that form in the flow of the universe from low-entropy to high-entropy. It may never be possible to halt entropy completely, but if the vortex grows enough, it may slow the progression enough that the universe never quite reaches equilibrium. That's the hope, at least.

One nice thing about this goal is that it's also an instrumental goal. It should lead to a very general form of intelligence that's capable of solving many problems.

One question is: if all living creatures share the same goal, why is there conflict? The simple answer is that it's a flaw in evolution. Different creatures encapsulate different information about how to survive. There are few ways to share this information, so there's not much way to form an alliance with other creatures. Ideally, we would want to maximize our internal, low entropy part, and minimize our interface with high entropy.

Imagine playing a game of Risk. A good strategy is to maximize the number of countries you control while minimizing the number of access points to your territory. If you hold North America, you want to take Venezuela, Iceland, and Kamchatka too because they add to your territory without adding to your "interface". You still only have three territories to defend. This principal extends to many real-world scenarios.

Of-course a better way is to form alliances with your neighbors so you don't have to spend so many resources concurring them (that's not a good way to win Risk, but it would be better in the real world).

The reason humans haven't figured out how to reach a state of peace is because we have a flawed implementation of intelligence that makes it difficult to align our interests (or to recognize that our base goals are inherently aligned).

One interesting consequence of the goal of collecting and preserving information is that it inherently implies a utility function to information. That is: information that is more relevant to the problem of collecting and preserving information is more valuable than information that's less relevant to that goal. You're not winning at life if you have an HD box set of "Happy Days" while your neighbor has only a flash drive with all of wikipedia on it. You may have more bits of information, but those bits aren't very useful.

Another reason for conflict among humans is the hard problem of when to favor information preservation over collection. Collecting information necessarily involves risk because it means encountering the unknown. This is the basic conflict between conservatism and liberalism in the most general form of those words.

Would an AI given the goal of collecting and preserving information completely solve the alignment problem? It seems like it might. I'd like to be able to prove such a statement. Thoughts?

EDIT: Please pardon the disorganized, stream-of-consciousness, style of this post. I'm usually skeptical of posts that seem so scatter-brained and almost... hippy-dippy... for lack of a better word. Like the kind of rambling that a stoned teenager might spout. Please work with me here. I've found it hard to present this idea without coming off as a spiritualist-quack, but it is a very serious proposal.

comment by Grue_Slinky · 2019-09-23T14:44:25.742Z · LW(p) · GW(p)

Is this open thread not going to be a monthly thing?

FWIW I liked reading the comment threads here, and would be inclined to participate in the future. But that's just my opinion. I'm curious if more senior people had reasons for not liking the idea?

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-09-29T19:28:36.179Z · LW(p) · GW(p)

I expected that it would be better for me to polish ideas before posting on the forum, and treated this as an experiment to check. I think it broadly confirmed my original view, so I'm not very likely to post top-level comments on open threads in the future, and I told the admins so. I don't know what their decision process was after that. (Possibly they expected that future open threads would be much quieter, since the two biggest comment threads here were both started by my top-level comments.)

Replies from: habryka4

↑ comment by habryka (habryka4) · 2019-09-29T19:48:28.837Z · LW(p) · GW(p)

I felt a bit uncertain about doing one every month, and was planning to start another one in October. Depending on how that one goes we might go with a monthly schedule, or maybe every two months is the right way to go.

comment by Chris_Leong · 2019-08-15T14:18:18.582Z · LW(p) · GW(p)

I've just been invited to this forum. How do I decide whether to put a post on the Alignment Forum vs. Less Wrong?

Replies from: Vaniver

↑ comment by Vaniver · 2019-08-15T19:58:48.694Z · LW(p) · GW(p)

Basically, whether you think it's primarily related to alignment vs. rationality. (Everything on the AF is also on LW, but the reverse isn't true.) The feedback loop if you're posting too much or stuff that isn't polished enough is downvotes (or insufficient upvotes).

comment by John_Maxwell (John_Maxwell_IV) · 2019-08-10T05:51:20.930Z · LW(p) · GW(p)

I saw this thread complaining about the state of peer review in machine learning. Has anyone thought about trying to design a better peer review system, then creating a new ML conference around it and also adding in a safety emphasis?

Replies from: rohinmshah, Raemon

↑ comment by Rohin Shah (rohinmshah) · 2019-09-30T15:24:19.952Z · LW(p) · GW(p)

Yes (though just the peer review, not the safety emphasis). I can send you thoughts about it if you'd like, email me at <my LW username> at gmail.

I thought about the differential development point and came away thinking it would be net positive, and convinced a few other people as well, even if it's just modifying peer review without having safety researchers run the conference.

Replies from: John_Maxwell_IV

↑ comment by John_Maxwell (John_Maxwell_IV) · 2019-10-04T01:20:03.150Z · LW(p) · GW(p)

Cool!

I guess another way of thinking about this is not a safety emphasis so much as a forecasting emphasis. Reminds me of our previous discussion here [LW(p) · GW(p)]. If someone could invent new scientific institutions which reward accurate forecasts about scientific progress, that could be really helpful for knowing how AI will progress and building consensus regarding which approaches are safe/unsafe.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-10-04T05:43:47.586Z · LW(p) · GW(p)

+1, that's basically the story I have in mind. I think of it as less about forecasting and more about understanding deep learning and how it works, but I think it serves basically the same purpose: it's helpful for knowing how AI will progress and building consensus about what's safe / unsafe.

↑ comment by Raemon · 2019-08-10T23:14:01.300Z · LW(p) · GW(p)

I'm vaguely worried that this might be net-negative for ML in particular, if you're worried about differential tech development.

Replies from: John_Maxwell_IV

↑ comment by John_Maxwell (John_Maxwell_IV) · 2019-08-11T04:12:24.629Z · LW(p) · GW(p)

The idea is that if the conference is run by people who are interested in safety, they can preferentially accept papers which are good from a differential technological development point of view.

AI Alignment Open Thread August 2019

Contents

96 comments