Posts

How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe? 2020-09-10T00:40:36.781Z · score: 18 (10 votes)
Anirandis's Shortform 2020-08-29T20:23:45.522Z · score: 2 (1 votes)
‘Maximum’ level of suffering? 2020-06-20T14:05:14.423Z · score: 7 (6 votes)
Likelihood of hyperexistential catastrophe from a bug? 2020-06-18T16:23:41.608Z · score: 11 (7 votes)

Comments

Comment by anirandis on A full explanation to Newcomb's paradox. · 2020-10-12T17:29:28.543Z · score: 3 (2 votes) · LW · GW

I think the idea is that the 4th scenario is the case, and you can’t discern whether you’re the real you or the simulated version, as the simulation is (near-) perfect. In that scenario, you should act in the same way that you’d want the simulated version to. Either (1) you’re a simulation and the real you just won $1,000,000; or (2) you’re the real you and the simulated version of you thought the same way that you did and one-boxed (meaning that you get $1,000,000 if you one-box.)

Comment by anirandis on How much to worry about the US election unrest? · 2020-10-12T17:22:28.784Z · score: 3 (2 votes) · LW · GW

If Trump loses the election, he's not the president anymore and the federal bureaucracy and military will stop listening to him.

He’d still be president until Biden’s inauguration though. I think most of the concern is that there’d be ~3 months of a president Trump with nothing to lose.

Comment by anirandis on Open & Welcome Thread - September 2020 · 2020-09-20T00:36:00.414Z · score: 4 (2 votes) · LW · GW

If anyone happens to be willing to privately discuss some potentially infohazardous stuff that's been on my mind (and not in a good way) involving acausal trade, I'd appreciate it - PM me. It'd be nice if I can figure out whether I'm going batshit.

Comment by anirandis on How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe? · 2020-09-11T17:10:07.281Z · score: 2 (1 votes) · LW · GW
it's much harder to know if you've got it pointed in the right direction or not

Perhaps, but the type of thing I'm describing in the post is more preventing worse-than-death outcomes even if the sign is flipped (by designing a reward function/model in such a way that it's not going to torture everyone if that's the case.)

This seems easier than recognising whether the sign is flipped or just designing a system that can't experience these sign-flip type errors; I'm just unsure whether this is something that we have robust solutions for. If it turns out that someone's figured out a reliable solution to this problem, then the only real concern is whether the AI's developers would bother to implement it. I'd much rather risk the system going wrong and paperclipping than going wrong and turning "I have no mouth, and I must scream" into a reality.

Comment by anirandis on How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe? · 2020-09-11T16:02:00.532Z · score: 4 (2 votes) · LW · GW

My anxieties over this stuff tend not to be so bad late at night, TBH.

Comment by anirandis on How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe? · 2020-09-11T03:47:22.459Z · score: 2 (1 votes) · LW · GW

Seems a little bit beyond me at 4:45am - I'll probably take a look tomorrow when I'm less sleep deprived (although still can't guarantee I'll be able to make it through then; there's quite a bit of technical language in there that makes my head spin.) Are you able to provide a brief tl;dr, and have you thought much about "sign flip in reward function" or "direction of updates to reward model flipped"-type errors specifically? It seems like these particularly nasty bugs could plausibly be mitigated more easily than avoiding false positives (as you defined them in the arxiv's paper's abstract) in general.

Comment by anirandis on How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe? · 2020-09-10T19:24:52.352Z · score: 1 (1 votes) · LW · GW

Would you not agree that (assuming there's an easy way of doing it), separating the system from hyperexistential risk is a good thing for psychological reasons? Even if you think it's extremely unlikely, I'm not at all comfortable with the thought that our seed AI could screw up & design a successor that implements the opposite of our values; and I suspect there are at least some others who share that anxiety.

For the record, I think that this is also a risk worth worrying about for non-psychological reasons.

Comment by anirandis on How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe? · 2020-09-10T15:01:50.541Z · score: 2 (1 votes) · LW · GW
You seem to have a somewhat general argument against any solution that involves adding onto the utility function in "What if that added solution was bugged instead?".

I might've failed to make my argument clear: if we designed the utility function as U = V + W (where W is the thing being added on and V refers to human values), this would only stop the sign flipping error if it was U that got flipped. If it were instead V that got flipped (so the AI optimises for U = -V + W), that'd be problematic.


I think it's better to move on from trying to directly target the sign-flip problem and instead deal with bugs/accidents in general.

I disagree here. Obviously we'd want to mitigate both, but a robust way of preventing sign-flipping type errors specifically is absolutely crucial (if anything, so people stop worrying about it.) It's much easier to prevent one specific bug from having an effect than trying to deal with all bugs in general.

Comment by anirandis on How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe? · 2020-09-10T13:36:01.952Z · score: 4 (2 votes) · LW · GW

I see. I'm somewhat unsure how likely AGI is to be built with a neuromorphic architecture though.


I don't think that's an example of (3), more like (1) or (2), or actually "none of the above because GPT-2 doesn't have this kind of architecture".

I just raised GPT-2 to indicate that flipping the goal sign suddenly can lead to optimising for bad behavior without the AI neglecting to consider new strategies. Presumably that'd suggest it's also a possibility with cosmic ray/other errors.

Comment by anirandis on How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe? · 2020-09-10T02:11:20.068Z · score: 2 (1 votes) · LW · GW

I hadn't really considered the possibility of a brain-inspired/neuromorphic AI, thanks for the points.

(2) seems interesting; as I understand it, you're basically suggesting that the error would occur gradually & the system would work to prevent it. Although maybe the AI realises it's getting positive feedback for bad things and keeps doing them, or something (I don't really know, I'm also a little sleep deprived and things like this tend to do my head in.) Like, if I hated beer then suddenly started liking it, I'd probably continue drinking it. Maybe the reward signals are simply so strong that the AI can't resist turning into a "monster", or whatever. Perhaps the system would implement checksums of some sort to do this automatically?

A similar point to (3) was raised by Dach in another thread, although I'm uncertain about this since GPT-2 was willing to explore new strategies when it got hit by a sign-flipping bug. I don't doubt that it would be different with a neuromorphic system, though.

Comment by anirandis on How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe? · 2020-09-10T01:59:36.333Z · score: 2 (1 votes) · LW · GW

Mainly for brevity, but also because it seems to involve quite a drastic change in how the reward function/model as a whole functions. So it doesn't seem particularly likely that it'll be implemented.

Comment by anirandis on How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe? · 2020-09-10T00:59:26.379Z · score: 3 (2 votes) · LW · GW

True, but note that he elaborates and comes up with a patch to the patch (that being have W refer to a class of events that would be expected to happen in the Universe's expected lifespan rather than one that won't.) So he still seems to support the basic idea, although he probably intended just to get the ball rolling with the concept rather than conclusively solve the problem.

Comment by anirandis on Anirandis's Shortform · 2020-09-09T02:53:34.620Z · score: 2 (1 votes) · LW · GW

Perhaps malware could be another risk factor in the type of bug I described here? Not sure.

I'm still a little dubious of Eliezer's solution to the problem of separation from hyperexistential risk; if we had U = V + W where V is a reward function & W is some arbitrary thing it wants to minimise (e.g. paperclips), a sign flip in V (due to any of a broad disjunction of causes) would still cause hyperexistential catastrophe.

Or what about the case where instead of maximising -U, the values that the reward function/model gives for each "thing" is multiplied by -1. E.g. AI system gets 1 point for wireheading and -1 for torture, some weird malware/human screw-up (in the reward model or some relevant database), etc. flips the signs for each individual action. AI now maximises U = W - V.

This seems a lot more nuanced than *just* avoiding cosmic rays; and the potential consequences of a hellish "I have no mouth, and I must scream"-type are far worse than human extinction. I'm not happy with *any* non-negligible probability of this happening.

Comment by anirandis on Is there a possibility of being subjected to eternal torture by aliens? · 2020-09-04T17:25:20.601Z · score: 2 (1 votes) · LW · GW

I have similar anxieties over possible torture scenarios (although mine relate to AI instead of aliens), but this specifically seems somewhat unlikely to me. If you were to think of it the other way around, we'd be unlikely to have some weird sadistic "itch" to torture these aliens we found. Hell, we don't even know if aliens would suffer in the same way that we do; perhaps they evolved a completely separate and alien mechanism to alert themselves to injury. In the same way, human concepts like "suffering" would potentially be an alien concept to these supposedly evil aliens.


Also worth considering is that if they're far more intelligent than us, they'd likely view us as we do ants. There's little reason to believe that sadistic aliens would prefer torturing humans to dogs or insects, which are probably a similar distance away from the aliens on the intelligence scale as we are.


For this scenario to happen, we'd (1) need to be visited by aliens (it could've happened at any point in human history and is yet to happen, so seems unlikely); who (2) recognise and care about concepts like "suffering" that may seem arbitrary to them; (3) have some sort of sadistic or *extremely* vengeful attitude and desire to torture humans despite seemingly being a waste of resources; and (4) fail to recognise that torturing a race over something like this is likely to be considerably unpopular with other aliens and could result in their own "punishment" of sorts.


Those 4 points alone create a conjunction of very low probability. It's possible, but we have more pressing concerns.

Comment by anirandis on Open & Welcome Thread - August 2020 · 2020-09-04T14:24:32.429Z · score: 2 (1 votes) · LW · GW

I see what you're saying here, but the GPT-2 incident seems to downplay it somewhat IMO. I'll wait until you're able to write down your thoughts on this at length; this is something that I'd like to see elaborated on (as well as everything else regarding hyperexistential risk.)

Comment by anirandis on Open & Welcome Thread - August 2020 · 2020-09-03T14:45:43.408Z · score: 3 (2 votes) · LW · GW
Paperclipping seems to be negative utility, not approximately 0 utility.

My thinking was that an AI system that *only* takes values between 0 and + ∞ (or some arbitrary positive number) would identify that killing humans would result in 0 human value, which is its minimum utility.


I read Eliezer's idea, and that strategy seems to be... dangerous. I think that "Giving an AGI a utility function which includes features which are not really relevant to human values" is something we want to avoid unless we absolutely need to.

How come? It doesn't seem *too* hard to create an AI that only expends a small amount of its energy on preventing the garbage thing from happening.


I have much more to say on this topic and about the rest of your comment, but it's definitely too much for a comment chain. I'll make an actual post containing my thoughts sometime in the next week or two, and link it to you.

Please do! I'd love to see a longer discussion on this type of thing.


EDIT: just thought some more about this and want to clear something up:

Modern machine learning systems often require a specific incentive in order to explore new strategies and escape local maximums. We may see this behavior in future attempts at AGI, And no, it would not be flipped with the reward function/model- I'm highlighting that there is a really large variety of sign flip mistakes and most of them probably result in paperclipping.

I'm a little unsure on this one after further reflection. When this happened with GPT-2, the bug managed to flip the reward & the system still pursued instrumental goals like exploring new strategies:

Bugs can optimize for bad behavior
One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished. A mechanism such as Toyota’s Andon cord could have prevented this, by allowing any labeler to stop a problematic training process.

So it definitely seems *plausible* for a reward to be flipped without resulting in the system failing/neglecting to adopt new strategies/doing something weird, etc.

Comment by anirandis on Open & Welcome Thread - August 2020 · 2020-09-03T00:01:13.808Z · score: 4 (3 votes) · LW · GW
As an almost entirely inapplicable analogy . . . it's just doing something weird.
If we inverted the utility function . . . tiling the universe with smiley faces, i.e. paperclipping.

Interesting analogy. I can see what you're saying, and I guess it depends on what specifically gets flipped. I'm unsure about the second example; something like exploring new strategies doesn't seem like something an AGI would terminally value. It's instrumental to optimising the reward function/model, but I can't see it getting flipped *with* the reward function/model.

Can you clarify what you mean by this? Also, I get what you're going for, but paperclips is still extremely negative utility because it involves the destruction of humanity and the reconfiguration of the universe into garbage.

My thinking was that a signflipped AGI designed as a positive utilitarian (i.e. with a minimum at 0 human utility) would prefer paperclipping to torture because the former provides 0 human utility (as there aren't any humans), whereas the latter may produce a negligible amount. I'm not really sure if it makes sense tbh.

The reward modelling system would need to be very carefully engineered, definitely.

Even if we engineered it carefully, that doesn't rule out screw-ups. We need robust failsafe measures *just in case*, imo.

I thought of this as well when I read the post. I'm sure there's something clever you can do to avoid this but we also need to make sure that these sorts of critical components are not vulnerable to memory corruption. I may try to find a better strategy for this later, but for now I need to go do other things.

I wonder if you could feasibly make it a part of the reward model. Perhaps you could train the reward model itself to disvalue something arbitrary (like paperclips) even more than torture, which would hopefully mitigate it. You'd still need to balance it in a way such that the system won't spend all of its resources preventing this thing from happening at the neglect of actual human values, but that doesn't seem too difficult. Although, once again, we can't really have high confidence (>90%) that the AGI developers are going to think to implement something like this.

There was also an interesting idea I found in a Facebook post about this type of thing that got linked somewhere (can't remember where). Stuart Armstrong suggested that a utility function could be designed as such:

Let B1 and B2 be excellent, bestest outcomes. Define U(B1)=1, U(B2)=-1, and U=0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes. Or, more usefully, let X be some trivial feature that the agent can easily set to -1 or 1, and let U be a utility function with values in [0,1]. Have the AI maximisise or minimise XU. Then the AI will always aim for the same best world, just with a different X value.

Even if we solve any issues with these (and actually bother to implement them), there's still the risk of an error like this happening in a localised part of the reward function such that *only* the part specifying something bad gets flipped, although I'm a little confused about this one. It could very well be the case that the system's complex enough that there isn't just one bit indicating whether "pain" or "suffering" is good or bad. And we'd presumably (hopefully) have checksums and whatever else thrown in. Maybe this could be mitigated by assigning more positive utility to good outcomes than negative utility to bad outcomes? (I'm probably speaking out of my rear end on this one.)


Memory corruption seems to be another issue. Perhaps if we have more than one measure we'd be less vulnerable to memory corruption. Like, if we designed an AGI with a reward model that disvalues two arbitrary things rather than just one, and memory corruption screwed with *both* measures, then something probably just went *very* wrong in the AGI and it probably won't be able to optimise for suffering anyway.

Comment by anirandis on Open & Welcome Thread - August 2020 · 2020-09-02T15:53:13.140Z · score: 4 (3 votes) · LW · GW

Thanks for the detailed response. A bit of nitpicking (from someone who doesn't really know what they're talking about):

However, the vast majority of these mistakes would probably buff out or result in paper-clipping.

I'm slightly confused by this one. If we were to design the AI as a strict positive utilitarian (or something similar), I could see how the worst possible thing to happen to it would be *no* human utility (i.e. paperclips). But most attempts at an aligned AI would have a minimum at "I have no mouth, and I must scream". So any sign-flipping error would be expected to land there.

If humans are making changes to the critical software/hardware of an AGI (And we'll assume you figured out how to let the AGI allow you to do this in a way that has no negative side effects), *while that AGI is already running*, something bizarre and beyond my abilities of prediction is already happening.

In the example, the AGI was using online machine learning, which, as I understand it, would probably require the system to be hooked up to a database that humans have access to in order for it to learn properly. And I'm unsure as to how easy it'd be for things like checksums to pick up an issue like this (a boolean flag getting flipped) in a database.

Perhaps there'll be a reward function/model intentionally designed to disvalue some arbitrary "surrogate" thing in an attempt to separate it from hyperexistential risk. So "pessimizing the target metric" would look more like paperclipping than torture. But I'm unsure as to (1) whether the AGI's developers would actually bother to implement it, and (2) whether it'd actually work in this sort of scenario.

Also worth noting is that an AGI based on reward modelling is going to have to be linked to another neural network, which is going to have constant input from humans. If that reward model isn't designed to be separated in design space from AM, someone could screw up with the model somehow. If we were to, say, have U = V + W (where V is the reward given by the reward model and W is some arbitrary thing that the AGI disvalues, as is the case in Eliezer's Arbital post that I linked,) a sign flip-type error in V (rather than a sign flip in U) would lead to a hyperexistential catastrophe.

It will not be possible to flip the sign of the utility function or the direction of the updates to the reward model, even if several of the researchers on the project are actively trying to sabotage the effort and cause a hyperexistential disaster.

I think this is somewhat likely to be the case, but I'm not sure that I'm confident enough about it. Flipping the direction of updates to the reward model seems harder to prevent than a bit flip in a utility function, which could be prevent through error-correcting code memory (as you mentioned earlier.)


Despite my confusions, your response has definitely decreased my credence in this sort of thing from happening.

Comment by anirandis on Open & Welcome Thread - August 2020 · 2020-09-01T14:21:36.835Z · score: 2 (1 votes) · LW · GW

I've seen that post & discussed it on my shortform. I'm not really sure how effective something like Eliezer's idea of "surrogate" goals there would actually be - sure, it'd help with some sign flip errors but it seems like it'd fail on others (e.g. if U = V + W, a sign error could occur in V instead of U, in which case that idea might not work.) I'm also unsure as to whether the probability is truly "very tiny" as Eliezer describes it. Human errors seem much more worrying than cosmic rays.

Comment by anirandis on Open & Welcome Thread - August 2020 · 2020-08-30T16:05:45.642Z · score: 6 (4 votes) · LW · GW

I don't really know what the probability is. It seems somewhat low, but I'm not confident that it's *that* low. I wrote a shortform about it last night (tl;dr it seems like this type of error could occur in a disjunction of ways and we need a good way of separating the AI in design space.)


I think I'd stop worrying about it if I were convinced that its probability is extremely low. But I'm not yet convinced of that. Something like the example Gwern provided elsewhere in this thread seems more worrying than the more frequently discussed cosmic ray scenarios to me.

Comment by anirandis on Anirandis's Shortform · 2020-08-29T20:23:46.126Z · score: 4 (2 votes) · LW · GW

It seems to me that ensuring we can separate an AI in design space from worse-than-death scenarios is perhaps the most crucial thing in AI alignment. I don’t at all feel comfortable with AI systems that are one cosmic ray: or, perhaps more plausibly, one human screw-up (e.g. this sort of thing) away from a fate far worse than death. Or maybe a human-level AI makes a mistake and creates a sign flipped successor. Perhaps there’s some sort of black swan possibility that nobody realises. I think that it’s absolutely critical that we have a robust mechanism in place to prevent something like this from happening regardless of the cause; sure, we can sanity-check the system, but that won’t help when the issue is caused after we’ve sanity-checked it, as is the case with cosmic rays or some human errors (like Gwern’s example, which I linked). We need ways to prevent this sort of thing from happening *regardless* of the source.

Some propositions seem promising. Eliezer’s suggestion of assigning a sort of “surrogate goal” that the AI hates more than torture, but not enough to devote all of its energy to attempt to prevent, seems promising. But this would only work when the *entire* reward is what gets flipped; with how much confidence can we rule out, say, a localised sign flip in some specific part of the AI that leads to the system terminally valuing something bad but that doesn’t change anything else (so the sign on the “surrogate” goal remains negative). Can we even be confident that the AI’s development team is going to implement something like this, and that it will work as intended?

An FAI that's one software bug or screw-up in a database away from AM is a far scarier possibility than a paperclipper, IMO.

Comment by anirandis on Open & Welcome Thread - August 2020 · 2020-08-24T00:17:50.304Z · score: 2 (1 votes) · LW · GW

Sure, but I'd expect that a system as important as this would have people monitoring it 24/7.

Comment by anirandis on Open & Welcome Thread - August 2020 · 2020-08-22T18:25:08.378Z · score: 3 (2 votes) · LW · GW

Do you think that this specific risk could be mitigated by some variant of Eliezer’s separation from hyperexistential risk or Stuart Armstrong's idea here:

Let B1 and B2 be excellent, bestest outcomes. Define U(B1) = 1, U(B2) = -1, and U = 0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes.
Or, more usefully, let X be some trivial feature that the agent can easily set to -1 or 1, and let U be a utility function with values in [0, 1]. Have the AI maximise or minimise XU. Then the AI will always aim for the same best world, just with a different X value.

Or at least prevent sign flip errors from causing something worse than paperclipping?

Comment by anirandis on Open & Welcome Thread - August 2020 · 2020-08-21T15:20:31.598Z · score: 4 (2 votes) · LW · GW

I asked Rohin Shah about that possibility in a question thread about a month ago. I think he's probably right that this type of thing would only plausibly make it through the training process if the system's *already* smart enough to be able to think about this type of thing. And then on top of that there are still things like sanity checks which, while unlikely to pick up numerous errors, would probably notice a sign error. See also this comment:

Furthermore, if an AGI design has an actually-serious flaw, the likeliest consequence that I expect is not catastrophe; it’s just that the system doesn’t work. Another likely consequence is that the system is misaligned, but in an obvious ways that makes it easy for developers to recognize that deployment is a very bad idea.

IMO it's incredibly important that we find a way to prevent this type of thing from occurring *after* the system has been trained, whether that be hyperexistential separation or something else. I think that a team that's safety-conscious enough to come up with a (reasonably) aligned AGI design is going to put a considerable amount of effort into fixing bugs & one as obvious as a sign error would be unlikely to make it through. And hopefully - even better, they would have come up with a utility function that can't be easily reversed by a single bit flip or doesn't cause outcomes worse than death when minimised. That'd (hopefully?) solve the SignFlip issue *regardless* of what causes it.

Comment by anirandis on Open & Welcome Thread - August 2020 · 2020-08-19T02:53:17.257Z · score: 4 (2 votes) · LW · GW

I'm under the impression that an AGI would be monitored *during* training as well. So you'd effectively need the system to turn "evil" (utility function flipped) during the training process, and the system to be smart enough to conceal that the error occurred. So it'd need to happen a fair bit into the training process. I guess that's possible, but IDK how likely it'd be.

Comment by anirandis on Open & Welcome Thread - August 2020 · 2020-08-19T02:18:07.571Z · score: 4 (2 votes) · LW · GW

Sure, but the *specific* type of error I'm imagining would surely be easier to pick up than most other errors. I have no idea what sort of sanity checking was done with GPT-2, but the fact that the developers were asleep when it trained is telling: they weren't being as careful as they could've been.

For this type of bug (a sign error in the utility function) to occur *before* the system is deployed and somehow persist, it'd have to make it past all sanity-checking tools (which I imagine would be used extensively with an AGI) *and* somehow not be noticed at all while the model trains *and* whatever else. Yes, these sort of conjunctions occur in the real world but the error is generally more subtle than "system does the complete opposite of what it was meant to do".

I made a question post about this specific type of bug occurring before deployment a while ago and think my views have shifted significantly; it's unlikely that a bug as obvious as one that flips the sign of the utility function won't be noticed before deployment. Now I'm more worried about something like this happening *after* the system has been deployed.

I think a more robust solution to all of these sort of errors would be something like the separation from hyperexistential risk article that I linked in my previous response. I optimistically hope that we're able to come up with a utility function that doesn't do anything worse than death when minimised, just in case.

Comment by anirandis on Open & Welcome Thread - August 2020 · 2020-08-19T01:38:56.861Z · score: 4 (2 votes) · LW · GW

Wouldn't any configuration errors or updates be caught with sanity-checking tools though? Maybe the way I'm visualising this is just too simplistic, but any developers capable of creating an *aligned* AGI are going to be *extremely* careful not to fuck up. Sure, it's possible, but the most plausible cause of a hyperexistential catastrophe to me seems to be where a SignFlip-type error occurs once the system has been deployed.


Hopefully a system as crucially important as an AGI isn't going to have just one guy watching it who "takes a quick bathroom break". When the difference is literally Heaven and Hell (minimising human values), I'd consider only having one guy in a basement monitoring it to be gross negligence.

Comment by anirandis on Open & Welcome Thread - August 2020 · 2020-08-19T00:51:36.380Z · score: 2 (1 votes) · LW · GW

If we actually built an AGI that optimised to maximise a loss function, wouldn't we notice long before deploying the thing?


I'd imagine that this type of thing would be sanity-checked and tested intensively, so signflip-type errors would predominantly be scenarios where the error occurs *after* deployment, like the one Gwern mentioned ("A programmer flips the meaning of a boolean flag in a database somewhere while not updating all downstream callers, and suddenly an online learner is now actively pessimizing their target metric.")

Comment by anirandis on Open & Welcome Thread - August 2020 · 2020-08-16T00:30:45.598Z · score: 2 (1 votes) · LW · GW

Interesting. Terrifying, but interesting.

Forgive me for my stupidity (I'm not exactly an expert in machine learning), but it seems to me that building an AGI linked to some sort of database like that in such a fashion (that some random guy's screw-up can effectively reverse the utility function completely) is a REALLY stupid idea. Would there not be a safer way of doing things?

Comment by anirandis on Open & Welcome Thread - August 2020 · 2020-08-15T22:52:10.208Z · score: 4 (2 votes) · LW · GW

Do you think that this type of thing could plausibly occur *after* training and deployment?

Comment by anirandis on Open & Welcome Thread - August 2020 · 2020-08-07T17:33:33.212Z · score: 4 (2 votes) · LW · GW

The scenario I'm imagining isn't an AGI that merely "gets rid of" humans. See SignFlip.

Comment by anirandis on Open & Welcome Thread - August 2020 · 2020-08-06T15:45:37.501Z · score: 2 (1 votes) · LW · GW

Would it be likely for the utility function to flip *completely*, though? There's a difference between some drift in the utility function and the AI screwing up and designing a successor with the complete opposite of its utility function.

Comment by anirandis on Open & Welcome Thread - August 2020 · 2020-08-06T14:18:18.903Z · score: 10 (8 votes) · LW · GW

Is it plausible that an AGI could have some sort of exploit (buffer overflow maybe?) that could be exploited (maybe by an optimization daemon…?) and cause a sign flip in the utility function?

How about an error during self-improvement that leads to the same sort of outcome? Should we expect an AGI to sanity-check its successors, even if it’s only at or below human intelligence?

Sorry for the dumb questions, I’m just still nervous about this sort of thing.

Comment by anirandis on Open & Welcome Thread - July 2020 · 2020-07-25T14:37:14.748Z · score: 3 (2 votes) · LW · GW

Thanks for your response, just a few of my thoughts on your points:

If you *can* stop doing philosophy and futurism

To be honest, I've never really *wanted* to be involved with this. I only really made an account here *because* of my anxieties and wanted to try to talk myself through them.

If an atom-for-atom identical copy of you, *is* you, and an *almost* identical copy is *almost* you, then in a sufficiently large universe where all possible configurations of matter are realized, it makes more sense to think about the relative measure of different configurations rather than what happens to "you".

I don't buy that theory of personal-identity personally. It seems to me that if the biological me that's sitting here right now isn't *feeling* the pain, that's not worth worrying about as much. Like, I can *imagine* that a version of me might be getting tortured horribly or experiencing endless bliss, but my consciousness doesn't (as far as I can tell) "jump" over to those versions. Similarly, were *I* to get tortured it'd be unlikely that I care about what's happening to the "other" versions of me. The "continuity of consciousness" theory *seems* stronger to me, although admittedly it's not something I've put a lot of thought into. I wouldn't want to use a teleporter for the same reasons.

*And* there are evolutionary reasons for a creature like you to be *more* unable to imagine the scope of the great things.

Yes, I agree that it's possible that the future could be just as good as an infinite torture future would be bad. And that my intuitions are somewhat lopsided. But I do struggle to find that comforting. Were an infinite-torture future realised (whether it be a SignFlip error, an insane neuromorph, etc.) the fact that I could've ended up in a utopia wouldn't console me one bit.

Comment by anirandis on Open & Welcome Thread - July 2020 · 2020-07-25T02:46:46.236Z · score: 3 (2 votes) · LW · GW

As anyone could tell from my posting history, I've been obsessing & struggling psychologically recently when evaluating a few ideas surrounding AI (what if we make a sign error on the utility function, malevolent actors creating a sadistic AI, AI blackmail scenarios, etc.) It's predominantly selfishly worrying about things like s-risks happening to me, or AI going wrong so I have to live in a dystopia and can't commit suicide. I don't worry about human extinction (although I don't think that'd be a good outcome, either!)


I'm wondering if anyone's gone through similar anxieties and have found a way to help control them? I'm diagnosed ASD and I wouldn't consider it unlikely that I've got OCD or something similar on top of it, so it's possibly just that playing up.

Comment by anirandis on Likelihood of hyperexistential catastrophe from a bug? · 2020-07-23T14:36:58.410Z · score: 1 (1 votes) · LW · GW
Not really, because it takes time to train the cognitive skills necessary for deception.

Would that not be the case with *any* form of deceptive alignment, though? Surely it (deceptive alignment) wouldn't pose a risk at all if that were the case? Sorry in advance for my stupidity.

Comment by anirandis on Likelihood of hyperexistential catastrophe from a bug? · 2020-07-23T02:34:34.109Z · score: 1 (1 votes) · LW · GW

Sorry for the dumb question a month after the post, but I've just found out about deceptive alignment. Do you think it's plausible that a signflipped AGI could fake being an FAI in the training stage, just to take a treacherous turn at deployment?

Comment by anirandis on ‘Maximum’ level of suffering? · 2020-06-22T08:28:58.559Z · score: 1 (1 votes) · LW · GW

It’s more a selfish worry, tbh. I don’t buy that pleasure being unlimited can cancel it out though - even if I were promised a 99.9% chance of Heaven and 0.1% chance of Hell, I still wouldn’t want both pleasure and pain to be potentially boundless.

Comment by anirandis on ‘Maximum’ level of suffering? · 2020-06-21T13:07:28.434Z · score: 1 (1 votes) · LW · GW

I do agree that they’re symmetrical. I just find it worrying that I could potentially experience such enormous amounts of pain, even when the opposite is also a possibility.

Comment by anirandis on ‘Maximum’ level of suffering? · 2020-06-20T23:49:02.503Z · score: 1 (1 votes) · LW · GW
I'd still expect a reasonable utility function to *cap* the (dis)utility of pain. If it didn't, the (possible) torture of just one creature capable of experiencing arbitrary amounts/degrees/levels of pain would effectively be 'Pascal's hostage'

I suppose I never thought about that, but I'm not entirely sure how it'd work in practice. Since the AGI could never be 100% certain that the pain it's causing is at its maximum, it might further increase pain levels, just to *make sure* that it's hitting the maximum level of disutility.


It also seems unclear why evolution would result in creatures able to experience pain more intensely than such a maximum.

I think part of what worries me is that, even if we had a "maximum" amount of pain, it'd be hypothetically possible for humans to be re-wired to remove that maximum. I'd think that I'd still be the same person experiencing the same consciousness *after* being rewired, which is somewhat troubling.


If the pain a superintelligence can cause scales linearly or better with computational power, then the thought is even more terrifying.


Overall, you make some solid points that I wouldn't have considered otherwise.

Comment by anirandis on ‘Maximum’ level of suffering? · 2020-06-20T20:19:56.410Z · score: 1 (1 votes) · LW · GW

I think it's the modifying humans to experience pain part that's the most terrifying, to be honest.

Comment by anirandis on ‘Maximum’ level of suffering? · 2020-06-20T19:40:56.328Z · score: 1 (1 votes) · LW · GW
Real-life animals can and do die of shock, which seems *like* it might be some maximum 'pain' threshold being exceeded.

In theory, would it not be possible for, say, a malevolent superintelligence to "override" any possibility of a "shock" reaction, and prevent the brain from shutting down? Wouldn't that allow for ridiculous amounts of agony?


It seems plausible to me that a sufficiently powerful agent could create some form of ever-growing agony by expanding subjects' pain centres to maximise pain; and the limit being the point where most of the matter in the universe is part of someone's pain centre seems incredibly scary. I sincerely hope there's good reason to believe that a hypothetical "evil" superintelligence would get diminishing returns quite quickly.

Comment by anirandis on Likelihood of hyperexistential catastrophe from a bug? · 2020-06-19T21:27:03.975Z · score: 1 (1 votes) · LW · GW

I think it's also a case of us (or at least me) not yet being convinced that the probability is <= 10^-6. Especially with something as uncertain as this. My credence in such a scenario happening has, too, decreased a fair bit with this thread but I remain unconvinced overall.


And even then, 1 in a million isn't *that* unlikely - it's massive compared to the likelihood that a mugger is actually a God. I'm not entirely sure how low it would have to be for me to dismiss it as "Pascalian", but 1 in a million still feels far too high.

Comment by anirandis on Likelihood of hyperexistential catastrophe from a bug? · 2020-06-19T18:39:44.188Z · score: 1 (1 votes) · LW · GW

I think a probability of ~1/30,000 is still way too high for something as bad as this (with near-infinite negative utility). I sincerely hope that it’s much lower.

Comment by anirandis on Likelihood of hyperexistential catastrophe from a bug? · 2020-06-19T13:41:21.449Z · score: 1 (1 votes) · LW · GW

Everything about the AGI, loosely speaking, has to be near-perfect except for that one bit.

Isn’t this exactly what happened with the GPT-2 bug, which led to maximally ‘bad’ output? Would that not suggest that the probability of this occurring with an AGI is non-negligible?

Comment by anirandis on Likelihood of hyperexistential catastrophe from a bug? · 2020-06-18T22:02:20.635Z · score: 1 (1 votes) · LW · GW

All of these worry me as well. It simply doesn't console me enough to think that we "will probably notice it".

Comment by anirandis on Likelihood of hyperexistential catastrophe from a bug? · 2020-06-18T21:36:05.138Z · score: 1 (1 votes) · LW · GW

Can we be sure that we'd pick it up during the training process, though? And would it be possible for it to happen after the training process?

Comment by anirandis on Likelihood of hyperexistential catastrophe from a bug? · 2020-06-18T21:22:47.093Z · score: 1 (1 votes) · LW · GW

What do you think the difference would be between an AGI's reward function, and that of GPT-2 during the error it experienced?

Comment by anirandis on Likelihood of hyperexistential catastrophe from a bug? · 2020-06-18T20:00:23.910Z · score: 3 (1 votes) · LW · GW

Surely with a sufficiently hard take-off it would be possible for the AI to prevent its turning off? If not, couldn’t the AI just deceive its creators into thinking that no signflip has occurred (e.g. making it look like it’s gaining utility from doing something beneficial to human values when it’s actually losing it). How would we be able to determine that it’s happened before it’s too late?

Further to that, what if this fuck-up happens during an arms race when its creators haven’t put enough time into safety to prevent this type of thing from happening?

Comment by Anirandis on [deleted post] 2020-06-04T22:14:46.342Z

The American government is shit, don't get me wrong, but in modern times it's much better than some others. The US government isn't currently ethnically cleansing and hasn't done for a while now. I'm a lot less worried about them doing anything than certain other governments.