Dach's Shortform 2020-09-24T07:17:48.478Z · score: 2 (1 votes)
Outcome Terminology? 2020-09-14T18:04:05.048Z · score: 6 (3 votes)


Comment by dach on Dach's Shortform · 2020-09-24T18:29:14.016Z · score: 1 (1 votes) · LW · GW

Right, that isn't an exhaustive list. I included the candidates which seemed most likely.

So, I think superintelligence is unlikely in general- but so is current civilization. I think superintelligences have a high occurrence rate given current civilization (for lots of reasons), which also means that current civilization isn't that much more likely than superintelligence. It's more justified to say "Superintelligences which make human minds" have a super low occurrence rate relative to natural examples of me and my environment, but that still seems to be an unlikely explanation.

Based on the "standard" discussion on this topic, I get the distinct impression that the probability our civilization will construct an aligned superintelligence is significantly greater than, for example, 10^-20%, and the large amounts of leverage that a superintelligence would have (There's lots of matter out there) would produce this same effect.

Comment by dach on Dach's Shortform · 2020-09-24T07:17:48.785Z · score: 1 (1 votes) · LW · GW

Warning: Weird Stuff. I'm writing this in shortform for a reason- I really don't know what I'm talking about here. It's some of the most "delusional-feeling" thinking I've done, which is some tough competition.


So, let us imagine our universe is "big" in the sense of many worlds, and all realities compatible with the universal wavefunction are actualized- or at least something close to that. This seems increasingly likely to me.

I've been thinking about a problem for a week or so now, and I just stumbled upon that old post. I was surprised other people had been thinking about this, to say the least. (Doubly surprising- Scott said many of my exact ideas that I expected were rather unique, and also said that he's found that he's surprisingly similar to other people from circa 2012 lesswrong. Ouch!).

Aligned superintelligences might permute through all possible human minds, for various reasons. They might be especially interested in permuting through minds which are thinking about the AI alignment problem- also for various reasons. 

As a result, it's not evident to me for normal reasons that most of the "measure" of me-right-now flows into "normal" things- it seems plausible (on a surface level- I have some arguments against this) that most of the "measure" of me should be falling into Weird Stuff. Future superintelligences control the vast majority of all of the measure of everything (more branches in the future, more future in general, etc.), and they're probably more interested in building minds from scratch and then doing weird things with them.

if, among all of the representations of my algorithm in reality, (100%) * (1 - 10^-30) of my measure was "normal stuff", I'd still expect to be "diverted" basically instantly, if we assume there's one opportunity for a "diversion" every planck second.

However, this is, of course, not happening. We are constantly avoiding waking up from the simulation.

Possible explanations:

  • The world is small. This seems unlikely- look at these conditions:
  1. Many worlds is wrong.
  2. The universe is finite and not arbitrarily large.
  3. There's no multiverse, or the multiverse is small, or other universes all have properties which mean they don't support the potential for diversion. e.g. their laws are sufficiently different where none of them will contain human algorithms in great supply,
  4. There's no way to gain infinite energy, or truly arbitrarily large amounts of energy.
  • The sum of "Normal universe simulations" vastly outweighs the sum of "continuity of experience stealing simulations", for some reason. Maybe there are lots of different intelligent civilizations in the game, and lots of us spawn general universe simulators, and we also tend to cut simulations when other superintelligences arrive.
  • Superintelligences are taking deliberate action to prevent "diversion" tactics from being effective, or predict that other superintelligences are taking these actions. For example, if I don't like the idea of diverting people, I might snipe for recently diverted sentients and place them back in a simulation consistent with a "natural" environment.
  • "Diversion" as a whole isn't possible, and my understanding of how identity and experience work is sketchy.
  • Some other sort of misunderstanding hidden in my assumptions or premise.

This whole line of thinking seems crooked, so I'll refrain from writing any more.


Plot for a good sci-fi novel. The Simulation Hypothesis 2, electric boogaloo.

Comment by dach on Making the Monte Hall problem weirder but obvious · 2020-09-18T00:45:28.514Z · score: 3 (2 votes) · LW · GW

Amusing anecdote: I once tried to give my mother intuition behind Monte Hall with a process similar to this. She didn't quite get it, so I played the game with her a few times. Unfortunately, she won more often when she stayed than when she switched (n ~= 10), and decided that I was misremembering. A lesson was learned, but not by the person I had intended.

Comment by dach on Why haven't we celebrated any major achievements lately? · 2020-09-10T12:36:59.480Z · score: 3 (3 votes) · LW · GW

Scientific and industrial progress is an essential part of modern life. The opening of a new extremely long suspension bridge would be entirely unsurprising- If it was twice the length of the previous longest, I might bother to read a short article about it. I would assume there would be some local celebration (Though not too much- if it was too well received, why did we not do it before?), but it would not be a turning point in technology or a grand symbol of man's triumph over nature. We've been building huge awe inspiring structures for quite some time by now, and the awe has worn off. Innovation and progress is normal.

Celebration in terms of "Bells are ringing and the people are weeping and philosophizing" requires complete upsets. Reusable rockets, manned missions to mars, a COVID-19 vaccine, etc- those are all part of the current state of affairs. If humanity wants these things, and has the time, I know they will come.

Comment by dach on Open & Welcome Thread - August 2020 · 2020-09-04T10:01:15.647Z · score: 3 (2 votes) · LW · GW

So it definitely seems plausible for a reward to be flipped without resulting in the system failing/neglecting to adopt new strategies/doing something weird, etc.

I didn't mean to imply that a signflipped AGI would not instrumentally explore.

I'm saying that, well... modern machine learning systems often get specific bonus utility for exploring, because it's hard to explore the proper amount as an instrumental goal due to the difficulties of fully modelling the situation, and because systems which don't have this bonus will often get stuck in local maximums.

Humans exhibit this property too. We have investigating things, acquiring new information, and building useful strategic models as a terminal goal- we are "curious".

This is a feature we might see in early stages of modern attempts at full AGI, for similar reasons to why modern machine learning systems and humans exhibit this same behavior.

Presumably such features would be built to uninstall themselves after the AGI reaches levels of intelligence sufficient to properly and fully explore new strategies as an instrumental goal to satisfying the human utility function, if we do go this route.

If we sign flipped the amount of reward the AGI gets from such a feature, the AGI would be penalized for exploring new strategies- this may have any number of effects which are fairly implementation specific and unpredictable. However, it probably wouldn't result in hyperexistential catastrophe. This AI, providing everything else works as intended, actually seems to be perfectly aligned. If performed on a subhuman seed AI, it may brick- in this trivial case, it is neither aligned nor misaligned- it is an inanimate object.

Yes, an AGI with a flipped utility function would pursue its goals with roughly the same level of intelligence.

The point of this argument is super obvious, so you probably thought I was saying something else. I'm going somewhere with this, though- I'll expand later.

Comment by dach on Open & Welcome Thread - August 2020 · 2020-09-03T11:04:02.022Z · score: 3 (2 votes) · LW · GW

Interesting analogy. I can see what you're saying, and I guess it depends on what specifically gets flipped. I'm unsure about the second example; something like exploring new strategies doesn't seem like something an AGI would terminally value. It's instrumental to optimising the reward function/model, but I can't see it getting flipped with the reward function/model.

Sorry, I meant instrumentally value. Typo. Modern machine learning systems often require a specific incentive in order to explore new strategies and escape local maximums. We may see this behavior in future attempts at AGI, And no, it would not be flipped with the reward function/model- I'm highlighting that there is a really large variety of sign flip mistakes and most of them probably result in paperclipping.

My thinking was that a signflipped AGI designed as a positive utilitarian (i.e. with a minimum at 0 human utility) would prefer paperclipping to torture because the former provides 0 human utility (as there aren't any humans), whereas the latter may produce a negligible amount. I'm not really sure if it makes sense tbh.

Paperclipping seems to be negative utility, not approximately 0 utility. It involves all the humans being killed and our beautiful universe being ruined. I guess if there are no humans, there's no utility in some sense, but human values don't actually seem to work that way. I rate universes where humans never existed at all and

I'm... not sure what 0 utility would look like. It's within the range of experiences that people experience on modern-day earth- somewhere between my current experience and being tortured. This is just definition problems, though- We could shift the scale such that paperclipping is zero utility, but in that case, we could also just make an AGI that has a minimum at paperclipping levels of utility.

Even if we engineered it carefully, that doesn't rule out screw-ups. We need robust failsafe measures just in case, imo.

In the context of AI safety, I think "robust failsafe measures just in case" is part of "careful engineering". So, we agree!

You'd still need to balance it in a way such that the system won't spend all of its resources preventing this thing from happening at the neglect of actual human values, but that doesn't seem too difficult.

I read Eliezer's idea, and that strategy seems to be... dangerous. I think that "Giving an AGI a utility function which includes features which are not really relevant to human values" is something we want to avoid unless we absolutely need to.

I have much more to say on this topic and about the rest of your comment, but it's definitely too much for a comment chain. I'll make an actual post on this containing my thoughts sometime in the next week or two, and link it to you.

Comment by dach on Open & Welcome Thread - August 2020 · 2020-09-02T20:25:32.683Z · score: 4 (3 votes) · LW · GW

I'm slightly confused by this one. If we were to design the AI as a strict positive utilitarian (or something similar), I could see how the worst possible thing to happen to it would be no human utility (i.e. paperclips). But most attempts at an aligned AI would have a minimum at "I have no mouth, and I must scream". So any sign-flipping error would be expected to land there.

It's hard to talk in specifics because my knowledge on the details of what future AGI architecture might look like is, of course, extremely limited.

As an almost entirely inapplicable analogy (which nonetheless still conveys my thinking here): consider the sorting algorithm for the comments on this post. If we flipped the "top-scoring" sorting algorithm to sort in the wrong direction, we would see the worst-rated posts on top, which would correspond to a hyperexistential disaster. However, if we instead flipped the effect that an upvote had on the score of a comment to negative values, it would sort comments which had no votes other than the default vote assigned on posting the comment to the top. This corresponds to paperclipping- it's not minimizing the intended function, it's just doing something weird.

If we inverted the utility function, this would (unless we take specific measures to combat it like you're mentioning) lead to hyperexistential disaster. However, if we invert some constant which is meant to initially provide value for exploring new strategies while the AI is not yet intelligent enough to properly explore new strategies as an instrumental goal, the AI would effectively brick itself. It would place negative value on exploring new strategies, presumably including strategies which involve fixing this issue so it can acquire more utility and strategies which involve preventing the humans from turning it off. If we had some code which is intended to make the AI not turn off the evolution of the reward model before the AI values not turning off the reward model for other reasons (e.g. the reward model begins to properly model how humans don't want the AI to turn the reward model evolution process off), and some crucial sign was flipped which made it do the opposite, the AI would freeze the process of the reward model being updated and then maximize whatever inane nonsense its model currently represented, and it would eventually run into some bizarre previously unconsidered and thus not appropriately penalized strategy comparable to tiling the universe with smiley faces, i.e. paperclipping.

These are really crude examples, but I think the argument is still valid. Also, this argument doesn't address the core concern of "What about the things which DO result in hypexistential disaster", it just establishes that much of the class of mistake you may have previously thought usually or always resulted in hyperexistential disaster (sign flips on critical software points) in fact usually causes paperclipping or the AI bricking itself.

If we were to design the AI as a strict positive utilitarian (or something similar), I could see how the worst possible thing to happen to it would be no human utility (i.e. paperclips).

Can you clarify what you mean by this? Also, I get what you're going for, but paperclips is still extremely negative utility because it involves the destruction of humanity and the reconfiguration of the universe into garbage.

Perhaps there'll be a reward function/model intentionally designed to disvalue some arbitrary "surrogate" thing in an attempt to separate it from hyperexistential risk. So "pessimizing the target metric" would look more like paperclipping than torture. But I'm unsure as to (1) whether the AGI's developers would actually bother to implement it, and (2) whether it'd actually work in this sort of scenario.

I sure hope that future AGI developers can be bothered to embrace safe design!

Also worth noting is that an AGI based on reward modelling is going to have to be linked to another neural network, which is going to have constant input from humans. If that reward model isn't designed to be separated in design space from AM, someone could screw up with the model somehow.

The reward modelling system would need to be very carefully engineered, definitely.

If we were to, say, have U = V + W (where V is the reward given by the reward model and W is some arbitrary thing that the AGI disvalues, as is the case in Eliezer's Arbital post that I linked,) a sign flip-type error in V (rather than a sign flip in U) would lead to a hyperexistential catastrophe.

I thought this as well when I read the post. I'm sure there's something clever you can do to avoid this but we also need to make sure that these sorts of critical components are not vulnerable to memory corruption. I may try to find a better strategy for this later, but for now I need to go do other things.

I think this is somewhat likely to be the case, but I'm not sure that I'm confident enough about it. Flipping the direction of updates to the reward model seems harder to prevent than a bit flip in a utility function, which could be prevent through error-correcting code memory (as you mentioned earlier.)

Sorry, I meant to convey that this was a feature we're going to want to ensure that future AGI efforts display, not some feature which I have some other independent reason to believe would be displayed. It was an extension of the thought that "Our method will, ideally, be terrorist proof."

Comment by dach on Open & Welcome Thread - August 2020 · 2020-09-02T08:25:35.062Z · score: 4 (3 votes) · LW · GW

You can't really be accidentally slightly wrong. We're not going to develop Mostly Friendly AI, which is Friendly AI but with the slight caveat that it has a slightly higher value on the welfare of shrimp than desired, with no other negative consequences. The molecular sorts of precision needed to get anywhere near the zone of loosely trying to maximize or minimize for anything resembling human values will probably only follow from a method that is converging towards the exact spot we want it to be at, such as some clever flawless version of reward modelling.

In the same way, we're probably not going to accidentally land in hyperexistential disaster territory. We could have some sign flipped, our checksum changed, and all our other error-correcting methods (Any future seed AI should at least be using ECC memory, drives in RAID 10, etc.) defeated by religious terrorists, cosmic rays, unscrupulous programmers, quantum fluctuations, etc. However, the vast majority of these mistakes would probably buff out or result in paper-clipping. If an FAI has slightly too high of a value assigned to the welfare of shrimp, it will realize this in the process of reward modelling and correct the issue. If its operation does not involve the continual adaptation of the model that is supposed to represent human values, it's not using a method which has any chance of converging to Overwhelming Victory or even adjacent spaces for any reason other than sheer coincidence.

A method such as this has, barring stuff which I need to think more about (stability under self-modification), no chance of ending up in a "We perfectly recreated human values... But placed an unreasonably high value on eating bread! Now all the humans will be force-fed bread until the stars burn out! Mwhahahahaha!" sorts of scenarios. If the system cares about humans being alive enough to not reconfigure their matter into something else, we're probably using a method which is innately insulated from most types of hyperexistential risk.

It's not clear that Gwern's example, or even that category of problem, is particularly relevant to this situation. Most parallels to modern-day software systems and the errors they are prone to are probably best viewed as sobering reminders, not specific advice. Indeed, I suspect his comment was merely a sobering reminder and not actual advice. If humans are making changes to the critical software/hardware of an AGI (And we'll assume you figured out how to let the AGI allow you to do this in a way that has no negative side effects), while that AGI is already running, something bizarre and beyond my abilities of prediction is already happening. If you need to make changes after you turn your AGI on, you've already lost. If you don't need to make changes and you're making changes, you're putting humanity in unnecessary risk. At this point, if we've figured out how to assist the seed AI in self-modification, at least until the point at which it can figure out how to do stable self-modification for itself, the problem is already solved. There's more to be said here, but I'll refrain for the purpose of brevity.

Essentially, we can not make any ordinary mistake. The type of mistake we would need to make in order to land up in hyperexistential disaster territory would, most likely, be an actual, literal sign flip scenario, and such scenarios seem much easier to address. There will probably only be a handful of weak points for this problem, and those weak points are all already things we'd pay extra super special attention to and will engineer in ways which make it extra super special sure nothing goes wrong. Our method will, ideally, be terrorist proof. It will not be possible to flip the sign of the utility function or the direction of the updates to the reward model, even if several of the researchers on the project are actively trying to sabotage the effort and cause a hyperexistential disaster.

I conjecture that most of the expected utility gained from combating the possibility of a hyperexistential disaster lies in the disproportionate positive effects on human sanity and the resulting improvements to the efforts to avoid regular existential disasters, and other such side-benefits.

None of this is intended to dissuade you from investigating this topic further. I'm merely arguing that a hyperexistential disaster is not remotely likely- not that it is not a concern. The fact that people will be concerned about this possibility is an important part of why the outcome is unlikely.

Comment by dach on Open & Welcome Thread - August 2020 · 2020-08-30T13:58:46.671Z · score: 4 (3 votes) · LW · GW

If you're having significant anxiety from imagining some horrific I-have-no-mouth-and-I-must-scream scenario, I recommend that you multiply that dread by a very, very small number, so as to incorporate the low probability of such a scenario. You're privileging this supposedly very low probability specific outcome over the rather horrifically wide selection of ways AGI could be a cosmic disaster.

This is, of course, not intended to dismay you from pursuing solutions to such a disaster.

Comment by dach on Likelihood of hyperexistential catastrophe from a bug? · 2020-06-18T20:27:40.715Z · score: 3 (2 votes) · LW · GW

In this specific example, the error becomes clear very early on in the training process. The standard control problem issues with advanced AI systems don't apply in that situation.

As for the arms race example, building an AI system of that sophistication to fight in your conflict is like building a Dyson Sphere to power your refrigerator. Friendly AI isn't the sort of thing major factions are going to want to fight with each other over. If there's an arm's race, either something delightfully improbable and horrible has happened, or it's an extremely lopsided "race" between a Friendly AI faction and a bunch of terrorist groups.

EDIT (From two months in the future...): I am not implying that such a race would be an automatic win, or even a likely win, for said hypothesized Friendly AI faction. For various reasons, this is most certainly not the case. I'm merely saying that the Friendly AI faction will have vastly more resources than all of its competitors combined, and all of its competitors will be enemies of the world at large, etc.

Addressing this whole situation would require actual nuance. This two month old throw away comment is not the place to put that nuance. And besides, it's been done before.