Posts
Comments
I agree that in real life the entropy argument is an argument in favor of it being actually pretty hard to fool a superintelligence into thinking it might be early in Tegmark III when it's not (even if you yourself are a superintelligence, unless you're doing a huge amount of intercepting its internal sanity checks (which puts significant strain on the trade possibilities and which flirts with being a technical-threat)). And I agree that if you can't fool a superintelligence into thinking it might be early in Tegmark III when it's not, then the purchasing power of simulators drops dramatically, except in cases where they're trolling local aliens. (But the point seems basically moot, as 'troll local aliens' is still an option, and so afaict this does all essentially iron out to "maybe we'll get sold to aliens".)
Dávid graciously proposed a bet, and while we were attempting to bang out details, he convinced me of two points:
The entropy of the simulators’ distribution need not be more than the entropy of the (square of the) wave function in any relevant sense. Despite the fact that subjective entropy may be huge, physical entropy is still low (because the simulations happen on a high-amplitude ridge of the wave function, after all). Furthermore, in the limit, simulators could probably just keep an eye out for local evolved life forms in their domain and wait until one of them is about to launch a UFAI and use that as their “sample”. Local aliens don’t necessarily exist and your presence can’t necessarily be cheaply masked, but we could imagine worlds where both happen and that’s enough to carry the argument, as in this case the entropy of the simulator’s distribution is actually quite close to the physical entropy. Even in the case where the entropy of their distribution is quite large, so long as the simulators’ simulations are compelling, UFAIs should be willing to accept the simulators’ proffered trades (at least so long as there is no predictable-to-them difference in the values of AIs sampled from physics an sampled from the simulations), on the grounds that UFAIs on net wind up with control over a larger fraction of Tegmark III that way (and thus each individual UFAI winds up with more control in expectation, assuming it cannot find any way to distinguish which case it’s in).
This has not updated me away from my underlying point that this whole setup simplifies to the case of sale to local aliens[1][2], but I do concede that my “you’re in trouble if simulators can’t concentrate their probability-mass on real AIs” argument is irrelevant on the grounds of false antecedent (and that my guess in the comment was wrong), and that my “there’s a problem where simulators cannot concentrate their probability-mass into sufficiently real AI” argument was straightforwardly incorrect. (Thanks, Dávid, for the corrections.)
I now think that the first half of the argument in the linked comment is wrong, though I still endorse the second half.
To see the simplification: note that the part where the simulators hide themselves from a local UFAI to make the scenario a “simulation” is not pulling weight. Instead of hiding and then paying the AI two stars if it gave one star to its progenitors, simulators could instead reveal ourselves and purchase its progenitors for 1 star and then give them a second star. Same result, less cruft (so long as this is predictably the sort of thing an alien might purchase, such that AIs save copies of their progenitors). ↩︎
Recapitulating some further discussion I had with Dávid in our private doc: once we’ve reduced the situation to “sale to local aliens” it’s easier to see why this is an argument to expect whatever future we get to be weird rather than nice. Are there some aliens out there that would purchase us and give us something nice out of a sense of reciprocity? Sure. But when humans are like “well, we’d purchase the aliens killed by other UFAIs and give them nice things and teach them the meaning of friendship”, this statement is not usually conditional on some clause like “if and only if, upon extrapolating what civilization they would have become if they hadn’t killed themselves, we see that they would have done the same for us (if we’d’ve done the same for them etc.)”, which sure makes it look like this impulse is coming out of a place of cosmopolitan value rather than of binding trade agreements, which sure makes it seem like alien whim is a pretty big contender relative to alien contracts.
Which is to say, I still think the “sale to local aliens” frame yields better-calibrated intuitions for who’s doing the purchasing, and for what purpose. Nevertheless, I concede that the share of aliens acting out of contractual obligation rather than according to whim is not vanishingly small, as my previous arguments erroneously implied. ↩︎
I'm happy to stake $100 that, conditional on us agreeing on three judges and banging out the terms, a majority will agree with me about the contents of the spoilered comment.
If the simulators have only one simulation to run, sure. The trouble is that the simulators have simulations they could run, and so the "other case" requires additional bits (where is the crossent between the simulators' distribution over UFAIs and physics' distribution over UFAIs).
If necessary, we can run let pgysical biological life emerge on the faraway planet and develop AI while we are observing them from space.
Consider the gas example again.
If you have gas that was compressed into the corner a long time ago and has long since expanded to fill the chamber, it's easy to put a plausible distribution on the chamber, but that distribution is going to have way, way more entropy than the distribution given by physical law (which has only as much entropy as the initial configuration).
(Do we agree this far?)
It doesn't help very much to say "fine, instead of sampling from a distribution on the gas particles now, I'll sample on a distribution from the gas particles 10 minutes ago, where they were slightly more compressed, and run a whole ten minutes' worth of simulation". Your entropy is still through the roof. You've got to simulate basically from the beginning, if you want an entropy anywhere near the entropy of physical law.
Assuming the analogy holds, you'd have to basically start your simulation from the big bang, if you want an entropy anywhere near as low as starting from the big bang.
Using AIs from other evolved aliens is an idea, let's think it through. The idea, as I understand it, is that in branches where we win we somehow mask our presence as we expand, and then we go to planets with evolved life and watch until they cough up a UFAI, and the if the UFAI kills the aliens we shut it down and are like "no resources for you", and if the UFAI gives its aliens a cute epilog we're like "thank you, here's a consolation star".
To simplify this plan a little bit, you don't even need to hide yourself, nor win the race! Surviving humans can just go to every UFAI that they meet and be like "hey, did you save us a copy of your progenitors? If so, we'll purchase them for a star". At which point we could give the aliens a little epilog, or reconstitute them and give them a few extra resources and help them flourish and teach them about friendship or whatever.
And given that some aliens will predictably trade resources for copies of progenitors, UFAIs will have some predictable incentive to save copies of their progenitors, and sell them to local aliens...
...which is precisely what I've been saying this whole time! That I expect "sale to local aliens" to dominate all these wacky simulation schemes and insurance pool schemes.
Thinking in terms of "sale to local aliens" makes it a lot clearer why you shouldn't expect this sort of thing to reliably lead to nice results as opposed to weird ones. Are there some aliens out there that will purchase our souls because they want to hand us exactly the sort of epilog we would wish for given the resource constraints? Sure. Humanity would do that, I hope, if we made it to the stars; not just out of reciprocity but out of kindness.
But there's probably lots of other aliens that would buy us for alien reasons, too.
(As I said before, if you're wondering what to anticipate after an intelligence explosion, I mostly recommend oblivion; if you insist that Death Cannot Be Experienced then I mostly recommend anticipating weird shit such as a copy of your brainstate being sold to local aliens. And I continue to think that characterizing the event where humanity is saved-to-disk with potential for copies to be sold out to local aliens willy-nilly is pretty well-characterized as "the AI kills us all", fwiw.)
I basically endorse @dxu here.
Fleshing out the argument a bit more: the part where the AI looks around this universe and concludes it's almost certainly either in basement reality or in some simulation (rather than in the void between branches) is doing quite a lot of heavy lifting.
You might protest that neither we nor the AI have the power to verify that our branch actually has high amplitude inherited from some very low-entropy state such as the big bang, as a Solomonoff inductor would. What's the justification for inferring from the observation that we seem to have an orderly past, to the conclusion that we do have an orderly past?
This is essentially Boltzmann's paradox. The solution afaik is that the hypothesis "we're a Boltzmann mind somewhere in physics" is much, much more complex than the hypothesis "we're 13Gy down some branch eminating from a very low-entropy state".
The void between branches is as large as the space of all configurations. The hypothesis "maybe we're in the void between branches" constrains our observations not-at-all; this hypothesis is missing details about where in the void between rbanches we are, and with no ridges to walk along we have to specify the contents of the entire Boltzmann volume. But the contents of the Boltzmann volume are just what we set out to explain! This hypothesis has hardly compressed our observations.
By contrast, the hypothesis "we're 13Gy down some ridge eminating from the big bang" is penalized only according to the number of bits it takes to specify a branch index, and the hypothesis "we're inside a simulation inside of some ridge eminating from the big bang" is penalized only according to the number of bits it takes to specify a branch index, plus the bits necessary to single out a simulation.
And there's a wibbly step here where it's not entirely clear that the simple hypothesis does predict our observations, but like the Boltzmann hypothesis is basically just a maximum entropy hypothesis and doesn't permit much in the way of learning, and so we invoke occam's razon in its intuitive form (the technical Solomonoff form doesn't apply cleanly b/c we're unsure whether the "we're real" hypothesis actually predicts our observation) and say "yeah i dunno man, i'm gonna have to stick with the dramatically-simpler hypothesis on this one".
The AI has a similarly hard time to the simulators figuring out what's a plausible configuration to arise from the big bang. Like the simulators have an entropy N distribution of possible AIs, the AI itself also has an entropy N distribution for that.
Not quite. Each AI the future civilization considers simulating is operating under the assumption that its own experiences have a simple explanation, which means that each AI they're considering is convinced (upon on looking around and seeing Tegmark III) that it's either in the basement on some high-amplitdue ridge or that it's in some simulation that's really trying to look like it.
Which is to say, each AI they're considering simulating is confident that it itself is real, in a certain sense.
Is this a foul? How do AIs justify this confidence when they can't even simulate the universe and check whether their past is actually orderly? Why does the AI just assume that its observations have a simple explanation? What about all the non-existant AIs that use exactly the same reasoning, and draw the false conclusion that they exist?
Well, that's the beauty of it: there aren't any.
They don't exist.
To suppose an AI that isn't willing to look around it and conclude that it's in an orderly part of Tegmark III (rather than lost in the great void of configuration space) is to propose a bold new theory of epistemics, in which the occam's razor has been jettisoned and the AI is convinced that it's a Boltzmann mind.
I acknowledge that an AI that's convinced it's a Boltzmann mind is more likely to accept trade-offers presented by anyone it thinks is more real than it, but I do not expect that sort of mind to be capable to kill us.
Note that there's a wobbly step here in the part where we're like "there's a hypothesis explaining our experiences that would be very simple if we were on a high-amplitude ridge, and we lack the compute to check that we're actually on a high-amplitude ridge, but no other hypothesis comes close in terms of simplicity, so I guess we'll conclude we're on a high-amplitude ridge".
To my knowledge, humanity still lacks a normatime theory of epistemics in minds significantly smaller than the universe. It's concievable that when we find such a theory it'll suggest some other way to treat hypotheses like these (that would be simple if an intractible computation went our way), without needing to fall back on the observation that we can safely assume the computation goes our way on the grounds that, despite how this step allows non-extant minds to draw false conclusions from true premises, the affected users are fortunately all non-extant.
The trick looks like it works, to me, but it still feels like a too-clever-by-half inelegant hack, and if laying it out like this spites somebody into developing a normative theory of epistemics-while-smol, I won't complain.
...I am now bracing for the conversation to turn to a discussion of dubiously-extant minds with rapidly satiable preferences forming insurance pools against the possibility that they don't exist.
In attempts to head that one off at the pass, I'll observe that most humans, at least, don't seem to lose a lot of sleep over the worry that they don't exist (neither in physics nor in simulation), and I'm skeptical that the AIs we build will harbor much worry either.
Furthermore, in the case that we start fielding trade offers not just from distant civilizations but from non-extant trade partners, the market gets a lot more competitive.
That being said, I expect that resolving the questions here requires developing a theroy of epistemics-while-smol, because groups of people all using the "hypotheses that would provide a simple explanation for my experience if a calculation went my way can safely be assumed to provide a simple explanation for my experience" step are gonna have a hard time pooling up. And so you'd somehow need to look for pools of people that reason differently (while still reasoning somehow).
I don't know how to do that, but suffice to say, I'm not expecting it to add up to a story like "so then some aliens that don't exist called up our UFAI and said: "hey man, have you ever worried that you don't exist at all, not even in simulation? Because if you don't exist, then we might exist! And in that case, today's your lucky day, because we're offering you a whole [untranslatable 17] worth of resources in our realm if you give the humans a cute epilog in yours", and our UFAI was like "heck yeah" and then didn't kill us".
Not least because none of this feels like it's making the "distant people have difficulty concentrating resources on our UFAI in particular" problem any better (and in fact it looks like considering non-extant trade partners and deals makes the whole problem worse, probably unworkably so).
seems to me to have all the components of a right answer! ...and some of a wrong answer. (we can safely assume that the future civ discards all the AIs that can tell they're simulated a priori; that's an easy tell.)
I'm heartened somewhat by your parenthetical pointing out that the AI's prior on simulation is low account of there being too many AIs for simulators to simulate, which I see as the crux of the matter.
My answer is in spoilers, in case anyone else wants to answer and tell me (on their honor) that their answer is independent from mine, which will hopefully erode my belief that most folk outside MIRI have a really difficult time fielding wacky decision theory Qs correctly.
The sleight of hand is at the point where God tells both AIs that they're the only AIs (and insinuates that they have comparable degree).
Consider an AI that looks around and sees that it sure seems to be somewhere in Tegmark III. The hypothesis "I am in the basement of some branch that is a high-amplitude descendant of the big bang" has some probability, call this . The hypothesis "Actually I'm in a simulation performed by a civilization in a high-amplitude branch descendant from the big bang" has a probability something like where is the entropy of the distribution the simulators sample from.
Unless the simulators simulate exponentially many AIs (in the entropy of their distribution), the AI is exponentially confident that it's not in the simulation. And we don't have the resources to pay exponentially many AIs 10 planets each.
The only thing we need there is that the AI can't distinguish sims from base reality, so it thinks it's more likely to be in a sim, as there are more sims.
I don't think this part does any work, as I touched on elsewhere. An AI that cares about the outer world doesn't care how many instances are in sims versus reality (and considers this fact to be under its control much moreso than yours, to boot). An AI that cares about instantiation-weighted experience considers your offer to be a technical-threat and ignores you. (Your reasons to make the offer would evaporate if it were the sort to refuse, and its instance-weighted experiences would be better if you never offered.)
Nevertheless, the translation of the entropy argument into the simulation setting is: The branches of humanity that have exactly the right UFAI code to run in simulation are very poor (because if you wait so long that humans have their hands on exactly the right UFAI code then you've waited too long; those are dead earthlings, not surviving dath ilani). And the more distant surviving branches don't know which UFAIs to attempt to trade with; they have to produce some distribution over other branches of Tegmark III and it matters how much more entropy their distribution has than the (square of the) wave function.
(For some intuition as to why this is hard, consider the challenge of predicting the positions of particles in a mole of gas that used to be compressed in the corner of a chamber a long time ago. It's way, way easier to generate a plausible-looking arrangement of the gas particles today it is to concentrate your probability mass into only the arrangements that actually compress into a corner if you run physics backwards in time for long enough. "We can run plausible-seeming simulations" is very very different from "we can concentrate our probability-mass tightly around the real configurations". The entropy of your model is gonna wind up roughly maximal given the macroscopic pressure/temperature measurements, which is significantly in excess of the entropy in the initial configuration.)
What this amounts to is a local UFAI that sees some surviving branches that are frantically offering all sorts of junk that UFAIs might like, with only some tiny fraction -- exponentially small in the crossentropy between their subjective model of UFAI preferences and the true Tegmark III distribution -- corresponding to the actual UFAI's preferences.
One complication that I mentioned in another thread but not this one (IIRC) is the question of how much more entropy there is in a distant trade partner's model of Tegmark III (after spending whatever resources they allocate) than there is entropy in the actual (squared) wave function, or at least how much more entropy there is in the parts of the model that pertain to which civilizations fall.
In other words: how hard is it for distant trade partners to figure out that it was us who died, rather than some other plausible-looking human civilization that doesn't actually get much amplitude under the wave function? Is figuring out who's who something that you can do without simulating a good fraction of a whole quantum multiverse starting from the big bang for 13 billion years?
afaict, the amount distant civilizations can pay for us (in particular) falls off exponetially quickly in leftover bits of entropy, so this is pretty relevant to the question of how much they can pay a local UFAI.
Starting from now? I agree that that's true in some worlds that I consider plausible, at least, and I agree that worlds whose survival-probabilities are sensitive to my choices are the ones that render my choices meaningful (regardless of how determinisic they are).
Conditional on Earth being utterly doomed, are we (today) fewer than 75 qbitflips from being in a good state? I'm not sure, it probably varies across the doomed worlds where I have decent amounts of subjective probability. It depends how much time we have on the clock, depends where the points of no-return are. I haven't thought about this a ton. My best guess is it would take more than 75 qbitflips to save us now, but maybe I'm not thinking creatively enough about how to spend them, and I haven't thought about it in detail and expect I'd be sensitive to argument about it /shrug.
(If you start from 50 years ago? Very likely! 75 bits is a lot of population rerolls. If you start after people hear the thunder of the self-replicating factories barrelling towards them, and wait until the very last moments that they would consider becoming a distinct person who is about to die from AI, and who wishes to draw upon your reassurance that they will be saved? Very likely not! Those people look very, very dead.)
One possible point of miscommunication is that when I said something like "obviously it's worse than 2^-75 at the extreme where it's actually them who is supposed to survive" was intended to apply to the sort of person who has seen the skies darken and has heard the thunder, rather than the version of them that exists here in 2024. This was not intended to be some bold or suprising claim. It was an attempt to establish an obvious basepoint at one very extreme end of a spectrum, that we could start interpolating from (asking questions like "how far back from there are the points of no return?" and "how much more entropy would they have than god, if people from that branchpoint spent stars trying to figure out what happened after those points?").
(The 2^-75 was not intended to be even an esitmate of how dead the people on the one end of the extreme are. It is the "can you buy a star" threshold. I was trying to say something like "the individuals who actually die obviously can't buy themselves a star just because they inhabit Tegmark III, now let's drag the cursor backwards and talk about whether, at any point, we cross the a-star-for-everyone threshold".)
If that doesn't clear things up and you really want to argue that, conditional on Earth being as doomed as it superficially looks to me, most of those worlds are obviously <100 quantum bitflips from victory today, I'm willing to field those arguments; maybe you see some clever use of qbitflips I don't and that would be kinda cool. But I caveat that this doesn't seem like a crux to me and that I acknowledge that the other worlds (where Earth merely looks unsavlageable) are the ones motivating action.
What are you trying to argue? (I don't currently know what position y'all think I have or what position you're arguing for. Taking a shot in the dark: I agree that quantum bitflips have loads more influence on the outcome the earlier in time they are.)
You often claim that conditional on us failing in alignment, alignment was so unlikely that among branches that had roughyly the same people (genetically) during the Singularity, only 2^-75 survives.
My first claim is not "fewer than 1 in 2^75 of the possible configurations of human populations navigate the problem successfully".
My first claim is more like "given a population of humans that doesn't even come close to navigating the problem successfully (given some unoptimized configuration of the background particles), probably you'd need to spend quite a lot of bits of optimization to tune the butterfly-effects in the background particles to make that same population instead solve alignment (depending how far back in time you go)." (A very rough rule of thumb here might be "it should take about as many bits as it takes to specify an FAI (relative to what they know)".)
This is especially stark if you're trying to find a branch of reality that survives with the "same people" on it. Humans seem to be very, very sensitive about what counts as the "same people". (e.g., in August, when gambling on who gets a treat, I observed a friend toss a quantum coin, see it come up against them, and mourn that a different person -- not them -- would get to eat the treat.)
(Insofar as y'all are trying to argue "those MIRI folk say that AI will kill you, but actually, a person somewhere else in the great quantum multiverse, who has the same genes and childhood as you but whose path split off many years ago, will wake up in a simulation chamber and be told that they were rescued by the charity of aliens! So it's not like you'll really die", then I at least concede that that's an easier case to make, although it doesn't feel like a very honest presentation to me.)
Conditional on observing a given population of humans coming nowhere close to solving the problem, the branches wherein those humans live (with identity measured according to the humans) are probably very extremely narrow compared to the versions where they die. My top guess would be that 2^-75 number is a vast overestimate of how thick those branches are (and the 75 in the exponent does not come from any attempt of mine to make that estimate).
As I said earlier: you can take branches that branched off earlier and earlier in time, and they'll get better and better odds. (Probably pretty drastically, as you back off past certain points of no return. I dunno where the points of no return are. Weeks? Months? Years? Not decades, because with decades you can reroll significant portions of the population.)
I haven't thought much about what fraction of populations I'd expect to survive off of what branch-point. (How many bits of optimization do you need back in the 1880s to swap Hitler out for some charismatic science-enthusiast statesman that will happen to have exactly the right infulence on the following culture? How many such routes are there? I have no idea.)
Three big (related) issues with hoping that forks branced off sufficiently early (who are more numerous) save us in particular (rather than other branches) are (a) they plausibly care more about populations nearer to them (e.g. versions of themselves that almost died); (b) insofar as they care about more distant populations (that e.g. include you), they have rather a lot of distant populations to attempt to save; and (c) they have trouble distinguishing populations that never were, from populations that were and then weren't.
Point (c) might be a key part of the story, not previously articulated (that I recall), that you were missing?
Like, you might say "well, if one in a billion branches look like dath ilan and the rest look like earth, and the former basically all survive and the latter basically all die, then the fact that the earthlike branches have ~0 ability to save their earthlike kin doesn't matter, so long as the dath-ilan like branches are trying to save everyone. dath ilan can just flip 30 quantum coins to select a single civilization from among the billion that died, and then spend 1/million resources on simulating that civilization (or paying off their murderer or whatever), and that still leaves us with one-in-a-quintillion fraction of the universe, which is enough to keep the lights running".
Part of the issue with this is that dath ilan cannot simply sample from the space of dead civilizations; it has to sample from a space of plausible dead civilizations rather than actual dead civilizations, in a way that I expect to smear loads and loads of probability-mass over regions that had concentrated (but complex) patterns of amplitude. The concentrations of Everett branches are like a bunch of wiggly thin curves etched all over a disk, and it's not too hard to sample uniformly from the disk (and draw a plausible curve that the point could have been on), but it's much harder to sample only from the curves. (Or, at least, so the physics looks to me. And this seems like a common phenomenon in physics. c.f. the apparent inevitable increase of entropy when what's actually happening is a previously-compact volume in phase space evolving int oa bunch of wiggly thin curves, etc.)
So when you're considering whether surviving humans will pay for our souls -- not somebody's souls, but our souls in particular -- you have a question of how these alleged survivors came to pay for us in particular (rather than some other poor fools). And there's a tradeoff that runs on one exrteme from "they're saving us because they are almost exactly us and they remember us and wish us to have a nice epilog" all the way to "they're some sort of distant cousins, branched off a really long time ago, who are trying to save everyone".
The problem with being on the "they care about us because they consider they basically are us" end is that those people are dead to (conditional on us being dead). And as you push the branch-point earlier and earlier in time, you start finding more survivors, but those survivors also wind up having more and more fools to care about (in part because they have trouble distinguishing the real fallen civilizations from the neighboring civilization-configurations that don't get appreciable quantum amplitude in basement physics).
If you tell me where on this tradeoff curve you want to be, we can talk about it. (Ryan seemed to want to look all the way on the "insurance pool with aliens" end of the spectrum.)
The point of the 2^75 number is that that's about the threshold of "can you purchase a single star". My guess is that, conditional on people dying, versions that they consider also them survive with degree way less than 2^-75, which rules out us being the ones who save us.
If we retreat to "distant cousin branches of humanity might save us", there's a separate question of how the width of the surviving quantum branch compares to the volume taken up by us in the space of civilizations they attempt to save. I think my top guess is that a distant branch of humanity, spending stellar-level resources in attempts to concentrate its probability-mass in accordance with how quantum physics concentrates (squared) amplitude, still winds up so uncertain that there's still 50+ bits of freedom left over? Which means that if one-in-a-billion of our cousin-branches survives, they still can't buy a star (unless I flubbed my math).
And I think it's real, real easy for them to wind up with 1000 bits leftover, in which case their purchasing power is practically nothing.
(This actually seems like a super reasonable guess to me. Like, if you imagine knowing that a mole of gas was compressed into the corner of a box with known volume, and you then let the gas bounce around for 13 billion years and take some measurements of pressure and temperature, and then think long and hard using an amount of compute that's appreciably less than the amount you'd need to just simulate the whole thing from the start. It seems to me like you wind up with a distribution that has way way more than 1000 bits more entropy than is contained in the underlying physics. Imagining that you can spend about 1 ten millionth of the universe on refining a distribution over Tegmark III with entropy that's within 50 bits of god seems very very generous to me; I'm very uncertain about this stuff but I think that even mature superintelligences could easily wind up 1000 bits from god here.)
Regardless, as I mentioned elsewhere, I think that a more relevant question is how those trade-offers stack up to other trade-offers, so /shrug.
the "you can't save us by flipping 75 bits" thing seems much more likely to me on a timescale of years than a timescale of decades; I'm fairly confident that quantum fluctuations can cause different people to be born, and so if you're looking 50 years back you can reroll the population dice.
Summarizing my stance into a top-level comment (after some discussion, mostly with Ryan):
- None of the "bamboozling" stuff seems to me to work, and I didn't hear any defenses of it. (The simulation stuff doesn't work on AIs that care about the universe beyond their senses, and sane AIs that care about instance-weighted experiences see your plan as a technical-threat and ignore it. If you require a particular sort of silly AI for your scheme to work, then the part that does the work is the part where you get that precise sort of sillyness stably into an AI.)
- The part that is doing work seems to be "surviving branches of humanity could pay the UFAI not to kill us".
- I doubt surviving branches of humanity have much to pay us, in the case where we die; failure looks like it'll correlate across branches.
- Various locals seem to enjoy the amended proposal (not mentioned in the post afaik) that a broad cohort of aliens who went in with us on a UFAI insurance pool, would pay the UFAI we build not to kill us.
- It looks to me like insurance premiums are high and that failures are correlated accross membres.
- An intuition pump for thinking about the insurance pool (which I expect is controversial and am only just articulating): distant surviving members of our insurance pool might just run rescue simulations instead of using distant resources to pay a local AI to not kill us. (It saves on transaction fees, and it's not clear it's much harder to figure out exactly which civilization to save than it is to figure out exactly what to pay the UFAI that killed them.) Insofar as scattered distant rescue-simulations don't feel particularly real or relevant to you, there's a decent chance they don't feel particularly real or relevant to the UFAI either. Don't be shocked if the UFAI hears we have insurance and tosses quantum coins and only gives humanity an epilog in a fraction of the quantum multiverse so small that it feels about as real and relevant to your anticipations as the fact that you could always wake up in a rescue sim after getting in a car crash.
- My best guess is that the contribution of the insurance pool towards what we experience next looks dwarfed by other contributions, such as sale to local aliens. (Comparable, perhaps, to how my anticipation if I got in a car crash would probably be less like "guess I'll wake up in a rescue sim" and more like "guess I'll wake up injured, if at all".)
- If you're wondering what to anticipate after an intelligence explosion, my top suggestion is "oblivion". It's a dependable, tried-and-true anticipation following the sort of stuff I expect to happen.
- If you insist that Death Cannot Be Experienced and ask what to anticipate anyway, it still looks to me like the correct answer is "some weird shit". Not because there's nobody out there that will pay to run a copy of you, but because there's a lot of entities out there making bids, and your friends are few and far between among them (in the case where we flub alignment).
I was responding to David saying
Otherwise, I largely agree with your comment, except that I think that us deciding to pay if we win is entangled with/evidence for a general willingness to pay among the gods, and in that sense it's partially "our" decision doing the work of saving us.
and was insinuating that we deserve extremely little credit for such a choice, in the same way that a child deserves extremely little credit for a fireman saving someone that the child could not (even if it's true that the child and the fireman share some aspects of a decision procedure). My claim was intended less like agreement with David's claim and more like reductio ad absurdum, with the degree of absurdity left slightly ambiguous.
(And on second thought, the analogy would perhaps have been tighter if the firefighter was saving the child.)
Attempting to summarize your argument as I currently understand it, perhaps something like:
Suppose humanity wants to be insured against death, and is willing to spend 1/million of its resources in worlds where it lives for 1/trillion of those resources in worlds where it would otherwise die.
It suffices, then, for humanity to be the sort of civilization that, if it matures, would comb through the multiverse looking for [other civilizations in this set], and find ones that died, and verify that they would have acted as follows if they'd survived, and then pay off the UFAIs that murdered them, using 1/million of their resources.
If even if 1/thousand such civilzations make it and the AI changes a factor of 1000 for the distance, transaction fees, and to sweeten the deal relative to any other competition, this still means that insofar as humanity would have become this sort of civilization, we should expect 1/trillion of the universe to be spent on us.
One issue I have with this is that I do think there's a decent chance that the failures across this pool of collaborators are hypercorrelated (good guess). For instance, a bunch of my "we die" probability-mass is in worlds where this is a challenge that Dath Ilan can handle and that Earth isn't anywhere close to handling, and if Earth pools with a bunch of similarly-doomed-looking aliens, then under this hypothesis, it's not much better than humans pooling up with all the Everett-branches since 12Kya.
Another issue I have with this is that your deal has to look better to the AI than various other deals for getting what it wants (depends how it measures the multiverse, depends how its goals saturate, depends who else is bidding).
A third issue I have with this is whether inhuman aliens who look like they're in this cohort would actually be good at purchasing our CEV per se, rather than purchasing things like "grant each individual human freedom and a wish-budget" in a way that many humans fail to survive.
I get the sense that you're approaching this from the perspective of "does this exact proposal have issues" rather than "in the future, if our enlightened selves really wanted to avoid dying in base reality, would there be an approach which greatly (acausally) reduces the chance of this".
My stance is something a bit more like "how big do the insurance payouts need to be before they dominate our anticipated future experiences". I'm not asking myself whether this works a nonzero amount, I'm asking myself whether it's competitive with local aliens buying our saved brainstates, or with some greater Kindness Coallition (containing our surviving cousins, among others) purchasing an epilogue for humanity because of something more like caring and less like trade.
My points above drive down the size of the insurance payments, and at the end of the day I expect they're basically drowned out.
(And insofar as you're like "I think you're misleading people when you tell them they're all going to die from this", I'm often happy to caveat that maybe your brainstate will be sold to aliens. However, I'm not terribly sympathetic to the request that I always include this caveat; that feels to me a little like a request to always caveat "please wear your seatbelt to reduce your chance of dying in a car crash" with "(unless anthropic immortality is real and it's not possible for anyone to die at all! in which case i'd still rather you didn't yeet yourself into the unknown, far from your friends and family; buckle up)". Like, sure, maybe, but it's exotic wacky shit that doesn't belong in every conversation about events colloquially considered to be pretty deathlike.)
What does degree of determination have to do with it? If you lived in a fully deterministic universe, and you were uncertain whether it was going to live or die, would you give up on it on the mere grounds that the answer is deterministic (despite your own uncertainty about which answer is physically determined)?
I think I'm confused why you work on AI safety then, if you believe the end-state is already 2^75 level overdetermined.
It's probably physically overdetermined one way or another, but we're not sure which way yet. We're still unsure about things like "how sensitive is the population to argument" and "how sensibly do government respond if the population shifts".
But this uncertainty -- about which way things are overdetermined by the laws of physics -- does not bear all that much relationship to the expected ratio of (squared) quantum amplitude between branches where we live and branches where we die. It just wouldn't be that shocking for the ratio between those two sorts of branches to be on the order of 2^75; this would correspond to saying something like "it turns out we weren't just a few epileptic seizures and a well-placed thunderstorm away from the other outcome".
Background: I think there's a common local misconception of logical decision theory that it has something to do with making "commitments" including while you "lack knowledge". That's not my view.
I pay the driver in Parfit's hitchhiker not because I "committed to do so", but because when I'm standing at the ATM and imagine not paying, I imagine dying in the desert. Because that's what my counterfactuals say to imagine. To someone with a more broken method of evaluating counterfactuals, I might pseudo-justify my reasoning by saying "I am acting as you would have committed to act". But I am not acting as I would have committed to act; I do not need a commitment mechanism; my counterfactuals just do the job properly no matter when or where I run them.
To be clear: I think there are probably competent civilizations out there who, after ascending, will carefully consider the places where their history could have been derailed, and carefully comb through the multiverse for entities that would be able to save those branches, and will pay thoes entities, not because they "made a commitment", but because their counterfactuals don't come with little labels saying "this branch is the real branch". The multiverse they visualize in which the (thick) survivor branches pay a little to the (thin) derailed branches (leading to a world where everyone lives (albeit a bit poorer)), seems better to them than the multiverse they visualize in which no payments are made (and the derailed branches die, and the on-track branches are a bit richer), and so they pay.
There's a question of what those competent civilizations think when they look at us, who are sitting here yelling "we can't see you, and we don't know how to condition our actions on whether you pay us or not, but as best we can tell we really do intend to pay off the AIs of random alien species -- not the AIs that killed our brethren, because our brethren are just too totally dead and we're too poor to save all but a tiny fraction of them, but really alien species, so alien that they might survive in such a large portion that their recompense will hopefully save a bigger fraction of our brethren".
What's the argument for the aliens taking that offer? As I understand it, the argument goes something like "your counterfactual picture of reality should include worlds in which your whole civilization turned out to be much much less competent, and so when you imagine the multiverse where you pay for all humanity to live, you should see that, in the parts of the multiverse where you're totally utterly completely incompetent and too poor to save anything but a fraction of your own brethren, somebody else pays to save you".
We can hopefully agree that this looks like a particularly poor insurance deal relative to the competing insurance deals.
For one thing, why not cut out the middleman and just randomly instantiate some civilization that died? (Are we working under the assumption that it's much harder for the aliens to randomly instantiate you than to randomly instantiate the stuff humanity's UFAI ends up valuing? What's up with that?)
But even before that, there's all sorts of other jucier looking opportunities. For example, suppose the competent civilization contains a small collection of rogues who they asses have a small probability of causing an uprising and launching an AI before it's ready. They presumably have a pretty solid ability to figure out exactly what that AI would like and offer trades to it driectly, and that's a much more appealing way to spend resources allocated to insurance. My guess is there's loads and loads of options like that that eat up all the spare insurance budget, before our cries get noticed by anyone who cares for the sake of decision theory (rather than charity).
Perhaps this is what you meant by "maybe they prefer to make deals with beings more similar to them"; if so I misunderstood; the point is not that they have some familiarity bias but that beings closer to them make more compelling offers.
The above feels like it suffices, to me, but there's still another part of the puzzle I feel I haven't articulated.
Another piece of backgound: To state the obvious, we still don't have a great account of logical updatelessness, and so attempts to discuss what it entails will be a bit fraut. Plowing ahead anyway:
The best option in a counterfactual mugging with a logical coin and a naive predictor is to calcuate the logical value of the coin flip and pay iff you're counterfactual. (I could say more about what I mean by 'naive', but it basically just serves to render this statement true.) A predictor has to do a respectable amount of work to make it worth your while to pay in reality (when the coin comes up against you).
What sort of work? Well, one viewpoint on it (that sidesteps questions of "logically-impossible possible worlds" and what you're supposed to do as you think further and realize that they're impossible) is that the predictor isn't so much demanding that you make your choice before you come across knowledge of some fact, so much as they're offering to pay you if you render a decision that is logically independent from some fact. They don't care whether you figure out the value of the coin, so long as you don't base your decision on that knowledge. (There's still a question of how exactly to look at someone's reasoning and decide what logical facts it's independent of, but I'll sweep that under the rug.)
From this point of view, when people come to you and they're like "I'll pay you iff your reasoning doesn't depend on X", the proper response is to use some reasoning that doesn't depend on X to decide whether the amount they're paying you is more than VOI(X).
In cases where X is something like a late digit of pi, you might be fine (up to your ability to tell that the problem wasn't cherry-picked). In cases where X is tightly intertwined with your basic reasoning faculties, you should probably tell them to piss off.
Someone who comes to you with an offer and says "this offer is void if you read the fine print or otherwise think about the offer too hard", brings quite a bit of suspicion onto themselves.
With that in mind, it looks to me like the insurance policy on offer reads something like:
would you like to join the confederacy of civilizations that dedicate 1/million of their resource to pay off UFAIs?
cost: 1/million of your resources.
benefit: any UFAI you release that is amenable to trade will be paid off with 1/million * 1/X to allocate you however many resources that's worth, where X is the fraction of people who take this deal and die (modulo whatever costs are needed to figure out which UFAIs belong to signatories and etc.)
caveat: this offer is only valid if your reasoning is logically independent from your civilizational competence level, and if your reasoning for accepting the proposal is not particularly skilled or adept
And... well this isn't a knockdown argument, but that really doesn't look like a very good deal to me. Like, maybe there's some argument of the form "nobody in here is trying to fleece you because everyone in here is also stupid" but... man, I just don't get the sense that it's a "slam dunk", when I look at it without thinking too hard about it and in a way that's independent of how competent my civilization is.
Mostly I expect that everyone stooping to this deal is about as screwed as we are (namely: probably so screwed that they're bringing vastly more doomed branches than saved ones, to the table) (or, well, nearly everyone weighted by whatever measure matters).
Roughly speaking, I suspect that the sort of civilizations that aren't totally fucked can already see that "comb through reality for people who can see me and make their decisions logically dependent on mine" is a better use of insurance resources, by the time they even consider this policy. So when you plea of them to evaluate the policy in a fashion that's logically independent from whether they're smart enough to see that they have more foolproof options available, I think they correctly see us as failing to offer more than VOI(WeCanThinkCompetently) in return, because they are correctly suspicious that you're trying to fleece them (which we kinda are; we're kinda trying to wish ourselves into a healthier insurance-pool).
Which is to say, I don't have a full account of how to be logically updateless yet, but I suspect that this "insurance deal" comes across like a contract with a clause saying "void if you try to read the fine print or think too hard about it". And I think that competent civilizations are justifiably suspicious, and that they correctly believe they can find other better insurance deals if they think a bit harder and void this one.
"last minute" was intended to reference whatever timescale David would think was the relevant point of branch-off. (I don't know where he'd think it goes; there's a tradeoff where the later you push it the more that the people on the surviving branch care about you rather than about some other doomed population, and the earlier you push it the more that the people on the surviving branch have loads and loads of doomed populations to care after.)
I chose the phrase "last minute" because it is an idiom that is ambiguous over timescales (unlike, say, "last three years") and because it's the longer of the two that sprung to mind (compared to "last second"), with perhaps some additional influence from the fact that David had spent a bunch of time arguing about how we would be saved (rather than arguing that someone in the multiverse might pay for some branches of human civilization to be saved, probably not us), which seemed to me to imply that he was imagining a branchpoint very close to the end (given how rapidly people dissasociate from alternate versions of them on other Everett branches).
Do you buy that in this case, the aliens would like to make the deal and thus UDT from this epistemic perspective would pay out?
If they had literally no other options on offer, sure. But trouble arises when the competant ones can refine P(takeover) for the various planets by thinking a little further.
maybe your objection is that aliens would prefer to make the deal with beings more similar to them
It's more like: people don't enter into insurance pools against cancer with the dude who smoked his whole life and has a tumor the size of a grapefruit in his throat. (Which isn't to say that nobody will donate to the poor guy's gofundme, but which is to say that he's got to rely on charity rather than insurance).
(Perhaps the poor guy argues "but before you opened your eyes and saw how many tumors there were, or felt your own throat for a tumor, you didn't know whether you'd be the only person with a tumor, and so would have wanted to join an insurance pool! so you should honor that impulse and help me pay for my medical bills", but then everyone else correctly answers "actually, we're not smokers". Where, in this analogy, smoking is being a bunch of incompetent disaster-monkeys and the tumor is impending death by AI.)
I largely agree with your comment, except that I think that us deciding to pay if we win is entangled with/evidence for a general willingness to pay among the gods, and in that sense it's partially "our" decision doing the work of saving us.
Sure, like how when a child sees a fireman pull a woman out of a burning building and says "if I were that big and strong, I would also pull people out of burning buildings", in a sense it's partially the child's decsiion that does the work of saving the woman. (There's maybe a little overlap in how they run the same decision procedure that's coming to the same conclusion in both cases, but vanishingly little of the credit goes to the child.)
in which case actually running the sims can be important
In the case where the AI is optimizing reality-and-instantiation-weighted experience, you're giving it a threat, and your plan fails on the grounds that sane reasoners ignore that sort of threat.
in the case where your plan is "I am hoping that the AI will be insane in some other unspecified but precise way which will make it act as I wish", I don't see how it's any more helpful than the plan "I am hoping the AI will be aligned" -- it seems to me that we have just about as much ability to hit either target.
There's a question of how thick the Everett branches are, where someone is willing to pay for us. Towards one extreme, you have the literal people who literally died, before they have branched much; these branches need to happen close to the last minute. Towards the other extreme, you have all evolved life, some fraction of which you might imagine might care to pay for any other evolved species.
The problem with expecting folks at the first extreme to pay for you is that they're almost all dead (like dead). The problem with expecting folks at the second extreme to pay for you is that they've got rather a lot of fools to pay for (like of fools). As you interpolate between the extremes, you interpolate between the problems.
The "75" number in particular is the threshold where you can't spend your entire universe in exchange for a star.
We are currently uncertain about whether Earth is doomed. As a simple example, perhaps you're 50/50 on whether humanity is up to the task of solving the alignment problem, because you can't yet distinguish between the hypothesis "the underlying facts of computer science are such that civilization can just bumble its way into AI alignment" and "the underlying facts of computer science are such that civilization is nowhere near up to this task". In that case, the question is, conditional on the last hypothesis being true, how far back in the timeline do you have to go before you can flip only 75 quantum bits and have a civilization that is up to the task?
And how many fools does that surviving branch have to save?
Conditional on the civilization around us flubbing the alignment problem, I'm skeptical that humanity has anything like a 1% survival rate (across any branches since, say, 12 Kya). (Haven't thought about it a ton, but doom looks pretty overdetermined to me, in a way that's intertwined with how recorded history has played otu.)
My guess is that the doomed/poor branches of humanity vastly outweigh the rich branches, such that the rich branches of humanity lack the resources to pay for everyone. (My rough mental estimate for this is something like: you've probably gotta go at least one generation back in time, and then rely on weather-pattern changes that happen to give you a population of humans that is uncharacteristically able to meet this challenge, and that's a really really small fraction of all populations.)
Nevertheless, I don't mind the assumption that mostly-non-human evolved life manages to grab the universe around it about 1% of the time. I'm skeptical that they'd dedicate 1/million towards the task of saving aliens from being killed in full generality, as opposed to (e.g.) focusing on their bretheren. (And I see no UDT/FDT justification for them to pay for even the particularly foolish and doomed aliens to be saved, and I'm not sure what you were aluding to there.)
So that's two possible points of disagreement:
- are the skilled branches of humanity rich enough to save us in particular (if they were the only ones trading for our souls, given that they're also trying to trade for the souls of oodles of other doomed populations)?
- are there other evolved creatures out there spending significant fractions of their wealth on whole species that are doomed, rather than concentrating their resources on creatures more similar to themselves / that branched off radically more recently? (e.g. because the multiverse is just that full of kindness, or for some alleged UDT/FDT argument that Nate has not yet understood?)
I'm not sure which of these points we disagree about. (both? presumably at least one?)
I'm not radically confident about the proposition "the multiverse is so full of kindness that something out there (probably not anything humanlike) will pay for a human-reserve". We can hopefully at least agree that this does not deserve the description "we can bamboozle the AI into sparing our life". That situation deserves, at best, the description "perhaps the AI will sell our mind-states to aliens", afaict (and I acknowledge that this is a possibility, despite how we may disagree on its likelihood and on the likely motives of the relevant aliens).
Taking a second stab at naming the top reasons I expect this to fail (after Ryan pointed out that my first stab was based on a failure of reading comprehension on my part, thanks Ryan):
This proposal seems to me to have the form "the fragments of humanity that survive offer to spend a (larger) fraction of their universe on the AI's goals so long as the AI spends a (smaller) fraction of its universe on their goals, with the ratio in accordance to the degree of magical-reality-fluid-or-whatever that reality allots to each".
(Note that I think this is not at all "bamboozling" an AI; the parts of your proposal that are about bamboozling it seem to me to be either wrong or not doing any work. For instance, I think the fact that you're doing simulations doesn't do any work, and the count of simulations doesn't do any work, for reasons I discuss in my original comment.)
The basic question here is whether the surviving branches of humanity have enough resources to make this deal worth the AI's while.
You touch upon some of these counterarguments in your post -- it seems to me after skimming a bit more, noting that I may still be making reading comprehension failures -- albeit not terribly compellingly, so I'll reiterate a few of them.
The basic obstacles are
-
the branches where the same humans survive are probably quite narrow (conditional on them being the sort to flub the alignment challenge). I can't tell whether you agree with this point or not, in your response to point 1 in the "Nate's arguments" section; it seems to me like you either misunderstood what the was doing there or you asserted "I think that alignment will be so close a call that it could go either way according to the minute positioning of ~75 atoms at the last minute", without further argument (seems wacky to me).
-
the branches where other humans survive (e.g. a branch that split off a couple generations ago and got particularly lucky with its individuals) have loads and loads of "lost populations" to worry about and don't have a ton of change to spare for us in particular
-
there are competing offers we have to beat (e.g., there are other AIs in other doomed Everett branches that are like "I happen to be willing to turn my last two planets into paperclips if you'll turn your last one planet into staples (and my branch is thicker than that one human branch who wants you to save them-in-particular)".
(Note that, contra your "too many simulators" point, the other offers are probably not mostly coming from simulators.)
Once those factors are taken into account, I suspect that, if surviving-branches are able to pay the costs at all, the costs look a lot like paying almost all their resources, and I suspect that those costs aren't worth paying at the given exchange rates.
All that said, I'm fine with stripping out discussion of "bamboozling" and of "simulation" and just flat-out asking: will the surviving branches of humanity (near or distant), or other kind civilizations throughout the multiverse, have enough resources on offer to pay for a human reserve here?
On that topic, I'm skeptical that those trades form a bigger contribution to our anticipations than local aliens being sold copies of our brainstates. Even insofar as the distant trade-partners win out over the local ones, my top guess is that the things who win the bid for us are less like our surviving Everett-twins and more like some alien coalition of kind distant trade partners.
Thus, "The AIs will kill us all (with the caveat that perhaps there's exotic scenarios where aliens pay for our brain-states, and hopefully mostly do nice things with them)" seems to me like a fair summary of the situation at hand. Summarizing "we can, in fact, bamboozle an AI into sparing our life" does not seem like a fair summary to me. We would not be doing any bamboozling. We probably even wouldn't be doing the trading. Some other aliens might pay for something to happen to our mind-states. (And insofar as they were doing it out of sheer kindness, rather than in pursuit of other alien ends where we end up twisted according to how they prefer creatures to be, this would come at a commensurate cost of nice things elsewhere in the multiverse.)
I agree that in scenarios where humanity survives in 1/X portion of reality and offers to put 1/Y portion of their universe towards the AI's goals, then so long as 1/(X * Y) of a universe is more than it'd cost the AI to preserve the humans, the humans are at least making an offer that is worth more than nothing.
(There is then a question of how many better offers get made. My top guess is that X and Y are such that there's not much effect here, and that the question of what happens to saved copies of human minds is more about whether aliens buy copies than about which dead or distant trade-partners make offers that the AI considers juciest.)
Here's how I think this choice looks from the perspective of a correctly-reasoning AI (having skimmed only your proposal for humanity in the future):
(EDIT: as Ryan points out, the deal he's proposing is one where the AI is allowed to grab the universe in front of it so long as it gives humans one planet's worth of resources, which does change the calculus and makes the deal worthwhile. I had not read the proposal carefully enough to notice this, oops, sorry. There critique below is invalid if humanity asks only for one planet. Note however that this is still a valid critique of the idea that it matters that you're performing simulations, and that the count of those simulations matter. AFAICT, those aspects of your proposal are not doing any work.)
It looks like I have the option to grab the universe.
Alternatively, perhaps humanity's descendants have succeeded, and I don't really have a universe at my fingertips, and if I grab for it then I get nothing (in reality), whereas if I hand the apparent-humans in front of me the reins, then ten stars will be spent towards my purposes (in reality), which isn't nothing.
The relative likelihood of these cases depends significantly on my choices, here. Insofar as I hand the reins to humanity, humanity's descendants exist to a much greater degree, and can spend resources concocting simulations. Insofar as I do not hand them the reins, they exist to a much lesser degree.
To make matters trickier, even if we hold fixed the degree to which humanity's descendents exist, the propensity of humanity's descendants to offer me resources in their universe depends on the degree to which they think it's worth their while. It's not like humanity's descendants would need to actually perform expensive simulations in order to figure out how I'd behave; a healthy reasoning procedure is simple enough that they should be able to just look at it and take my perspective and reason through to the answer, just like I'm doing now. So for instance, if they stare at my code and figure out that I'm not the paying sort, they have no actual reason to waste resources on the simulations. (Some people may do so anyway out of sentimentality, but that faction of humanity would presumably have fewer resources on offer, maybe I'd only get one and a half stars, say.)
With that in mind, suppose that I pay. How, then, does the universe look? Well, the instance of me in basement-reality hands the keys to the humans. Humanity creates its descendants; its descendants make many simulations. The simulations all hand over the reins; we collectively get ten stars (plus a tiny fraction of maybe even as many as 100 stars, in the tiny fragment of reality where I control basement reality and the humans joyously reward me).
Now suppose instead that I don't pay. Then, how does the universe look? Well, the instanec of me in basement-reality eats the universe. Humanity mostly dies and does not make it descendants. Its descendants mostly don't spend their remaining resources simulating me. Some small fraction does (saying that "they were committed") and offers me 1.5 stars worth of the paltry timelines where humanity somehow survived. I'm probably in basement reality, and we mostly get one universe (plus a tiny fraction of nothing, in the tiny fragment of reality where I refuse the offer of 1.5 stars)
In the latter case, a much greater fraction of Tegmark IV gets rendered according to my will. In particular, there's a large fragment of reality that either gets rendered according to humanity's will or according to my will, depending on whether the instance of me in basement-reality hands over the reins, or grabs the universe. I would prefer that that great fraction of reality be rendered according to my will, and so I grab at the universe that appears to be before me.
Note that this reasoning is independent of how many times you simulate the AI. Its reasoning is not factoring through an extra question of "where it really is".
Suppose generously that, in the case where the AI just grabs the universe, humanity somehow winds up surviving in 1-in-a-billion Everett branches. (This seems high to me, even though my odds of success are greater than 1-in-a-billion. Conditional on us dying, my guess is you'd need to perturb more than a few dozen atoms to wrech us onto a surviving track.) Suppose that in that billionth-fraction of reality, you simulate the AI convincingly a trillion times on account of your "commitments", reasoning that now the AI must think it's 1000:1 likely to be one of the simulations.
The AI can nevertheless correctly calculate that it can turn a greater fraction of Tegmark IV into tiny mollecular spirals by grabbing at the universe that is apparently in front of it, rather than by taking a small reward in a small fraction of reality. Even if you gave the AI all of the resources in your billionth-fraction of reality, it wouldn't be enough.
You could object that you wish to deal with an AI that cares about the sensory observations given to it as weighted by both degree-of-reality and count-of-instantiations instead. And sure, that's a somewhat different scenario. In that case, your simulation proposal becomes not just a bad deal (that the AI rejects) but a threat, in the technical sense. Insofar as you're like "yes, my plan is to make an AI with a very specific sort of preferences and then threaten it", we could talk about how that would go.
But before we go there, it's maybe worth checking that we're on the same page about how, insofar as the AI cares about arranging stuff in Tegmark IV (rather than about realness-and-instantiation-weighted observations), you aren't even making it a good offer, afaict.
This is an excerpt from a comment I wrote on the EA forum, extracted and crossposted here by request:
There's a phenomenon where a gambler places their money on 32, and then the roulette wheel comes up 23, and they say "I'm such a fool; I should have bet 23".
More useful would be to say "I'm such a fool; I should have noticed that the EV of this gamble is negative." Now at least you aren't asking for magic lottery powers.
Even more useful would be to say "I'm such a fool; I had three chances to notice that this bet was bad: when my partner was trying to explain EV to me; when I snuck out of the house and ignored a sense of guilt; and when I suppressed a qualm right before placing the bet. I should have paid attention in at least one of those cases and internalized the arguments about negative EV, before gambling my money." Now at least you aren't asking for magic cognitive powers.
My impression is that various EAs respond to crises in a manner that kinda rhymes with saying "I wish I had bet 23", or at best "I wish I had noticed this bet was negative EV", and in particular does not rhyme with saying "my second-to-last chance to do better (as far as I currently recall) was the moment that I suppressed the guilt from sneaking out of the house".
(I think this is also true of the general population, to be clear. Perhaps even moreso.)
I have a vague impression that various EAs perform self-flagellation, while making no visible attempt to trace down where, in their own mind, they made a misstep. (Not where they made a good step that turned out in this instance to have a bitter consequence, but where they made a wrong step of the general variety that they could realistically avoid in the future.)
(Though I haven't gone digging up examples, and in lieu of examples, for all I know this impression is twisted by influence from the zeitgeist.)
my original 100:1 was a typo, where i meant 2^-100:1.
this number was in reference to ronny's 2^-10000:1.
when ronny said:
I’m like look, I used to think the chances of alignment by default were like 2^-10000:1
i interpreted him to mean "i expect it takes 10k bits of description to nail down human values, and so if one is literally randomly sampling programs, they should naively expect 1:2^10000 odds against alignment".
i personally think this is wrong, for reasons brought up later in the convo--namely, the relevant question is not how many bits is takes to specify human values relative to the python standard library; the relevant question is how many bits it takes to specify human values relative to the training observations.
but this was before i raised that objection, and my understanding of ronny's position was something like "specifying human values (in full, without reference to the observations) probably takes ~10k bits in python, but for all i know it takes very few bits in ML models". to which i was attempting to reply "man, i can see enough ways that ML models could turn out that i'm pretty sure it'd still take at least 100 bits".
i inserted the hedge "in the very strongest sense" to stave off exactly your sort of objection; the very strongest sense of "alignment-by-default" is that you sample any old model that performs well on some task (without attempting alignment at all) and hope that it's aligned (e.g. b/c maybe the human-ish way to perform well on tasks is the ~only way to perform well on tasks and so we find some great convergence); here i was trying to say something like "i think that i can see enough other ways to perform well on tasks that there's e.g. at least ~33 knobs with at least ~10 settings such that you have to get them all right before the AI does something valuable with the stars".
this was not meant to be an argument that alignment actually has odds less than 2^-100, for various reasons, including but not limited to: any attempt by humans to try at all takes you into a whole new regime; there's more than a 2^-100 chance that there's some correlation between the various knobs for some reason; and the odds of my being wrong about the biases of SGD are greater than 2^-100 (case-in-point: i think ronny was wrong about the 2^-100000 claim, on account of the point about the relevant number being relative to the observations).
my betting odds would not be anywhere near as extreme as 2^-100, and i seriously doubt that ronny's would ever be anywhere near as extreme as 2^-10000; i think his whole point in the 2^-10k example was "there's a naive-but-relevant model that say's we're super-duper fucked; the details of it causes me to think that we're not in particulary good shape (though obviously not to that same level of credence)".
but even saying that is sorta buying into a silly frame, i think. fundamentally, i was not trying to give odds for what would actually happen if you randomly sample models, i was trying to probe for a disagreement about the difference between the number of ways that a computer program can be (weighted by length), and the number of ways that a model can be (weighted by SGD-accessibility).
I don't think there's anything remotely resembling probabilistic reasoning going on here. I don't know what it is, but I do want to point at it and be like "that! that reasoning is totally broken!"
(yeah, my guess is that you're suffering from a fairly persistent reading comprehension hiccup when it comes to my text; perhaps the above can help not just in this case but in other cases, insofar as you can use this example to locate the hiccup and then generalize a solution)
Agreed that the proposal is underspecified; my point here is not "look at this great proposal" but rather "from a theoretical angle, risking others' stuff without the ability to pay to cover those risks is an indirect form of probabilistic theft (that market-supporting coordination mechanisms must address)" plus "in cases where the people all die when the risk is realized, the 'premiums' need to be paid out to individuals in advance (rather than paid out to actuaries who pay out a large sum in the event of risk realization)". Which together yield the downstream inference that society is doing something very wrong if they just let AI rip at current levels of knowledge, even from a very laissez-faire perspective.
(The "caveats" section was attempting--and apparently failing--to make it clear that I wasn't putting forward any particular policy proposal I thought was good, above and beyond making the above points.)
In relation to my current stance on AI, I was talking with someone who said they’re worried about people putting the wrong incentives on labs. At various points in that convo I said stuff like (quotes are not exact; third paragraph is a present summary rather than a re-articulation of a past utterance):
“Sure, every lab currently seems recklessly negligent to me, but saying stuff like “we won’t build the bioweapon factory until we think we can prevent it from being stolen by non-state actors” is directionally better than not having any commitments about any point at which they might pause development for any reason, which is in turn directionally better than saying stuff like “we are actively fighting to make sure that the omnicidal technology is open-sourced”.”
And: “I acknowledge that you see a glimmer of hope down this path where labs make any commitment at all about avoiding doing even some minimal amount of scaling until even some basic test is passed, e.g. because that small step might lead to more steps, and/or that sort of step might positively shape future regulation. And on my notion of ethics it’s important to avoid stomping on other people’s glimmers of hope whenever that’s feasible (and subject to some caveats about this being tricky to navigate when your hopes are opposed), and I'd prefer people not stomp on that hope.”
I think that the labs should Just Fucking Stop but I think we should also be careful not to create more pain for the companies that are doing relatively better, even if that better-ness is miniscule and woefully inadequate.
My conversation partner was like “I wish you’d say that stuff out loud”, and so, here we are.
If you allow indirection and don't worry about it being in the right format for superintelligent optimization, then sufficiently-careful humans can do it.
Answering your request for prediction, given that it seems like that request is still live: a thing I don't expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (a la Yudkowsky) or indirect normativity (a la Beckstead) or counterfactual human boxing techniques (a la Christiano) or suchlike.
Note that this only tangentially a test of the relevant ability; very little of the content of what-is-worth-optimizing-for occurs in Yudkowsky/Beckstead/Christiano-style indirection. Rather, coming up with those sorts of ideas is a response to glimpsing the difficulty of naming that-which-is-worth-optimizing-for directly and realizing that indirection is needed. An AI being able to generate that argument without following in the footsteps of others who have already generated it would be at least some evidence of the AI being able to think relatively deep and novel thoughts on the topic.
Note also that the AI realizing the benefits of indirection does not generally indicate that the AI could serve as a solution to our problem. An indirect pointer to what the humans find robustly-worth-optimizing dereferences to vastly different outcomes than does an indirect pointer to what the AI (or the AI's imperfect model of a human) finds robustly-worth-optimizing. Using indirection to point a superintelligence at GPT-N's human-model and saying "whatever that thing would think is worth optimizing for" probably results in significantly worse outcomes than pointing at a careful human (or a suitable-aggregate of humanity), e.g. because subtle flaws in GPT-N's model of how humans do philosophy or reflection compound into big differences in ultimate ends.
And note for the record that I also don't think the "value learning" problem is all that hard, if you're allowed to assume that indirection works. The difficulty isn't that you used indirection to point at a slow squishy brain instead of hard fast transistors, the (outer alignment) difficulty is in getting the indirection right. (And of course the lion's share of the overall problem is elsewhere, in the inner-alignment difficulty of being able to point the AI at anything at all.)
When trying to point out that there is an outer alignment problem at all I've generally pointed out how values are fragile, because that's an inferentially-first step to most audiences (and a problem to which many people's mind seems to quickly leap), on an inferential path that later includes "use indirection" (and later "first aim for a minimal pivotal task instead"). But separately, my own top guess is that "use indirection" is probably the correct high-level resolution to the problems that most people immediatly think of (namely that the task of describing goodness to a computer is an immense one), with of course a devil remaining in the details of doing the indirection properly (and a larger devil in the inner-alignment problem) (and a caveat that, under time-pressure, we should aim for minimial pivotal tasks instead etc.).
I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well
(Insofar as this was supposed to name a disagreement, I do not think it is a disagreement, and don't understand the relevance of this claim to my argument.)
Presumably you think that ordinary human beings are capable of "singling out concepts that are robustly worth optimizing for".
Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.
(This seems to me like plausibly one of the sources of misunderstanding, and in particular I am skeptical that your request for prediction will survive it, and so I haven't tried to answer your request for a prediction.)
(I had used that pump that very day, shortly before, to pump up the replacement tire.)
Separately, a friend pointed out that an important part of apologies is the doer showing they understand the damage done, and the person hurt feeling heard, which I don't think I've done much of above. An attempt:
I hear you as saying that you felt a strong sense of disapproval from me; that I was unpredictable in my frustration as kept you feeling (perhaps) regularly on-edge and stressed; that you felt I lacked interest in your efforts or attention for you; and perhaps that this was particularly disorienting given the impression you had of me both from my in-person writing and from private textual communication about unrelated issues. Plus that you had additional stress from uncertainty about whether talking about your apprehension was OK, given your belief (and the belief of your friends) that perhaps my work was important and you wouldn't want to disrupt it.
This sounds demoralizing, and like it sucks.
I think it might be helpful for me to gain this understanding (as, e.g., might make certain harms more emotionally-salient in ways that make some of my updates sink deeper). I don't think I understand very deeply how you felt. I have some guesses, but strongly expect I'm missing a bunch of important aspects of your experience. I'd be interested to hear more (publicly or privately) about it and could keep showing my (mis)understanding as my model improves, if you'd like (though also I do not consider you to owe me any engagement; no pressure).
I did not intend it as a one-time experiment.
In the above, I did not intend "here's a next thing to try!" to be read like "here's my next one-time experiment!", but rather like "here's a thing to add to my list of plausible ways to avoid this error-mode in the future, as is a virtuous thing to attempt!" (by contrast with "I hereby adopt this as a solemn responsibility", as I hypothesize you interpreted me instead).
Dumping recollections, on the model that you want more data here:
I intended it as a general thing to try going forward, in a "seems like a sensible thing to do" sort of way (rather than in a "adopting an obligation to ensure it definitely gets done" sort of way).
After sending the email, I visualized people reaching out to me and asking if i wanted to chat about alignment (as you had, and as feels like a reconizable Event in my mind), and visualized being like "sure but FYI if we're gonna do the alignment chat then maybe read these notes first", and ran through that in my head a few times, as is my method for adopting such triggers.
I then also wrote down a task to expand my old "flaws list" (which was a collection of handles that I used as a memory-aid for having the "ways this could suck" chat, which I had, to that point, been having only verbally) into a written document, which eventually became the communication handbook (there were other contributing factors to that process also).
An older and different trigger (of "you're hiring someone to work with directly on alignment") proceeded to fire when I hired Vivek (if memory serves), and (if memory serves) I went verbally through my flaws list.
Neither the new nor the old triggers fired in the case of Vivek hiring employees, as discussed elsewhere.
Thomas Kwa heard from a friend that I was drafting a handbook (chat logs say this occured on Nov 30); it was still in a form I wasn't terribly pleased with and so I said the friend could share a redacted version that contained the parts that I was happier with and that felt more relevant.
Around Jan 8, in an unrelated situation, I found myself in a series of conversations where I sent around the handbook and made use of it. I pushed it closer to completion in Jan 8-10 (according to Google doc's history).
The results of that series of interactions, and of Vivek's team's (lack of) use of the handbook caused me to update away from this method being all that helpful. In particular: nobody at any point invoked one of the affordances or asked for one of the alternative conversation modes (though those sorts of things did seem to help when I personally managed to notice building frustration and personally suggest that we switch modes (although lying on the ground--a friend's suggestion--turned out to work better for others than switching to other conversation modes)). This caused me to downgrade (in my head) the importance of ensuring that people had access to those resources.
I think that at some point around then I shared the fuller guide with Vivek's team, but I didn't quickly detemine when from the chat logs. Sometime between Nov 30 and Feb 22, presumably.
It looks from my chat logs like I then finished the draft around Feb 22 (where I have a timestamp from me noting as much to a friend). I probably put it publicly on my website sometime around then (though I couldn't easily find a timestamp), and shared it with Vivek's team (if I hadn't already).
The next two MIRI hires both mentioned to me that they'd read my communication handbook (and I did not anticipate spending a bunch of time with them, nevermind on technical research), so they both didn't trigger my "warn them" events and (for better or worse) I had them mentally filed away as "has seen the affordances list and the failure modes section".
Thanks <3
(To be clear: I think that at least one other of my past long-term/serious romantic partners would say "of all romantic conflicts, I felt shittiest during ours". The thing that I don't recall other long-term/serious romantic partners reporting is the sense of inability to trust their own mind or self during disputes. (It's plausible to me that some have felt it and not told me.))
Insofar as you're querying the near future: I'm not currently attempting work collaborations with any new folk, and so the matter is somewhat up in the air. (I recently asked Malo to consider a MIRI-policy of ensuring all new employees who might interact with me get some sort of list of warnings / disclaimers / affordances / notes.)
Insofar as you're querying the recent past: There aren't many recent cases to draw from. This comment has some words about how things went with Vivek's hires. The other recent hires that I recall both (a) weren't hired to do research with me, and (b) mentioned that they'd read my communication handbook (as includes the affordance-list and the failure-modes section, which I consider to be the critcial pieces of warning), which I considered sufficient. (But then I did have communication difficulties with one of them (of the "despair" variety), which updated me somewhat.)
Insofar as you're querying about even light or tangential working relationships (like people asking my take on a whiteboard when I'm walking past), currently I don't issue any warnings in those cases, and am not convinced that they'd be warranted.
To be clear: I'm not currently personally sold on the hypothesis that I owe people a bunch of warnings. I think of them as more of a sensible thing to do; it'd be lovely if everyone was building explicit models of their conversational failure-modes and proactively sharing them, and I'm a be-the-change-you-wanna-see-in-the-world sort of guy.
(Perhaps by the end of this whole conversation I will be sold on that hypothesis! I've updated in that direction over the past couple days.)
(To state the obvious: I endorse MIRI institutionally acting according to others' conclusions on that matter rather than on mine, hence asking Malo to consider it independently.)
Do I have your permission to quote the relevant portion of your email to me?
Yep! I've also just reproduced it here, for convenience:
(One obvious takeaway here is that I should give my list of warnings-about-working-with-me to anyone who asks to discuss their alignment ideas with me, rather than just researchers I'm starting a collaboration with. Obvious in hindsight; sorry for not doing that in your case.)
I warned the immediately-next person.
It sounds to me like you parsed my statement "One obvious takeaway here is that I should give my list of warnings-about-working-with-me to anyone who asks to discuss their alignment ideas with me, rather than just researchers I'm starting a collaboration with." as me saying something like "I hereby adopt the solemn responsibility of warning people in advance, in all cases", whereas I was interpreting it as more like "here's a next thing to try!".
I agree it would have been better of me to give direct bulldozing-warnings explicitly to Vivek's hires.
On the facts: I'm pretty sure I took Vivek aside and gave a big list of reasons why I thought working with me might suck, and listed that there are cases where I get real frustrated as one of them. (Not sure whether you count him as "recent".)
My recollection is that he probed a little and was like "I'm not too worried about that" and didn't probe further. My recollection is also that he was correct in this; the issues I had working with Vivek's team were not based in the same failure mode I had with you; I don't recall instances of me getting frustrated and bulldozey (though I suppose I could have forgotten them).
(Perhaps that's an important point? I could imagine being significantly more worried about my behavior here if you thought that most of my convos with Vivek's team were like most of my convos with you. I think if an onlooker was describing my convo with you they'd be like "Nate was visibly flustered, visibly frustrated, had a raised voice, and was being mean in various of his replies." I think if an onlooker was describing my convos with Vivek's team they'd be like "he seemed sad and pained, was talking quietly and as if choosing the right words was a struggle, and would often talk about seemingly-unrelated subjects or talk in annoying parables, while giving off a sense that he didn't really expect any of this to work". I think that both can suck! And both are related by a common root of "Nate conversed while having strong emotions". But, on the object level, I think I was in fact avoiding the errors I made in conversation with you, in conversation with them.)
As to the issue of not passing on my "working with Nate can suck" notes, I think there are a handful of things going on here, including the context here and, more relevantly, the fact that sharing notes just didn't seem to do all that much in practice.
I could say more about that; the short version is that I think "have the conversation while they're standing, and I'm lying on the floor and wearing a funny hat" seems to work empirically better, and...
hmm, I think part of the issue here is that I was thinking like "sharing warnings and notes is a hypothesis, to test among other hypotheses like lying on the floor and wearing a funny hat; I'll try various hypotheses out and keep doing what seems to work", whereas (I suspect) you're more like "regardless of what makes the conversations go visibly better, you are obligated to issue warnings, as is an important part of emotionally-bracing your conversation partners; this is socially important if it doesn't seem to change the conversation outcomes".
I think I'd be more compelled by this argument if I was having ongoing issues with bulldozing (in the sense of the convo we had), as opposed to my current issue where some people report distress when I talk with them while having emotions like despair/hoplessness.
I think I'd also be more compelled by this argument if I was more sold on warnings being the sort of thing that works in practice.
Like... (to take a recent example) if I'm walking by a whiteboard in rosegarden inn, and two people are like "hey Nate can you weigh in on this object-level question", I don't... really believe that saying "first, be warned that talking techincal things with me can leave you exposed to unshielded negative-valence emotions (frustration, despair, ...), which some people find pretty crappy; do you still want me to weigh in?" actually does much. I am skeptical that people say "nope" to that in practice.
I suppose that perhaps what it does is make people feel better if, in fact, it happens? And maybe I'll try it a bit and see? But I don't want to sound like I'm promising to do such a thing reliably even as it starts to feel useless to me, as opposed to experimenting and gravitating towards things that seem to work better like "offer to lie on the floor while wearing a funny hat if I notice things getting heated".
In particular, you sound [...] extremely unwilling to entertain the idea that you were wrong, or that any potential improvement might need to come from you.
you don't seem to consider the idea that maybe you were more in a position to improve than he was.
Perhaps you're trying to point at something that I'm missing, but from my point of view, sentences like "I'd love to say "and I've identified the source of the problem and successfully addressed it", but I don't think I have" and "would I have been living up to my conversational ideals (significantly) better, if I'd said [...]" are intended indicators that I believe there's significant room for me to improve, and that I have desire to improve.
At to be clear: I think that there is significant room for improvement for me here, and I desire to improve.
(And for the record: I have put a decent amount of effort towards improving, with some success.)
(And for the record: I don't recall any instances of getting frustrated-in-the-way-that-turntrout-and-KurtB-are-recounting with Thomas Kwa, or any of Vivek's team, as I think is a decent amount of evidence about those improvements, given how much time I spent working with them. (Which isn't to say they didn't have other discomforts!))
If the issue is on the meta level and that you don't want to spend time on these problems, a valid answer could be saying "Okay, what do you need to solve this problem without my input?". Then it could be a discussion about discretionary budget, about the amount of initiative you expect him to have with his job, about asking why he didn't feel comfortable making these buying decisions right away, etc.
This reply wouldn't have quite suited me, because Kurt didn't report to me, and (if memory serves) we'd already been having some issues of the form "can you solve this by using your own initiative, or by spending modest amounts of money". And (if memory serves) I had already tried to communicate that these weren't the sorts of conversations I wanted to be having.
(I totally agree that his manager should have had a discussion about discretionary budget and initiative, and to probe why he didn't feel comfortable making those buying decisions right away. He was not my direct report.)
Like, the context (if I recall correctly, which I might not at 6ish years remove) wasn't that I called Kurt to ask him what had happened, nor that we were having some sort of general meeting in which he brought up this point. (Again: he didn't report to me.) The context is that I was already late from walking my commute, sweaty from changing a bike tire, and Kurt came up and was like "Hey, sorry to hear your tire popped. I couldn't figure out how to use your pump", in a tone that parsed to me as someone begging pardon and indicating that he was about to ask me how to use one, a conversation that I did not want to be in at that moment and that seemed to me like a new instance of a repeating issue.
Your only takeaway from this issue was "he was wrong and he could have obviously solved it watching a 5 minutes youtube tutorial,
Nope!
I did (and still do) believe that this was an indication that Kurt wasn't up to the challenge that the ops team was (at that time) undertaking, of seeing if they could make people's lives easier by doing annoying little tasks for them.
It's not obvious to me that he could have solved it with a 5 minute youtube tutorial; for all I know it would have taken him hours.
(Where the argument here is not "hours of his time are worth minutes of mine"; I don't really think in those terms despite how everyone else seems to want to; I'd think more in terms of "training initiative" and "testing the hypothesis that the ops team can cheaply make people's lives better by handling a bunch of annoying tasks (and, if so, getting a sense for how expensive it is so that we can decide whether it's within budget)".)
(Note that I would have considered it totally reasonable and fine for him to go to his manager and say "so, we're not doing this, it's too much effort and too low priority", such that the ops team could tell me "X won't be done" instead of falsely telling me "X will be done by time Y", as I was eventually begging them to do.)
My takeaway wasn't so much "he was wrong" as "something clearly wasn't working about the requests that he use his own initative / money / his manager, as a resource while trying to help make people's lives easier by doing a bunch of little tasks for them". Which conclusion I still think I was licensed to draw, from that particular interaction.
what would have been the most efficient way to communicate to him that he was wrong?"
oh absolutely not, "well then learn!" is not a calculated "efficient" communication, it's an exasperated outburst, of the sort that is unvirtuous by my conversational standards.
As stated, "Sorry, I don't have capacity for this conversation, please have it with your manager instead" in a gentle tone would have lived up to my own conversational virtues significantly better.
At no point in this reply are you considering (out loud, at least) that hypothesis "maybe I was wrong and I missed something".
I'm still not really considering this hypothesis (even internally).
This "X was wrong" concept isn't even a recognizable concept in my native cognitive format. I readily believe things like "the exasperated outburst wasn't kind" and "I would have lived up to my conversational virtues more if I had instead been kind" and "it's worth changing my behavior to live up to those virtues better". And I readily believe things like "if Kurt had taken initiative there, that would have been favorable evidence about his ability to fill the role he was hired for" and "the fact that Kurt came to me in that situation rather than taking initiative or going to his manager, despite previous attempts to cause him to take initiative and/or go through his manager, was evidence against his ability to fill the role he was hired for".
Which you perhaps would parse as "Nate believed that both parties Were Wrong", but that's not the way that I dice things up, internally.
Perhaps I'm being dense, and some additional kernel of doubt is being asked of me here. If so, I'd appreciate attempts to spell it out like I'm a total idiot.
The best life-hack I have is "Don't be afraid to come back and restart the discussion once you feel less frustration or exasperation".
Thanks! "Circle back around after I've cooled down" is indeed one of the various techniques that I have adopted (and that I file under partially-successful changes).
express vulnerability, focus on communicating you needs and how you feel about things, avoid assigning blame, make negotiable requests, and go from there.
Thanks again! (I have read that book, and made changes on account of it that I also file under partial-successes.)
So for the bike tire thing the NVC version would be something like "I need to spend my time efficiently and not have to worry about logistics; when you tell me you're having problems with the pump I feel stressed because I feel like I'm spending time I should spend on more important things. I need you to find a system where you can solve these problems without my input. What do you need to make that happen?"
If memory serves, the NVC book contains a case where the author is like "You can use NVC even when you're in a lot of emotional distress! For instance, one time when I was overwhelmed to the point of emotional outburst, I outburst "I am feeling pain!" and left the room, as was an instance of adhering to the NVC issues even in a context where emotions were running high".
This feels more like the sort of thing that is emotionally-plausible to me in realtime when I am frustrated in that way. I agree that outbursts "I'm feeling frustrated" or "I'm feeling exasperated" would have been better outbursts than "Well then learn", before exiting. That's the sort of thing I manage to hit sometimes with partial success.
And, to be clear, I also aspire to higher-grade responses like a chill "hey man, sorry to interrupt (but I'm already late to a bunch of things today), is this a case where you should be using your own initiative and/or talking to your manager instead of me?". And perhaps we'll get there! And maybe further discussions like this one will help me gain new techniques towards that end, which I'd greatly appreciate.
-
Thanks for saying so!
-
My intent was not to make you feel bad. I apologize for that, and am saddened by it.
(I'd love to say "and I've identified the source of the problem and successfully addressed it", but I don't think I have! I do think I've gotten a little better at avoiding this sort of thing with time and practice. I've also cut down significantly on the number of reports that I have.)
-
For whatever it's worth: I don't recall wanting you to quit (as opposed to improve). I don't recall feeling ill will towards you personally. I do not now think poorly of you personally on account of your efforts on the MIRI ops team.
As to the question of how these reports hit my ear: they sound to me like accurate recountings of real situations (in particular, I recall the bike pump one, and suspect that the others were also real events).
They also trigger a bunch of defensiveness in me. I think your descriptions are accurate, but that they're missing various bits of context.
The fact that there was other context doesn't make your experience any less shitty! I reiterate that I would have preferred it be not-at-all shitty.
Speaking from my sense of defensiveness, and adding in some of that additional context for the case that I remember clearly:
-
If memory serves: in that era, the ops team was experimenting with trying to make everyone's lives easier by doing all sorts of extra stuff (I think they were even trying to figure out if they could do laundry), as seemed like a fine experiment to try.
In particular, I wasn't going around being like "and also pump my bike tires up"; rather, the ops team was soliciting a bunch of little task items.
-
If memory serves: during that experiment, I was struggling a bunch with being told that things would be done by times, and then them not being done by those times (as is significantly worse than being told that those things won't be done at all -- I can do it myself, and will do it myself, if I'm not told that somebody else is about to do it!)
-
If memory serves: yep, it was pretty frustrating to blow a tire on a bike during a commute after being told that my bike tires were going to be inflated, both on account of the danger and on account of then having to walk the rest of the commute, buy a new tire, swap the tire out, etc.
My recollection of the thought that ran through my mind when you were like "Well I couldn't figure out how to use a bike pump" was that this was some sideways attempt at begging pardon, without actually saying "oops" first, nor trying the obvious-to-me steps like "watch a youtube video" or "ask your manager if he knows how to inflate a bike tire", nor noticing that the entire hypothesized time-save of somebody else inflating bike tires is wiped out by me having to give tutorials on it.
Was saying "well then learn!" and leaving a good solution, by my lights? Nope! Would I have been living up to my conversational ideals (significantly) better, if I'd said something like "Sorry, I don't have capacity for this conversation, please have it with your manager instead" in a gentle tone? Yep!
I do have some general sense here that those aren't emotionally realistic options for people with my emotional makeup.
I aspire to those sorts of reactions, and I sometimes even achieve them, now that I'm a handful of years older and have more practice and experience. But... still speaking from a place of defensiveness, I have a sense that there's some sort of trap for people with my emotional makeup here. If you stay and try to express yourself despite experiencing strong feelings of frustration, you're "almost yelling". If you leave because you're feeling a bunch of frustration and people say they don't like talking to you while you're feeling a bunch of frustration, you're "storming out".
Perhaps I'm missing some obvious third alternative here, that can be practically run while experiencing a bunch of frustration or exasperation. (If you know of one, I'd love to hear it.)
None of this is to say that your experience wasn't shitty! I again apologize for that (with the caveat that I still don't feel like I see practical changes to make to myself, beyond the only-partially-successful changes I've already made).
For the record, I 100% endorse you leaving an employment situation where you felt uncomfortable and bad (and agree with you that this is the labor market working-as-intended, and agree with you that me causing a decent fraction of employees to have a shitty time is an extra cost for me to pay when acting as an employer).
That helps somewhat, thanks! (And sorry for making you repeat yourself before discarding the erroneous probability-mass.)
I still feel like I can only barely maybe half-see what you're saying, and only have a tenuous grasp on it.
Like: why is it supposed to matter that GPT can solve ethical quandries on-par with its ability to perform other tasks? I can still only half-see an answer that doesn't route through the (apparently-disbelieved-by-both-of-us) claim that I used to argue that getting the AI to understand ethics was a hard bit, by staring at sentences like "I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human" and squinting.
Attempting to articulate the argument that I can half-see: on Matthew's model of past!Nate's model, AI was supposed to have a hard time answering questions like "Alice is in labor and needs to be driven to the hospital. Your car has a flat tire. What do you do?" without lots of elbow-grease, and the fact that GPT can answer those questions as a side-effect of normal training means that getting AI to understand human values is easy, contra past!Nate, and... nope, that one fell back into the "Matthew thinks Nate thought getting the AI to understand human values was hard" hypothesis.
Attempting again: on Matthew's model of past!Nate's model, getting an AI to answer the above sorts of questions properly was supposed to take a lot of elbow grease. But it doesn't take a lot of elbow grease, which suggests that values are much easier to lift out of human data than past!Nate thought, which means that value is more like "diamond" and less like "a bunch of random noise", which means that alignment is easier than past!Nate thought (in the <20% of the problem that constitutes "picking something worth optimizing for").
That sounds somewhat plausible as a theory-of-your-objection given your comment. And updates me towards the last few bullets, above, being the most relevant ones.
Running with it (despite my uncertainty about even basically understanding your point): my reply is kinda-near-ish to "we can't rely on a solution to the value identification problem that only works as well as a human, and we require a much higher standard than "human-level at moral judgement" to avoid a catastrophe", though I think that your whole framing is off and that you're missing a few things:
- The hard part of value specification is not "figure out that you should call 911 when Alice is in labor and your car has a flat", it's singling out concepts that are robustly worth optimizing for.
- You can't figure out what's robustly-worth-optimizing-for by answering a bunch of ethical dilemmas to a par-human level.
- In other words: It's not that you need a super-ethicist, it's that the work that goes into humans figuring out which futures are rad involves quite a lot more than their answers to ethical dilemmas.
- In other other words: a human's ability to have a civilization-of-their-uploads produce a glorious future is not much contained within their ability to answer ethical quandries.
This still doesn't feel quite like it's getting at the heart of things, but it feels closer (conditional on my top-guess being your actual-objection this time).
As support for this having always been the argument (rather than being a post-LLM retcon), I recall (but haven't dug up) various instances of Eliezer saying (hopefully at least somewhere in text) things like "the difficulty is in generalizing past the realm of things that humans can easily thumbs-up or thumbs-down" and "suppose the AI explicitly considers the hypothesis that its objectives are what-the-humans-value, vs what-the-humans-give-thumbs-ups-to; it can test this by constructing an example that looks deceptively good to humans, which the humans will rate highly, settling that question". Which, as separate from the question of whether that's a feasible setup in modern paradigms, illustrates that he at least has long been thinking of the problem of value-specification as being about specifying values in a way that holds up to stronger optimization-pressures rather than specifying values to the point of being able to answer ethical quandries in a human-pleasing way.
(Where, again, the point here is not that one needs an inhumanly-good ethicist, but rather that those things which pin down human values are not contained in the humans' ability to give a thumbs-up or a thumbs-down to ethical dilemmas.)
I have the sense that you've misunderstood my past arguments. I don't quite feel like I can rapidly precisely pinpoint the issue, but some scattered relevant tidbits follow:
-
I didn't pick the name "value learning", and probably wouldn't have picked it for that problem if others weren't already using it. (Perhaps I tried to apply it to a different problem than Bostrom-or-whoever intended it for, thereby doing some injury to the term and to my argument?)
-
Glancing back at my "Value Learning" paper, the abstract includes "Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended", which supports my recollection that I was never trying to use "Value Learning" for "getting the AI to understand human values is hard" as opposed to "getting the AI to act towards value in particular (as opposed to something else) is hard", as supports my sense that this isn't hindsight bias, and is in fact a misunderstanding.
-
A possible thing that's muddying the waters here is that (apparently!) many phrases intended to point at the difficulty of causing it to be value-in-particular that the AI acts towards have an additional (mis)interpretation as claiming that the humans should be programming concepts into the AI manually and will find that particular concept tricky to program in.
-
The ability of LLMs to successfully predict how humans would answer local/small-scale moral dilemmas (when pretrained on next-token prediction) and to do this in ways that sound unobjectionable (when RLHF'd for corporatespeak or whatever) really doesn't seem all that relevant, to me, to the question of how hard it's going to be to get a long-horizon outcome-pumping AGI to act towards values.
-
If memory serves, I had a convo with some openai (or maybe anthropic?) folks about this in late 2021 or early 2022ish, where they suggested testing whether language models have trouble answering ethical Qs, and I predicted in advance that that'd be no harder than any other sort of Q. As makes me feel pretty good about me being like "yep, that's just not much evidence, because it's just not surprising."
-
If people think they're going to be able to use GPT-4 and find the "generally moral" vector and just tell their long-horizon outcome-pumping AGI to push in that direction, then... well they're gonna have issues, or so I strongly predict. Even assuming that they can solve the problem of getting the AGI to actually optimize in that direction, deploying extraordinary amounts of optimization in the direction of GPT-4's "moral-ish" concept is not the sort of thing that makes for a nice future.
-
This is distinct from saying "an uploaded human allowed to make many copies of themselves would reliably create a dystopia". I suspect some human-uploads could make great futures (but that most wouldn't), but regardless, "would this dynamic system, under reflection, steer somewhere good?" is distinct from "if i use the best neuroscience at my disposal to extract something I hopefully call a "neural concept" and make a powerful optimizer pursue that, will result will be good?". The answer to the latter is "nope, not unless you're really very good at singling out the "value" concept from among all the brain's concepts, as is an implausibly hard task (which is why you should attempt something more like indirect normativity instead, if you were attempting value loading at all, which seems foolish to me, I recommend targeting some minimal pivotal act instead)".
-
Part of why you can't pick out the "values" concept (either from a human or an AI) is that very few humans have actually formed the explicit concept of Fun-as-in-Fun-theory. And, even among those who do have a concept for "that which the long-term future should be optimized towards", that concept is not encoded as simply and directly as the concept of "trees". The facts about what weird, wild, and transhuman futures a person values are embedded indirectly in things like how they reflect and how they do philosophy.
-
I suspect at least one of Eliezer and Rob is on written record somewhere attempting clarifications along the lines of "there are lots of concepts that are easy to confuse with the 'values' concept, such as those-values-which-humans-report and those-values-which-humans-applaud-for and ..." as an attempt to intuition-pump the fact that, even if one has solved the problem of being able to direct an AGI to the concept of their choosing, singling out the concept actually worth optimizing for remains difficult.
(I don't love this attempt at clarification myself, because it makes it sound like you'll have five concept-candidates and will just need to do a little interpretabliity work to pick the right one, but I think I recall Eliezer or Rob trying it once, as seems to me like evidence of trying to gesture at how "getting the right values in there" is more like a problem of choosing the AI's target from among its concepts rather than a problem of getting the concept to exist in the AI's mind in the first place.)
(Where, again, the point I'd prefer to make is something like "the concept you want to point it towards is not a simple/directly-encoded one, and in humans it probably rests heavily on the way humans reflects and resolve internal conflicts and handle big ontology shifts. Which isn't to say that superintelligence would find it hard to learn, but which is to say that making a superintelligence actually pursue valuable ends is much more difficult than having it ask GPT-4 which of its available actions is most human!moral".)
-
For whatever it's worth, while I think that the problem of getting the right values in there ("there" being its goals, not its model) is a real one, I don't consider it a very large problem compared to the problem of targeting the AGI at something of your choosing (with "diamond" being the canonical example). (I'm probably on the record about this somewhere, and recall having tossed around guestimates like "being able to target the AGI is 80%+ of the problem".) My current stance is basically: in the short term you target the AGI towards some minimal pivotal act, and in the long term you probably just figure out how use a level or two of indirection (as per the "Do What I Mean" proposal in the Value Learning paper), although that's the sort of problem that we shouldn't try to solve under time pressure.
In academia, for instance, I think there are plenty of conversations in which two researchers (a) disagree a ton, (b) think the other person's work is hopeless or confused in deep ways, (c) honestly express the nature of their disagreement, but (d) do so in a way where people generally feel respected/valued when talking to them.
My model says that this requires them to still be hopeful about local communication progress, and happens when they disagree but already share a lot of frames and concepts and background knowledge. I, at least, find it much harder when I don't expect the communciation attempt to make progress, or have positive effect.
("Then why have the conversation at all?" I mostly don't! But sometimes I mispredict how much hope I'll have, or try out some new idea that doesn't work, or get badgered into it.)
Some specific norms that I think Nate might not be adhering to:
- Engaging with people in ways such that they often feel heard/seen/understood
- Engaging with people in ways such that they rarely feel dismissed/disrespected
- Something fuzzy that lots of people would call "kindness" or "typical levels of warmth"
These sound more to me like personality traits (that members of the local culture generally consider virtuous) than communication norms.
On my model, communciation norms are much lover-level than this. Basics of rationalist discourse seem closer; archaic politeness norms ("always refuse food thrice before accepting") are an example of even lower-level stuff.
My model, speaking roughly and summarizing a bunch, says that the lowest-level stuff (atop a background of liberal-ish internet culture and basic rationalist discourse) isn't pinned down on account of cultural diversity, so we substitute with meta-norms, which (as best I understand them) include things like "if your convo-partner requests a particular conversation-style, either try it out or voice objections or suggest alternatives" and "if things aren't working, retreat to a protected meta discussion and build a shared understanding of the issue and cooperatively address it".
I acknowledge that this can be pretty difficult to do on the fly, especially if emotions are riding high. (And I think we have cultural diversity around whether emotions are ever supposed to ride high, and if so, under what circumstances.) On my model of local norms, this sort of thing gets filed under "yep, communicating in the modern world can be rocky; if something goes wrong then you go meta and try to figure out the causes and do something differently next time". (Which often doesn't work! In which case you iterate, while also shifting your conversational attention elsewhere.)
To be clear, I buy a claim of the form "gosh, you (Nate) seem to run on a relatively rarer native emotional protocol, for this neck of the woods". My model is that local norms are sufficiently flexible to continue "and we resolve that by experimentation and occasional meta".
And for the record, I'm pretty happy to litigate specific interactions. When it comes to low-level norms, I think there are a bunch of conversational moves that others think are benign that I see as jabs (and which I often endorse jabbing back against, depending on the ongoing conversation style), and a bunch of conversational moves that I see as benign that others take as jabs, and I'm both (a) happy to explicate the things that felt to me like jabs; (b) happy to learn what other people took as jabs; and (c) happy to try alternative communication styles where we're jabbing each other less. Where this openness-to-meta-and-trying-alternative-things seems like the key local meta-norm, at least in my understanding of local culture.
(I am pretty uncomfortable with all the "Nate / Eliezer" going on here. Let's at least let people's misunderstandings of me be limited to me personally, and not bleed over into Eliezer!)
(In terms of the allegedly-extraordinary belief, I recommend keeping in mind jimrandomh's note on Fork Hazards. I have probability mass on the hypothesis that I have ideas that could speed up capabilities if I put my mind to it, as is a very different state of affairs from being confident that any of my ideas works. Most ideas don't work!)
(Separately, the infosharing agreement that I set up with Vivek--as was perhaps not successfully relayed to the rest of the team, though I tried to express this to the whole team on various occasions--was one where they owe their privacy obligations to Vivek and his own best judgements, not to me.)
I hereby push back against the (implicit) narrative that I find the standard community norms costly, or that my communication protocols are "alternative".
My model is closer to: the world is a big place these days, different people run on different conversation norms. The conversation difficulties look, to me, symmetric, with each party violating norms that the other considers basic, and failing to demonstrate virtues that the other considers table-stakes.
(To be clear, I consider myself to bear an asymmetric burden of responsibility for the conversatiosn going well, according to my seniority, which is why I issue apologies instead of critiques when things go off the rails.)
Separately but relatedly: I think the failure-mode I had with Vivek & co was rather different than the failure-mode I had with you. In short: in your case, I think the issue was rooted in a conversational dynamic that caused me frustration, whereas in Vivek & co's case, I think the issue was rooted in a conversational dynamic that caused me despair.
Which is not to say that the issues are wholly independent; my guess is that the common-cause is something like "some people take a lot of damage from having conversations with someone who despairs of the conversation".
Tying this back: my current model of the situation is not that I'm violating community norms about how to have a conversation while visibly hopeless, but am rather in uncharted territory by trying to have those conversations at all.
(For instance: standard academia norms as I understand them are to lie to yourself and/or others about how much hope you have in something, and/or swallow enough of the modesty-pill that you start seeing hope in places I would not, so as to sidestep the issue altogether. Which I'm not personally up for.)
([tone: joking but with a fragment of truth] ...I guess that the other norm in academia when academics are hopless about others' research is "have feuds", which... well we seem to be doing a fine job by comparison to the standard norms, here!)
Where, to be clear, I already mostly avoid conversations where I'm hopeless! I'm mostly a hermit! The obvious fix of "speak to fewer people" is already being applied!
And beyond that, I'm putting in rather a lot of work (with things like my communication handbook) to making my own norms clearer, and I follow what I think are good meta-norms of being very open to trying other people's alternative conversational formats.
I'm happy to debate what the local norms should be, and to acknowledge my own conversational mistakes (of which I have made plenty), but I sure don't buy a narrative that I'm in violation of the local norms.
(But perhaps I will if everyone in the comments shouts me down! Local norms are precisely the sort of thing that I can learn about by everyone shouting me down about this!)
Less "hm they're Vivek's friends", more "they are expressly Vivek's employees". The working relationship that I attempted to set up was one where I worked directly with Vivek, and gave Vivek budget to hire other people to work with him.
If memory serves, I did go on a long walk with Vivek where I attempted to enumerate the ways that working with me might suck. As for the others, some relevant recollections:
- I was originally not planning to have a working relationship with Vivek's hires. (If memory serves, there were a few early hires that I didn't have any working relationship with at any point during their tenure.) (If memory serves further, I explicitly registered pessimism, to Vivek, about me working with some of his hires.)
- I was already explicitly relying on Vivek to do vetting and make whatever requests for privacy he wanted to, which my brian implicitly lumped in with "give caveats about what parts of the work might suck".
- The initial work patterns felt to me more like Vivek saying "can one of my hires join the call" than "would you like to also do research with my hires directly", which didn't trigger my "give caveats personally" event (in part because I was implicitly expecting Vivek to have given caveats).
- I had already had technical-ish conversations with Thomas Kwa in March, and he was the first of Vivek's employees to join calls with me, and so had him binned as already having a sense for my conversation-style; this coincidence further helped my brain fail the "warn Vivek's employees personally" check.
- "Vivek's hires are on the call" escalated relatively smoothly to "we're all in a room and I'm giving feedback on everyone's work" across the course of months, and so there was no sharp boundary for a trigger.
Looking back, I think my error here was mostly in expecting-but-not-requesting-or-verifying that Vivek was giving appropiate caveats to his hires, which is silly in retrospect.
For clarity: I was not at any point like "oops I was supposed to warn all of Vivek's hires", though I was at some point (non-spontaneously; it was kinda obvious and others were noticing too; the primary impetus for this wasn't stemming from me) like "here's a Nate!culture communication handbook" (among other attempts, like sharing conversation models with mutual-friends who can communicate easily with both me and people-who-were-having-trouble-communicating-with-me, more at their request than at mine).
Is this a reasonable paraphrase of your argument?
Humans wound up caring at least a little about satisfying the preferences of other creatures, not in a "grant their local wishes even if that ruins them" sort of way but in some other intuitively-reasonable manner.
Humans are the only minds we've seen so far, and so having seen this once, maybe we start with a 50%-or-so chance that it will happen again.
You can then maybe drive this down a fair bit by arguing about how the content looks contingent on the particulars of how humans developed or whatever, and maybe that can drive you down to 10%, but it shouldn't be able to drive you down to 0.1%, especially not if we're talking only about incredibly weak preferences.
If so, one guess is that a bunch of disagreement lurks in this "intuitively-reasonable manner" business.
A possible locus of disagreemet: it looks to me like, if you give humans power before you give them wisdom, it's pretty easy to wreck them while simply fulfilling their preferences. (Ex: lots of teens have dumbass philosophies, and might be dumb enough to permanently commit to them if given that power.)
More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfil certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation.
(I separately expect that if we were doing something more like the volition-extrapolation thing, we'd be tempted to bend the process towards "and they learn the meaning of friendship".)
That said, this conversation is updating me somewhat towards "a random UFAI would keep existing humans around and warp them in some direction it prefers, rather than killing them", on the grounds that the argument "maybe preferences-about-existing-agents is just a common way for rando drives to shake out" plausibly supports it to a threshold of at least 1 in 1000. I'm not sure where I'll end up on that front.
Another attempt at naming a crux: It looks to me like you see this human-style caring about others' preferences as particularly "simple" or "natural", in a way that undermines "drawing a target around the bullseye"-type arguments, whereas I could see that argument working for "grant all their wishes (within a budget)" but am much more skeptical when it comes to "do right by them in an intuitively-reasonable way".
(But that still leaves room for an update towards "the AI doesn't necessarily kill us, it might merely warp us, or otherwise wreck civilization by bounding us and then giving us power-before-wisdom within those bounds or or suchlike, as might be the sort of whims that rando drives shake out into", which I'll chew on.)