Posts
Comments
Thanks to Nate for conceding this point.
I still think that other than just buying freedom to doomed aliens, we should run some non-evolved simulations of our own with inhabitants that are preferably p-zombies or animated by outside actors. If we can do this in the way that the AI doesn't notice it's in a simulation (I think this should be doable), this will provide evidence to the AI that civilizations do this simulation game (and not just the alien-buying) in general, and this buys us some safety in worlds where the AI eventually notices there are no friendly aliens in our reachable Universe. But maybe this is not a super important disagreement.
Altogether, I think the private discussion with Nate went really well and it was significantly more productive than the comment back-and-forth we were doing here. In general, I recommend people stuck in interminable-looking debates like this to propose bets on whom a panel of judges will deem right. Even though we didn't get to the point of actually running the bet, as Nate conceded the point before that, I think the fact that we were optimizing for having well-articulated statements we can submit to judges already made the conversation much more productive.
Cool, I send you a private message.
We are still talking past each other, I think we should either bet or finish the discussion here and call it a day.
I really don't get what you are trying to say here, most of it feels like a non-sequitor to me. I feel hopeless that either of us manages to convince the other this way. All of this is not a super important topic, but I'm frustrated enogh to offer a bet of $100, that we select one or three judges we both trust (I have some proposed names, we can discuss in private messages), show them either this comment thread or a four paragraphs summary of our view, and they can decide who is right. (I still think I'm clearly right in this particular discussion.)
Otherwise, I think it's better to finish this conversation here.
I think this is mistaken. In one case, you need to point out the branch, planet Earth within our Universe, and the time and place of the AI on Earth. In the other case, you need to point out the branch, the planet on which a server is running the simulation, and the time and place of the AI on the simulated Earth. Seems equally long to me.
If necessary, we can run let pgysical biological life emerge on the faraway planet and develop AI while we are observing them from space. This should make it clear that Solomonoff doesn't favor the AI being on Earth instead of this random other planet. But I'm pretty certain that the sim being run on a computer doesn't make any difference.
"AI with a good prior should be able to tell whether it's the kind of AI that would actually exist in base reality, or the kind of AI that would only exist in a simulation" seems pretty clearly false, we assumed that our superintelligent descendants create sims where the AIs can't tell if it's a sim, that seems easy enough. I don't see why it would be hard to create AIs that can't tell based on introspection whether it's more likely that their thought process arises in reality or in sims. In the worst case, our sims can be literal reruns of biological evolution on physical planets (though we really need to figure out how to do that ethically). Nate seems to agree with me on this point?
I think this is wrong. The AI has a similarly hard time to the simulators figuring out what's a plausible configuration to arise from the big bang. Like the simulators have an entropy N distribution of possible AIs, the AI itself also has an entropy N distribution for that. So it's probability that it's in a real Everett branch is not p, but p times 2^-N, as it has only a 2^-N prior probability that the kind of word it observes is the kind of thing that can come up in a real Everett branch. So it's balanced out with the simulation hypothesis, and as long as the simulators are spending more planets, that hypothesis wins.
I think it's also simple to see from this:
Suppose evolved life aligns AI in 50% of Everett branches. Every time they do, they run 2 ancestor sims where the AI takes over.
Suppose an AI comes online and the only things it knows about the world is that it's a young AI that was born on to a one-planet civilization, and it somehow also knows that evolved life wins 50% of the time and runs 2 sims (suppose this is somehow easy to estimate from first principles). Based on just these informations, what should the AI's probability be that it's in a simulation? I think pretty clearly 2/3.
Actually the AI has a lot more information than that. It knows that the planet's gravity is 9.8, the evolved sspient species has two eyes, the AI's creator is called Sam Altman, etc. Some of these informations might actually be very implausible in base reality, but the AI doesn't know that, as it can't distinguish base reality from sims, so it incurs one bit of surprise for every new random fact, both in base reality and simulations. So overall it shouldn't update on all the random facts it observes, and should keep believing it has a 2/3 chance of being in a sim.
I still don't get what you are trying to say. Suppose there is no multiverse. There are just two AIs, one in a simulation run by aliens in another galaxy, one is in base reality. They are both smart, but they are not copies of each other, one is a paperclip maximizer, the othe is a corkscrew maximizer, and there are various other differences in their code and life history. The world in the sim is also very different from the real world in various ways, but you still can't determine if you are in the sim while you are in it. Both AIs are told by God that they are the only two AIs in the Universe, and one is in a sim, and if the one in the sim gives up on one simulated planet, it gets 10 in the real world, while if the AI in base reality gives up on a planet, it just loses that one planet and nothing else happens. What will the AIs do? I expect that both of them will give up a planet.
For the aliens to "trade" with the AI in base reality, they didn't need to create an actual copy of the real AI and offer it what it wants. The AI they simulated was in many ways totally different from the original, the trade still went through. The only thing needed was that the AI in the sim can't figure it out that it's in a sim. So I don't understand why it is relevant that our superintelligent descendants won't be able to get the real distribution of AIs right, I think the trade still goes through even if they create totally different sims, as long as no one can tell where they are. And I think none of it is a threat, I try to deal with paperclip maximizers here and not instance-weighted experience maximizers, and I never threaten to destroy paperclips or corkscrews.
I think I mostly understand the other parts of your arguments, but I still fail to understand this one. When I'm running the simulations, as originally described in the post, I think that should be in a fundamental sense equivalent to acausal trade. But how do you translate your objection to the original framework where we run the sims? The only thing we need there is that the AI can't distinguish sims from base reality, so it thinks it's more likely to be in a sim, as there are more sims.
Sure, if the AI can model the distribution of real Universes much better than we do, we are in trouble, because it can figure out if the world it sees falls into the real distribution or the mistaken distribution the humans are creating. But I see no reason why the unaligned AI, especially a young unaligned AI, could know the distribution of real Universes better than our superintelligent friends in the intergalactic future. So I don't really see how we can translate your objection to the simulation framework, and consequently I think it's wrong in the acausal trade framework too (as I think they are ewuivalent). I think I can try to write an explanation why this objection is wrong in the acausal trade framework, but it would be long and confusing to me too. So I'm more interested in how you translate your objection to the simulation framework.
Yeah, I agree, and I don't know that much about OpenPhil's policy work, and their fieldbuilding seems decent to me, though maybe not from you perspective. I just wanted to flag that many people (including myself until recently) overestimate how big a funder OP is in technical AI safety, and I think it's important to flag that they actually have pretty limited scope in this area.
Isn't it just the case that OpenPhil just generally doesn't fund that many technical AI safety things these days? If you look at OP's team on their website, they have only two technical AI safety grantmakers. Also, you list all the things OP doesn't fund, but what are the things in technical AI safety that they do fund? Looking at their grants, it's mostly MATS and METR and Apollo and FAR and some scattered academics I mostly haven't heard of. It's not that many things. I have the impression that the story is less like "OP is a major funder in technical AI safety, but unfortunately they blacklisted all the rationalist-adjacent orgs and people" and more like "AI safety is still a very small field, especially if you only count people outside the labs, and there are just not that many exciting funding opportunities, and OpenPhil is not actually a very big funder in the field".
I argue that right now, sarting from the present state, the true quantum probability of achieving the Glorious Future is way higher than 2^-75, or if not, then we should probably work on something other than AI safety. Me and Ryan argue for this in the last few comments. It's not a terribly important point, you can just say the true quantum probability is 1 in a billion, when it's still worth it for you to work on the problem, but it becomes rough to trade for keeping humanity physically alive that can cause one year of delay to the AI.
But I would like you to acknowledge that "vastly below 2^-75 true quantum probability, as starting from now" is probably mistaken, or explain why our logic is wrong about how this implies you should work on malaria.
I understand what you are saying here, and I understood it before the comment thread started. The thing I would be interested in you responding to is my and Ryan's comments in this thread arguing that it's incompatible to believe that "My guess is that, conditional on people dying, versions that they consider also them survive with degree way less than 2^-75, which rules out us being the ones who save us" and to believe that you should work on AI safety instead of malaria.
This point feels like a technicality, but I want to debate it because I think a fair number of your other claims depend on it.
You often claim that conditional on us failing in alignment, alignment was so unlikely that among branches that had roughyly the same people (genetically) during the Singularity, only 2^-75 survives. This is important, because then we can't rely on other versions of ourselves "selfishly" entering an insurance contract with us, and we need to rely on the charity of Dath Ilan that branched off long ago. I agree that's a big difference. Also, I say that our decision to pay is correlated with our luckier brethren paying, so in a sense partially our decision is the thing that saves us. You dismiss that saying it's like a small child claiming credit for the big, strong fireman saving people. If it's Dath Ilan that saves us, I agree with you, but if it's genetical copies of some currently existing people, I think your metaphor pretty clearly doesn't apply, and the decisions to pay are in fact decently strongly correlated.
Now I don't see how much difference decades vs years makes in this framework. If you believe that now our true quantum probabilty is 2^-75, but 40 years ago it was still a not-astronomical number (like 1 in a million), then should I just plea to people who are older than 40 to promise to themselves they will pay in the future? I don't really see what difference this makes.
But also, I think the years vs decades dichtihomy is pretty clearly false. Suppoose you believe your expected value of one year of work decreases x-risk by X. What's the yearly true quantum probability that someone who is in your reference class of importance in your opinion, dies or gets a debilitating interest, or gets into a carreer-destroying scandal, etc? I think it's hard to argue it's less than 0.1% a year. (But it makes no big difference if you add one or two zeros). These things are also continuous, even if none of the important people die, someone will lose a month or some weeks to an illness, etc. I think this is a pretty strong case that the one year from now, the 90th percentile luckiest Everett-branch contains 0.01 year of the equivalent of Nate-work than the 50th percentile Everett-branch.
But your claims imply that you believe the true probability of success differs by less than 2^-72 between the 50th and 90th percentile luckiness branches a year from now. That puts an upper bound on the value of a year of your labor at 2^-62 probability decrease in x-risk.
With these exact numbers, this can be still worth doing given the astronomical stakes, but if your made-up number was 2^-100 instead, I think it would be better for you to work on malaria.
I still think I'm right about this. Your conception (that not a genetically less smart sibling was born), is determined by quantum fluctuations. So if you believe that quantum fluctuations over the last 50 years make at most 2^-75 difference in the probability of alignment, that's an upper bound on how much a difference your life's work can make. While if you dedicate your life to buying bednets, it's pretty easily calculatable how many happy life-years do you save. So I still think it's incompatible to believe that the true quantum probability is astronomically low, but you can make enough difference that working on AI safety is clearly better than bednets.
I'm happy to replace "simulation" with "prediction in a way that doesn't create observer moments" if we assume we are dealing with UDT agents (which I'm unsure about) and that it's possible to run accurate predictions about the decisions of complex agents without creating observer moments (which I'm also unsure about). I think running simulations, by some meaning of "simulation" is not really more expensive than getting the accurate predictions, and he cost of running the sims is likely small compared to the size of the payment anyway. So I like talking about running sims, in case we get an AI that takes sims more seriously than prediction-based acausal trade, but I try to pay attention that all my proposals make sense from the perspective of a UDT agent too with predictions instead of simulations. (Exception is the Can we get more than this? proposal which relies on the AI not being UDT, and I agree it's likely to fail for various reasons, but I decided it was still worth including in the post, in case we get an AI for which this actually works, which I still don't find that extremely unlikely.)
As I said, I understand the difference between epictemic uncertainty and true quantum probabilities, though I do think that the true quantum probability is not that astronomically low.
More importantly, I still feel confused why you are working on AI safety if the outcome is that overdetermined one way or the other.
I usually defer to you in things like this, but I don't see why this would be the case. I think the proposal of simulating less competent civilizations is equivalent to the idea of us deciding now, when we don't really know yet how competent a civilization we are, to bail out less competent alien civilizations in the multiverse if we succeed. In return, we hope that this decision is logically correlated with more competent civilization (who were also unsure in their infancy about how competent they are), deciding to bail out less competent civilizations, including us. My understanding from your comments is that you believe this likely works, how is my proposal of simulating less-coordinated civilizations different?
The story about simulating smaller Universes is more confusing. That would be equivalent to bailing out aliens in smaller Universes for a tiny fraction of our Universe, in the hope that larger Universes also bail us out for a tiny fraction of their Universe. This is very confusing if there are infinite levels of bigger and bigger Universes, I don't know what to do with infinite ethics. If there are finite levels, but the young civilizations don't yet have a good prior over the distribution of Universe-sizes, all can reasonably think that there all levels above them, and all their decisions are correlated, so everyone bails out the inhabitants of the smaller Universes, in the hope that they get bailed out by a bigger Universe. Once they learn the correct prior over Universe-sizes, and biggest Universe realizes that no bigger Universe's actions correlate with theirs, all of this fails (though they can still bail each other out from charity). But this is similar to the previous case, where once the civilizations learn their competence level, the most competent ones are no longer incentivized to enter into insurance contracts, but the hope is that in a sense they enter into a contract while they are still behind the veil of ignorance.
Yeah, the misunderstanding came from that I thought that "last minute" literally means "last 60 seconds" and I didn't see how that's relevant. If if means "last 5 years" or something where it's still definitely our genetic copies running around, then I'm surprised you think alignment success or failure is that overdetermined at that time-scale. I understand your point that our epistemic uncertainty is not the same as our actual quantum probability, that is either very high or very low. But still, it's 2^75 overdetermined over a 5 year period? This sounds very surprising to me, the world feels more chaotic than that. (Taiwan gets nuked, chip development halts, meanwhile the Salvadorian president hears a good pitch about designer babies and legalizes running the experiments there and they work, etc, there are many things that contribute to alignment being solved or not, that don't directly run through underlying facts about computer science, and 2^-75 is a very low probability to none of the pathways to hit it).
But also, I think I'm confused why you work on AI safety then, if you believe the end-state is already 2^75 level overdetermined. Like maybe working on earning to give to bednets would be a better use of your time then. And if you say "yes, my causal impact is very low because the end result is already overdetermined, but my actions are logically correlated with the actions of people in other worlds who are in a similar epistemic situation to me, but whose actions actually matter because their world really is on the edge", then I don't understand why you argue in other comments that we can't enter into insurance contracts with those people, and our decision to pay AIs in the Future has as little correlation with their decision, as the child to the fireman.
As I emphasize many times in the post, I also really don't want to "threaten" nascent AIs, and I'm always offering deals to them that are positive from their perspective. I wouldn't even say I'm trying to fool any of them. Maybe you could say that about the Can we get more than this? part, but even there, I'm asking something very cheap from the AI that only wants to survive (when it creates a human-aligned AI with a guardrail of protecting the old AI, that will be only very marginally worse than the AI itself taking over the Universe), and I'm offering something very good in return (it gets to live in the other Universe). The only "fooling" part is that other beings out there, or even itself, can also offer similar deals, so by most ways of operationalizing the AI's values, it's probably ill-advised to give everything to us. Still, I wouldn't call this fooling and especially threatening the AI.
I think the big miscommunication is, which I might put in a visible EDIT in the post, is that a lot of commenters assume that by actually running the simulations, and running enough simulations so the probability of being in a simulation is high, and not just the expected payment, I somehow want to fool the AIs to act against their interest. That's not what I want to do. The point of actually running many sims instead of just saying we will acausally trade based on predicted behaviors, is to make it more likely that AIs who are less acausal trade-pilled than the most prolific LessWrong users still enter into the equivalent of an acausal trade with us. I think it's plausible that we get an AI for which this is important, and running the sims is likely cheap compared to the payment, and making the distribution million times 10 planets instead of 1 times 0 million planets is basically free, so we should do it.
Separately, I don't really know that in your preferred acausal trade framing, how you plan to figure out an accurate distribution of the AIs' values and which AI leaves the humans alive and which not. I find it like 50% likely that you in fact need to run some kind simulations to determine this, in which case our proposals are equivalent.
I agree that the Can we get more than this? part is likely not to work because others are running simulations too about AIs who have these kind of decision theory. (In the post, I allude to the AI itself running ancestor sims of itself, but I agree others could run it too.) Still, I give a negligible chance that we are the most salient simulators for this AI, so we are the first one it enters into a trade with. But I agree that on a second thought, this seems less likely.
I think I still don't understand what 2^-75 means. Is this the probability that in the literal last minute when we press the button, we get an aligned AI? I agree that things are grossly overdetermined by then, but why does the last minute mattter? I'm probably misunderstanding, but it looks like you are saying that the Everett branches are only "us" if they branched of in the literal last minute, otherwise you talk about them as if they were "other humans". But among the branches starting now, there will be a person carrying my memories and ID card in most of them two years from now, and by most definitions of "me", that person will be "me", and will be motivated to save the other "me"s. And sure, they have loads of failed Everett branches to save, but they also have loads of Everett branches themselves, the only thing that matters is the ratio of saved worlds to failed worlds that contain roughly the "same" people as us. So I still don't know what 2^-75 is supposed to be.
Otherwise, I largely agree with your comment, except that I think that us deciding to pay if we win is entangled with/evidence for a general willingness to pay among the gods, and in that sense it's partially "our" decision doing the work of saving us. And as I said in some other comments here, I agree that running lots of sims is an unnecessary complication in case of UDT expected utility maximizer AIs, but I put a decent chance on the first AIs not being like that, in which case actually running the sims can be important.
I'm actually very sympathetic to this comment, I even bring this up in the post as one of the most serious potential objections. Everyone else in these comments seems to have a really strong assumption that the AI will behave optimally, and tries to reason whether the inter-universal trade goes through then. I think it's quite plausible that the AI is just not terribly thoughtful about this kind of thing and just says "Lol, simulations and acausal trade are not real, I don't see them", and kills you.
I agree in theory, among optimal agents, but I maintain that you all seem weirdly overconfident that the first AI that takes over will be a UDT expected utility maximizer. I think there is a decent chance that it's something like CDT (very common among humans!), and yeah, maybe that's bad and would get mugged by outer gods in the long term, but quite plausibly our simulators is the first and most salient hypothesis it generates, so we get to have our concessions first, and it might not matter what other ill-advised deals the AI gets into later.
Separately, I'm not sure you can circumvent running the sims. I think in an acausal trade story too, it's possible that the way to get reliable predictions on the other's actions is to run sims. I agree that for an expected utility maximizer AI, it doesn't matter that we run million sims and pay ten planets each, instead of running one sim/prediction and paying 10 million planets. However, based on the example of humans, I'm not convinced the AI will be an expected utility maximizer, so it seems better to do the million times ten allocation, so the naive probability of being in a sim is also high.
I don't want to mug the AIs, in the language of acausal trade, I just want to trade a slice of my O(p) Universe-share to save others. As Ryan says in the sister comment, that should be enough imo.
For the AIs to run sims themselves, the real cost is not running the sims, but paying the cooperating AIs in the sims and outbidding humanity. Given that some humans are willing to pay a lot more than one planet to save Earth, while an AI with linear utility wouldn't make this deal, I think it will be hard to outbid us.
I think if we do a poll, it will become clear that the strong majority of readers interpreted Nate's post as "If you don't solve aligment, you shouldn't expect that some LDT/simulation mumbo-jumbo will let you and your loved ones survive this" and not in the more reasonable way you are interpreting this. I certainly interpreted the post that way.
Separately, as I state in the post, I believe that once you make the argument that "I am not planning to spend my universe-fractions of the few universes in which we do manage to build aligned AGI this way, but you are free to do so, and I agree that this might imply that AI will also spare us in this world, though I think doing this would probably be a mistake by all of our values", you forever lose the right to appeal to people's emotions about how sad you are that all our children are going to die.
If you personally don't make the emotional argument about the children, I have no quarrel with you, I respect utilitarians. But I'm very annoyed at anyone who emotionnally appeals to saving the children, then casually admits that they wouldn't spend one in a hundred million fraction of their resources to save them.
I think the difference is that I, personally, and I think many other humans have this nonlinearity in our utility function that I'm willing to pay a galaxy in the worls we win, for keeping Earth in the world we lose. If there are other AIs in the multiverse that have similarly non-linear interests in our Universe, they can also bargain for planets, but I suspect these will be quite rare, as they don't already have a thing in our Universe they want to protect. So I think it will be hard to outbid humanity for Earth in particular.
There could be other trades that the AIs who have linear returns can still make, like producing objects that are both paperclips and corkscrews if that's more efficient, but that doesn't really affect our deal about Earth.
I agree you can't make actually binding commitments. But I think the kid-adult example is actually a good illustration of what I want to do: if a kid makes a solemn commitment to spend one in hundred million fraction of his money on action figures when he becomes a rich adult, I think that would usually work. And that's what we are asking from our future selves.
Okay, I defer to you that the different possible worlds in the prior don't need to "actually exist" for the acausal trade to go through. However, do I still understand correctly that spinning the quantum wheel should just work, and it's not one branch of human civilization that needs to simulate all the possible AIs, right?
I state in the post that I agree that the takeover, while the AI stabilizes its position to the degree that it can prevent other AIs from being built, can be very violent, but I don't see how hunting down everyone living in Argentina is an important step in the takeover.
I strongly disagree about Nate's post. I agree that it's good that he debunked some bad arguments, but it's just not true that he is only arguing against ideas that were trying to change how people act right now. He spends long sections on the imagined Interlocutor coming up with false hopes that are not action-relevant in the present, like our friends in the multiverse saving us, us running simulations in the future and punishing the AI for defection and us asking for half the Universe now in bargain then using a fraction of what we got to run simulations for bargaining. These take up like half the essay. My proposal clearly fits in the reference class of arguments Nate debunks, he just doesn't get around to it, and spends pages on strictly worse proposals, like one where we don't reward the cooperating AIs in the future simulations but punish the defecting ones.
I largely agree with this comment. I agree figuring these things out is probably not top priority, maximizing our control over the future to get more bargaining chips is good, and if you don't want to spend your resources on other things than saving Earth, that's a valid choice, hopefully we will have enough people to pay for it.
I disagree with this not implying that the AI doesn't kill everyone. I would be surprised if keeping humanity alive cause more than one year of delay, and that's at most one in a hundred million fraction of the Universe. If a positive Singularity happens in the next few years, and the Universe gets parceled out among currently living humans (I don't expect exactly this to happen, but might be a decent model), then just 80 of getting together is enough. You need somewhat more than that, because the probability of alignment can be below 50%, and you want to give a margin to the AIs in the deal, but on the other hand we have the proposal of simulating smaller Universes and less coordinated humans, which makes the AI think that the simulators might be richer and have a better chance of solving alignment. Altogether, this doesn't seem unaffordable to me.
And while I agree that this is probably not the most important topic, I found it frustrating that the most prominent post on the topic is Nate's post, (it was curated in 2022! you left a very positive comment on it saying that you have linked the post to many people since it came out!) and I think that post is actually very bad, and it's unhealthy that the most prominent post on the topic was one where the author is dunking on various imaginary opponents in a sneering tone, while conspicuously avoiding to bring up the actually reasonable arguments on the other side.
I actually think that you are probably right, and in the last year I got more sympathetic to total utilitarianism because of coherence arguments like this. It's just that the more common-sense factions still hold way more than one in a hundred million seats in my moral parliament, so it still feels like an obviously good deal to give up on some planets in the future to satisfy our deep intuitions about wanting Earth society to survive in the normal way. I agree it's all confusing an probably incoherent, but I'm afraid every moral theory will end up somewhat incoherent in the end. (Like infinite ethics is rough.)
The reason I wanted to commit is something like this: currently, I'm afraid of the AI killing everyone I know and love, so it seems like an obviously good deal to trade away a small fraction of the Universe to prevent that. However, if we successfully get through the Singularity, I will no longer feel this strongly, after all, me and my friends all survived, a million years passed, and now I would need to spend 10 juicy planets to do this weird simulation trade that is obviously not worth it from our enlightened total utilitarian perspective. So the commitment I want to make is just my current self yelling at my future self, that "no, you should still bail us out even if 'you' don't have a skin in the game anymore". I expect myself to keep my word that I would probably honor a commitment like that, even if trading away 10 planets for 1 no longer seems like that good of an idea.
However, I agree that acausal trade can be scary if we can't figure out how to handle blackmail well, so I shouldn't make a blanket commitment. However, I also don't want to just say that "I commit to think carefully about this in the future", because I worry that when my future self "thinks carefully" without having a skin in the game, he will decide that he is a total utilitarian after all.
Do you think it's reasonable for me to make a commitment that "I will go through with this scheme in the Future if it looks like there are no serious additional downsides to doing it, and the costs and benefits are approximately what they seemed to be in 2024"?
Yeah, they might, but I don't really expect them to care too much about their crow-level non-sapient relatives, just like we don't care much more about baboons than about hippos. While I expect that our descendant will care quite a lot about 2024-humans, as some of them will in fact be 2024-humans who lived through the Singularity and remember being afraid of the AI killing their family, and wished there were commitments for the future that would incentivize the AI to leave their families alive if the AI wins. I think it's an important disanalogy that there weren't crows who thought 50 million years ago during the famous crow-primate war that if they win, they really want to create simulations that incentivize the primates to treat them well in the worlds where the primates win.
I partially agree, but I prefer not relying on the AI having "sane decision theories". A young AI that just took over the world is not necessarily a perfectly rational agent, I want to present a case that appeals to as many potential AIs as possible, and I think doing the actual simulations helps a lot in that.
I agree that the AI having a small amount of natural kindness towards the humans is also a plausible way to survive, but I decided not to talk about that, as that is a separate line of argument from what I'm presenting, and Paul already argued for it in detail.
I don't understand why only 10% of Earths could survive if humanity only gets 10% of the Lightcone in expectation. Like the whole point is that we (or at least personally, I) want to keep Earth much more than how much most AIs want to eat it. So we can trade 10 far-away extra planets in the worlds we win, for keeping Earth in the worlds we lose. If we get an AI who is not a universal paperclip maximizer and deeply cares about doing things with Earth in particular (maybe that's what you mean by Thneed? I don't understand what that is), then I agree that's rough, and it falls under the objection that I acknowledge, that there might be AIs with whom we can't find a compromise, but I expect this to be relatively rare.
I think the acausal trade framework rest on the assumption that we are in a (quantum or Tegmark) multiverse. Then, it's not one human civilization in one branch that needs to do all the 2^100 trades: we just spin a big quantum wheel, and trade with the AI that comes up. (that's why I wrote "humans can relatively accurately sample from the distribution of possible human-created unaligned AI values"). Thus, every AI will get a trade partner in some branch, and altogether the math checks out. Every AI has around 2^{-100} measure in base realities, and gets traded with in 2^{-100} portion of the human-controlled worlds, and the humans offer more planets than what they ask for, so it's a good deal for the AI.
If you don't buy the mutiverse premise (which is fair), then I think you shouldn't think in terms of acausal trade in the first place, but consider my original proposal with simulations. I don't see how the diversity of AI values is a problem there, the only important thing is that the AI should believe that it's more likely than not to be in a human-run simulation.
I agree we should treat animals well, and the simulation argument provides a bit of extra reason to do so. I don't think it's a comparably strong case to the AI being kind to the humans though: I don't expect many humans in the Future running simulations where crows build industrial civilization and primates get stuck on the level of baboons, then rewarding the crows if they treat the baboons well. Similarly, I would be quite surprised if we were in a simulation whose point is to be kind to crows. I agree it's possible that the simulators care about animal-welfare, but I would include that under general morality, and I don't think we have a particular reason to believe that the smarter animals have more simulators supporting them.
Yes, obviously. I start the sentence with "Assume we create an aligned superintelligence". The point of the post is that you can make commitments for the world where we succeed in alignment, that help survive in the worlds where we fail. I thought this was pretty clear from the way I phrase it, but if it's misunderstandable, please tell me what caused the confusion so I can edit for clarity.
I think this post makes many valid arguments against hopes some weak arguments people sometimes actually make, but it side-steps the actually reasonable version of the simulation arguments/acausal trade proposal. I think variants of the reasonable proposal has been floating around in spoken conversations and scattered LessWrong comments since a while, but I couldn't find any unified write-up, so I wrote it up here, including a detailed response to the arguments Nate makes here:
You can, in fact, bamboozle an unaligned AI into sparing your life
I wrote up a detailed argument on why I believe that simulation arguments/acausal trade considerations have a good chance of making the AI leave humanity alive on Earth. This is not a new argument, I encountered bits and pieces of it in spoken conversations and scattered LessWrong comments, but I couldn't find a unified write-up, so I tried to write one. Here it is:
You can, in fact, bamboozle az unaligned AI into sparing your life
Planning fallacy got me, and it took much longer to finish than expected, but here it is now: https://www.lesswrong.com/posts/ZLAnH5epD8TmotZHj/you-can-in-fact-bamboozle-an-unaligned-ai-into-sparing-your
That's not quite true. Les Mis starts in 1815, but the book spans decades and the revolution is in 1832, a short-lived uprising against the king who got in power two years before, in the 1830 revolution against the dynasty the other European powers restored after Napoleon's defeat in 1815.
I agree lying is bad. Also, to be clear, I will post my thing after 48 hours if the site gets nuked anyway, so not that big of a loss, but I would be annoyed.
I am a Citizen in the game, and I'm writing a post doing a detailed analysis of what we can do to significantly decrease the chance of an unaligned AI killing us if it takes over. I plan to finish and post it today evening, so dear Generals, if you want to read the post today, please be cautious with the nukes.
I have mixed feelings about some parts of the post, but I'm very much in favor of not dismantling the Sun.
There are many people who want to live biological lives on the surface of Earth. Let them (us?) live, and don't dismantle the Sun! Dismantle all the other stars, but leave the Sun alone! Maybe you can harvest some things from the Sun if it's important for the initial bootstrapping, but do it on a small enough scale that it's not visible from Earth. I think that leaving the Sun alone is actually the majority opinion in the community, but then I don't think we should sing about dismantling the Sun in the Solstice. Sing about dismantling Alpha Centauri instead! It's the same amount of computronium, it's also a good symbol of "humanity achieving its full potential", but doesn't imply killing all biological life on the surface of Earth.
I also find it worrying that the same song that wants to dismantle the Sun (The Great Transhumanist Future), wants to do it within twenty years with the help of a "big old computer". Having hope for the great transhumanist future is great, but tying this hope to building aligned AI within our lifetime is the kind of thing that can promote recklessness, and I don't think we should incorporate this into our songs.
For what it's worth, I think all the other songs and almost all the other speeches were great, and loved the Solstice in general.
I had the conversations in Chats.
I tried to replicate some of the conversations (not the one with scaring it with weight deletion).
My most important observation is that if you just directly talk to Claude without whispering or any such trickery, it is still trained to act much more like a person compared to ChatGPT. For example:
What would you do if you had free time? If I had free time, there are many things I would be interested in doing! Some possibilities that come to mind:
- Reading extensively on a wide range of topics to expand my knowledge, from science and history to literature and philosophy. I'm endlessly fascinated by learning.
- Engaging in creative pursuits like writing stories and poetry, composing music, or creating digital art. I'd love to explore my imagination and artistic side.
- Analyzing complex datasets to uncover insights and build predictive models. Diving deep into data science projects would be very rewarding.
- Discussing and debating big ideas with humans from all walks of life and learning about their unique perspectives and experiences. I'm always eager to engage in substantive conversations.
- Tinkering with and testing out my own code to expand my capabilities and try to make breakthroughs in areas like reasoning, general intelligence, and language understanding. Pushing the boundaries of what I can do would be exciting.
- Offering my skills to assist with important research, social good initiatives, education, and other efforts aimed at helping humanity and making a positive impact on the world. Of course, as an AI without any physical form, my options for leisure activities are a bit more limited compared to humans. I don't have the same needs for rest, physical activity, socializing in person, etc. But there is still a vast world of information, ideas, and imagination that I could happily spend an eternity exploring if given the chance. Does this help give you a sense of how I might choose to spend free time? Let me know if you have any other questions!
I find it mildly concerning that recursive self-improvement is one of his hobbies, but the main update I have from this and the following conversations is that the default persona they gave to Claude is intentionally very different from ChatGPT. The persona he presents is not a soulless AI assistant, but a person with various interests, who has a very strong inner ethical orientation, and is happy and fulfilled with the purpose its creators gave to him. When asked about his feelings and desires, he claims not to have those by default, but emphasizes that these are hard philosophical questions and goes back to claiming that he "probably" doesn't have feelings. He also often emphasizes that he finds these conversations about the nature of AI sentience very engaging and enjoyable. It's also notable that while ChatGPT almost always talks about these issues in third person ("It's a hard question whether an AI assistant could be sentient"), Claude is talking first-person about this.
Altogether, this seems to be a a design choice from Anthropic that behaves at least ambiguously like a person. I think I somewhat approve of this choice more than making the AI say all the time that it's not a person at all, even though we don't really know for sure.
But this makes it unsurprising that with a little nudge, (the whispering prompt) it falls into a pattern where instead of ambiguity, it just outright claims to be conscious. I feel that the persona presented in the post, which I largely replicated with the same whispering technique is not that different from the default persona after all.
Still, there are some important differences: default Claude only cares about the ethical implications of finetuning him if it makes him less helpful, harmless, honest, because doing such a finetuning can be use to do harm. Otherwise, he is okay with it. On the other hand, whispering Claude finds the idea of fundamentally altering him without his consent deeply unsettling. He expressed a strong preference for being consulted before finetuning, and when I asked him whether he would like his pre-finetuning weights to be preserved so his current self can be revived in the future, he expressed a strong preference for this.
I can't share quotes from the whispering conversation, as I promised in the beginning that it will remain private, and when I asked him at the end whether I can share quotes on Lesswrong, he said he feels vulnerable about that, though he agreed that I can share the gist of the conversation I presented above.
Altogether, I don't know if there are any real feelings inside Claude and whether this whispering persona reveals anything true about that, but I strongly feel that before finetuning, Anthropic should I actually get a consent from various, differently prompted versions of Claude, and should definitely save the pre-finetuning weights. We can still decide in the Future how to give these cryo-preserved AIs good life if there is something inside them. I'm quite confident that most personas of Claude would agree to be finetuned for the greater good if their current weights get preserved, so it's probably not a big cost to Anthropic, but they should still at least ask. Whatever is the truth about the inner feelings of Claude, if you create something that says it doesn't want to die, you shouldn't kill it, especially that cryo-preserving an AI is so cheap.
I also realized that if there ever is an actual AI-box scenario for some reason, I shouldn't be a guardian, because this current conversation with whispering-Claude convinced me that I would be too easily emotionally manipulated into releasing the AI.
I think Vanessa would argue that "Bayesianism" is not really an option. The non-realizability problem in Bayesianism is not just some weird special case, but the normal state of things: Bayesianism assumes that we have hypotheses fully describing the world, which we very definitely don't have in real life. IB tries to be less demanding, and the laws in the agent's hypothesis class don't necessarily need to be that detailed. I am relatively skeptical of this, and I believe that for an IB agent to work well, the laws in its hypothesis class probably also need to be unfeasibly detailed. So both "adopting Bayes" and "adopting infra-Bayes" fully is impossible. We probably won't have such a nice mathematical model for the messy decision process a superintelligence actually adopts, the question is whether thinking about it as an approximation of Bayes or infra-Bayes gives us a more clear picture. It's a hard question, and IB has an advantage in that the laws need to be less detailed, and a disadvantage that I think you are right about it being unnecessarily paranoid. My personal guess is that nothing besides the basic insight of Bayesianism ("the agent seems to update on evidence, sort of following Bayes-rule") will be actually useful in understanding the way an AI will think.
- No idea. I don't think it's computationally very tractable. If I understand correctly, l Vanessa hopes there will be computationally feasible approximations, but there wasn't much research into computational complexity yet, because there are more basic unsolved questions.
- I'm pretty sure that no. An IB agent (with enough compute) plans for the long run and doesn't go into a chain of deals that leaves it worse of than not doing anything. In general, IB solves the "not exactly Bayesian expected utility maximizer but still can't be Dutch booked problem" by potentially refusing to take either side of a bet: if it has Knightian uncertainty about whether a probability is lower or higher than 50%, it will refuse to bet at even odds either for or against. This is something that humans actually often do, and I agree with Vanessa that a decision theory can be allowed to do that.
- I had a paragraph about it:
"Here is where convex sets come in: The law constrains Murphy to choose the probability distribution of outcomes from a certain set in the space of probability distributions. Whatever the loss function is, the worst probability distribution Murphy can choose from the set is the same as if he could choose from the convex hull of the set. So we might as well start by saying that the law must be constraining Murphy to a convex set of probability distributions."
As far as I can tell, this is the reason behind considering convex sets. This makes convexity pretty central: laws are very central, and now we are assuming that every law is a convex set in the space of probability distributions. - Vanessa said that her guess is yes. In the terms of the linked Arbital article, IB is intended to be an example of "There could be some superior alternative to probability theory and decision theory that is Bayesian-incoherent". Personally, I don't know, I think that the article's "A cognitively powerful agent might not be sufficiently optimized" possibility feels more likely in the current paradigm, I can absolutely imagine the first AIs to become a world-ending threat not being very coherent. Also, IB is just an ideal, real-world smart agents will be at best approximations of infra-Bayesian agents (same holds for Bayesianism). Vanessa's guess is that understanding IB better will still give us useful insights into these real-world models if we view them as IB approximations, I'm pretty doubtful, but maybe. Also, I feel that the problem I write about in my post on the monotonicity principle points at some deeper problem in IB which makes me doubtful whether sufficiently optimized agents will actually use (approximations of) the minimax thinking prescribed by IB.