Posts
Comments
Furthermore, I don't think independent AI safety funding is that important anymore; models are smart enough now that most of the work to do in AI safety is directly working with them, most of that is happening at labs,
It might be the case that most of the quality weighted safety research involving working with large models is happening at labs, but I'm pretty skeptical that having this mostly happen at labs is the best approach and it seems like OpenPhil should be actively interested in building up a robust safety research ecosystem outside of labs.
(Better model access seems substantially overrated in its importance and large fractions of research can and should happen with just prompting or on smaller models. Additionally, at the moment, open weight models are pretty close to the best models.)
(This argument is also locally invalid at a more basic level. Just because this research seems to be mostly happening at large AI companies (which I'm also more skeptical of I think) doesn't imply that this is the way it should be and funding should try to push people to do better stuff rather than merely reacting to the current allocation.)
Here is another more narrow way to put this argument:
- Let's say Nate is 35 (arbitrary guess).
- Let's say that branches which deviated 35 years ago would pay for our branch (and other branches in our reference class). The case for this is that many people are over 50 (thus existing in both branches), and care about deviated versions of themselves and their children etc. Probably the discount relative to zero deviation is less than 10x.
- Let's say that Nate thinks that if he didn't ever exist, P(takeover) would go up by 1 / 10 billion (roughly 2^-32). If it was wildly lower than this, that would be somewhat surprising and might suggest different actions.
- Nate existing is sensitive to a bit of quantum randomness 35 years ago, so other people as good as Nate existing could be created with a bit of quantum randomness. So, 1 bit of randomness can reduce risk by at least 1 / 10 billion.
- Thus, 75 bits of randomness presumably reduces risk by > 1 / 10 billion which is >> 2^-75.
(This argument is a bit messy because presumably some logical facts imply that Nate will be very helpful and some imply that he won't be very helpful and I was taking an expectation over this while we really care about the effect on all the quantum branches. I'm not sure exactly how to make the argument exactly right, but at least I think it is roughly right.)
What about these case where we only go back 10 years? We can apply the same argument, but instead just use some number of bits (e.g. 10) to make Nate work a bit more, say 1 week of additional work via changing whether Nate ends up getting sick (by adjusting the weather or which children are born, or whatever). This should also reduce doom by 1 week / (52 weeks/year) / (20 years/duration of work) * 1 / 10 billion = 1 / 10 trillion.
And surely there are more efficient schemes.
To be clear, only having ~ 1 / 10 billion branches survive is rough from a trade perspective.
Thanks, this seems like a reasonable summary of the proposal and a reasonable place to wrap.
I agree that kindness is more likely to buy human survival than something better described as trade/insurance schemes, though I think the insurance schemes are reasonably likely to matter.
(That is, reasonably likely to matter if the kindness funds aren't large enough to mostly saturate the returns of this scheme. As a wild guess, maybe 35% likely to matter on my views on doom and 20% on yours.)
At a more basic level, I think the situation is just actually much more confusing than human extinction in a bunch of ways.
(Separately, under my views misaligned AI takeover seems worse than human extinction due to (e.g.) biorisk. This is because primates or other closely related seem very likely to re-evolve into an intelligent civilization and I feel better about this civilization than AIs.)
And I think this is also true by the vast majority of common-sense ethical views. People care about the future of humanity. "Saving the world" is hugely more important than preventing the marginal atrocity. Outside of EA I have never actually met a welfarist who only cares about present humans. People of course think we are supposed to be good stewards of humanity's future, especially if you select on the people who are actually involved in global scale decisions.
Hmmm, I agree with this as stated, but it's not clear to me that this is scope sensitive. As in, suppose that the AI will eventually leave humans in control of earth and the solar system. Do people typically this is an extremely bad? I don't think so, though I'm not sure.
And, I think trading for humans to eventually control the solar system is pretty doable. (Most of the trade cost is in preventing an earlier slaughter and violence which was useful for takeover or avoiding delay.)
(I mostly care about long term future and scope sensitive resource use like habryka TBC.)
Sure, we can amend to:
"I believe that AI takeover would eliminate humanity's control over its future, has a high probability of killing billions, and should be strongly avoided."
We could also say something like "AI takeover seems similar to takeover by hostile aliens with potentially unrecognizable values. It would eliminate humanity's control over its future and has a high probability of killing billions."
I agree that "not dying in a base universe" is a more reasonable thing to care about than "proving people right that takeoff is slow" but I feel like both lines of argument that you bring up here are doing something where you take a perspective on the world that is very computationalist, unituitive and therefore takes you to extremely weird places, makes strong assumptions about what a post-singularity humanity will care about, and then uses that to try to defeat an argument in a weird and twisted way that maybe is technically correct, but I think unless you are really careful with every step, really does not actually communicate what is going on.
I agree that common sense morality and common sense views are quite confused about the relevant situation. Indexical selfish perspectives are also pretty confused and are perhaps even more incoherant.
However, I think that under the most straightforward generalization of common sense views or selfishness where you just care about the base universe and there is just one base universe, this scheme can work to save lives in the base universe[1].
I legitimately think that common sense moral views should care less about AI takeover due to these arguments. As in, there is a reasonable chance that a bunch of people aren't killed due to these arguments (and other different arguments) in the most straightforward sense.
I also think "the AI might leave you alone, but we don't really know and there seems at least a high chance that huge numbers of people, including you, die" is not a bad summary of the situation.
In some sense they are an argument against any specific human-scale bad thing to happen, because if we do win, we could spend a substantial fraction of our resources with future AI systems to prevent that.
Yes. I think any human-scale bad thing (except stuff needed for the AI to most easily take over and solidify control) can be paid for and this has some chance of working. (Tiny amounts of kindness works in a similar way.)
Humanity will not go extinct, because we are in a simulation.
FWIW, I think it is non-obvious how common sense views interpret these considerations. I think it is probably common to just care about base reality? (Which is basically equivalent to having a measure etc.) I do think that common sense moral views don't consider it good to run these simulations for this purpose while bailing out aliens who would have bailed us out is totally normal/reasonable under common sense moral views.
It is obviously extremely fucking bad for AI to disempower humanity. I think "literally everyone you know dies" is a much more accurate capture of that, and also a much more valid conclusion from conservative premises
Why not just say what's more straightforwardly true:
"I believe that AI takeover has a high probability of killing billions and should be strongly avoided, and would be a serious and irreversible decision by our society that's likely to be a mistake even if it doesn't lead to billions of deaths."
I don't think "literally everyone you know dies if AI takes over" is accurate because I don't expect that in the base reality version of this universe for multiple reasons. Like it might happen, but I don't know if it is more than 50% likely.
It's not crazy to call the resulting scheme "multiverse/simulation shenanigans" TBC (as it involves prediction/simulation and uncertainty over the base universe), but I think this is just because I expect that multiverse/simulation shenanigans will alter the way AIs in base reality act in the common sense straightforward way. ↩︎
I probably won't respond further than this. Some responses to your comment:
I agree with your statements about the nature of UDT/FDT. I often talk about "things you would have commited to" because it is simpler to reason about and easier for people to understand (and I care about third parties understanding this), but I agree this is not the true abstraction.
It seems like you're imagining that we have to bamboozle some civilizations which seem clearly more competent than humanity in your lights. I don't think this is true.
Imagine we take all the civilizations which are roughly equally-competent-seeming-to-you and these civilizations make such an insurance deal[1]. My understanding is that your view is something like P(takeover) = 85%. So, let's say all of these civilizations are in a similar spot from your current epistemic perspective. While I expect that you think takeover is highly correlated between these worlds[2], my guess is that you should think it would be very unlikely that >99.9% of all of these civilizations get taken over. As in, even in the worst 10% of worlds where takeover happens in our world and the logical facts on alignment are quite bad, >0.1% of the corresponding civilizations are still in control of their universe. Do you disagree here? >0.1% of universes should be easily enough to bail out all the rest of the worlds[3].
And, if you really, really cared about not getting killed in base reality (including on reflection etc) you'd want to take a deal which is at least this good. There might be better approaches which reduce the correlation between worlds and thus make the fraction of available resources higher, but you'd like something at least this good.
(To be clear, I don't think this means we'd be fine, there are many ways this can go wrong! And I think it would be crazy for humanity to . I just think this sort of thing has a good chance of succeeding.)
(Also, my view is something like P(takeover) = 35% in our universe and in the worst 10% of worlds 30% of the universes in a similar epistemic state avoided takeover. But I didn't think about this very carefully.)
And further, we don't need to figure out the details of the deal now for the deal to work. We just need to make good choices about this in the counterfactuals where we were able to avoid takeover.
Another way to put this is that you seem to be assuming that there is no way our civilization would end up being the competent civilization doing the payout (and thus to survive some bamboozling must occur). But your view is that it is totally plausible (e.g. 15%) from your current epistemic state that we avoid takeover and thus a deal should be possible! While we might bring in a bunch of doomed branches, ex-ante we have a good chance of paying out.
I get the sense that you're approaching this from the perspective of "does this exact proposal have issues" rather than "in the future, if our enlightened selves really wanted to avoid dying in base reality, would there be an approach which greatly (acausally) reduces the chance of this". (And yes I agree this is a kind of crazy and incoherant thing to care about as you can just create more happy simulated lives with those galaxies.)
There just needs to exist one such insurance/trade scheme which can be found and it seems like there should be a trade with huge gains to the extent that people really care a lot about not dying. Not dying is very cheap.
Hmm, maybe I misunderstood your point. I thought you were talking about using simulations to anthropically capture AIs. As in, creating more observer moments where AIs take over less competent civilizations but are actually in a simulation run by us.
If you're happy to replace "simulation" with "prediction in a way that doesn't create observer moments" and think the argument goes through either way then I think I agree.
I agree that paying out to less competent civilizations if we find out we're competent and avoid takeover might be what you should do (as part of a post-hoc insurance deal via UDT or as part of a commitment or whatever). As in, this would help avoid getting killed if you ended up being a less competent civilization.
The smaller thing won't work exactly for getting us bailed out. I think infinite ethics should be resolvable and end up getting resolved with something roughly similar to some notion of reality-fluid and this implies that you just have to pay more for higher measure places. (Of course people might disagree about the measure etc.)
But trouble arises when the competant ones can refine P(takeover) for the various planets by thinking a little further.
Similar to how the trouble arises when you learn the result of the coin flip in a counterfactual mugging? To make it exactly analogous, imagine that the mugging is based on whether the 20th digit of pi is odd (omega didn't know the digit at the point of making the deal) and you could just go look it up. Isn't the situation exactly analogous and the whole problem that UDT was intended to solve?
(For those who aren't familiar with counterfactual muggings, UDT/FDT pays in this case.)
To spell out the argument, wouldn't everyone want to make a deal prior to thinking more? Like you don't know whether you are the competent one yet!
Concretely, imagine that each planet could spend some time thinking and be guaranteed to determine whether their P(takeover) is 99.99999% or 0.0000001%. But, they haven't done this yet and their current view is 50%. Everyone would ex-ante prefer an outcome in which you make the deal rather than thinking about it and then deciding whether the deal is still in their interest.
At a more basic level, let's assume your current views on the risk after thinking about it a bunch (80-90% I think). If someone had those views on the risk and cared a lot about not having physical humans die, they would benefit from such an insurance deal! (They'd have to pay higher rates than aliens in more competent civilizations of course.)
It's more like: people don't enter into insurance pools against cancer with the dude who smoked his whole life and has a tumor the size of a grapefruit in his throat.
Sure, but you'd potentially want to enter the pool at the age of 10 prior to starting smoking!
To make the analogy closer to the actual case, suppose you were in a society where everyone is selfish, but every person has a 1/10 chance of becoming fabulously wealthy (e.g. owning a galaxy). And, if you commit as of the age of 10 to pay 1/1,000,000 of your resourses in the fabulously wealthy case, you can ensure that the version in the non-wealthy case gets very good health insurance. Many people would take such a deal and this deal would also be a slam dunk for the insurance pool!
(So why doesn't this happen in human society? Well, to some extent it does. People try to get life insurance early while they are still behind the veil of ignorance. It is common in human society to prefer to make a deal prior to having some knowledge. (If people were the right type of UDT, then this wouldn't be a problem.) As far as why people don't enter into fully general income insurance schemes when very young, I think it is a combination of irrationality, legal issues, and adverse selection issues.)
will the surviving branches of humanity (near or distant), or other kind civilizations throughout the multiverse, have enough resources on offer to pay for a human reserve here?
Nate and I discuss this question in this other thread for reference.
Towards one extreme, you have the literal people who literally died, before they have branched much; these branches need to happen close to the last minute.
By "last minute", you mean "after I existed" right? So, e.g., if I care about genetic copies, that would be after I am born and if I care about contingent life experiences, that could be after I turned 16 or something. This seems to leave many years, maybe over a decade for most people.
I think David was confused by the "last minute language" which is really many years right? (I think you meant "last minute on evolutionary time scales, but not literally in the last few minutes".)
That said, I'm generally super unconfident about how much a quantum bit changes things.
in full generality, as opposed to (e.g.) focusing on their bretheren. (And I see no UDT/FDT justification for them to pay for even the particularly foolish and doomed aliens to be saved, and I'm not sure what you were aluding to there.)
[...]
rather than concentrating their resources on creatures more similar to themselves / that branched off radically more recently? (e.g. because the multiverse is just that full of kindness, or for some alleged UDT/FDT argument that Nate has not yet understood?)
Partial delta from me. I think the argument for directly paying for yourself (or your same species, or at least more similar civilizations) is indeed more clear and I think I was confused when I wrote that. (In that I was mostly thinking about the argument for paying for the same civilization but applying it more broadly.)
But, I think there is a version of the argument which probably does go through depending on how you set up UDT/FDT.
Imagine that you do UDT starting from your views prior to learning about x-risk, AI risk, etc and you care a lot about not dying. At that point, you were uncertain about how competent your civilization would be and you don't want your civilization to die. (I'm supposing that our version of UDT/FDT isn't logically omniscient relative to our observations which seems reasonable.) So, you'd like to enter into an insurance agreement with all the aliens in a similar epistemic state and position. So, you all agree to put at least 1/1000 of your resources on bailing out the aliens in a similar epistemic state who would have actually gone through with the agreement. Then, some of the aliens ended up being competent (sadly you were not) and thus they bail you out.
I expect this isn't the optimal version of this scheme and you might be able to make a similar insurance deal with people who aren't in the same epistemic state. (Though it's easier to reason about the identical case.) And I'm not sure exactly how this all goes through. And I'm not actually advocating for people doing this scheme, IDK if it is worth the resources.
Even with your current epistemic state on x-risk (e.g. 80-90% doom) if you cared a lot about not dying you might want to make such a deal even though you have to pay out more in the case where you surprisingly win. Thus, from this vantage point UDT would follow through with a deal.
Here is a simplified version where everything is as concrete as possible:
Suppose that there are 3 planets with evolved life with equal magical-reality-fluid (and nothing else for simplicity). For simplicity, we'll also say that these planets are in the same universe and thus the resulting civilizations will be able to causally trade with each other in the far future.
The aliens on each of these planets really don't want to die and would be willing to pay up to 1/1000 of all their future resources to avoid dying (paying these resources in cases where they avoid takeover and successfully use the resources of the future). (Perhaps this is irrational, but let's suppose this is endorsed on reflection.)
On each planet, the aliens all agree that P(takeover) for their planet is 50%. (And let's suppose it is uncorrelated between planets for simplicity.)
Let's suppose the aliens across all planets also all know this, as in, they know there are 3 planets etc.
So, the aliens would love to make a deal with each other where winning planets pay to avoid AIs killing everyone on losing planets so that they get bailed out. So, if at least one planet avoids takeover, everyone avoids dying. (Of course, if a planet would have defected and not payed out if they avoided takeover, the other aliens also wouldn't bail them out.)
Do you buy that in this case, the aliens would like to make the deal and thus UDT from this epistemic perspective would pay out?
It seems like all the aliens are much better off with the deal from their perspective.
Now, maybe your objection is that aliens would prefer to make the deal with beings more similar to them. And thus, alien species/civilizations who are actually all incompetent just die. However, all the aliens (including us) don't know whether we are the incompetent ones, so we'd like to make a diverse and broader trade/insurance-policy to avoid dying.
You said "shouldn't just do what's clearly in his best interests", I was responding to that.
No, it is in the AIs best interest to keep humans alive because this gets it more stuff.
Some more notes:
- We shouldn't expect that we get a huge win from AIs which are anthropically muggable, as discussed in Can we get more than this?, because other people will also be mugging these AIs and thus the price of marginal mugged resources will increase until it reaches marginal cost. Such AIs (which clearly have a crazy decision theory) will get their resources distributed out, but we can still trade with the other civilizations that get their resources etc. Overall, we should just focus on which positive sum trades are possible and the anthropic mugging stuff is a distraction due to competition. (Thanks to various commenters for making this more clear to me.)
- One issue with this scheme is that at the point where the AIs need to take the biggest costs to spare humans (during takeover and immediately after), they will not necessarily have super strong predictive abilities. Thus, it is unclear that a normal acausal trade setup with good prediction will work. As in, future humans/aliens might know that the AI's early actions aren't sensitive to their actions and the AI will also know this and thus a trade doesn't happen. I think minimally a binding commitment from humanity could work (if well specified), though to actually avoid dying we also need aliens/other branches to make similar commitments.
Let's conservatively say that evolved life gets around 1% of the multiverse/measure and that evolved life is willing to pay 1/million of its resources in expectation to save aliens from being killed (either "selfishly" to save their own civilization via UDT/FDT supposing that AIs are good enough predictors at the relevant points or out of a common sense altruistic case).
This would be 1/100 million which gets you a lot.
There could be other aliens who are willing to pay a huge fraction of their resources to perform rituals on the original civilization or whatever and thus these other aliens win out in the bargaining, but I'm skeptical.
Also, at least in the upload case, it's not clear that this is rival good as uploads can be copied for free. Of course, people might have a preference that their upload isn't used for crazy alien rituals or whatever.
(A bunch of the cost is in saving the human in the first place. Paying for uploads to eventually get run in a reasonable way should be insanely cheap, like <<10^-25 of the overall universe or something.)
This proposal doesn't depend on mugging the AI. The proposal actually gets the AI more resources in expectation due to a trade.
I agree the post is a bit confusing and unclear about this. (And the proposal under "Can we get more than this" is wrong. At a minimum, such AIs will also be mugged by everyone else too meaning you get get huge amounts of extra money for basically free.)
I agree that it is kind of insane for an AGI which cares about scope sensitive resources to treat sims in this way and thus we should expect a more sensible decision theory.
Introducing the option of creating lots of simulations of your adversary in the future where you win doesn’t seem like it’d change the result that Bob’s share has size O(p). So if O(p) is only enough to preserve humanity for a year instead of a billion years[1], then that’s all we get.
This seems right to me, I agree you shouldn't be able to mug AIs better than other people are able to scam AIs. (AIs that care about sims in this way might get mugged for all their stuff.)
However, I think O(p) only needs to be quite small for the deal to go through. 1/100 million of expected resources seems like it should suffice to keep humans from being killed I think?
I'd naively guess that a lot of resources get controlled by evolved life (50%), that evolved life cares a lot about not getting exterminated, and that evolved life is also often willing to pay a moderate amount to either bail out other aliens or to save themselves in a UDT sense.
Even if you think the fraction controlled by evolved life is much smaller (e.g. 1/1000), I'd guess that it's pretty cheap to avoid maximal slaughter from the AI?
(I'm not claiming that we should be willing to pay this money, just that people in practice are likely to, including aliens.)
Note that we don't want the AI to hand over the reins to humans, just to avoid killing humans when possible/cheap and we're willing to pay quite a bit for this (in the proposal, IDK if I personally think we should pay). So, it should look like a good offer for AIs who care about Tegmark IV (with a measure etc).
So, if humans execute this scheme, the AI's options should look something like:
- Don't do anything about what humans want or keeping them alive, get the entire universe.
- Keep humans alive and happy at a total cost of <1/100 million of the universe I control, get 1/million of the universe somewhere else in expectation. I also get the entire universe minus less than 1/100 million of the universe that I lost in expectation for a total of approximately 1 + 1/million of total universe.
(I increased the pay and the margin so we don't have to argue about that pay being insufficient. Probably you can pay much less.)
(Let's suppose the swap is from two universes with equal measure for simplicity.)
Do you agree that if we assume that humans execute this scheme and that the AI is confident in this scheme, it makes sense for (sensible, Tegmark-IV-caring) AIs to pay up?
I'm sympathetic to the idea that humans shouldn't or won't execute this scheme. (After all, we're clearly losing out from the perspective of arranging matter in Tegmark IV and it's unclear whether the AI will be able to predict us sufficiently well at the point of killing/not killing to get the normal acausal trade without commitments story going.) So, let's just talk about conditioning on trying to execute the scheme for now.
The case where the AI "cares about the sensory observations given to it as weighted by both degree-of-reality and count-of-instantiations" is messier and I haven't really analyzed this case personally. I agree that in this case, simulating it could be a threat which is unfortunate. I'd prefer to put this case aside for now, though I think we should probably be able to work out a deal with AIs which are happy to be simulated as long as they get at least 1/million of the universe or whatever[1].
Again, probably you can pay much less. ↩︎
Yes, but most of the expected cost is in keeping the humans alive/happy prior to being really smart.
This cost presumably goes way down if it kills everyone physically and scans their brains, but people obviously don't want this.
Notably, David is proposing that AIs take different actions prior to making powerful sims: not kill all the humans.
I think we should have a norm that you should explain the limitations of the debunking when debunking bad arguments, particularly if there are stronger arguments that sound similar to the bad argument.
A more basic norm is that you shouldn't claim or strongly imply that your post is strong evidence against something when it just debunks some bad arguments for it, particularly there are relatively well known better arguments.
I think Nate's post violates both of these norms. In fact, I think multiple posts about this topic from Nate and Eliezer[1] violate this norm. (Examples: the corresponding post by Nate, "But why would the AI kill us" by Nate, and "The Sun is big, but superintelligences will not spare Earth a little sunlight" by Eliezer.)
I discuss this more in this comment I made earlier today.
I'm including Eliezer because he has a similar perspective, obviously they are different people. ↩︎
Another way to put this is that posts should often discuss their limitations, particular when debunking bad arguments that are similar to more reasonable arguments.
I think discussing limitations clearly is a reasonable norm for scientific papers that reduces the extent to which people intentionally or unintentionally get away with implying their results prove more than they do.
However, do I still understand correctly that spinning the quantum wheel should just work, and it's not one branch of human civilization that needs to simulate all the possible AIs, right?
This is my understanding.
I agree that Nate's post makes good arguments against AIs spending a high fraction of resources on being nice or on stuff we like (and that this is an important question). And it also debunks some bad arguments against small fractions. But the post really seems to be trying to argue against small fractions in general:
[Some people think maybe AI] would leave humanity a few stars/galaxies/whatever on game-theoretic grounds. [...] I'm pretty confident that this view is wrong (alas), and based on a misunderstanding of LDT. I shall now attempt to clear up that confusion.
As far as:
debunking the bad arguments is indeed qualitatively more important than engaging with the arguments in this post, because the arguments in this post do indeed not end up changing your actions, whereas the arguments Nate argued against were trying to change what people do right now
I interpreted the main effect (on people) of Nate's post as arguing for "the AI will kill everyone despite decision theory, so you shouldn't feel good about the AI situation" rather than arguing against decision theory schemes for humans getting a bunch of the lightcone. (I don't think there are many people who care about AI safety but are working on implementing crazy decision theory schemes to control the AI?)
If so, then I think we're mostly just arguing about P(misaligned AI doesn't kill us due to decision theory like stuff | misaligned AI takeover). If you agree with this, then I dislike the quoted argument. This would be similar to saying "debunking bad arguments against x-risk is more important than debunking good arguments against x-risk because bad arguments are more likely to change people's actions while the good arguments are more marginal".
Maybe I'm misunderstanding you.
To be clear, I think the exact scheme in A proposal for humanity in the future probably doesn't work as described because the exact level of payment is wrong and more minimally we'll probably be able to make a much better approach in the future.
This seemed important to explicitly call out (and it wasn't called out explicitly in the post), though I do think it is reasonable to outline a concrete baseline proposal for how this can work.
In particular, the proposal randomly picks 10 planets per simulation. I think the exact right amount of payment will depend on how many sims/predictions you run and will heavily depend on some of the caveats under Ways this hope could fail. I think you probably get decent results if the total level of payment is around 1/10 million, with returns to higher aggregate payment etc.
As far as better approaches, I expect that you'll be doing a bunch of stuff more efficient than sims and this will be part of a more general acausal trade operation among other changes.
acausal trade framework rest on the assumption that we are in a (quantum or Tegmark) multiverse
I think the argument should also go through without simulations and without the multiverse so long as you are a UDT-ish agent with a reasonable prior.
their force of course depends on the degree to which you think alignment is easy or hard.
I think even if aliens similar to humans always fail at alignment, it's plausible that this type of scheme saves some humans because more competent aliens bail us out.[1] This is even less good to depend on...
I've made this point to you before, just noting it for third parties because I think it is a pretty important counterargument. ↩︎
we have the proposal of simulating smaller Universes and less coordinated humans, which makes the AI think that the simulators might be richer and have a better chance of solving alignment
This only matters if the AIs are CDT or dumb about decision theory etc.
I also don't think making any commitment is actually needed or important except under relatively narrow assumptions.
This is a great post on the topic which I pretty much entirely mostly agree with. Thanks for writing this so I didn't have to!
I think the argument presented in this post is a pretty strong case against "The AI will kill literally everyone with more than 80% probability", so I wish people either stopped saying that, or explained why the argument presented here is wrong.
Agreed. I hope that the corresponding people are more careful in their statements going forward.
Here are some relatively minor notes:
- If the AIs aren't CDT agent, have a more sane decision theory (e.g. EDT/UDT), and have linear-ish returns to resources, then I think these sorts of arguments should all go through as long as you can sufficiently accurately predict the AI's actions, regardless of whether it is simulated. Using prediction strategies which don't depend on simulation could address the moral concerns you raise around sentient AIs. AIs with more sane decision theory and linear-ish returns also don't care much about anthropic capture, so you should just need to predict them, anthropic capture isn't required.
- In the sane decision theory + linear-ish returns case, It should be possible to make something work out as long as the total measure-weighted multiverse resource allocation toward bailing out existing intelligent life is greater than the total measure-weighted resources required to avoid keep the original civilization going. This probably implies that it should be doable for pretty cheap and that the "too many simulators" objection doesn't apply. When AIs don't have sane decision theory or don't have something well described as linear returns, it's less clear. (Though I expect that AIs without sane decision theory but who do care about simulations/anthropic capture will probably get mugged for all of their stuff by systems we can bargain with.)
- Making a commitment in advance to pay the AIs only seems important under pretty narrow assumptions. It does seem good to make it clear what your payment policy might be like so that AIs have an easier time with the prediction problem and know that it is probably a good to avoid killing people. (This is similar to what Buck said in another comment.)
- I think you only benefit from such a commitment if both:
- You don't care about saving other universes (or care notably less than our current universe). If you cared as much about saving other universes, then you can just spend this later.
- You're EDT (CDT doesn't care about influencing the behavior of aliens to also make commitments, UDT makes relevant commitments by default). (My guess is that you should just be UDT-ish and make a blanket commitment to "do the basically right version of decision theory/UDT whatever that might end up being".
- I think you only benefit from such a commitment if both:
- Probably the relevant scale of payment to avoid killing humans is more like 1/billion or 1/10 million rather than just a few planets. You note this, but mostly use the few planets while talking. Edit: I think the cost will be much more than 1/10 million if you want to alter the AIs takeover plans not just what it does after having nearly complete control. The cost will also be much higher if there is strong competition between AIs such that delay is extremely costly.
- Another cost of delay is that AIs might terminally temporally discount. (It's unclear how temporally discounting works when you consider simulations and the multiverse etc though).
- On "Are we in a simulation? What should we do?", I don't think you should care basically at all about being in a simulation if you have a sane decision theory, have linear-ish returns to resources, and you were already working on longtermist stuff. I spent a while thinking about this some time ago. It already made sense to reduce x-risk and optimize for how much control your values/similar values end up having. If you're CDT, then the sim argument should point toward being more UDT/EDT-ish in various ways though it might also cause you to take super wacky actions in the future at some point (e.g. getting anthropically mugged). If you aren't working on longtermist stuff, then being in a sim should potentially alter your actions depending on your reasoning for being not doing longtermist stuff. (For instance, the animals probably aren't sentient if we're in a sim.)
- You don't really mention the argument that AIs might spare us due to being at least a bit kind. I think this is another reason to be skeptical about >80% on literally every human dies.
- Edit: I think this post often acts as though AIs are CDT agents and otherwise have relatively dumb decision theories. (Non-CDT agents don't care about what sims are run as long as the relevant trading partners are making accurate enough predictions.) I think if AIs are responsive to simulation arguments, they won't be CDT. Further, CDT AIs which are responsive to simulation arguments plausibly get mugged for all of their stuff[1], so you mostly care about trading with the AIs that mug them as they have no influence.
- Edit: I think this post is probably confused about acausal trade in at least 1 place.
I'm not going to justify this here. ↩︎
I don't think "continuous" is self-evident or consistently used to refer to "a longer gap from human-expert level AI to very superhuman AI". For instance, in the very essay you link, Tom argues that "continuous" (and fairly predictable) doesn't imply that this gap is long!
What are the close-by arguments that are actually reasonable? Here is a list of close-by arguments (not necessarily endorsed by me!):
- On empirical updates from current systems: If current AI systems are broadly pretty easy to steer and there is good generalization of this steering, that should serve as some evidence that future more powerful AI systems will also be relatively easier to steer. This will help prevent concerns like scheming from arising in the first place or make these issues easier to remove.
- This argument holds to some extent regardless of whether current AIs are smart enough to think through and successfully execute scheming strategies. For instance, imagine we were in a world where steering current AIs was clearly extremely hard: AIs would quickly overfit and goodhart training processes, RLHF was finicky and had terrible sample efficiency, and AIs were much worse at sample efficiently updating on questions about human deontological constraints relative to questions about how to successfully accomplish other tasks. In such a world, I think we should justifiably be more worried about future systems.
- And in fact, people do argue about how hard it is to steer current systems and what this implies. For an example of a version of an argument like this, see here, though note that I disagree with various things.
- It's pretty unclear what predictions Eliezer made about the steerability of future AI systems and he should lose some credit for not making clear predictions. Further, my sense is his implied predictions don't look great. (Paul's predictions as of about 2016 seem pretty good from my understanding, though not that clearly laid out, and are consistent with his threat models.)
- On unfalsifiability: It should be possible to empirically produce evidence for or against scheming prior to it being too late. The fact that MIRI-style doom views often don't discuss predictions about experimental evidence and also don't make reasonably convincing arguments that it will be very hard to produce evidence in test beds is concerning. It's a bad sign if advocates for a view don't try hard to make it falsifiable prior to that view implying aggressive action.
- My view is that we'll probably get a moderate amount of evidence on scheming prior to catastrophe (perhaps 3x update either way), with some chance that scheming will basically be confirmed in an earlier model. And, it is in principle possible to obtain certainty either way about scheming using experiments, though this might be very tricky and a huge amount of work for various reasons.
- On empiricism: There isn't good empirical evidence for scheming and instead the case for scheming depends on dubious arguments. Conceptual arguments have a bad track record, so to estimate the probability of scheming we should mostly guess based on the most basic and simple conceptual arguments and weight more complex arguments very little. If you do this, scheming looks unlikely.
- I roughly half agree with this argument, but I'd note that you also have to discount conceptual arguments against scheming in the same way and that the basic and simple conceptual arguments seem to indicate that scheming isn't that unlikely. (I'd say around 25%.)
I find this essay interesting as a case study in discourse and argumentation norms. Particularly as a case study of issues with discourse around AI risk.
When I first skimmed this essay when it came out, I thought it was ok, but mostly uninteresting or obvious. Then, on reading the comments and looking back at the body, I thought it did some pretty bad strawmanning.
I reread the essay yesterday and now I feel quite differently. Parts (i), (ii), and (iv) which don't directly talk about AI are actually great and many of the more subtle points are pretty well executed. The connection to AI risk in part (iii) is quite bad and notably degrades the essay as a whole. I think a well-executed connection to AI risk would have been good. Part (iii) seems likely to contribute to AI risk being problematically politicized and negatively polarized (e.g. low quality dunks and animosity). Further, I think this is characteristic of problems I have with the current AI risk discourse.
In parts (i), (ii), and (iv), it is mostly clear that the Spokesperson is an exaggerated straw person who doesn't correspond to any particular side of an issue. This seems like a reasonable rhetorical move to better explain a point. However, part (iii) has big issues in how it connects the argument to AI risk. Eliezer ends up defeating a specific and weak argument against AI risk. This is an argument that actually does get made, but unfortunately, he both associates this argument with the entire view of AI risk skepticism (in the essay, the "AI-permitting faction") and he fails to explain that the debunking doesn't apply to many common arguments which sound similar but are actually reasonable. Correspondingly, the section suffers from an ethnic tension style issue: in practice, it attacks an entire view by associating it with a bad argument for that view (but of course, reversed stupidity is not intelligence). This issue is made worse because the argument Eliezer attacks is also very similar to various more reasonable arguments that can't be debunked in the same way and Eliezer doesn't clearly call this out. Thus, these more reasonable arguments are attacked in association. It seems reasonable for Eliezer to connect the discussion to AI risk directly, but I think the execution was poor.
I think my concerns are notably similar to the issues people had with The Sun is big, but superintelligences will not spare Earth a little sunlight and I've encountered similar issues in many things written by Eliezer and Nate.
How could Eliezer avoid these issues? I think when debunking or arguing against bad arguments, you should explain that you're attacking bad arguments and that there exist other commonly made arguments which are better or at least harder to debunk. It also helps to disassociate the specific bad arguments from a general cause or view as much as possible. This essay seems to associate the bad "Empiricism!" argument with AI risk skepticism nearly as much as possible. Whenever you push back against an argument which is similar or similar-sounding to other arguments, but the push back doesn't apply in the same way to those other arguments, it's useful to explicitly spell out the limitations of the push back.[1] One possible bar to try to reach is that people who disagree strongly on the topic due to other more reasonable arguments should feel happy to endorse the push back against those specific bad arguments.
There is perhaps a bit of a motte-and-bailey with this essay where Eliezer can strongly defend debunking a specific bad argument (the motte), but there is an implication that the argument also pushes against more complex and less clearly bad arguments (the bailey). (I'm not saying that Eliezer actively engages in motte-and-bailey in the essay, just that this essay probably has this property to some extent in practice.) That said there is also perhaps a motte-and-bailey for many of the arguments that Eliezer argues against where the motte is a more complex and narrow argument and the bailey is "Empiricism! is why AI risk is fake".
A thoughtful reader can recognize and avoid their own cognitive biases and correctly think though the exact correct implications of the arguments made here. But, I wish Eliezer did this work for the reader to reduce negative polarization.
When debunking bad arguments, it's useful to be clear about what you aren't covering or implying.
Part (iv) helps to explain the scope of the broader point, but doesn't explain the limitations in specifically the AI case. ↩︎
'like the industrial revolution but 100x faster'
Actually, much more extreme than the industrial revolution because at the end of the singularity you'll probably discover a huge fraction of all technology and be able to quickly multiply energy production by a billion.
That said, I think Paul expects a timeframe longer than 1.5 years, at least he did historically, maybe this has changed with updates over the last few years. (Perhaps more like 5-10 years.)
Long takeoff and short takeoff sound strange to me. Maybe because they are too close to long timelines and short timelines.
I think we're pretty likely to see a smooth and short takeoff with ASI prior to 2029. Now, imagine that you were making this exactly prediction up through 2029 in 2000. From the perspective in 2000, you are exactly predicting smooth and short takeoff with long timelines!
So, I think this is actually a pretty natural prediction.
For instance, you get this prediction if you think that a scalable paradigm will be found in the future and will scale up to ASI and on this scalable paradigm the delay between ASI and transformative AI will be short (either because the flop difference is small or because flop scaling will be pretty rapid at the relevant point because it is still pretty cheap, perhaps <$100 billion).
But, I also think that variable is captured by saying "smooth takeoff/long timelines?" (which is approximately what people are currently saying?
You can have smooth and short takeoff with long timelines. E.g., imagine that scaling works all the way to ASI, but requires a ton of baremetal flop (e.g. 1e34) implying longer timelines and early transformative AI requires almost as much flop (e.g. 3e33) such that these events are only 1 year apart.
Long duration/short duration takeoff
I don't love "smooth" vs "sharp" because these words don't naturally point at what seem to me to be the key concept: the duration from the first AI capable of being transformatively useful to the first system which is very qualitatively generally superhuman[1]. You can have "smooth" takeoff driven by purely scaling things up where this duration is short or nonexistent.
I also care a lot about the duration from AIs which are capable enough to 3x R&D labor to AIs which are capable enough to strictly dominate (and thus obsolete) top human scientists but which aren't necessarly very smarter. (I also care some about the duration between a bunch of different milestones and I'm not sure that my operationalizations of the milestones is the best one.)
Paul originally operationalized this as seeing an economic doubling over 4 years prior to a doubling within a year, but I'd prefer for now to talk about qualitative level of capabilities rather than also entangling questions about how AI will effect the world[2].
So, I'm tempted by "long duration" vs "short duration" takeoff, though this is pretty clumsy.
Really, there are bunch of different distinctions we care about with respect to takeoff and the progress of AI capabilities:
- As discussed above, the duration from the first transformatively useful AIs to AIs which are generally superhuman. (And between very useful AIs to top human scientist level AIs.)
- The duration from huge impacts in the world from AI (e.g. much higher GDP growth) to very superhuman AIs. This is like the above, but also folding in economic effects and other effects on the world at large which could come apart from AI capabilities even if there is a long duration takeoff in terms of capabilities.
- Software only singularity. How much the singularity is downstream of AIs working on hardware (and energy) vs just software. (Or if something well described as a singularity even happens.)
- Smoothness of AI progress vs jumpyness. As in, is progress driven by a larger number of smaller innovations and/or continuous scale ups rather being substantially driven by a small number of innovations and/or large phase changes that emerge with scale.
- Predictability of AI progress. Even if AI progress is smooth in the sense of the prior bullet, it may not follow a very predictable trend if the rate of innovations or scaling varies a lot.
- Tunability of AI capability. Is is possible to get a fully sweep of models which continuously interpolates over a range of capabilities?[3]
Of course, these properties are quite correlated. For instance, if the relevant durations for the first bullet are very short, then I also don't expect economic impacts until AIs are much smarter. And, if the singularity requires AIs working on increasing available hardware (software only doesn't work or doesn't go very far), then you expect more economic impact and more delay.
One could think that there will be no delay between these points, though I personally think this is unlikely. ↩︎
In short timelines, with a software only intelligence explosion, and with relevant actors not intentionally slowing down, I think I don't expect huge global GDP growth (e.g. 25% annualized global GDP growth rate) prior to very superhuman AI. I'm not very confident in this, but I think both inference availability and takeoff duration point to this. ↩︎
This is a very weak property, though I think some people are skeptical of this. ↩︎
As in, I care about the long-run power of values-which-are-similar-to-my-values-on-reflection. Which includes me (on reflection) by definition, but I think probably also includes lots of other humans.
Then shouldn't such systems (which can surely recognize this argument) just take care of short term survival instrumentally? Maybe you're making a claim about irrationality being likely or a claim that systems that care about long run benefit act in appararently myopic ways.
(Note that historically it was much harder to keep value stability/lock-in than it will be for AIs.)
I'm not going to engage in detail FYI.
I'd also add that a high fraction of these costs won't be increased if you improve cyber crime productivity (by e.g. 10%). As in, maybe a high fraction of the costs are due to the possiblity of very low effort cyber crime (analogous to the cashier case).
And Fabien's original motivation was more closely related to this.
His claim is that we should expect any random evolved agent to mostly care about long-run power.
I meant that any system which mostly cares about long-run power won't be selected out. I don't really have a strong view about whether other systems that don't care about long-run power will end up persisting, especially earlier (e.g. human evolution). I was just trying to argue against a claim about what gets selected out.
My language was bit sloppy here.
(If evolutionary pressures continue forever, then ultimately you'd expect that all systems have to act very similarly to ones that only care about long-run power, but there could be other motivations that explain this. So, at least from a behavioral perspective, I do expect that ultimately (if evolutionary pressures continue forever) you get systems which at least act like they are optimizing for long-run power. I wasn't really trying to make an argument about this though.)
Personally? I guess I would say that I mostly (98%?) care about long-run power for similar values on reflection to me. And, probably some humans are quite close to my values and many are adjacent.
Shouldn't we expect that ultimately the only thing selected for is mostly caring about long run power? Any entity that mostly cares about long run power can instrumentally take whatever actions needed to ensure that power (and should be competitive with any other entity).
Thus, I don't think terminally caring about humans (a small amount) will be selected against. Such AIs could still care about their long run power and then take the necessary actions.
However, if there are extreme competitive dynamics and no ability to coordinate, then it might become vastly more expensive to prevent environmental issues (e.g. huge changes in earth's temperature due to energy production) from killing humans. That is, saving humans (in the way they'd like to be saved) might take a bunch of time and resources (e.g. you have to build huge shelters to prevent humans from dying when the oceans are boiled in the race) and thus might be very costly in an all out race. So, an AI which only cares 1/million or 1/billion about being "kind" to humans might not be able to afford saving humans on that budget.
I'm personally pretty optimistic about coordination prior to boiling-the-oceans-scale issues killing all humans.
I don't know if we can be confident in the exact 95% result, but it is the case that o1 consistently performs at a roughly similar level on math across a variety of different benchmarks (e.g., AIME and other people have found strong performance on other math tasks which are unlikely to have been in the training corpus).
It doesn't say "during 2040", it says "before 2040".
Daniel almost surely doesn't think growth will be constant. (Presumably he has a model similar to the one here.) I assume he also thinks that by the time energy production is >10x higher, the world has generally been radically transformed by AI.