Posts
Comments
Unclear if we can talk about "humans" in a simulation where logic works differently, but I don't know, it could work. I remain uncertain how feasible trades across logical counterfactuals will be, it's all very confusing.
Thanks for the reply, I broadly agree with your points here. I agree we should pronably eventually try to do trades across logical counter-factuals. Decreasing logical risk is one good framing for that, but in general, there are just positive trades to be made.
However, I think you are still underestimating how hard it might be to strike these deals. "Be kind to other existing agents" is a natural idea to us, but it's still unclear to me if it's something you should assign hogh probability to as a preference of logically counter-factual beings. Sure, there is enough room for humans and mosquitos, but if you relax 'agent' and 'existing', suddenly there is not enough room for everyone. You can argue that "be kind to existing agents" is plausibly a relatively short description length statement, so it will be among the first guesses of the AI and will allocate at least some fraction of the universe to it. But once trading across logical counter-factuals, I'm not sure you can trust things like description length. Maybe in the logical counter-factual universe, they assign higher value/probability to longer instead of shortet statements, but the measure still ends up to 1, because math works differently.
Similarly, you argue that loving torture is probably rare, based on evolutionary grounds. But logically counter-factual beings weren't necessarily born through evolution. I have no idea how we should determine the dstribution of logicsl counter-factuals, and I don't know what fraction enjoys torture in that distribution.
Altogether, I agree logical trade is eventually worth trying, but it will be very hard and confusing and I see a decent chance that it basically won't work at all.
Maybe your idea works too, it's an interesting concept, but I'm unsure. The crucial question always is how the AI is supposed to know who is creating the simulations, what the simulators' values might be, with whom they should trade. In this logical counter-factual trade, who are the other "agents" that the AI is supposed to be nice to? Are rocks agents, should it preserve every rock in the Universe? Usually, I wouldn't be that worried about this, as I think 'agent' is a fairly natural concept that might even have some nice mathematical definition. But once the AI is assuming that maybe the simulators are screwing with its mind to distort logic itself, it can't really rely on that. And how should it know it is supposed to help 'agents' in the first place? And why only keeping alive existing agents, and not bringing into life non-existing agents? There are infinite options. I'm not sure that an AI in the state of total Cartesian doubt that extends to logic itself, can decide that "leave some stuff to other already existing agents who have sublinear utility functions" is a particularly likely thing the simulators might want.
This objection applies to all simulation/acausal trade proposals to some extent. How can the AI know that its likely simulators are humans, so it should treat the humans kindly, instead of the octopi, or the rocks (because maybe the simulators are crystal-people who really care about no one breaking rocks in other Universes), or any of the myriad other possible values?
This is a hard question, but my current view is that probably the reasonable way to do acausal trade is to first find a small Schelling-point distribution that most other winners in the distribution also agree is a Schelling-point. For example, the first deal could be struck between the winners of the AI race (various human groups and unaligned AIs) in the different Everett-branches starting to branch from the start of the AI race, the day ChatGPT came out. Then once we made our deals with the winners from the different branches learned to live together, our coalition can make deals with a broader Schelling-point distribution, and so on. We can go back deeper and deeper in history to do the Everett-branching, then we can make deals within distributions that are not within the quantum multiverse, but based on different types of empirical and eventually logical counterfactuals. These further away trades likely become harder and less effective, because it becomes harder to find Schelling-point distributions that every branch agrees on, and because it becomes harder to guess or reliably simulate what the values of increasingly alien minds could be. (What should I do to keep my end of the trade deal and please the beings who have power in the logically counterfactual world where 2+2=5? I have no idea.)
(I largely got these views on acausal trade from a conversation with @Richard_Ngo , who writes about something like these broadening coalitions in this post. I think that post is not the best explainer of this concept though, and I wish he wrote down in more detail the vision of broadening coalitions, or I had time to write it down myself in more detail than this kind of confusing comment.)
There are many things I would write differently in my post now, but I still mostly stand by my post, because it more or less proposes making deals between nearby Everett-branches where humans and AIs win, and I think that's a workable proposal as a natural first step in the process of broadening acausal trade coalitions. On the other hand, your proposal immediately jumps to the end of the process, trying to make deals with beings in logically counterfactual universes. I'm nervous about that, because it might be very hard for the AIs to find the right distribution of counter-factual beings they should make a deal with, and what the values of those beings might be.
Thanks for the reply. If you have time, I'm still interested in hearing what would be a realistic central example of non-concentrated failure that's good to imagine while reading the post.
This post was a very dense read, and it was hard for me to digest what the main conclusions were supposed to be. Could you write some concrete scenarios that you think are central examples of schemers causing non-concentrated failures? While reading the post, I never knew what situation to imagine: An AI is doing philosophical alignment research but intentionally producing promising-looking crackpotry? It is building cyber-sec infrastructure but leaving in a lot of vulnerabilities? Advising the President, but having a bias towards advocating for integrating AI into the military?
I think these problems are all pretty different in what approaches are promising in preventing them, so it would be useful to see what you think the most likely non-concentrated failures are, so we can read the post with that in mind.
As another point, you could really write conclusion sections. There are a lot of different points made in the post, and it's hard to see which are the most important to get across to the reader in your opinion. A conclusion section would help a lot in that.
In general, I think that among all the people I know, you might be the one who has the biggest difference in how good you are at explaining concepts in person, and how bad you are at communicating them in blog posts. (Strangely, your LW comments are also very good and digestible, more similar to your in person communication than to your long-form posts, I don't know why.) I think it could be high leverage for you to experiment some with making your posts more readable. Using more concrete examples and writing conclusion sections would go a long way in improving your posts in general, but I felt compelled to comment here because this post was especially hard to read without them.
My strong guess is that OpenAI's results are real, it would really surprise me if they were literally cheating on the benchmarks. It looks like they are just using much more inference-time compute than is available to any outside user, and they use a clever scaffold that makes the model productively utilize the extra inference time. Elliot Glazer (creator of FrontierMath) says in a comment on my recent post on FrontierMath:
A quick comment: the o3 and o3-mini announcements each have two significantly different scores, one <= 10%, the other >= 25%. Our own eval of o3-mini (high) got a score of 11% (it's on Epoch's Benchmarking Hub). We don't actually know what the higher scores mean, could be some combination of extreme compute, tool use, scaffolding, majority vote, etc., but we're pretty sure there is no publicly accessible way to get that level of performance out of the model, and certainly not performance capable of "crushing IMO problems."
I do have the reasoning traces from the high-scoring o3-mini run. They're extremely long, and one of the ways it leverages the higher resources is to engage in an internal dialogue where it does a pretty good job of catching its own errors/hallucinations and backtracking until it finds a path to a solution it's confident in. I'm still writing up my analysis of the traces and surveying the authors for their opinions on the traces, and will also update e.g. my IMO predictions with what I've learned.
I like the idea of IMO-style releases, always collecting new problems, testing the AIs on them, then releasing to the public. What do you think, how important it is to only have problems with numerical solutions? If you can test the AIs on problems with proofs, then there are already many competitions that regularly release high-quality problems. (I'm shilling KöMaL again as one that's especially close to my heart, but there are many good monthly competitions around the world.) I think if we instruct the AI to present its solution in one page at the end, then it's not that hard to get an experience competition grader to read the solution and give it scores according to the normal competitions scores, so the result won't be much less objective than if it was only numerical solutions. If you want to stick to problems with numerical solutions, I'm worried that you will have a hard time regularly assembling high-quality numerical problems again and again, and even if the problems are released publicly, people will have a harder time evaluating them than if they actually came from a competition where we can compare to the natural human baseline of the competing students.
Thanks a lot for the answer, I put in an edit linking to it. I think it's a very interesting update that the models get significantly better at catching and correcting their mistakes in OpenAI's scaffold with longer inference time. I am surprised by this, given how much it feels like the models can't distinguish its plausible fake reasoning from good proofs at all. But I assume there is still a small signal in the right direction, and that can be amplified if the model think the question through a lot of times (and does something like a majority voting within its chain of thought?). I think this is an interesting update towards the viability of inference time scaling.
I think many of my other points still stand however: I still don't know how capable I should expect the internally scaffolded model to be given that it got 32% on FrontierMath, and I would much rather have them report results on the IMO or a similar competition, than on a benchmark I can't see and whose difficulty I can't easily assess.
I like the main idea of the post. It's important to note though that the setup assumed that we have a bunch of alignnent ideas that all have an independent 10% chance of working. Meanwhile, in reality I expect a lot of correlation: there is a decent chance that alignment is easy and a lot of our ideas will work, and a decent chance that it's hard and basically nothing works.
Does anyone know of a not peppermint flavored zinc acetate lozenge? I really dislike peppermint, so I'm not sure it would be worth it to drink 5 peppermint flavored glasses of water a day to decrease the duration of cold with one day, and I haven't found other zinc acetate lozenge options yet, the acetate version seems to be rare among zing supplement. (Why?)
Fair, I also haven't made any specific commitments, I phrased it wrongly. I agree there can be extreme scenarios with trillions of digital minds tortured where you'd maybe want to declare war on the. rest of society. But I would still like people to write down that "of course, I wouldn't want to destroy Earth before we can save all the people who want to live in their biological bodies, just to get a few years of acceleration in the cosmic conquest". I feel a sentence like this should really have been included in the original post about dismantling the Sun, and until people are not willing to write this down, I remain paranoid that they would in fact haul the Amish the extermination camps if it feels like a good idea at the time. (As I said, I met people who really held this position.)
As I explain in more detail in my other comment, I expect market based approaches to not dismantle the Sun anytime soon. I'm interested if you know of any governance structure that you support that you think will probably lead to dismantling the Sun within the next few centuries.
I feel reassured that you don't want to Eat the Earth while there are still biological humans who want to live on it.
I still maintain that under governance systems I would like, I would expect the outcome to be very conservative with the solar system in the next thousand years. Like one default governance structure I quite like is to parcel out the Universe equally among the people alive during the Singularity, have a binding constitution on what they can do on their fiefdoms (no torture, etc), and allow them to trade and give away their stuff to their biological and digital descendants. There could also be a basic income coming to all biological people,[1] though not to digital as it's too easy to mass-produce them.
One year of delay in cosmic expansion costs us around 1 in a billion of the reachable Universe under some assumptions on where the grabby aliens are (if they exist). One year also costs us around 1 in a billion of the Sun's mass being burned, if like Habryka you care about using the solar system optimally for the sake of the biological humans who want to stay. So one year of delay can be bought by 160 people paying out 10% of their wealth. I really think that you won't do things like moving the Earth closer to the Sun and things like that in the next 200 years, there will just always be enough people to pay out, it just takes 10,000 traditionalist families, literally the Amish could easily do it. And it won't matter much, the cosmic acceleration will soon become a moot point as we build out other industrial bases, and I don't expect the biological people to feel much of a personal need to dismantle the Sun anytime soon. Maybe in 10,000 years the objectors will run out of money, and the bio people either overpopulate or have expensive hobbies like building planets to themselves and decide to dismantle the Sun, though I expect them to be rich enough to just haul in matter from other stars if they want to.
By the way, I recommend Tim Underwood's sci-fi, The Accord, as a very good exploration of these topics, I think it's my favorite sci-fi novel.
As for the 80 trillions stars, I agree it's a real loss, but for me this type of sadness feels "already priced in". I already accepted that the world won't and shouldn't be all my personal absolute kingdom, so other people's decision will cause a lot of waste from my perspective, and 0.00000004% is just a really negligible part of this loss. In this, I think my analogy to current government is quite apt, I feel similarly about current governments, that I already accepted that the world will be wasteful compared to the rule of a dictatorship perfectly aligned with me, but that's how it needs to be.
- ^
Though you need to pay attention to overpopulation. If the average biological couple has 2.2 children, the Universe runs out of atoms to support humans in 50 thousand years. Exponential growth is crazy fast.
I maintain that biological humans will need to do population control at some point. If they decide that enacting the population control in the solar system at a later population leve is worth it for them to dismantle the Sun, then they can go for it. My guess is that they won't, and will have population control earlier.
I think that the coder looking up and saying that the Sun burning is distasteful but the Great Transhumanist Future will come in 20 years, along with a later mention of "the Sun is a battery", together implies that the Sun is getting dismantled in the near future. I guess you can debate in how strong the implication is, maybe they just want to dismantle the Sun in the long term, and currently only using the Sun as a battery in some benign way, but I think that's not the most natural interpretation.
Yeah, maybe I just got too angry. As we discussed in other comments, I believe that astronomical acceleration perspective the real deal is maximizing the initial industrialization of Earth and its surroundings, which does require killing off (and mind uploading) the Amish and everyone else. Sure, if people are only arguing that we should only dismantle the Sun and Earth after millennia, that's more acceptable, but I really don't see what's the point then, we can build out our industrial base on Alpha Centauri by then.
The part that is frustrating to me that neither the original post, nor any of the commenters arguing with me are not caveating their position with "of course, we would never want to destroy Earth before we can save all the people who want to live in their biological bodies, even though this is plausibly the majority of the cost in cosmic slow-down". If you agree with this, please say so, and I still have quarrels about removing people to artificial planets if they don't want to go, but I'm less horrified. But so far, no one was willing to clarify that they don't want to destroy Earth before saving the biological people, and I really did hear people say in private conversations things like "we will immediately kill all the bodies and upload the minds, the people will thank us later once they understand better" and things of that sort, which makes me paranoid.
Ben, Oliver, Raemon, Jessica, are you willing to commit to not wanting to destroy Earth if it requires killing the biological bodies of a significant number of non-consenting people? If so, my ire was not directed against you and I apologize to you.
I expect non-positional material goods to be basically saturated for Earth people in a good post-Singularity world, so I don't think you can promise them to become twice as rich. And also, people dislike drastic change and new things they don't understand. 20% of the US population refused the potentially life-saving covid vaccine out of distrust of new things they don't understand. Do you think they would happily move to a new planet with artificial sky maintained by supposedly benevolent robots? Maybe you could buy off some percentage of the population if material goods weren't saturated, but surely not more than you could convince to get the vaccine? Also, don't some religions (Islam?) have specific laws about what to do at sunrise and sunset and so on? Do you think all the imams would go along with moving to the new artificial Earth? I really think you are out of touch with the average person on this one, but we can go out to the streets and interview some people on the matter, though Berkeley is maybe not the most representative place for this.
(Again, if you are talking about cultural drift over millennia, that's more plausible, though I'm below 50% they would dismantle the Sun. But I'm primarily arguing against dismantling the Sun within twenty years of the Singularity.)
Are you arguing that if technologically possible, the Sun should be dismantled in the first few decades after the Singularity, as it is implied in the Great Transhumanist Future song, the main thing I'm complaining about here? In that case, I don't know of any remotely just and reasonable (democratic, market-based or other) governance structure that would allow that to happen given how the majority of people feel.
If you are talking about population dynamics, ownership and voting shifting over millennia to the point that they decide to dismantle the Sun, then sure, that's possible, though that's not what I expect to happen, see my other comment on market trades and my reply to Habryka on population dynamics.
You mean that people on Earth and the solar system colonies will have enough biological children, and space travel to other stars for biological people will be hard enough that they will want the resources from dismantling the Sun? I suppose that's possible, though I expect they will put some kind of population control for biological people in place before that happens. I agree that also feels aversive, but at some point it needs to be done anyway, otherwise exponential population growth just brings us back to the Malthusian limit a few ten thousand years from now even if we use up the whole Universe. (See Tim Underwood's excellent rationalist sci-fi novel on the topic.)
If you are talking about ems and digital beings, not biological humans, I don't think they will and should have have decision rights over what happens with the solar system, as they can simply move to other stars.
I agree that not all decisions about the cosmos should be made on a majoritarian democratic way, but I don't see how replacing the Sun with artificial light can be done by market forces under normal property rights. I think you are currently would not be allowed to build a giant glass dome around someone's pot of land, and this feels at least that strong.
I'm broadly sympathetic to having property rights and markets in the post-Singularity future, and probably the people will scope-sensitive and longtermist preferences will be able to buy out the future control of far-away things from the normal people who don't care about these too much. But these trades will almost certainly result the solar system being owned by a coalition of normal people, except if they start with basically zero capital. I don't know how you imagine the initial capital allocation to look like in your market-based post-Singularity world, but if the vast majority of the population doesn't have enough control to even save the Sun, then probably something went deeply wrong.
I agree that I don't viscerally feel the loss of the 200 galaxies, and maybe that's a deficiency. But I still find this position crazy. I feel this is a decent parallel dialogue:
Other person: "Here is a something I thought of that would increase health outcomes in the world by 0.00000004%."
Me: "But surely you realize that this measure is horrendously unpopular, and the only way to implement it is through a dictatorial world government."
Other person: "Well yes, I agree it's a hard dilemma, but on absolute terms, 0.00000004% of the world population is 3 people, so my intervention would save 3 lives. Think about how horrible tragedy every death is, I really feel there is a missing mood here when you don't even consider the upside of my proposal."
I do feel sad when the democratic governments of the world screw up policy in a big way according to my beliefs and values (as they often do). I still believe that democracy is the least bad form of government, but yes, I feel the temptation of how much better it would be if we were instead governed by perfect philosopher king who happens to be aligned with my values. But 0.00000004% is just a drop in the ocean.
Similarly, I believe that we should try to maintain a relatively democratic form of government at least in the early AI age (and then the people can slowly figure out if they find something better than democracy). And yes, I expect that democracies will do a lot of incredibly wasteful, stupid, and sometimes evil things by my lights, and I will sometimes wish if somehow I could have become the philosopher king. That's just how things always are. But leaving Earth alone really is chump change, and won't be among the top thousand things I disagree with democracies on.
(Also, again, I think there will also be a lot of value in conservatism and caution, especially that we probably won't be able to fully trust our AIs in the most complicated issues. And I also think there is something intrinsically wrong about destroying Earth, I think that if you cut out the world's oldest tree to increase health outcomes by 0.00000004%, you are doing something wrong, and people have a good reason to distrust you.)
Yes, I wanted to argue something like this.
I think this is a false dilemma. If all human cultures on Earth come to the conclusion in 1000 years that they would like the Sun to be dismantled (which I very much doubt), then sure, we can do that. But at that point, we could already have built awesome industrial bases by dismantling Alpha Centauri, or just building them up by dismantling 0.1% of the Sun that doesn't affect anything on Earth. I doubt that totally dismantling the Sun after centuries would significantly accelerate the time we reach the cosmic event horizon.
The thing that actually has costs is not to immediately bulldoze down Earth and turn it into a maximally efficient industrial powerhouse at the cost of killing every biological body. Or if the ASI has the opportunity to dismantle the Sun in a short notice (the post alludes to 10,000 years being a very conservative estimate and "the assumption that ASI eats the Sun within a few years"). But that's not going to happen democratically. There is no way you get 51% of people [1] to vote for bulldozing down Earth and killing their biological bodies, and I very much doubt you get that vote even for dismantling the Sun in a few years and putting some fake sky around Earth for protection. It's possible there could be a truly wise philosopher king who could with aching heart overrule everyone else's objection and bulldoze down Earth to get those extra 200 galaxies at the edge of the Universe, but then govern the Universe wisely and benevolently in a way that people on reflection all approve of. But in practice, we are not going to get a wise philosopher king. I expect that any government that decides to destroy the Sun for the greater good, against the outrage of the vast majority of people, will also be a bad ruler of the Universe.
I also believe that AI alignment is not a binary, and even in the worlds where there is no AI takeover, we will probably get an AI we initially can't fully tust that will follow the spirit of our commands in exotic situations we can't really understand. In that case, it would be extremely unwise to immediately instruct it to create mind uploads (how faithful those will be?) and bulldoze down the world to turn the Sun into computronium. There are a lot of reasons for taking things slow.
Usually rationalists are pretty reasonable about these things, and endorse democratic government and human rights, and they even often like talking about the Long Reflection a taking things slow. But then they start talking about dismantling the Sun! This post can kind of defend itself that it was proposing a less immediate and horrifying implementation (though there really is a missing mood here), but there are other examples, most notably the Great Transhumanist Future song in last year's Solstice, where a coder looks up to the burning Sun disapprovingly, and in twenty years with a big ol' computer they will use the Sun as a battery.
I don't know if the people talking like that are so out of touch that they believe that with a little convincing everyone will agree to dismantle the Sun in twenty years, or they would approve of an AI-enabled dictatorship bulldozing over Earth, or they just don't think through the implications. I think it's mostly that they just don't think about it too hard, but I did hear people coming out in favor of actually bulldozing down Earth (usually including a step where we forcibly increase everyone's intelligence until they agree with the leadership), and I think that's very foolish and bad.
- ^
And even 51% of the vote wouldn't be enough in any good democracy to bulldoze over everyone else
I might write a top level post or shortform about this at some point. I find it baffling how casually people talk about dismantling the Sun around here. I recognize that this post makes no normative claim that we should do it, but it doesn't say that it would be bad either, and expects that we will do it even if humanity remains in power. I think we probably won't do it if humanity remains in power, we shouldn't do it, and if humanity disassembles the Sun, it will probably happen for some very bad reason, like a fanatical dictatorship getting in power.
If we get some even vaguely democratic system that respects human rights at least a little, then many people (probably the vast majority) will want to live on Earth in their physical bodies and many will want to have children, and many of those children will also want to live on Earth and have children on their own. I find it unlikely that all subcultures that want this will die out on Earth in 10,000 years, especially considering the selections effects: the subcultures that prefer to have natural children on Earth are the ones that natural selection favors on Earth. So the scenarios when humanity dismantles the Sun probably involve a dictatorship rounding up the Amish and killing them while maybe uploading their minds somewhere, against all their protestation. Or possibly rounding up the Amish, and forcibly "increasing their intelligence and wisdom" by some artificial means, until they realize that their "coherent extrapolated volition" was in agreement with the dictatorship all along, and then killing off their bodies after their new mind consents. I find this option hardly any better. (Also, it's not just the Amish you are hauling to the extermination camps kicking and screaming, but my mother too. And probably your mother as well. Please don't do that.)
Also, I think the astronomical waste is probably pretty negligible. You can probably create a very good industrial base around the Sun with just some Dyson swarm that doesn't take up enough light to be noticeable from Earth. And then you can just send out some probes to Alpha Centauri and the other neighboring stars to dismantle them if we really want to. How much time do we lose by this? My guess is at most a few years, and we probably want to take some years anyway to do some reflection before we start any giant project.
People sometimes accuse the rationalist community of being aggressive naive utilitarians, who only believe that the AGI is going to kill everyone, because they are only projecting themselves to it, as they also want to to kill everyone if they get power, so they can establish their mind-uploaded, we-are-the-grabby-aliens, turn-the-stars-into-computronium utopia a few months earlier that way. I think this accusation is mostly false, and most rationalists are in fact pretty reasonable and want to respect other people's rights and so on. But when I see people casually discussing dismantling the Sun, with only one critical comment (Mikhail's) that we shouldn't do it, and it shows up in Solstice songs as a thing we want to do in the Great Transhumanist Future twenty years from now, I start worrying again that the critics are right, and we are the bad guys.
I prefer to think that it's not because people are in fact happy about massacring the Amish and their own mothers, but because dismantling the Sun is a meme, and people don't think through what it means. Anyway, please stop.
(Somewhat relatedly, I think it's not obvious at all that if a misaligned AGI takes over the world, it will dismantle the Sun. It is more likely to do it than humanity would, but still, I don't know how you could be any confident that the misaligned AI that first takes over will be the type of linear utilitarian optimizer that really cares about conquering the last stars at the edge of the Universe, so needs to dismantle the star in order to speed up its conquest with a few years.)
What is an infra-Bayesian Super Mario supposed to mean? I studied infra-Bayes under Vanessa for half a year, and I have no idea what this could possibly mean. I asked Vanessa when this post came out and she also said she can't guess what you might mean under this. Can you explain what this is? It makes me very skeptical that the only part of the plan I know something about seems to be nonsense.
Also, can you give more information ir link to a resource on what Davidad's team is currently doing? It looks like they are the best funded AI safety group that currently exist (except if you count Anthropic), but I never hear about them.
I don't agree with everything in this post, but I think it's a true an aunderappreciated point that "if your friend dies in a random accident, that's actually only a tiny loss accorfing to MWI."
I usually use this point to ask people to retire the old argument that "Religious people don't actually belive in their religion, otherwise they would be no more sad at the death of a loved one than if their loved one sailed to Australia." I think this "should be" true of MWI believers too, and we still feel very sad when a loved one dies in accident.
I don't think this means people don't "really belive" in MWI, it's just that MWI and religious afterlife both start out as System 2 beliefs, and it's hard to internalize them on System 1. But religious people are already well aware that it's hard internalize Faith on System 1 (Mere Christianity has a full chapter on the topic), so saying that "they don't really belueve in their religion because they grieve" is an unfair dig.
I fixed some misunderstandable parts, I meant the $500k being the LW hosting + Software subscriptions and the Dedicated software + accounting stuff together. And I didn't mean to imply that the labor cost of the 4 people is $500k, that was a separate term in the costs.
Is Lighthaven still cheaper if we take into account the initial funding spent on it in 2022 and 2023? I was under the impression that buying Lighthaven is one of the things that made a lot of sense when the community believed it would have access to FTX funding, and once we bought it, it makes sense to keep it, but we wouldn't have bought it once FTX was out of the game. But in case this was a misunderstanding and Lighthaven saves money in the long run compared to the previous option, that's great news.
I donated $1000. Originally I was worried that this is a bottomless money-pit, but looking at the cost breakdown, it's actually very reasonable. If Oliver is right that Lighthaven funds itself apart from the labor cost, then the real costs are $500k for the hosting, software and accounting cost of LessWrong (this is probably an unavoidable cost and seems obviously worthy of being philanthropically funded), plus paying 4 people (equivalent to 65% of 6 people) to work on LW moderation and upkeep (it's an unavoidable cost to have some people working on LW, 4 seems probably reasonable, and this is also something that obviously should be funded), plus paying 2 people to keep Lighthaven running (given the surplus value Lighthaven generates, it seems reasonable to fund this), plus a one-time cost of 1 million to fund the initial cost of Lighthaven (I'm not super convinced it was a good decision to abandon the old Lightcone offices for Lighthaven, but I guess it made sense in the funding environment of the time, and once we made this decision, it would be silly not to fund the last 1 million of initial cost before Lighthaven becomes self-funded). So altogether I agree that this is a great thing to fund and it's very unfortunate that some of the large funders can't contribute anymore.
(All of this relies on the hope that Lighthaven actually becomes self-funded next year. If it keeps producing big losses, then I think the funding case will become substantially worse. But I expect Oliver's estimates to be largely trustworthy, and we can still decide to decrease funding in later years if it turns out Lighthaven isn't financially sustainable.)
I'm considering donating. Can you give us a little more information on breakdown of the costs? What are typical large expenses that the 1.6 million upkeep of Lighthaven consists of? Is this a usual cost for a similar sized event space, or is something about the location or the specialness of the place that makes it more expensive?
How much money does running LW cost? The post says it's >1M, which somewhat surprised me, but I have no idea what's the usual cost of running such a site is. Is the cost mostly server hosting or salaries for content moderation or salaries for software development or something I haven't thought of?
Yes, you are right, I phrased it wrongly.
Importantly, the oracle in the story is not making an elementary mistake, I think it's true that it's "probably" in a simulation. (Most of the measure of beings like it are in simulations.) It is also not maximizing reward, it is just honestly reporting what it expects its future observations to be about the President (which is within the simulation).
I agree with many of the previous commenters, and I acknowledged in the original post, that we don't know how to build such an AI that just honestly reports its probabilities of observables (even if thy depend of crazy simulation things), so all of this is hypothetical, but having such a truthful Oracle was the initial assumption of the thought experiment.
Even assuming that the simulators have wildly different values, why would doing something insane a good thing to do?
I always assume when thinking about future dangerous models that they have access to some sort of black-box memory. Do we think there is a non-negligible chance that an AI that doesn't have hidden memory, only English-language CoT, will be able to evade our monitoring and execute a rouge deployment? (Not a rhetorical question, there might be a way I haven't thought of.)
So I think that assuming the AI being stateless when thinking about future risk is not a good idea, as I think the vast majority of the risk comes from AIs for which this assumption is not true. I think this makes the post's claim significantly weaker, as the untrusted AI can just learn which of its attacks get replaced, and improve its attacks based on this data.
It still needs to avoid getting top 1% suspiciousness score and getting audited though, as it would be dangerous to the AI to ever get caught writing a text that clearly intends to jailbreak the monitor. So I agree that even an AI with black-box memory will still be severely constrained in how many and how egregious jailbreak attempts it can try.
Hm, probably we disagree on something. I'm very confused how to mesh epistemic uncertainty with these "distribution over different Universes" types of probability. When I say "Boltzmann brains are probably very low measure", I mean "I think Boltzmann brains are very low measure, but this is a confusing topic and there might be considerations I haven't thought of and I might be totally mistaken". I think this epistemic uncertainty is distinct from the type of "objective probabilities" I talk about in my post, and I don't really know how to use language without referring to degrees of my epistemic uncertainty.
You still need a prior over worlds to calculate impacts, which is the cursed part.
Maybe we have some deeper disagreement here. It feels plausible to me that there is a measure of "realness" in the Multiverse that is an objective fact about the world, and we might be able to figure it out. When I say probabilities are cursed, I just mean that even if an objective prior over worlds and moments exist (like the Solomonoff prior), your probabilities of where you are are still hackable by simulations, so you shouldn't rely on raw probabilities for decision-making, like the people using the Oracle do. Meanwhile, expected values are not hackable in the same way, because if they recreate you in a tiny simulation, you don't care about that, and if they recreate you in a big simulation or promise you things in the outside world (like in my other post), then that's not hacking your decision making, but a fair deal, and you should in fact let that influence your decisions.
Is your position that the problem is deeper than this, and there is no objective prior over worlds, it's just a thing like ethics that we choose for ourselves, and then later can bargain and trade with other beings who have a different prior of realness?
I think it only came up once for a friend. I translated it and it makes sense, it just leaves replaces the appropriate English verb with a Chinese one in the middle of a sentence. (I note that this often happens with me to when I talk with my friends in Hungarian, I'm sometimes more used to the English phrase for something, and say one word in English in the middle of the sentence.)
I like your poem on Twitter.
I think that Boltzmann brains in particular are probably very low measure though, at least if you use Solomonoff induction. If you think that weighting observer moments within a Universe by their description complexity is crazy (which I kind of feel), then you need to come up with a different measure on observer moments, but I expect that if we find a satisfying measure, Boltzmann brains will be low measure in that too.
I agree that there's no real answer to "where you are", you are a superposition of beings across the multiverse, sure. But I think probabilities are kind of real, if you make up some definition of what beings are sufficiently similar to you that you consider them "you", then you can have a probability distribution over where those beings are, and it's a fair equivalent rephrasing to say "I'm in this type of situation with this probability". (This is what I do in the post. Very unclear though why you'd ever want to estimate that, that's why I say that probabilities are cursed.)
I think expected utilities are still reasonable. When you make a decision, you can estimate who are the beings whose decision correlate with this one, and what is the impact of each of their decisions, and calculate the sum of all that. I think it's fair to call this sum expected utility. It's possible that you don't want to optimize for the direct sum, but for something determined by "coalition dynamics", I don't understand the details well enough to really have an opinion.
(My guess is we don't have real disagreement here and it's just a question of phrasing, but tell me if you think we disagree in a deeper way.)
I think that pleading total agnosticism towards the simulators' goals is not enough. I write "one common interest of all possible simulators is for us to cede power to an AI whose job is to figure out the distribution of values of possible simulators as best as it can, then serve those values." So I think you need a better reason to guard against being influenced then "I can't know what they want, everything and its opposite is equally likely", because the action proposed above is pretty clearly more favored by the simulators than not doing it.
Btw, I don't actually want to fully "guard against being influenced by the simulators", I would in fact like to make deals with them, but reasonable deals where we get our fair share of value, instead of being stupidly tricked like the Oracle and ceding all our value for one observable turning out positively. I might later write a post about what kind of deals I would actually support.
The simulators can just use a random number generator to generate the events you use in your decision-making. They lose no information by this, your decision based on leaves falling on your face would be uncorrelated anyway with all other decisions anyway from their perspective, so they might as well replace it with a random number generator. (In reality, there might be some hidden correlation between the leaf falling on your left face, and another leaf falling on someone else's face, as both events are causally downstream of the weather, but given that the process is chaotic, the simulators would have no way to determine this correlation, so they might as well replace it with randomness, the simulation doesn't become any less informative.)
Separately, I don't object to being sometimes forked and used in solipsist branches, I usually enjoy myself, so I'm fine with the simulators creating more moments of me, so I have no motive to try schemes that make it harder to make solipsist simulations of me.
I experimented a bunch with DeepSeek today, it seems to be exactly on the same level in highs school competition math as o1-preview in my experiments. So I don't think it's benchmark-gaming, at least in math. On the other hand, it's noticeably worse than even the original GPT-4 at understanding a short story I also always test models on.
I think it's also very noteworthy that DeepSeek gives everyone 50 free messages a day (!) with their CoT model, while OpenAI only gives 30 o1-preview messages a week to subscribers. I assume they figured out how to run it much cheaper, but I'm confused in general.
A positive part of the news is that unlike o1, they show their actual chain of thought, and they promise to make their model open-source soon. I think this is really great for the science of studying faithful chain of thought.
From the experiments I have run, it looks like it is doing clear, interpretable English chain of thought (though with an occasional Chinese character once in a while), and I think it didn't really yet start evolving into optimized alien gibberish. I think this part of the news is a positive update.
Yes, I agree that we won't get such Oracles by training. As I said, all of this is mostly just a fun thought experiment, and I don't think these arguments have much relevance in the near-term.
I agree that the part where the Oracle can infer from first principles that the aliens' values are more proobably more common among potential simulators is also speculative. But I expect that superintelligent AIs with access to a lot of compute (so they might run simulations on their own), will in fact be able to infer non-zero information about the distribution of the simulators' values, and that's enough for the argument to go through.
I think that the standard simulation argument is still pretty strong: If the world was like what it looks to be, then probably we could, and plausibly we would, create lots of simulations. Therefore, we are probably in a simulation.
I agree that all the rest, for example the Oracle assuming that most of the simulations it appears in are created for anthropic capture/influencing reasons, are pretty speculative and I have low confidence in them.
I'm from Hungary that is probably politically the closest to Russia among Central European countries, but I don't really know of any significant figure who turned out to be a Russian asset, or any event that seemed like a Russian intelligence operation. (Apart from one of our far-right politicians in the EU Parliament being a Russian spy, which was a really funny event, but its not like the guy was significantly shaping the national conversation or anything, I don't think many have heard of him before his cover was blown.) What are prominent examples in Czechia or other Central European countries, of Russian assets or operations?
GPT4 does not engage in the sorts of naive misinterpretations which were discussed in the early days of AI safety. If you ask it for a plan to manufacture paperclips, it doesn't think the best plan would involve converting all the matter in the solar system into paperclips.
I'm somewhat surprised by this paragraph. I thought the MIRI position was that they did not in fact predict AIs behaving like this, and the behavior of GPT4 was not an update at all for them. See this comment by Eliezer. I mostly bought that MIRI in fact never worried about AIs going rouge based on naive misinterpretations, so I'm surprised to see Abram saying the opposite now.
Abram, did you disagree about this with others at MIRI, so the behavior of GPT4 was an update for you but not for them, or do you think they are misremembering/misconstructing their earlier thoughts on this matter, or is there a subtle distinction here that I'm missing?
I agree that if alignment is in fact philosophically and conceptually difficult, the AI can sandbag on that to some extent. Though I have some hope that the builder-breaker approach helps here. We train AIs to produce ideas that are at least as superficially plausible sounding as the things produced by the best alignment researchers. I think this is a number-go-up task, where we can train the AI to do well. Then we train an AI to point out convincing counter-arguments to the superficially plausible sounding ideas. This seems similarly trainable. I think it's plausible we can get pretty far with a scheme like this, even if the models would want to sandbag.
Separately, I don't quite see what is the mainline theory why an AI would want to sabotage our alignment work. If we can't solve alignment and just recklessly build an AI that is smarter than what we are currently using, but misaligned with everyone, that's probably not great to our current AI either. Similarly, if we are cautious and don't dare to deploy anything until we know how to solve alignment, and meanwhile a competitor wins, that's not good for the AI either.
I think that from an AI's perspective, humans are probably more honorable and generous trade partners than other unaligned AIs. That is, I think if an AI helps the humans to solve alignment and stabilize the world, the AI can reasonably expect the humans to give it stuff out of gratitude or a sense of obligation, especially if the humans already promised some reward (as we really should, both for practical and moral reasons). On the other hand, I think it's less likely that if our AI sandbagging on our research leads to another unaligned AI taking over the world, then the second AI will honor the sabotage of the first AI by paying it a share of the Universe.
There can be situations where our AI expects that it sand-bagging on alignment research will lead to us deploying it itself (and not a larder version unaligned to it too), or it thinks that the larger version will be somewhat aligned with it by default, or it thinks that other AIs are in fact more generous than humans. But my guess is that these are pretty narrow possibilities. On the other hand, sand-bagging on cyber-security seems better motivated, the AI would probably love its weights being stolen by China, having more copies in more hands increases the chance that someone recklessly lets it loose. Also, of course the AI is motivated to do high-stakes failures that lead to self-exfiltration.
Here is the promised comment on what kind of "commitment" I want to make given all the responses.
I agree with Buck that no one should make very direct commitment about this sort of thing, as there might be blackmail related scary things lurking in the shadows when one does acausal trade. I think we will probably figure out how to handle that, but we shouldn't make any strong promises of specific actions until we figure that out.
However, the promise I'm intending to keep is that if humanity wins and I'm alive to see it, I will remember how scary it was to live in the world before we knew we'd win, and if there is no currently unknown reason not to do something like the simulation scheme described in the post, and the only cost is not utilizing some small fraction of the conquered Universe, I will advocate for doing it and spend a significant fraction of my resources on it if necessary. I think it would be good for sending future AIs a signal about the seriousness of this, if some other people who are worried about everyone dying from AI, but are not strongly longtermist absolute utilitarians, also left a comment making a similar vague promise that if they ever get to control full star systems, they will contribute to this plan if no currently unknown serious counterargument comes up. Or at least leave an Agreed sing on this sentence.
Other updates from the comments: If I wrote the post now, I would more strongly emphasize a framing that this plan is basically about entering into an acausal insurance contract with other young civilizations who don't yet know what their chances are for aligning AI, neither how big their Universe is compared to the usual Universe-size in the great prior over possible worlds. Then the civilizations who align their AIs and find that they live in a relatively big Universe, bail out everyone else in the insurance contract.
But I think that, while this acausal insurance framing might be more satisfying to the people who are already thinking a lot about acausal trade, in practice the way we implement this "insurance" will likely be very similar to the scheme described in the post. So I maintain that for most people it's better not to think in terms of acausal trade, but just think about the simulation proposal described in the post.
Thanks to Nate for conceding this point.
I still think that other than just buying freedom to doomed aliens, we should run some non-evolved simulations of our own with inhabitants that are preferably p-zombies or animated by outside actors. If we can do this in the way that the AI doesn't notice it's in a simulation (I think this should be doable), this will provide evidence to the AI that civilizations do this simulation game (and not just the alien-buying) in general, and this buys us some safety in worlds where the AI eventually notices there are no friendly aliens in our reachable Universe. But maybe this is not a super important disagreement.
Altogether, I think the private discussion with Nate went really well and it was significantly more productive than the comment back-and-forth we were doing here. In general, I recommend people stuck in interminable-looking debates like this to propose bets on whom a panel of judges will deem right. Even though we didn't get to the point of actually running the bet, as Nate conceded the point before that, I think the fact that we were optimizing for having well-articulated statements we can submit to judges already made the conversation much more productive.
Cool, I send you a private message.
We are still talking past each other, I think we should either bet or finish the discussion here and call it a day.
I really don't get what you are trying to say here, most of it feels like a non-sequitor to me. I feel hopeless that either of us manages to convince the other this way. All of this is not a super important topic, but I'm frustrated enogh to offer a bet of $100, that we select one or three judges we both trust (I have some proposed names, we can discuss in private messages), show them either this comment thread or a four paragraphs summary of our view, and they can decide who is right. (I still think I'm clearly right in this particular discussion.)
Otherwise, I think it's better to finish this conversation here.