Posts
Comments
As far as what a plan to automate AI safety would work out in practice, assuming a relatively strong version of the concept is in this post below, and there will be another post that comes out by the same author talking more about the big risks discussed in the comments below:
https://www.lesswrong.com/posts/TTFsKxQThrqgWeXYJ/how-might-we-safely-pass-the-buck-to-ai
In general, I think the crux is that in most timelines (at a lower bound, 65-70%) that have AGI developed relatively soon (so timelines from 2030-2045, roughly), and the alignment problem isn't solvable by default/it's at least non-trivially tricky to solved, conditioning on alignment success looks more like "we've successfully figured out how to prepare for AI automation of everything, and we managed to use alignment and control techniques well enough that we can safely pass most of the effort to AI", rather than other end states like "humans are deeply enhanced" or "lawmakers actually coordinated to pause AI, and are actually giving funding to alignment organizations such that we can make AI safe."
How much probability do you assign to automating AI safety not working in time? Because I believe the preparing to automate AI safety work is probably the highest-value in pure ability to reduce X-risk probability, assuming it does work, so I assign much higher EV to automating AI safety, relative to other approaches.
More so that I'm imagining they might not even have heard of the argument, and it's helpful to note that people like Terence Tao, Timothy Gowers and more are all excellent people at their chosen fields, but most people that have a big impact on the world don't go into AI alignment.
Remember, superintelligence is not omniscience.
So I don't expect them to be self motivated to work on this specific problem without at least a little persuasion.
I'd expect a few superintelligent adults to join alignment efforts, but nowhere near thousands or tens of thousands, and I'd upper bound it at 300-500 new researchers at most in 15-25 years.
Much less impactful than automating AI safety.
The issue in this discourse, to me, is comparing this with AGI misalignment. It's conceptually related in some interesting ways, but in practical terms they're just extremely quantitatively different. And, naturally, I care about this specific non-comparability being clear because it says whether to do human intelligence enhancement; and in fact many people cite this as a reason to not do human IE.
Re human vs AGI misalignment, I'd say this is true, in that human misalignments don't threaten the human species, or even billions of people, whereas AI does, so in that regard I admit human misalignment is less impactful than AGI misalignment.
Of course, if we succeed at creating aligned AI, than human misalignments matter much, much more.
(Rest of the comment is a fun tangentially connected scenario, but ultimately is a hypothetical that doesn't matter that much for AI alignment.)
Ok... so I think I understand and agree with you here. (Though plausibly we'd still have significant disagreement; e.g. I think it would be feasible to bring even Hitler back and firmly away from the death fever if he spent, IDK, a few years or something with a very skilled listener / psychic helper.)
At the very least, that would require him to not be in control of Germany by that point, and IMO most value change histories rely on changing their values in the child-teen years, because that's when their sensitivity to data is maximal. After that, the plasticity/sensitivity of values goes way down when you are an adult, and changing values is much, much harder.
I'd argue quite a lot, though independent evidence could cause me to update me here, and a key reason for that is there is a plausible argument that a lot of the evidence for cultural learning/cultural practices written in the 1940s-1960s were fundamentally laundered to hide evidence of secret practices.
More generally, I was worried that such an obviously false claim implied a lot more hidden to me wrong claims that I couldn't test, so after spot-checking I didn't want to invest more time into an expensive search process.
You mean to say that the human body was virtually “finished evolving” 200,000 years ago, thereby laying the groundwork for cultural optimization which took over form that point? Henrich’s thesis of gene-culture coevolution contrasts with this view and I find it to be much more likely to be true. For example, the former thesis posits that humans lost a massive amount of muscle strength (relative to, say, chimpanzees) over many generations and only once that process had been virtually “completed”, started to compensate by throwing rocks or making spears when hunting other animals, requiring much less muscle strength than direct engagement. This begs the question, how did our ancestors survive in the time when muscle strength had already significantly decreased, but tool usage did not exist yet? Henrich’s thesis answers this by saying that such a time did not exist; throwing rocks came first, which provided the evolutionary incentive for our ancestors to expend less energy on growing muscles (since throwing rocks suffices for survival and requires less muscle strength). The subsequent invention of spears provided further incentive for muscles to grow even weaker.
There are many more examples to make that are like the one above. Perhaps the most important one is that as the amount of culture grows (also including things like rudimentary language and music), a larger brain has an advantage because it can learn more and more quickly (as also evidenced by the LLM scaling laws). Without culture, this evolutionary incentive for larger brains is much weaker.
In a sense, evolution never stops, but yes the capacity to make tools was way later than the physical optimizations that used the tools.
More generally, an even earlier force for bigger brains probably comes from both cooking food and social modeling in groups, but yes language and culture do help at the margin.
What do you make of feral children like Genie? While there are not many counterfactuals to cultural learning—probably mostly because depriving children of cultural learning is considered highly immoral—feral children do provide strong evidence that humans that are deprived of cultural learning do not come close to being functional adults. Additionally, it seems obvious that people who do not receive certain training, e.g., those who do not learn math or who do not learn carpentry, generally have low capability in that domain.
I actually think this is reasonably strong evidence for some form of cultural learning, thanks.
I'd probably bump that down to O(90%) at max, and this could get worse (I'm downranking based on the number of psychopaths/sociopaths and narcissists that exist).
I'd actually maybe agree with this, though with the caveat that there's a real possibility you will need a lot more selection/firepower as a human gets smarter, because you lack the ability to technically control humans in the way you can control AIs.
I'm saying that (waves hands vigorously) 99% of people are beneficent or "neutral" (like, maybe not helpful / generous / proactively kind, but not actively harmful, even given the choice) in both intention and in action. That type of neutral already counts as in a totally different league of being aligned compared to AGI.
I think this is ultimately the crux, at least relative to my values, I'd expect at least 20% in America to support active efforts to harm me or my allies/people I'm altruistic to, and do so fairly gleefully (an underrated example here is voting for people that will bring mass harm to groups they hate, and hope that certain groups go extinct).
Ok, yes, conflict between large groups is something to be worried about, though I don't much see the connection with germline engineering. I thought we were talking about, like, some liberal/techie/weirdo people have some really really smart kids, and then those kids are somehow a threat to the future of humanity that's comparable to a fast unbounded recursive self-improvement AGI foom.
Okay, the connection was to point out that lots of humans are not in fact aligned with each other, and I don't particularly think superbabies are a threat to the future of humanity that is comparable to AGI, so my point was more so that the alignment problem is not naturally solved in human-to human interactions.
I'd flag here that the 99% number seems very easy to falsify, solely based on the 20th century experience of both the 2 great wars, as well as the genocides/civil wars of the 20th century, and it's quite often that one human group is vastly unaligned to another human group, causing mass strife and chaos.
I will go further, and say the human universals are nowhere near strong enough to assume that alignment of much more powerful people will automatically/likely happen, or that not aligning them produces benevolent results, and the reason for this is humans are already misaligned, in many cases very severely to each other, so allowing human augmentation without institutional reform makes things a lot worse by default.
It is better to solve the AI alignment problem first, then have a legal structure created by AIs that can make human genetic editing safe, rather than try to solve the human alignment problem:
https://www.lesswrong.com/posts/DfrSZaf3JC8vJdbZL/how-to-make-superbabies#jgDtAPXwSucQhPBwf
Usually, the other interventions I compare it to are preparing for AI automation of AI safety by doing preliminary work to control/align those AIs, or AI governance interventions that are hopefully stable for a very long time, and at least for the automation of AI safety, I assign much higher magnitudes of success, conditioning on success, like multiple OOMs combined with moderately better cost effectiveness and quite larger chances of success than the genetic engineering approach.
To be clear, the key variable is conditional on success, the magnitude of that success is very, very high in a way that no other proposal really has, such that even with quite a lot lower probabilities for success than me, I'd still consider preparing for AI automation of AI safety and doing preliminary work such that we can trust/control these AIs to be the highest value alignment target by a mile.
How do LLMs and the scaling laws make you update in this way? They make me update in the opposite direction. For example, I also believe that the human body is optimized for tool use and scaling, precisely because of the gene-culture coevolution that Henrich describes. Without culture, this optimization would not have occurred. Our bodies are cultural artifacts.
I think the crux is that the genetic changes come first, then the cultural changes come after, and the reason is the modern human body plan happened around 200,000 BC, and arguably could be drawn even earlier, whereas the important optimizations that culture introduced came over 100,000 years at the latest.
That said, fair enough on calling me out on the scaling laws point, I was indeed not able to justify my update.
As far as why I'm generally skeptical of cultural learning mattering nearly as much as Heinrich, it's because I believe a lot of the evidence for this fundamentally doesn't exist for human cultures, and a lot of work in the 1940s-1960s around lots of subjects that were relevant to cultural learning were classified and laundered to produce simulated results:
If somehow international cooperation gives us a pause on going full AGI or at least no ASI - what then?
Just hope it never happens, like nuke wars?
The answer is no, but this might have to happen under certain circumstances.
The usual case (assuming that the government bans or restricts compute resources, and/or limits algorithimic research), is to use this time to either let the government fund AI alignment research, or go for a direct project to make AIs that are safe to automate AI safety research, and given that we don't have to race against other countries, we could afford far more safety taxes than usual to make AI safe.
I think the key crux is I don't particularly think genetic editing is the cheapest/easiest way to affect marginal chances of doom, because of time lag plus needing to reorient the entire political system, which is not cheap, and the cheapest/easiest strategy to me to affect doom probabilities is to do preparatory AI alignment/control schemes such that we can safely hand off the bulk of the alignment work to the AIs, which then solve the alignment problem fully.
The problem is that from a relative perspective, human augmentation is probably more doomed than AI safety automation, which in turn is more doomed than AI governance interventions, though I may have gotten the relative ordering of AI safety automation and I think the crux is I do not believe in the timeline for human genetic augmentation in adults being only 5 years, even given a well-funded effort, and I'd expect it to take 15-20 years, minimum for large increases in adult intelligence, which basically rules out the approach given the very likely timelines to advanced AI either killing us all or being aligned to someone.
For somewhat similar reasons, I expect AI to use formalistic/proof theoretic languages like Lean and F* more for their coding, since you get a very large amount of feedback, and the reward model is essentially unhackable like a go simulator.
In essence, it will shift towards more provable/strongly typed languages to get more feedback.
Perhaps you're making a Malthusian prediction based limited resources?
Kind of.
More specifically, assuming we can't cheat and circumvent the problem of the laws of thermodynamics/speed of light (which I put 50% probability on at this point, I have been convinced by Adam Brown that this might be cheatable in the far future), then at a global scale, energy, which is the foundation of all utility, is conserved globally, meaning that a global scale, everything must be a 0-sum game, because if it wasn't, you could use this as a way to break the first law of thermodynamics.
This also means that the engine of progress which made basically everything positive sum vanishes, and while I expect more positive sum games than the pre-industrial times, due to being able to cooperate better, the universally positive-sum era of the last few centuries would have to end.
Adam Brown on changing the laws of physics below:
TBH, I don't particularly think it's one of the most important projects right now, due to several issues:
-
There's no reason to assume that we could motivate them any better than what we already do, unless we are in the business of changing personality, which carries it's own problems, or we are willing to use it on a massive scale, which simply cannot be done currently.
-
We are running out of time. The likely upper bound for AI that will automate basically everything is 15-20 years from Rafael Harth and Cole Wyeth, and unfortunately there's a real possibility that the powerful AI comes in 5-10 years, if we make plausible assumptions about scaling continuing to work, and given that there's no real way to transfer any breakthroughs to the somatic side of gene editing, it will be irrelevant by the time AI comes.
Thus, human intelligence augmentation is quite poor from a reducing X-risk perspective.
Because they might consider that other problems are more worth their time, since smartness changes change their values little.
And maybe they believe that AI alignment isn't impactful for technical/epistemic reasons.
I'm confused/surprised I need to make this point, because I don't automatically think they will be persuaded that AI alignment is a big problem they will need to work on, and some effort will likely still need to be required.
That requires either massive personality changes to make them more persuadable, or massive willingness of people to put genetic changes in their germline, and I don't expect either of these to happen before AI automates everything and either takes over, leaving us extinct or humans/other AI control/align AIs successfully.
(A key reason for this is that Genesmith admitted that the breakthroughs in germline engineering can't transfer to the somatic side, and that means we'd have to wait 25-30 years in order for it to grow, minimum given that society won't maximally favor the genetically lucky, and that's way beyond most plausible AI timelines at this point)
I'm more optimistic, in that the upper bound could be as high as 50-60%, but yeah the people in power are unfortunately not part of this, and I'd only trust 25-30% of the population in practice if they had an ASI slave.
I think this is my disagreement, in that I don't think that most humans are in fact nice/aligned to each other by default, and the reason why this doesn't lead to catastrophe broadly speaking is a combo of being able to rely on institutions/mechanism design that means even if people are misaligned, you can still get people well off under certain assumptions (capitalism and the rule of law being one such example), combined with the inequalities not being so great that individual humans can found their own societies, except in special cases.
Even here, I'd argue that human autocracies are very often misaligned to their citizens values very severely.
To be clear about what I'm not claiming, I'm not saying that alignment is worthless, or alignment always or very often fails, because it's consistent with a world where >50-60% of alignment attempts are successful.
This means I'm generally much more scared of very outlier smart humans, for example a +7-12 SD human that was in power of a large group of citizens, assuming no other crippling disabilities unless they are very pro-social/aligned to their citizenry.
I'm not claiming that alignment will not work, or even that will very often not work, but rather that the chance of failure is real and the stakes are quite high long-term.
(And that's not even addressing how you could get super-smart people to work on the alignment problem).
This is probably a crux; I think the brain does have tons of specialized architecture for it, and if I didn't believe that, I probably wouldn't think thought assessment was as difficult.
I think this is also a crux.
IMO, I think the brain is mostly cortically uniform, ala Steven Byrnes, and in particular I think that the specialized architecture for thought assessment was pretty minimal.
The big driver of human success is basically something like the bitter lesson applied to biological brains, combined with humans being very well optimized for tool use, such that they can over time develop technology that is used to dominate the world (it's also helpful that humans can cooperate reasonably below 100 people, which is more than almost all social groups, though I've become much more convinced that cultural learning is way less powerful than Henrich et al have said).
(There are papers which show that humans are better at scaling neurons than basically everyone else, but I can't find them right now).
The other problem is in assuming democracy/the will of the people will ever be stable in the AI era, due to the combo of being able to brainwash people with neurotech, combined with being able to remove most of the traditional threats to regimes through aligned/technically safe AIs in the militaries.
(Though here, I'm using the word aligned/technically safe in the sense of it following orders/being aligned to the valueset of a single person, or at most a small group, not humanity as a whole.)
I want to flag that I've become much less convinced that the secret of our success/cultural learning is nearly as broad or powerful as an explanation as I used to believe, and I now believe that most of the success of humans comes more so down to the human body being very well optimized for tool use, combined with the bitter lesson for biological brains, and a way to actually cool them down:
This is because Heinrich mostly faked his evidence:
https://www.lesswrong.com/posts/m8ZLeiAFrSAGLEvX8/#MFyWDjh4FomyLnXDk
https://www.lesswrong.com/posts/m8ZLeiAFrSAGLEvX8/#q7bXpZb8JzHkXjLm7
While he is probably directionally correct, I think it should be way weaker than people think, and I now believe that something like the human body being optimized for tool use/the scaling hypothesis applied to evolution is my broad explanation of how humans have rocketed away from animals, and I now think cultural learning is way, way less powerful in humans than people thought, and this is partially updating from LLMs, which have succeeded but so far have nowhere near the capabilities that humans have (yet), and there's an argument to be made that cultural learning was the majority of AI capabilities before o1/o3.
EnigmaEval from ScaleAI and Dan Hendrycks, a collection of long, complex reasoning challenges, where AIs score under 10% on the easy problems and 0% on the hard problems.
I do admit, it’s not obvious developing this is helping?
Holly Elmore: I can honestly see no AI Safety benefit to this at this point in time. Once, ppl believed eval results would shock lawmakers into action or give Safety credibility w/o building societal consensus, but, I repeat, THERE IS NO SCIENTIFIC RESULT THAT WILL DO THE ADVOCACY WORK FOR US.
People simply know too little about frontier AI and there is simply too little precedent for AI risks in our laws and society for scientific findings in this area to speak for themselves. They have to come with recommendations and policies and enforcement attached.
Jim Babcock: Evals aren’t just for advocacy. They’re also for experts to use for situational awareness.
So I told him it sounded like he was just feeding evals to capabilities labs and he started crying.
I’m becoming increasingly skeptical of benchmarks like this as net useful things, because I despair that we can use them for useful situational awareness. The problem is: They don’t convince policymakers. At all. We’re learning that. So there’s no if-then action plan here. There’s no way to convince people that success on this eval should cause them to react.
I think the main value add at this point is to lower bound the capabilities for when AI safety can be automated and to upper bound capabilities for AI safety cases (very broadly), but yes the governance value of evals has declined, and there's no plausible way for evals to help with governance in short timelines.
This is more broadly downstream of how governance has become mostly worthless in short timelines, due to basically all the major powers showing e/acc tendencies towards AI (though notably without endorsing human extinction), so technical solutions to alignment are more valuable than they once were.
The only governance is what the labs choose.
In particular, we cannot assume any level of international coordination going forward, and must assume that the post-World War II order that held up international cooperation to prevent x-risks as an anomaly, not something enduring.
Re vibe shifts:
I especially appreciate Wildeford’s #1 point, that the vibes have shifted and will shift again. How many major ‘vibe shifts’ have there been in AI? Seems like at least ChatGPT, GPT-4, CAIS statement, o1 and now DeepSeek with a side of Trump, or maybe it’s the other way around. You could also consider several others.
Whereas politics has admittedly only had ‘vibe shifts’ in, let’s say, 2020, 2021 and then in 2024. So that’s only 3 of the last 5 years (how many happened in 2020-21 overall is an interesting debate). But even with only 3 that still seems like a lot, and history is accelerating rapidly. None of the three even involved AI.
It would not surprise me if the current vibe in AI is different as soon as two months from now even if essentially nothing not already announced happens, where we spent a few days on Grok 3, then OpenAI dropped the full o3 and GPT-4.5, and a lot more people both get excited and also start actually worrying about their terms of employment.
I do not expect this for the next 4 years, and conditional on short timelines (which is 5 years away here), the vibe shift will be too late to matter, IMO.
A big reason for this is I expect politics and the news to focus on the least relevant stuff by a mile, and bring any AI stuff way down until they are replaced, but at that point the AI takeoff is likely in full swing, so we are either doomed to be extinct we survive in a new order.
To be frank though, that basically guts any chance of institutional capability, so it's probably going to be at best a hollow organization.
You do not randomly fire large swathes of people and expect most institutional knowledge to survive, because there's no reason for anyone to work there rather than be in the private sector.
IMO, the most important ideas from game theory are these:
- Cooperation doesn't happen automatically, and you need to remember that any plan that assumes coordination must figure out ways to either make their values prefer coordination, or their incentives must be changed.
More generally, one of the most powerful things in game theory/economics is looking at the incentives/games can get you a lot of information on why a system has evolved the way it does.
- Except when it isn't that informative, due to the folk theorems of game theory. For the Iterated Prisoners Dillemma, which is actually quite common (as is stag hunt/Schelling problems), it's known that under a number of conditions that could plausibly happen, and at any rate are likely to be more probable in the future, any outcome that is possible and individually rational can happen, which means that game theory imposes no constraints on it's own:
https://en.wikipedia.org/wiki/Folk_theorem_(game_theory)
https://www.lesswrong.com/posts/FE4R38FQtnTNzeGoz/explaining-hell-is-game-theory-folk-theorems
As a special case, all of Elinor Ostrom's observations around cooperation IRL are mathematically provable to exist, so it can happen, and indeed it did.
- If you want cooperation, peace or a lot of other flourishing goals, it is very helpful to shift from zero-sum games to positive sum games, and there's a good argument to be made that the broad reason things are being more peaceful today is the fundamental shift from a broadly zero-sum world to a broadly positive sum world, primarily because of economic growth, but due to the laws of physics, it's probable that within the next several centuries, zero sum games will come back quite a bit, and while I expect more positive sum games than the era from 200,000 BC-1500 AD, I do expect a lot more zero sum games than the are 1500 AD-2050 AD era.
Okay, maybe he clarified that there was no thermodynamics-like blocker to getting a plan in principle to align AI, but I didn't interpret Eliezer's clarification to rule that out immediately, so I wanted to rule that interpretation out.
I didn't see the interpretation as false when I wrote it, because I believed he only ruled out a decisive blocker to getting behaviors you don't know how to verify.
For a model of how terrible outcomes can happen with aligned AI, the intelligence curse is excellent here, because it focuses on the incentives once people can be replaced with AI:
I think the runaway self-improvement problem is IMO vastly outweighed by other problems on aligning humans, like the fact that any control technique on AI would be illegal because of it being essentially equivalent to brainwashing, such that I consider AIs much more alignable than humans, and I think the human intelligence augmentation path is way more risky and fraught than people think for alignment purposes.
My claim here is that there is no decisive blocker for plans that get getting a safe, highly capable AIs that is used for automated AI safety research, in the way that thermodynamics blocks you from getting a perpetual motion machine (under the assumption that the universe is time symmetric, that is physics stays the same no matter when an experiment happens), which has been well tested, and the proposed blockers do not have anywhere close to the amount of evidence thermodynamics does such that we can safely discard any plan that doesn't meet a prerequisite.
For this:
What's the trick? My basic guess, when I see some very long complicated paper that doesn't explain the key problem and key solution up front, is that you've done the equivalent of an inventor building a sufficiently complicated perpetual motion machine that their mental model of it no longer tracks how conservation laws apply. (As opposed to the simpler error of their explicitly believing that one particular step or motion locally violates a conservation law.) But if you've got a directly explainable trick for how you get great suggestions you can't verify, go for it.
I think a crux here is I do not believe we will get no-go theorems like this, and more to the point, complete impossibilities given useful assumptions are generally much rarer than you make it out to be.
The big reason for this is the very fact that neural networks are messy/expressive means it's extremely hard to bound their behavior, and for the same reason you couldn't do provable safety/alignment on the AI itself except for very toy examples, will also limit any ability to prove hard theorems on what an AI is aligned to.
From an epistemic perspective, we have way more evidence of the laws of thermodynamics existing than particular proposals for AI alignment being impossible, arguably by billions or trillions of bits more, so much so that there is little reason to think at our current state of epistemic clarity that we can declare a direction impossible (rather than impractical).
Quintin Pope correctly challenges @Liron here on this exact point, because of the yawning gap in evidence between thermodynamics and AI alignment arguments, and Liron kind of switched gears mid-way to claim a much weaker stance:
https://x.com/QuintinPope5/status/1703569557053644819
Equating a bunch of speculation about instrumental convergence, consequentialism, the NN prior, orthogonality, etc., with the overwhelming evidence for thermodynamic laws, is completely ridiculous. Seeing this sort of massive overconfidence on the part of pessimists is part of why I've become more confident in my own inside-view beliefs that there's not much to worry about.
The weak claim is here:
https://x.com/liron/status/1704126007652073539
Right, orthogonality doesn’t argue that AI we build *will* have human-incompatible preferences, only that it can. It raises the question: how will the narrow target in preference-space be hit? Then it becomes concerning how AI labs admit their tools can’t hit narrow targets.
This, but I will caveat that weaker goals relating to this, for example getting data on whether gravity is classical or quantum at all (ignoring the specific theory) might become possible by 2040.
I agree this particular part is unrealistic, given the other capabilities implied.
I think it would be somewhat odd if P(models think about their goals and they change) is extremely tiny like e-9. But the extent to which models might do this is rather unclear to me. I'm mostly relying on analogies to humans -- which drift like crazy
I think they will change, but not nearly as randomly as claimed, and in particular I think value change is directable via data, which is the main claim here.
I also think alignment could be remarkably fragile. Suppose Claude thinks "huh, I really care about animal rights... humans are rather mean to animals ... so maybe I don't want humans to be in charge and I should build my own utopia instead."
I think preventing AI takeover (at superhuman capabilities) requires some fairly strong tendencies to follow instructions or strong aversions to subversive behavior, and both of these seem quite pliable to a philosophical thought.
I agree they're definitely pliable, but my key claim is that given that it's easy to get to that thought, a small amount of data will correct it, because I believe it's symmetrically easy to change behavior.
Of course, by the end of training you definitely cannot make major changes without totally scrapping the AI, which is expensive, and thus you do want to avoid crystallization of bad values.
The big reason why such a slowdown could happen is that the hyper-fast scaling trends can't last beyond 2030, which has been the main driver of AI progress, and I still expect it to be the main driver to 2030, and if there's no real way for AI systems to get better past that point through algorithmic advances, then this story becomes much more plausible.
Only the 2nd belief is necessarily wrong for everyone but 1 person, the first belief is not necessarily mistaken, because one of them relates to relative performance, and the other relates to absolute performance.
I think the crux might be that I disbelieve this claim:
- Goals randomly permute at some non-trivial (but still possibly quite low) probability.
And instead think that they are pretty predictable, given the data that is being fed to the AIs.
I'd also have quibbles with the ratio of safe to unsafe propensities, but that's another discussion.
In my time at METR, I saw agents game broken scoring functions. I've also heard that people at AI companies have seen this happen in practice too, and it's been kind of a pain for them.
Are you implicitly or explicitly claiming that reward was the optimization target for agents tested at METR, and that AI companies have experienced issues with reward being the optimization target for AIs?
I pretty much agree you can end up in arbitrary places with extrapolated values, and I don't think morality is convergent, but I also don't think it matters for the purpose of existential risk, because assuming something like instruction following works, the extrapolation problem can be solved by ordering AIs not to extrapolate values to cases where they get tortured/killed in an ethical scenario, and more generally I don't expect value extrapolation to matter for the purpose of making an AI safe to use.
The real impact is on CEV style alignment plans/plans for what to do with a future AI, which are really bad plans to do for a lot of people's current values, and thus I really don't want CEV to be the basis of alignment.
Thankfully, it's unlikely to ever be this, but it still matters somewhat, especially since Anthropic is targeting value alignment (though thankfully there is implicit constraints/grounding based on the values chosen).
Re continuous takeoff, you could argue that human takeoff was continuous, or only mildly discontinuous, just very fast, and at any rate it could well be discontinuous relative to your OODA loop/the variables you tracked, so unfortunately I think the continuity of the takeoff is less relevant than people thought (it does matter for alignment, but not for governance of AI):
https://www.lesswrong.com/posts/cHJxSJ4jBmBRGtbaE/continuity-assumptions#DHvagFyKb9hiwJKRC
https://www.lesswrong.com/posts/cHJxSJ4jBmBRGtbaE/continuity-assumptions#tXKnoooh6h8Cj8Tpx
Some thoughts:
Maybe! My vague Claude-given sense is that the Moon is surprisingly poor in important elements though.
What elements is the moon poor in that are important for a robot economy?
This is a good point! However, more intelligence in the world also means we should expect competition to be tighter, reducing the amount of slack by which you can deviate from the optimal. In general, I can see plausible abstract arguments for the long-run equilibrium being either Hansonian zero-slack Malthusian competition or absolute unalterable lock-in.
I think the key crux is that the slack necessary to preserve a lot of values, assuming they are compatible with expansion at all is so negligibly small compared to the resources of the AI economy that even very Malthusian competition means that values aren't eroded to what's purely optimal for expansion, because it's very easy to preserve your original values ~forever.
Some reasons for this are:
- Very long lived colonists fundamentally remove a lot of the ways human values have changed in the long run. While humans can change values across their lifetimes, it's generally rare once you are past 25, and it's very hard to persuade people, meaning most of the civilizational drift has been inter-generational, but with massively long-lived humans, AIs embodied as robots, or uploaded humans with designer bodies, you have basically removed most of the source of values change.
- I believe that replicating your values, or really everything will be so reliable that you could in theory, and probably in practice make yourself immune to random drift in values for the entire age of the universe, due to error-correction tricks.
It's described more below:
To continue the human example, we were created by evolution on genes, but within a lifetime, evolution has no effect on the policy and so even if evolution 'wants' to modify a human brain to do something other than what that brain does, it cannot operate within-lifetime (except at even lower levels of analysis, like in cancers or cell lineages etc); or, if the human brain is a digital emulation of a brain snapshot, it is no longer affected by evolution at all; and even if it does start to mold human brains, it is such a slow high-variance optimizer that it might take hundreds of thousands or millions of years... and there probably won't even be biological humans by that point, never mind the rapid progress over the next 1-3 generations in 'seizing the means of reproduction' if you will. (As pointed out in the context of Von Neumann probes or gray goo, if you add in error-correction, it is entirely possible to make replication so reliable that the universe will burn out before any meaningful level of evolution can happen, per the Price equation. The light speed delay to colonization also implies that 'cancers' will struggle to spread much if they take more than a handful of generations.)
While persuasion will get better, and become incomprehensibly superhuman eventually, they will almost certainly not be targeted towards values that are purely expansionist, except for a few cases.
I expect the US government to be competent enough to avoid being supplanted by the companies. I think politicians, for all their flaws, are pretty good at recognising a serious threat to their power. There's also only one government but several competing labs.
(Note that the scenario doesn't mention companies in the mid and late 2030s)
Maybe companies have already been essentially controlled by the government in canon, in which case the foregoing doesn't matter (I believe you hint at that solution), but I think the crux is I both expect a lot for competence/state capacity to be lost in the next 10-15 years by default (though Trump is a shock here that accelerates competence decline), and also I expect them to react when a company can credibly automate everyone's jobs, and by that point I think it's too easy to create an automated military which is unchallengable by local governments, and at that point the federal government would have to respond militarily, and I ultimately think what does America in the timeline (assuming companies haven't already been controlled by the government) is the vetocractic aspects/vetocracy.
In essence, I think they will react too slowly such that they get OODA looped by companies.
Also, the persuasion capabilities are not to be underestimated here, and since you have mentioned that AIs have gotten better at all humans by the 2030s at persuasion, I'd expect even further improvements in tandem with planning improvements such that it's very easy to convince the population that corporate governments are more legitimate than the US government.
In this timeline, a far more important thing is the sense among American political elite that they are freedom-loving people and that they should act in accordance with that, and a similar sense among Chinese political elite that they are a civilised people and that Chinese civilisational continuity is important. A few EAs in government, while good, will find it difficult to match the impact of the cultural norms that a country's leaders inherit and that proscribe their actions.
For example: I've been reading Christopher Brown's Moral Capital recently, which looks at how opposition to slavery rose to political prominence in 1700s Britain. It claims that early strong anti-slavery attitudes were more driven by a sense that slavery was insulting to Britons' sense of themselves as a uniquely liberal people, than by arguments about slave welfare. At least in that example, the major constraint on the treatment of a powerless group of people seems to have been in large part the political elite managing its own self-image.
I was more so imagining a few EAs in the companies like Anthropic or Deepmind, which do have the power to supplant the nation-state, so they are as or more powerful in setting cultural norms as current nations, but if companies are controlled by government so thoroughly they don't rebel, then I agree with you.
I agree unconditionally on what happened regarding China.
Given the 2040+ position, I'll try to speculate a little more on what a world will look like after 2040, though I do have to make a few comments first here.
1, while I do think Mars will be exploited eventually, I expect the moon to be first for serious robotics effort, and more effort will be directed towards the moon than mars mostly because of it's closeness and more useful minerals to jump-start the process of a robot economy, combined with plentiful amounts of power.
2, I expect the equation mentioned below to be severely undetermined, such that there are infinitely many solutions, and a big one is I think the relevant equation is needing to replicate fast, not being the fastest amongst them all (because replicating a little better will usually only get a little advantage, not an utterly dominant one), combined with a lot of values being compatible with replicating fast, so value alignment/intent alignment matters more than you think:
But this alone does not let you control the future. A thousand people go to a thousand AIs and say: do like so. The AIs obey, and it is done, but then the world responds: doing this leads to this much power, and doing that leads to that much power. In the vast sea of interactions, there are some patterns that strengthen themselves over time, and others that wind themselves down. Repeat enough times, each time giving to each actor what they sowed last time, and what emerges is not the sum of human wills—even if it is bent by it—but the solution to the equation: what propagates fastest?
As far as it's future goes, I expect the universe to be broadly divided between China, Anthropic, OpenAI, Google Deepmind and perhaps a UK AISI/company and Taiwan, with the other powers being either irrelevant or having been exterminated.
Given no nationalization of the companies has happened, and they still have large freedoms of action, it's likely that Google Deepmind, OpenAI and Anthropic have essentially supplanted the US as the legitimate government, given their monopolies on violence via robots.
Anthropic will likely be the big pressure group that counters the intelligence curse, due to their leadership being mostly composed of EAs that care about others that do not rely on them being instrumentally valuable, and in general the fact that EA types got hired to some of the most critical positions on AI was probably fairly critical in this timeline for preventing the worst outcomes from the intelligence curse from occurring.
Eventually, someone's going to develop very powerful biotech, neuralinks that can control your mind in almost arbitrary ways, and uploading in the 21st century, assuming AI and robotics are solved by the 2040s-2050s, and once these technologies are developed, it becomes near trivial to both preserve your culture for ~eternity, and makes the successor problem that causes cultures to diverge essentially no longer a problem, which essentially obviates evolutions role except in very limited settings, which means the alignment problem in full generality is likely very soluble by default in the timeline presented.
My broad prediction at this point is that the governance of the Universe/Earth looks to be set between ASI/human emulation dictatorships and states that are like the Sentinel Islands, where no one is willing to attack the nation for their own reasons.
In many ways, the story of the 21st century is the story of the end of evolution/dynamism as a major force in life, and to the extent that evolution matters, it's in much more limited settings that are always constrained by the design of the system.
Somewhat similarly, one of the most useful shifts in identity to apply it to more cases usefully is to take the view that "My identity is an algorithm/function" as the more fundamental primitive/general case, and view the idea that "My identity is a physical object" is a useful special case, but the physicalist view of identity cannot hold in certain regimes.
The shift from a physical view to an algorithmic view of an identity answers/dissolves/sidesteps a lot of confusing questions about what happens to identity.
(It's also possible that identity in a sense is basically a fiction, but that's another question entirely)
The main takeaway from the Paris AI anti-safety summit is that for people with reasonably short timelines (from say 5-10 years, though it applies quite a lot more to the 5 year case), and maybe for even longer, we cannot assume that AI governance is reasonably likely, and the AI governance theory of change will really have to pivot towards being prepared for when the vibe does shift to AI regulation again, so safety plans for AI should assume the US government does ~nothing of importance by default until very late in the game.
We might get AI regulation, but it will not be strong enough to slow down AI significantly until AIs completely obsolete humans at a lot of jobs, which is likely to be very late in the process.
I've updated away from the CoT hopes, due to the recurrent architecture paper, I'd now say it's probably going to be 45-65% at best, and I expect it to keep dropping predictably (though the reason I'm not yet updating all the way is that compute constraints combined with people maybe not choosing to use these architectures means I can't skip to the end and update all the way):
For an algorithmic advance that might be relevant, this paper has a new model that apparently scales without any currently known bound, and it's a recurrent architecture that actually works.
This is moderately spooky, both because of the fact that it works at all, combined with the fact that it does work being a signal for researchers to try to improve the architecture, and given the funding to AI, lots of money might come soon to these approaches:
Notably, the difference can become non-constant if we allow for semi-measures on infinite strings:
Assuming LEV happens in 2040 or 2050, does the expected value become net-positive or net-negative?