Eliezer's Lost Alignment Articles / The Arbital Sequence
post by Ruby, RobertM (T3t) · 2025-02-20T00:48:10.338Z · LW · GW · 6 commentsContents
The Sequence Tier 1 Bayes Rule Guide Tier 2 None 6 comments
Note: this is a static copy of this wiki page [? · GW]. We are also publishing it as a post to ensure visibility.
Circa 2015-2017, a lot of high quality content was written on Arbital by Eliezer Yudkowsky, Nate Soares, Paul Christiano, and others. Perhaps because the platform didn't take off, most of this content has not been as widely read as warranted by its quality. Fortunately, they have now been imported into LessWrong [LW · GW].
Most of the content written was either about AI alignment or math[1]. The Bayes Guide and Logarithm Guide [? · GW] are likely some of the best mathematical educational material online. Amongst the AI Alignment content are detailed and evocative explanations of alignment ideas: some well known, such as instrumental convergence [? · GW] and corrigibility [? · GW], some lesser known like epistemic/instrumental efficiency, and some misunderstood like pivotal act [? · GW].
The Sequence
The articles collected here were originally published as wiki pages with no set reading order. The LessWrong team first selected about twenty pages which seemed most engaging and valuable to us, and then ordered them[2][3] based on a mix of our own taste and feedback from some test readers that we paid to review our choices.
Tier 1
These pages are a good reading experience.
1. | AI safety mindset | What kind of mindset is required to successfully build an extremely advanced and powerful AGI that is "nice"? |
2. | Convergent instrumental strategies and Instrumental pressure | Certain sub-goals like "gather all the resources" and "don't let yourself be turned off" are useful for a very broad range of goals and values. |
3. | Context disaster | Current terminology would call this "misgeneralization". Do alignment properties that hold in one context (e.g. training, while less smart) generalize to another context (deployment, much smarter)? |
4. | Orthogonality Thesis | The Orthogonality Thesis asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal. |
5. | Hard problem of corrigibility | It's a hard problem to build an agent which, in an intuitive sense, reasons internally as if from the developer's external perspective – that it is incomplete, that it requires external correction, etc. This is not default behavior for an agent. |
6. | Coherent Extrapolated Volition | If you're extremely confident in your ability to align an extremely advanced AGI on complicated targets, this is what you should have your AGI pursue. |
7. | Epistemic and instrumental efficiency | "Smarter than you" is vague. "Never ever makes a mistake that you could predict" is more specific. |
8. | Corporations vs. superintelligences | Is a corporation a superintelligence? (An example of epistemic/instrumental efficiency in practice.) |
9. | Rescuing the utility function | "Love" and "fun" aren't ontologically basic components of reality. When we figure out what they're made of, we should probably go on valuing them anyways. |
10. | Nearest unblocked strategy | If you tell a smart consequentialist mind "no murder" but it is actually trying, it will just find the next best thing that you didn't think to disallow. |
11. | Mindcrime | The creation of artificial minds opens up the possibility of artificial moral patients who can suffer. |
12. | General intelligence | Why is AGI a big deal? Well, because general intelligence is a big deal. |
13. | Advanced agent properties | The properties of agents for which (1) we need alignment, (2) are relevant in the big picture. |
14. | Mild optimization | "Mild optimization" is where, if you ask your advanced AGI to paint one car pink, it just paints one car pink and then stops, rather than tiling the galaxies with pink-painted cars, because it's not optimizing that hard. It's okay with just painting one car pink; it isn't driven to max out the twentieth decimal place of its car-painting score. |
15. | Corrigibility | The property such that if you tell your AGI that you installed the wrong values in it, it lets you do something about that. An unnatural property to build into an agent. |
16. | Pivotal Act | An act which would make a large positive difference to things a billion years in the future, e.g. an upset of the gameboard that's decisive "win". |
17. |
| An interactive guide to Bayes' theorem, i.e, the law of probability governing the strength of evidence - the rule saying how much to revise our probabilities (change our minds) when we learn a new fact or observe new evidence. |
18. | Bayesian View of Scientific Virtues | A number of scientific virtues are explained intuitively by Bayes' rule. |
19. | A quick econ FAQ for AI/ML folks concerned about technological unemployment | An FAQ aimed at a very rapid introduction to key standard economic concepts for professionals in AI/ML who have become concerned with the potential economic impacts of their work. |
Tier 2
These pages are high effort and high quality, but are less accessible and/or of less general interest than the Tier 1 pages.
The list starts with a few math pages before returning to AI alignment topics.
20. | Uncountability | Sizes of infinity fall into two broad classes: countable infinities, and uncountable infinities. |
21. | Axiom of Choice [? · GW] | The axiom of choice states that given an infinite collection of non-empty sets, there is a function that picks out one element from each set. |
22. | Category theory [? · GW] | Category theory studies the abstraction of mathematical objects (such as sets, groups, and topological spaces) in terms of the morphisms between them. |
23. | Solomonoff Induction: Intro Dialogue | A dialogue between Ashley, a computer scientist who's never heard of Solomonoff's theory of inductive inference, and Blaine, who thinks it is the best thing since sliced bread. |
24. | Advanced agent properties | An "advanced agent" is a machine intelligence smart enough that we start considering how to point it in a nice direction. |
25. | Vingean uncertainty [? · GW] | Vinge's Principle says that you (usually) can't predict exactly what an entity smarter than you will do, because if you knew exactly what a smart agent would do, you would be at least that smart yourself. "Vingean uncertainty" is the epistemic state we enter into when we consider an agent too smart for us to predict its exact actions. |
26. | Sufficiently optimized agents appear coherent [? · GW] | Agents which have been subject to sufficiently strong optimization pressures will tend to appear, from a human perspective, as if they obey some bounded form of the Bayesian coherence axioms for probabilistic beliefs and decision theory. |
27. | Utility indifference [? · GW] | A proposed solution to the hard problem of corrigibility. |
28. | Problem of fully updated deference [? · GW] | One possible scheme in AI alignment is to give the AI a state of moral uncertainty implying that we know more than the AI does about its own utility function, as the AI's meta-utility function defines its ideal target. Then we could tell the AI, "You should let us shut you down because we know something about your ideal target that you don't, and we estimate that we can optimize your ideal target better without you." |
29. | Ontology identification problem [? · GW] | It seems likely that for advanced agents, the agent's representation of the world will change in unforeseen ways as it becomes smarter. The ontology identification problem is to create a preference framework for the agent that optimizes the same external facts, even as the agent modifies its representation of the world. |
30. | Edge instantiation [? · GW] | The edge instantiation problem is a hypothesized patch-resistant problem for safe value loading in advanced agent scenarios where, for most utility functions we might try to formalize or teach, the maximum of the agent's utility function will end up lying at an edge of the solution space that is a 'weird extreme' from our perspective. |
31. | Goodhart's Curse [? · GW] | Goodhart's Curse is a neologism for the combination of the Optimizer's Curse and Goodhart's Law, particularly as applied to the value alignment problem for Artificial Intelligences. |
32. | Low impact [? · GW] | A low-impact agent is one that's intended to avoid large bad impacts at least in part by trying to avoid all large impacts as such. |
33. | Executable philosophy [? · GW] | 'Executable philosophy' is Eliezer Yudkowsky's term for discourse about subjects usually considered in the realm of philosophy, meant to be used for designing an Artificial Intelligence. |
34. | Separation from hyperexistential risk [? · GW] | An AGI design should be widely separated in the design space from any design that would constitute a hyperexistential risk". A hyperexistential risk is a "fate worse than death". |
35. | Methodology of unbounded analysis [? · GW] | In modern AI and especially in value alignment theory, there's a sharp divide between "problems we know how to solve using unlimited computing power", and "problems we can't state how to solve using computers larger than the universe". |
36. | Methodology of foreseeable difficulties [? · GW] | Much of the current literature about value alignment centers on purported reasons to expect that certain problems will require solution, or be difficult, or be more difficult than some people seem to expect. The subject of this page's approval rating is this practice, considered as a policy or methodology. |
37. | Instrumental goals are almost-equally as tractable as terminal goals [? · GW] | One counterargument to the Orthogonality Thesis asserts that agents with terminal preferences for goals like e.g. resource acquisition will always be much better at those goals than agents which merely try to acquire resources on the way to doing something else, like making paperclips. This page is a reply to that argument. |
38. | Arbital: Solving online explanations [? · GW] | A page explaining somewhat how the rest of the pages here came to be. |
Lastly, we're sure this sequence isn't perfect, so any feedback (which you liked/disliked/etc) is appreciated – feel free to leave comments on this page.
- ^
Mathematicians were an initial target market for Arbital.
- ^
The ordering here is "Top Hits" subject to a "if you start reading at the top, you won't be missing any major prerequisites as your read along".
- ^
The pages linked here are only some of the AI alignment articles, and the selection/ordering has not been endorsed by Eliezer or MIRI. The rest of the imported Arbital content can be found via links from the pages below and also from the LessWrong Concepts page (use this link to highlight imported Arbital pages).
6 comments
Comments sorted by top scores.
comment by Dave Orr (dave-orr) · 2025-02-20T17:46:09.997Z · LW(p) · GW(p)
The pivotal act link is broken, fyi.
Replies from: kavecomment by Warty · 2025-02-20T20:27:03.759Z · LW(p) · GW(p)
there's weird shit going on on mobile like items 1-16 scroll a but horizontally
Replies from: Ruby, Warty