Posts
Comments
Because I might fund them or forward it to someone else who will.
In general people should feel free to DM me with pitches for this sort of thing.
I think this epistemic uncertainty is distinct from the type of "objective probabilities" I talk about in my post, and I don't really know how to use language without referring to degrees of my epistemic uncertainty.
The part I was gesturing at wasn't the "probably" but the "low measure" part.
Is your position that the problem is deeper than this, and there is no objective prior over worlds, it's just a thing like ethics that we choose for ourselves, and then later can bargain and trade with other beings who have a different prior of realness?
Yes, that's a good summary of my position—except that I think that, like with ethics, there will be a bunch of highly-suggestive logical/mathematical facts which make it much more intuitive to choose some priors over others. So the choice of prior will be somewhat arbitrary but not totally arbitrary.
I don't think this is a fully satisfactory position yet, it hasn't really dissolved the confusion about why subjective anticipation feels so real, but it feels directionally correct.
Hmmm, uncertain if we disagree. You keep saying that these concepts are cursed and yet phrasing your claims in terms of them anyway (e.g. "probably very low measure"), which suggests that there's some aspect of my response you don't fully believe.
In particular, in order for your definition of "what beings are sufficiently similar to you" to not be cursed, you have to be making claims not just about the beings themselves (since many Boltzmann brains are identical to your brain) but rather about the universes that they're in. But this is kinda what I mean by coalitional dynamics: a bunch of different copies of you become more central parts of the "coalition" of your identity based on e.g. the types of impact that they're able to have on the world around them. I think describing this as a metric of similarity is going to be pretty confusing/misleading.
you can estimate who are the beings whose decision correlate with this one, and what is the impact of each of their decisions, and calculate the sum of all that
You still need a prior over worlds to calculate impacts, which is the cursed part.
I don't think this line of argument is a good one. If there's a 5% chance of x-risk and, say, a 50% chance that AGI makes the world just generally be very chaotic and high-stakes over the next few decades, then it seems very plausible that you should mostly be optimizing for making the 50% go well rather than the 5%.
Worse than the current situation, because the counterfactual is that some later project happens which kicks off in a less race-y manner.
In other words, whatever the chance of its motivation shifting over time, it seems dominated by the chance that starting the equivalent project later would just have better motivations from the outset.
Great post. One slightly nitpicky point, though: even in the section where you argue that probabilities are cursed, you are still talking in the language of probabilities (e.g. "my modal guess is that I'm in a solipsist simulation that is a fork of a bigger simulation").
I think there's probably a deeper ontological shift you can do to a mindset where there's no actual ground truth about "where you are". I think in order to do that you probably need to also go beyond "expected utilities are real", because expected utilities need to be calculated by assigning credences to worlds and then multiplying them by expected impact in each world.
Instead the most "real" thing here I'd guess is something like "I am an agent in a superposition of being in many places in the multiverse. Each of my actions is a superposition of uncountable trillions of actions that will lead to nothing plus a few that will have lasting causal influence. The degree to which I care about one strand of causal influence over another is determined by the coalitional dynamics of my many subagents".
FWIW I think this is roughly the perspective on the multiverse Yudkowsky lays out in Planecrash (especially in the bits near the end where Keltham and Carissa discuss anthropics). Except that the degrees of caring being determined by coalitional dynamics is more related to geometric rationality.
I also tweeted about something similar recently (inspired by your post).
Cool, ty for (characteristically) thoughtful engagement.
I am still intuitively skeptical about a bunch of your numbers but now it's the sort of feeling which I would also have if you were just reasoning more clearly than me about this stuff (that is, people who reason more clearly tend to be able to notice ways that interventions could be surprisingly high-leverage in confusing domains).
Ty for the link but these seem like both clearly bad semantics (e.g. under either of these the second-best hypothesis under consideration might score arbitrarily badly).
Just changed the name to The Minority Coalition.
1. Yepp, seems reasonable. Though FYI I think of this less as some special meta argument, and more as the common-sense correction that almost everyone implicitly does when giving credences, and rationalists do less than most. (It's a step towards applying outside view, though not fully "outside view".)
2. Yepp, agreed, though I think the common-sense connotations of "if this became" or "this would have a big effect" are causal, especially in the context where we're talking to the actors who are involved in making that change. (E.g. the non-causal interpretation of your claim feels somewhat analogous to if I said to you "I'll be more optimistic about your health if you take these pills", and so you take the pills, and then I say "well the pills do nothing but now I'm more optimistic, because you're the sort of person who's willing to listen to recommendations". True, but it also undermines people's willingness/incentive to listen to my claims about what would make the world better.)
3. Here are ten that affect AI risk as much one way or the other:
- The US government "waking up" a couple of years earlier or later (one operationalization: AISIs existing or not right now).
- The literal biggest names in the field of AI becoming focused on AI risk.
- The fact that Anthropic managed to become a leading lab (and, relatedly, the fact that Meta and other highly safety-skeptical players are still behind).
- Trump winning the election.
- Elon doing all his Elon stuff (like founding x.AI, getting involved with Trump, etc).
- The importance of transparency about frontier capabilities (I think of this one as more of a logical update that I know you've made).
- o1-style reasoning as the next big breakthrough.
- Takeoff speeds (whatever updates you've made in the last three years).
- China's trajectory of AI capabilities (whatever updates you've made about that in last 3 years).
- China's probability of invading Taiwain (whatever updates you've made about that in last 3 years).
And then I think in 3 years we'll be able to publish a similar list of stuff that mostly we just hadn't predicted or thought about before now.
I expect you'll dispute a few of these; happy to concede the ones that are specifically about your updates if you disagree (unless you agree that you will probably update a bunch on them in the next 3 years).
But IMO the easiest way for safety cases to become the industry-standard thing is for AISI (or internal safety factions) to specifically demand it, and then the labs produce it, but kinda begrudgingly, and they don't really take them seriously internally (or are literally not the sort of organizations that are capable of taking them seriously internally—e.g. due to too much bureaucracy). And that seems very much like the sort of change that's comparable to or smaller than the things above.
I think I would be more sympathetic to your view if the claim were "if AI labs really reoriented themselves to take these AI safety cases as seriously as they take, say, being in the lead or making profit". That would probably halve my P(doom), it's just a very very strong criterion.
We have discussed this dynamic before but just for the record:
I think that if it became industry-standard practice for AGI corporations to write, publish, and regularly update (actual instead of just hypothetical) safety cases at at this level of rigor and detail, my p(doom) would cut in half.
This is IMO not the type of change that should be able to cut someone's P(doom) in half. There are so many different factors that are of this size and importance or bigger (including many that people simply have not thought of yet) such that, if this change could halve your P(doom), then your P(doom) should be oscillating wildly all the time.
I flag this as an example of prioritizing inside-view considerations too strongly in forecasts. I think this is the sort of problem that arises when you "take bayesianism too seriously", which is one of the reasons why I wrote my recent post on why I'm not a bayesian (and also my earlier post on Knightian uncertainty).
For context: our previous discussions about this related to Daniel's claim that appointing one specific person to one specific important job could change his P(doom) by double digit percentage points. I similarly think this is not the type of consideration that should be able to swing people's P(doom) that much (except maybe changing the US or Chinese leaders, but we weren't talking about those).
Lastly, since this is a somewhat critical comment, I should flag that I really appreciate and admire Daniel's forecasting, have learned a lot from him, and think he's generally a great guy. The epistemology disagreements just disproportionately bug me.
The former can be sufficient—e.g. there are good theoretical researchers who have never done empirical work themselves.
In hindsight I think "close conjunction" was too strong—it's more about picking up the ontologies and key insights from empirical work, which can be possible without following it very closely.
I think there's something importantly true about your comment, but let me start with the ways I disagree. Firstly, the more ways in which you're power-seeking, the more defense mechanisms will apply to you. Conversely, if you're credibly trying to do a pretty narrow and widely-accepted thing, then there will be less backlash. So Jane Street is power-seeking in the sense of trying to earn money, but they don't have much of a cultural or political agenda, they're not trying to mobilize a wider movement, and earning money is a very normal thing for companies to do, it makes them one of thousands of comparably-sized companies. (Though note that there is a lot of backlash against companies in general, which are perceived to have too much power. This leads a wide swathe of people, especially on the left, and especially in Europe, to want to greatly disempower companies because they don't trust them.)
Meanwhile the Gates Foundation has a philanthropic agenda, but like most foundations tries to steer clear of wider political issues, and also IIRC tries to focus on pretty object-level and widely-agreed-to-be-good interventions. Even so, it's widely distrusted and feared, and Gates has become a symbol of hated global elites, to the extent where there are all sorts of conspiracy theories about him. That'd be even worse if the foundation were more political.
Lastly, it seems a bit facile to say that everyone hates Goldman due to "perceived greed rather than power-seeking per se". A key problem is that people think of the greed as manifesting through political capture, evading regulatory oversight, deception, etc. That's part of why it's harder to tar entrepreneurs as greedy: it's just much clearer that their wealth was made in legitimate ways.
Now the sense in which I agree: I think that "gaining power triggers to defense mechanisms" is a good first pass, but also we definitely want a more mechanistic explanation of what the defense mechanisms are, what triggers them, etc—in particular so we don't just end up throwing our hands in the air and concluding that doing anything is hopeless and scary. And I also agree that your list is a good start. So maybe I'd just want to add to it stuff like:
- having a broad-ranging political agenda (that isn't near-universally agreed to be good)
- having non-transparent interactions with many other powerful actors
- having open-ended scope to expand
And maybe a few others (open to more suggestions).
The bits are not very meaningful in isolation; the claim "program-bit number 37 is a 1" has almost no meaning in the absence of further information about the other program bits. However, this isn't much of an issue for the formalism.
In my post I defend the use of propositions as a way to understand models, and attack the use of propositions as a way to understand reality. You can think of this as a two-level structure: claims about models can be crisp and precise enough that it makes sense to talk about them in propositional terms, but for complex bits of reality you mostly want to make claims of the form "this is well-modeled by model X". Those types of claims need to be understood in terms of continuous truth-values: they're basically never entirely true or entirely false.
Separately, Solomonoff programs are non-central examples of models because they do not come with structural correspondences to reality attached (except via their inputs and outputs). Most models have some mapping that allows you to point at program-bits and infer some features of reality from them.
I notice as I write this that there's some tension in my position: I'm saying we shouldn't apply propositions to reality, but also the mappings I mentioned above allow us to formulate propositions like "the value of X in reality is approximately the value of this variable in my model".
So maybe actually I'm actually arguing for a middle ground between two extremes:
- The basic units of epistemology should all map precisely to claims about reality, and should be arbitrarily combinable and composable (the propositional view)
- The basic units of epistemology should only map to claims about reality in terms of observable predictions, and not be combinable or composable at all (the Solomonoff view)
This spectrum isn't fully well-defined even in my head but seems like an interesting way to view things which I'll think more about.
The minority faction is the group of entities that are currently alive, as opposed to the vast number of entities that will exist in the future. I.e. the one Clarke talks about when he says "why won’t you help the rest of us form a coalition against them?"
In hindsight I should probably have called it The Minority Coalition.
Here's how that would be handled by a Bayesian mind:
- There's some latent variable representing the semantics of "humanity will be extinct in 100 years"; call that variable S for semantics.
- Lots of things can provide evidence about S. The sentence itself, context of the conversation, whatever my friend says about their intent, etc, etc.
- ... and yet it is totally allowed, by the math of Bayesian agents, for that variable S to still have some uncertainty in it even after conditioning on the sentence itself and the entire low-level physical state of my friend, or even the entire low-level physical state of the world.
What would resolve the uncertainty that remains after you have conditioned on the entire low-level state of the physical world? (I assume that we're in the logically omniscient setting here?)
"Dragons are attacking Paris!" seems true by your reasoning, since there are no dragons, and therefore it is vacuously true that all of them are attacking Paris.
Ty for the comment. I mostly disagree with it. Here's my attempt to restate the thrust of your argument:
The issues with binary truth-values raised in the post are all basically getting at the idea that the meaning of a proposition is context-dependent. But we can model context-dependence in a Bayesian way by referring to latent variables in the speaker's model of the world. Therefore we don't need fuzzy truth-values.
But this assumes that, given the speaker's probabilistic model, truth-values are binary. I don't see why this needs to be the case. Here's an example: suppose my non-transhumanist friend says "humanity will be extinct in 100 years". And I say "by 'extinct' do you include genetically engineered until future humans are a different species? How about being uploaded? How about all being cryonically frozen, to be revived later? How about...."
In this case, there is simply no fact of the matter about which of these possibilities should be included or excluded in the context of my friend's original claim, because (I'll assume) they hadn't considered any of those possibilities.
More prosaically, even if I have considered some possibilities in the past, at the time when I make a statement I'm not actively considering almost any of them. For some of them, if you'd raised those possibilities to me when I'd asked the question, I'd have said "obviously I did/didn't mean to include that", but for others I'd have said "huh, idk" and for others still I would have said different things depending on how you presented them to me. So what reason do we have to think that there's any ground truth about what the context does or doesn't include? Similar arguments apply re approximation error about how far away the grocery store is: clearly 10km error is unacceptable, and 1m is acceptable, but what reason do we have to think that any "correct" threshold can be deduced even given every fact about my brain-state when I asked the question?
I picture you saying in response to this "even if there are some problems with binary truth-values, fuzzy truth-values don't actually help very much". To this I say: yes, in the context of propositions, I agree. But that's because we shouldn't be doing epistemology in terms of propositions. And so you can think of the logical flow of my argument as:
- Here's why, even for propositions, binary truth is a mess. I'm not saying I can solve it but this section should at least leave you open-minded about fuzzy truth-values.
- Here's why we shouldn't be thinking in terms of propositions at all, but rather in terms of models.
- And when it comes to models, something like fuzzy truth-values seems very important (because it is crucial to be able to talk about models being closer to the truth without being absolutely true or false).
I accept that this logical flow wasn't as clear as it could have been. Perhaps I should have started off by talking about models, and only then introduced fuzzy truth-values? But I needed the concept of fuzzy truth-values to explain why models are actually different from propositions at all, so idk.
I also accept that "something like fuzzy truth-values" is kinda undefined here, and am mostly punting that to a successor post.
Suppose you have two models of the earth; one is a sphere, one is an ellipsoid. Both are wrong, but they're wrong in different ways. Now, we can operationalize a bunch of different implications of these hypotheses, but most of the time in science the main point of operationalizing the implications is not to choose between two existing models, or because we care directly about the operationalizations, but rather to come up with a new model that combines their benefits.
IMO all of the "smooth/sharp" and "soft/hard" stuff is too abstract. When I concretely picture what the differences between them are, the aspect that stands out most is whether the takeoff will be concentrated within a single AI/project/company/country or distributed across many AIs/projects/companies/countries.
This is of course closely related to debates about slow/fast takeoff (as well as to the original Hanson/Yudkowsky debates). But using this distinction instead of any version of the slow/fast distinction has a few benefits:
- If someone asks "why should I care about slow/fast takeoff?" a lot of the answers will end up appealing to the concentrated/distributed power thing. E.g. you might say "if takeoff is fast that means that there will be a few key points of leverage".
- Being more concrete, I think it will provoke better debates (e.g. how would a single AI lab concretely end up outcompeting everyone else?)
- This framing naturally concentrates the mind on an aspect of risk (concentration of power) that is concerning from both a misuse and a misalignment perspective.
Well, the whole point of national parks is that they're always going to be unproductive because you can't do stuff in them.
If you mean in terms of extracting raw resources, maybe (though presumably a bunch of mining/logging etc in national parks could be pretty valuable) but either way it doesn't matter because the vast majority of economic productivity you could get from them (e.g. by building cities) is banned.
Nothing makes humans all that special
This is just false. Humans are at the very least privileged in our role as biological bootloaders of AI. The emergence of written culture, industrial technology, and so on, are incredibly special from a historical perspective.
You only set aside occasional low-value fragments for national parks, mostly for your own pleasure and convenience, when it didn't cost too much?
Earth as a proportion of the solar system's planetary mass is probably comparable to national parks as a proportion of the Earth's land, if not lower.
It seems very plausible to me that alignment targets in practice will evolve out of things like the OpenAI Model Spec. If anyone has suggestions for how to improve that, please DM me.
we both agree it would not make sense to model OpenAI as part of the same power base
Hmm, I'm not totally sure. At various points:
- OpenAI was the most prominent group talking publicly about AI risk
- Sam Altman was the most prominent person talking publicly about large-scale AI regulation
- A bunch of safety-minded people at OpenAI were doing OpenAI's best capabilities work (GPT-2, GPT-3)
- A bunch of safety-minded people worked on stuff that led to ChatGPT (RLHF, John Schulman's team in general)
- Elon tried to take over, and the people who opposed that were (I'm guessing) a coalition of safety people and the rest of OpenAI
It's really hard to step out of our own perspective here, but when I put myself in the perspective of, say, someone who doesn't believe in AGI at all, these all seem pretty indicative of a situation where OpenAI and AI safety people were to a significant extent building a shared power base, and just couldn't keep that power base together.
The mistakes can (somewhat) be expressed in the language of Bayesian rationalism by doing two things:
- Talking about partial hypotheses rather than full hypotheses. You can't have a prior over partial hypotheses, because several of them can be true at once (though you can still assign them credences and update those credences according to evidence).
- Talking about models with degrees of truth rather than just hypotheses with degrees of likelihood. E.g. when using a binary conception of truth, general relativity is definitely false because it's inconsistent with quantum phenomena. Nevertheless, we want to say that it's very close to the truth. In general this is more of an ML approach to epistemology (we want a set of models with low combined loss on the ground truth).
Scott Garrabrant just convinced me that my notion of conservatism was conflating two things:
- Obligations to (slash constraints imposed by) the interests of existing agents.
- The assumption that large agents would grow in a bottom-up way (e.g. by merging smaller agents) rather than in a top-down way (e.g. by spinning up new subagents).
I mainly intend conservatism to mean the former.
Whose work is relevant, according to you?
If you truly aren't trying to make AGI, and you truly aren't trying to align AGI, and instead are just purely intrinsically interested in how neural networks work (perhaps you are an academic?) ...great! That's neither capabilities nor alignment research afaict, but basic science.
Consider Chris Olah, who I think has done more than almost anyone else to benefit alignment. It would be very odd if we had a definition of alignment research where you could read all of Chris's interpretability work and still not know whether or not he's an "alignment researcher". On your definition, when I read a paper by a researcher I haven't heard of, I don't know anything about whether it's alignment research or not until I stalk them on facebook and find out how socially proximal they are to the AI safety community. That doesn't seem great.
Back to Chris. Because I've talked to Chris and read other stuff by him, I'm confident that he does care about alignment. But I still don't know whether his actual motivations are more like 10% intrinsic interest in how neural networks work and 90% in alignment, or vice versa, or anything in between. (It's probably not even a meaningful thing to measure.) It does seem likely to me that the ratio of how much intrinsic interest he has in how neural networks work, to how much he cares about alignment, is significantly higher than that of most alignment researchers, and I don't think that's a coincidence—based on the history of science (Darwin, Newton, etc) intrinsic interest in a topic seems like one of the best predictors of actually making the most important breakthroughs.
In other words: I think your model of what produces more useful research from an alignment perspective overprioritizes towards first-order effects (if people care more they'll do more relevant work) and ignores the second-order effects that IMO are more important (1. Great breakthroughs seem, historically, to be primarily motivated by intrinsic interest; and 2. Creating research communities that are gatekept by people's beliefs/motivations/ideologies is corrosive, and leads to political factionalism + ingroupiness rather than truth-seeking.)
I'm not primarily trying to judge people, I'm trying to exhort people
Well, there are a lot of grants given out for alignment research. Under your definition, those grants would only be given to people who express the right shibboleths.
I also think that the best exhortation of researchers mostly looks like nerdsniping them, and the way to do that is to build a research community that is genuinely very interested in a certain set of (relatively object-level) topics. I'd much rather an interpretability team hire someone who's intrinsically fascinated by neural networks (but doesn't think much about alignment) than someone who deeply cares about making AI go well (but doesn't find neural nets very interesting). But any step in the pipeline that prioritizes "alignment researchers" (like: who gets invited to alignment workshops, who gets alignment funding or career coaching, who gets mentorship, etc) will prioritize the latter over the former if they're using your definition.
I think we're interpreting "pluralism" differently. Here are some central illustrations of what I consider to be the pluralist perspective:
- the Catholic priest I met at the Parliament of World Religions who encouraged someone who had really bad experiences with Christianity to find spiritual truth in Hinduism
- the passage in the Quran that says the true believers of Judaism and Christianity will also be saved
- the Vatican calling the Buddha and Jesus great healers
If I change "i.e. the pluralist focus Alex mentions" to "e.g. the pluralist focus Alex mentions" does that work? I shouldn't have implied that all people who believe in heuristics recommended by many religions are pluralists (in your sense). But it does seem reasonable to say that pluralists (in your sense) believe in heuristics recommended by many religions, unless I'm misunderstanding you. (In the examples you listed these would be heuristics like "seek spiritual truth", "believe in (some version of) God", "learn from great healers", etc.)
I think this doesn't work for people with IQ <= 100, which is about half the world. I agree that an understanding of these insights is necessary to avoid incorporating the toxic parts of Christianity, but I think this can be done even using the language of Christianity. (There's a lot of latitude in how one can interpret the Bible!)
I personally don't have a great way of distinguishing between "trying to reach these people" and "trying to manipulate these people". In general I don't even think most people trying to do such outreach genuinely know whether their actual motivations are more about outreach or about manipulation. (E.g. I expect that most people who advocate for luxury beliefs sincerely believe that they're trying to help worse-off people understand the truth.) Because of this I'm skeptical of elite projects that have outreach as a major motivation, except when it comes to very clearly scientifically-grounded stuff.
What if your research goal is "I'd like to understand how neural networks work?" This is not research primarily about how to make AIs aligned. We tend to hypothesize, as a community, that it will help with alignment more than it helps with capabilities. But that's not an inherent part of the research goal for many interpretability researchers.
(Same for "I'd like to understand how agency works", which is a big motivation for many agent foundations researchers.)
Conversely, what if your research goal is "I'm going to design a training run that will produce a frontier model, so that we can study it to advance alignment research"? Seems odd, but I'd bet that (e.g.) a chunk of Anthropic's scaling team thinks this way. Counts as alignment under your definition, since that's the primary goal of the research.
More generally, I think it's actually a very important component of science that people judge the research itself, not the motivations behind it—since historically scientific breakthroughs have often come from people who were disliked by establishment scientists. A definition that basically boils down to "alignment research is whatever research is done by the people with the right motivations" makes it very easy to prioritize the ingroup. I do think that historically being motivated by alignment has correlated with choosing valuable research directions from an alignment perspective (like mech interp instead of more shallow interp techniques) but I think we can mostly capture that difference by favoring more principled, robust, generalizable research (as per my definitions in the post).
Whereas I don't think it's particularly important that e.g. people switch from scalable oversight to agent foundations research. (In fact it might even be harmful lol)
I agree. I'll add a note in the post saying that the point you end up on the alignment spectrum should also account for feasibility of the research direction.
Though note that we can interpret your definition as endorsing this too: if you really hate the idea of making AIs more capable, then that might motivate you to switch from scalable oversight to agent foundations, since scalable oversight will likely be more useful for capabilities progress.
Some quick reactions:
- I believe that religions contain lot of psychological and cultural insight, and it's plausible to me that many of them contain many of the same insights.
- Religions can be seen as solutions to the coordination problem of how to get many very different people to trust each other. However, most of them solve it in a way which conflicts with other valuable cultural technologies (like science, free speech, liberal democracy, etc). I'm also sympathetic to the Nietzschean critique that they solve it in a way which conflicts with individual human agency and flourishing.
- Religions are, historically, a type of entity that consolidate power. E.g. Islam right now has a lot of power over a big chunk of the world. We should expect that the psychological insights within religions (and even the ones shared across religions) have been culturally selected in part for allowing those religions to gain power.
So my overall position here is something like: we should use religions as a source of possible deep insights about human psychology and culture, to a greater extent than LessWrong historically has (and I'm grateful to Alex for highlighting this, especially given the social cost of doing so).
But we shouldn't place much trust in the heuristics recommended by religions, because those heuristics will often have been selected for some combination of:
- Enabling the religion as a whole (or its leaders) to gain power and adherents.
- Operating via mechanisms that break in the presence of science, liberalism, individualism, etc (e.g. the mechanism of being able to suppress criticism).
- Operating via mechanisms that break in the presence of abrupt change (which I expect over the coming decades).
- Relying on institutions that have become much more corrupt over time.
Where the difference between a heuristic and an insight is something like the difference between "be all-forgiving" and "if you are all-forgiving it'll often defuse a certain type of internal conflict". Insights are about what to believe, heuristics are about what to do. Insights can be cross-checked against the rest of our knowledge, heuristics are much less legible because in general they don't explain why a given thing is a good idea.
IMO this all remains true even if we focus on the heuristics recommended by many religions, i.e. the pluralistic focus Alex mentions. And it's remains true even given the point Alex made near the end: that "for people in Christian Western culture, I think using the language of Christianity in good ways can be a very effective way to reach the users." Because if you understand the insights that Christianity is built upon, you can use those to reach people without the language of Christianity itself. And if you don't understand those insights, then you don't know how to avoid incorporating the toxic parts of Christianity.
Fair point. I've now removed that section from the post (and also, unrelatedly, renamed the post).
I was trying to make a point about people wanting to ensure that AI in general (not just current models) is "aligned", but in hindsight I think people usually talk about alignment with human values or similar. I have some qualms about that but will discuss in a different post.
Afaict, the current world we’re in is basically the worst case scenario
the status quo is not, imo, a remotely acceptable alternative either
Both of these quotes display types of thinking which are typically dangerous and counterproductive, because they rule out the possibility that your actions can make things worse.
The current world is very far from the worst-case scenario (even if you have very high P(doom), it's far away in log-odds) and I don't think it would be that hard to accidentally make things considerably worse.
AGI is heavy-tailed in both directions I think. I don't think we get utopias by default even without misalignment, since governance of AGI is so complicated.
(Vague, speculative thinking): Is the time element of UDT actually a distraction? Consider the following: agents A and B are in a situation where they'd benefit from cooperation. Unfortunately, the situation is complicated—it's not like a prisoner's dilemma, where there's a clear "cooperate" and a clear "defect" option. Instead they need to take long sequences of actions, and they each have many opportunities to subtly gain an advantage at the other's expense.
Therefore instead of agreements formulated as "if you do X I'll do Y", it'd be far more beneficial for them to make agreements of the form "if you follow the advice of person Z then I will too". Here person Z needs to be someone that both A and B trust to be highly moral, neutral, competent, etc. Even if there's some method of defecting that neither of them considered in advance, at the point in time when it arises Z will advise against doing it. (They don't need to actually have access to Z, they can just model what Z will say.)
If A and B don't have much communication bandwidth between them (e.g. they're trying to do acausal coordination) then they will need to choose a Z that's a clear Schelling point, even if that Z is suboptimal in other ways.
UDT can be seen as the special case where A and B choose Z as follows: "keep forgetting information until you don't know if you're A or B". If A and B are different branches of the same agent, then the easiest way to do this is just to let Z be their last common ancestor. (Coalitional agency can be seen as an implementation of this.) If they're not, then they'll also need to coordinate on a way to make sure they'll forget roughly the same things.
But there are many other ways of picking Schelling Zs. For example, if A and B follow the same religion, then the central figure in that religion (Jesus, Buddha, Mohammad, etc) is a clear Schelling point.
EDIT: Z need not be one person, it could be a group of people. E.g. in the UDT case, if there are several different orders in which A and B could potentially forget information, then they could just do all of them and then follow the aggregated advice of the resulting council. Similarly, even if A and B aren't of the same religion, they could agree to follow whatever compromise their respective religions' central figures would have come to.
EDIT 2: UDT is usually prone to commitment races because it thinks of each agent in a conflict as separately making commitments earlier in logical time. But focusing on symmetric commitments gets rid of this problem.
I recently had a very interesting conversation about master morality and slave morality, inspired by the recent AstralCodexTen posts.
The position I eventually landed on was:
- Empirically, it seems like the world is not improved the most by people whose primary motivation is helping others, but rather by people whose primary motivation is achieving something amazing. If this is true, that's a strong argument against slave morality.
- The defensibility of morality as the pursuit of greatness depends on how sophisticated our cultural conceptions of greatness are. Unfortunately we may be in a vicious spiral where we're too entrenched in slave morality to admire great people, which makes it harder to become great, which gives us fewer people to admire, which... By contrast, I picture past generations as being in a constant aspirational dialogue about what counts as greatness—e.g. defining concepts like honor, Aristotelean magnanimity ("greatness of soul"), etc.
- I think of master morality as a variant of virtue ethics which is particularly well-adapted to domains which have heavy positive tails—entrepreneurship, for example. However, in domains which have heavy negative tails, the pursuit of greatness can easily lead to disaster. In those domains, the appropriate variant of virtue ethics is probably more like Buddhism: searching for equanimity or "green". In domains which have both (e.g. the world as a whole) the closest thing I've found is the pursuit of integrity and attunement to oneself. So maybe that's the thing that we need a cultural shift towards understanding better.
Nice post. I'm excited about the bargaining interpretation of UDT.
However, if we think of our probability for the coin-flip as the result of bargaining, it makes sense that it might be sensitive to size. The negotiation which was willing to trade $100 from one branch to get $10,000 in another branch need not be equally willing to perform that trade arbitrarily many times.
Given this, is there any reason to focus on iterated counterfactual mugging, as opposed to just counterfactual muggings with higher stakes?
It seems like iteration is maybe related to learning. That doesn't make a difference for counterfactual mugging, because you'll learn nothing relevant over time.
For counterlogical muggings about the Nth digit of pi, we can imagine a scenario where you would have learned the Nth digit of pi after 1000 days, and therefore wouldn't have paid if Omega had first offered you the deal on the 1001st day. But now it's confounded by the fact that he already told you about it... So maybe there's something here where you stop taking the deal on the day when you would have found out the Nth digit of pi if Omega hadn't appeared?
- Trivially, I can say that you still have the problem when everyone needs to bootstrap a Schelling veil of ignorance.
I don't understand your point here, explain?
even behind the most simple/Schelling veils of ignorance, I find it likely that hawkish commitments are incentivized. For example, the veil might say that you might be Powerful agent A, or Weak agent B, and if some Powerful agents have weird enough utilities (and this seems likely in a big pool of agents), hawkishly committing in case you are A will be a net-positive bet.
This seems to be claiming that in some multiverses, the gains to powerful agents from being hawkish outweigh the losses to weak agents. But then why is this a problem? It just seems like the optimal outcome.
Here's a (messy, haphazard) list of ways a group of idealized agents could merge into a single agent:
Proposal 1: they merge into an agent which maximizes a weighted sum of their utilities. They decide on the weights using some bargaining solution.
Objection 1: this is not Pareto-optimal in the case where the starting agents have different beliefs. In that case we want:
Proposal 2: they merge into an agent which maximizes a weighted sum of their utilities, where those weights are originally set by bargaining but evolve over time depending on how accurately each original agent predicted the future.
Objection 2: this loses out on possible gains from acausal trade. E.g. if a paperclip-maximizer finds itself in a universe where it's hard to make paperclips but easy to make staples, it'd like to be able to give resources to staple-maximizers in exchange for them building more paperclips in universes where that's easier. This requires a kind of updateless decision theory:
Proposal 3: they merge into an agent which maximizes a weighted sum of their utilities (with those weights evolving over time), where the weights are set by bargaining subject to the constraint that each agent obeys commitments that logically earlier versions of itself would have made.
Objection 3: this faces the commitment races problem, where each agent wants to make earlier and earlier commitments to only accept good deals.
Proposal 4: same as proposal 3 but each agent also obeys commitments that they would have made from behind a veil of ignorance where they didn't yet know who they were or what their values were. From that position, they wouldn't have wanted to do future destructive commitment races.
Objection 4: as we take this to the limit we abstract away every aspect of each agent—their values, beliefs, position in the world, etc—until everything is decided by their prior from behind a veil of ignorance. But when you don't know who you are, or what your values are, how do you know what your prior is?
Proposal 5: all these commitments are only useful if they're credible to other agents. So, behind the veil, choose a Schelling prior which is both clearly non-cherrypicked and also easy for a wide range of agents to reason about. In other words, choose the prior which is most conducive to cooperation across the multiverse.
Okay, so basically we've ended up describing not just an ideal agent, but the ideal agent. The cost of this, of course, is that we've made it totally computationally intractable. In a later post I'll describe some approximations which might make it more relevant.
The part that matters is that squiggles/paperclips are of no value even from a very cosmopolitan and embracing perspective on value.
I agree that the particular type of misaligned goal is not crucial. I'm thinking of molecular squiggles as an unusually clean type of misalignment to make arguments about, because it's very clear that they're not valuable. If you told me that molecular squiggles weren't a central example of a goal that you think a misaligned superintelligence might have, then I'd update, but it sounds like your statements are consistent with this.
A very wide variety of utility functions you could run over the outside universe will have optima around making lots of small things because each thing scores one point, and so to score as many points as possible, each thing is as small as it can be as still count as a thing. Nothing downstream depends on this part coming true and there are many ways for it to come false; but the part where the squiggles are small and molecular is an obvious kind of guess.
If you had to put a rough number on how likely it is that a misaligned superintelligence would primarily value "small molecular squiggles" versus other types of misaligned goals, would it be more like 1000:1 or 1:1 or 1000:1 or something else?
They could be bigger and more complicated, like building giant mechanical clocks.
Clocks are not actually very complicated; how plausible is it on your model that these goals are as complicated as, say, a typical human's preferences about how human civilization is structured?
This is a good point. Though the thing about true beliefs is that there is a specific version of them that's true, which you're allowed to defend (if you can find it). And so you can more easily figure out what the implications are.
Whereas for false beliefs you can't get into the specifics, because looking hard enough at the specifics will tend to disprove the belief.
Unfortunately the way that taboos work is by surrounding the whole topic in an aversive miasma. If you could carefully debate the implications of X, then that would provide an avenue for disproving X, which would be unacceptable. So instead this process tends to look more like "if you don't believe Y then you're probably the sort of terrible person who believes ~X", and now you're tarred with the connotation even if you try to carefully explain why you actually have different reasons for not believing Y (which is what you'd likely say either way).
Fair, this isn't a confident claim from me. I do have a sense that the last decade has been particularly bad in terms of blatant preference falsification, but it's hard to distinguish "the world was different before then" from "I was younger and didn't have a great sense of what was going on".
I think this used to be a tenable position a decade or two ago. But I think it's no longer tenable, due to the dynamic described in this tweet:
Suppose an ideology says you're not allowed to question idea X. At first X might not be very important. But now when people want to argue for Y, "X->Y" and "~Y->~X" are both publicly irrefutable. So over time X will become more and more load-bearing for censorious ideologies.
We can also think of this as a variant of Goodhart's law, which I'll call ideological Goodhart (and have just tweeted about here): any false belief that cannot be questioned by adherents of an ideology will become increasingly central to that ideology. As this process plays out, advocates of that ideology will adopt increasingly extreme positions, and support increasingly crazy policies.
Well, the prior should weaken the extent to which we believe any given set of reasons for why it's different this time.
Relevant: my post on value systematization
Though I have a sneaking suspicion that this comment was originally made on a draft of that?
I disagree with the first one. I think that the spectrum of human-level AGI is actually quite wide, and that for most tasks we'll get AGIs that are better than most humans significantly before we get AGIs that are better than all humans. But the latter is much more relevant for recursive self-improvement, because it's bottlenecked by innovation, which is driven primarily by the best human researchers. E.g. I think it'd be pretty difficult to speed up AI progress dramatically using millions of copies of an average human.
Also, by default I think people talk about FOOM in a way that ignores regulations, governance, etc. Whereas in fact I expect these to put significant constraints on the pace of progress after human-level AGI.
If we have millions of copies of the best human researchers, without governance constraints on the pace of progress... Then compute constraints become the biggest thing. It seems plausible that you get a software-only singularity, but it also seems plausible that you need to wait for AI innovation of new chip manufacturing to actually cash out in the real world.
I broadly agree with the second one, though I don't know how many people there are left with 30-year timelines. But 20 years to superintelligence doesn't seem unreasonable to me (though it's above my median). In general I've updated lately that Kurzweil was more right than I used to think about there being a significant gap between AGI and ASI. Part of this is because I expect the problem of multi-agent credit assignment over long time horizons to be difficult.
In the last 24 hours. I read fast (but also skipped the last third of the Doomsday Machine).
This comment prompted me to read both Secrets and also The Doomsday Machine by Ellsberg. Both really great, highly recommend.