On your "WHY", you seem to be presenting reasons why other people not believing your model shouldn't count as strong evidence against it. Which is all fair. But I'm still curious for positive evidence to believe your model in the first place. Maybe this would be obvious if I knew more biology, but as it is, I don't know why I should place higher credence in your model than any other model (e.g. the one at the bottom of this comment, if that counts).
...Then that bimodal response could directly and cleanly justify claiming "antibody response was 3.5-fold higher" in some very fuzzy and general way (because 28% x 3.5 = 98%)
As far as I can tell, "antibody response was 3.5-fold higher" just means that, on average, people in the extended dosing schedule had 3.5x more antibodies. I can't tell whether you interpret it in some other way, or if you think this is a misleading way to describe things, or if you're making some other point...?
The graph you included as a supporting claim was, I think, just the B panel from the totality of Figure 2 which is nice in many ways.
The data in Panel A therefore seems consistent to me that "eventually" there is some roughly normal and acceptable level of "vaccinated at all, in an essentially bimodal way" that two doses reaches faster than typical?
Ok now I'm confused.
Do you think that all people on these graphs have reached a "normal and acceptable level of 'vaccinated at all, in an essentially bimodal way' "?
If so, do you not think that there's any important immunity difference between a single-vaccinated person around 1-10 on the graph, or a doubly-vaccinated person around 1000-10000?
Or if you think that only some of the people on this graph are immune, where do you think the line between immune and not-immune should be drawn on these graphs? (The distribution seems to be fairly continuous everywhere, to me, so it seems arbitrary to draw the line anywhere.)
Or if you think the important immunity difference isn't captured by antibody-levels, what is it about?
And re "that two doses reaches faster than typical"; are you implying that the single-dosed people's antibody response would've kept increasing beyond the 5-6 week mark and eventually gotten as high as the doubly-vaccinated people? That seems unlikely to me. (Other than maybe the few people where their antibodies did increase, but I'm happy to ignore them until I understand the most normal response curve better.)
My hunch is that extended Bleed3 would show a decline from the extended Bleed2 measurement...
The thing I'm asking for is: what's the best second epicycle to add? What is the mechanism? If someone is already seroconverted, what would you measure to detect "that their mechanistic biological state is not ALREADY in the configuration that you'd be hoping to cause to improve via the administration of a third dose"?
Here's one suggestion:
1. The more antibodies you have, the less probability of getting sick, the less probability of getting severe disease, etc.
2. More vaccines increases the number of antibodies you have.
3. Therefore you want to have more vaccines.
I would've thought (1) to be fairly uncontroversial? And the linked study seems to provide good evidence for (2) when going from 1 to 2 doses, increasing antibodies by roughly a factor of 100. And of course adding more vaccines will eventually stop adding more antibodies. But right now I don't have any reason to believe in a big difference between going from 1->2 vaccines vs going from 2->3 vaccines (other than 2 vaccines being the general standard). So I wouldn't be surprised if taking a 3rd vaccine could increase your antibodies by another order of magnitude.
Maybe you think this doesn't provide enough of a "mechanism"? Biology being complicated, I'm very happy to take empirical data for what it is, and make extrapolations even if I don't know what the mechanism is. Personally, I also don't feel like I have any more mechanism for "vaccine have a fixed probability of causing antibodies if you don't already have them, otherwise they don't do much" than "vaccine typically increases antibodies by a lot regardless of whether you have them or not". So when the evidence clearly indicates the latter, I will definitely believe it.
And yeah, also, if someone has the option, I agree that it seems probably better to get a different vaccine than the same vaccine again!
Huh, I'm pretty surprised by this model. Why do you think it's correct?
Here's an image of some measure of people's antibody responses from page 9 of this paper, where the first set of points is people's response 5-6 wks after dose 1, and the second set of points is people's response 2-3 wks after dose 2.
It looks like people who get an antibody response to the first dose still get a much improved response from a second dose. And there's no sign of a bimodal responses to any of the doses. Is that consistent with your model?
Also, the way vaccines can protect from severe disease without protecting from infection seems to suggest that there's more than a binary question of response/not-response.
(The human baseline is a loss of 0.7 bits, with lots of uncertainty on that figure.)
I'd like to know what this figure is based on. In the linked post, Gwern writes:
The pretraining thesis argues that this can go even further: we can compare this performance directly with humans doing the same objective task, who can achieve closer to 0.7 bits per character.
But in that linked post, there's no mention of "0.7" bits in particular, as far as I or cmd-f can see. The most relevant passage I've read is:
Claude Shannonfound that each character was carrying more like 1 (0.6-1.3) bit of unguessable information (differing from genre to genre8); Hamid Moradi found 1.62-2.28 bits on various books9; Brown et al 1992 found <1.72 bits; Teahan & Cleary 1996 got 1.46; Cover & King 1978 came up with 1.3 bits10; and Behr et al 2002 found 1.6 bits for English and that compressibility was similar to this when using translations in Arabic/Chinese/French/Greek/Japanese/Korean/Russian/Spanish (with Japanese as an outlier). In practice, existing algorithms can make it down to just 2 bits to represent a character, and theory suggests the true entropy was around 0.8 bits per character.11
I'm not sure what the relationship is between supposedly unguessable information and human performance, but assuming that all these sources were actually just estimating human performance, and without looking into the sources more... this isn't just lots of uncertainty, but vast amounts of uncertainty, where it's very plausible that GPT-3 has already beaten humans. This wouldn't be that surprising, given that GPT-3 must have memorised a lot of statistical information about how common various words are, which humans certainly don't know by default.
I have a lot of respect for people looking into a literature like this and forming their own subjective guess, but it'd be good to know if that's what happened here, or if there is some source that pinpoints 0.7 in particular as a good estimate.
Terminologically, I like topic/content/purpose. Where 'purpose' includes potential results from the job (including pay) and how much you care about and are motivated by them. It could be difficult to split content and purpose, though. E.g. being able to see and talk with the people you're helping could be very motivating, but it doesn't fit purely into either content or purpose.
SIA isn't needed for that; standard probability theory will be enough (as our becoming grabby is evidence that grabbiness is easier than expected, and vice-versa).
I think there's a confusion with SIA and reference classes and so on. If there are no other exact copies of me, then SIA is just standard Bayesian update on the fact that I exist. If theory T_i has prior probability p_i and gives a probability q_i of me existing, then SIA changes its probability to q_i*p_i (and renormalises).
Yeah, I agree with all of that. In particular, SIA updating on us being alive on Earth is exactly as if we sampled a random planet from space, discovered it was Earth, and discovered it had life on it. Of course, there are also tons of planets that we've seen that doesn't look like they have life on them.
But "Earth is special" theories also get boosted: if a theory claims life is very easy but only on Earth-like planets, then those also get boosted.
I sort-of agree with this, but I don't think it matters in practice, because we update down on "Earth is unlikely" when we first observe that the planet we sampled was Earth-like.
Here's a model: Assume that there's a conception of "Earth-like planet" such that life-on-Earth is exactly equal evidence for life emerging on any Earth-like planet, and 0 evidence for life emerging on other planets. This is clearly a simplification, but I think it generalises. "Earth-like planet" could be any rocky planet, any rocky planet with water, any rocky planet with water that was hit by an asteroid X years into its lifespan, etc.
Now, if we sample a planet (Earth) and notice that it's Earth-like and has life on it, we do two updates:
Noticing that Earth is an Earth-like planet should update us towards thinking that Earth-like planets are common in the universe.
Noticing that life emerged on Earth should update us towards thinking that life has a high probability of emerging on Earth-like planets.
If we don't know anything else about the universe yet, these two updates should collectively imply an update towards life-is-common that is just as big as if we hadn't done this decomposition, and just updated on the hypothesis "how common is life?" in the first place.
Now, lets say we start observing the rest of the universe. Lets assume this happens via sampling random planets and observing (a) whether they are/aren't Earth-like (b) whether they do/don't have life on them.
If we sample a non-Earth-like planet, we update towards thinking that Earth-like planets aren't common.
If we sample an Earth-like planet without life, we update towards thinking that Earth-like planets has a lower probability of supporting life.
I haven't done the math, but I'm pretty sure that it doesn't matter which of these we observe. The update on "How common is life?" will be the same regardless. So the existence of "Earth is special"-hypotheses doesn't matter for our best-guess of "How common is life?", if we only conside the impact of observing planets with/without Earth-like features and life.
Of course, observing planets isn't the only way we can learn about the universe. We can also do science, and reason about the likely reasons that life emerged, and how common those things ought to be.
That means that if you can come up with a strong theoretical argument (that isn't just based on observing how many planets are Earth-like and/or had life on them, including Earth) that some feature of Earth significantly boosts the probability of life and that that feature is extremely rare in the universe at-large, then that would be a solid argument for why to expect life to be rare in the universe. However, note that you'd have to argue that it was extremely rare. If we're assuming that grabby aliens could travel over many galaxies, then we've already observed evidence that grabby life is sufficiently rare to not yet have appeared in any of a very large number of planets in any of a very large number of galaxies. Your theoretical reasons to expect life to be rare would have to assert that it's even rarer than that to impact the results.
Good point, I didn't think about that. That's the old SIA argument for there being a late filter.
The reason I didn't think about it is because I use SIA-like reasoning in the first place because it pays attention to the stakes in the right way: I think I care about acting correctly in universes with more copies of me almost-proportionally more. But I also care more about universes where civilisations-like-Earth are more likely to colonise space (ie become grabby), because that means that each copy of me can have more impact. That kind-of cancels out the SIA argument for a late filter, mostly leaving me with my priors, which points toward a decent probability that any given civilisation colonises space in a grabby manner.
Also: if Earth-originiating intelligence ever becomes grabby, that's a huge bayesian update in favor of other civilisations becoming grabby, too. So regardless of how we describe the difference between T1 and T2, SIA will definitely think that T1 is a lot more likely once we start colonising space, if we ever do that.
But by "theory of the universe", Robin Hanson meant not only the theory of how the physical universe was, but the anthropic probability theory. The main candidates are SIA and SSA. SIA is indifferent between T1 and T2. But SSA prefers T1 (after updating on the time of our evolution).
SIA is not indifferent between T1 and T2. There are way more humans in world T1 than in world T2 (since T2 requires life to be very uncommon, which would imply that humans are even more uncommon), so SIA thinks world T1 is much more likely. After all, the difference between SIA and SSA is that SIA thinks that universes with more observers are proportionally more likely; so SIA will always think aliens are more likely than SSA does.
Previously, I thought this was in conflict with the fact that humans didn't seem to be particularly early (ie., if life is common, it's surprising that there aren't any aliens around 13.8 billion years into the universe's life span). I ran the numbers, and concluded that SIA still thought that we'd be very likely to encounter aliens (though most of the linked post instead focuses on answering the decision-relevant question "how much of potentially-colonisable space would be colonised without us?", evaluated ADT-style).
After having read Robin's work, I now think humans probably are quite early, which would imply that (given SIA/ADT) it is highly overdetermined that aliens are common. As you say, Robin's work also implies that SSA agrees that aliens are common. So that's nice: no matter which of these questions we ask, we get a similar answer.
Thanks, computer-speed deliberation being a lot faster than space-colonisation makes sense. I think any deliberation process that uses biological humans as a crucial input would be a lot slower, though; slow enough that it could well be faster to get started with maximally fast space colonisation. Do you agree with that? (I'm a bit surprised at the claim that colonization takes place over "millenia" at technological maturity; even if the travelling takes millenia, it's not clear to me why launching something maximally-fast – that you presumably already know how to build, at technological maturity – would take millenia. Though maybe you could argue that millenia-scale travelling time implies millenia-scale variance in your arrival-time, in which case launching decades or centuries after your competitors doesn't cost you too much expected space?)
If you do agree, I'd infer that your mainline expectation is that we succesfully enforce a worldwide pause before mature space-colonisation; since the OP suggests that biological humans are likely to be a significant input into the deliberation process, and since you think that the beaming-out-info schemes are pretty unlikely.
(I take your point that as far as space-colonisation is concerned; such a pause probably isn't strictly necessary.)
I'm curious about how this interacts with space colonisation. The default path of efficient competition would likely lead to maximally fast space-colonisation, to prevent others from grabbing it first. But this would make deliberating together with other humans a lot trickier, since some space ships would go to places where they could never again communicate with each other. For things to turn out ok, I think you either need:
to pause before space colonisation.
to finish deliberating and bargaining before space colonisation.
to equip each space ship with the information necessary for deciding what to do with the space they grab. In order of increasing ambitiousness:
You could upload a few leaders' or owners' brains (or excellent predictive model thereof) and send them along with their respective colonisation ships; hoping that they will individually reach good decisions without discussing with the rest of humanity.
You could also equip each colonisation ship with the uploads of all other human brains that they might want to deliberate with (or excellent predictive models thereof), so that they can use those other human as discussion partners and data for their deliberation-efforts.
You also set up these uploads in a way that makes them figure out what bargain would have been struck on Earth; and then have each space ship individually implement this. Maybe this happens by default with acausal trade; or maybe everyone in some reasonably big coalition could decide to follow the decision of some specified deliberative process that they don't have time to run on Earth.
to use some communication scheme that lets you send your space ships ahead to compete in space, and then lets you send instructions to your own ships once you've finished deliberating on Earth.
E.g. maybe you could use cryptography to ensure that your space ships will follow instructions signed with the right code; which you only send out once you've finished bargaining. (Though I'm not sure if your bargaining-partners would be able to verify how your space ships would react to any particular message; so maybe this wouldn't work without significant prior coordination.)
I'm curious wheter you're optimistic about any of these options, or if you have something else in mind.
(Also, all of this assumes that defensive capabilities are a lot stronger than offensive capabilities in space. If offense is comparably strong, than we also have the problem that the cosmic commons might be burned in wars if we don't pause or reach some other agreement before space colonisation.)
And yet I'd guess that none of these were/are on track to reach human-level intelligence. Agree/disagree?
Uhm, haven't thought that much about it. Not imminently, maybe, but I wouldn't exclude the possibility that they could be on some long-winded path there.
It feels like it really relies on this notion of "pretty smart" though
I don't think it depends that much on the exact definition of a "pretty smart". If we have a broader notion of what "pretty smart" is, we'll have more examples of pretty smart animals in our history (most of which haven't reached human level intelligence). But this means both that the evidence indicates that each pretty smart animal has a smaller chance of reaching human-level intelligence, and that we should expect much more pretty smart animals in the future. E.g. if we've seen 30 pretty smart species (instead of 3) so far, we should expect maybe M=300 pretty smart species (instead of 30) to appear over Earth's history. Humans still evolved from some species in the first 10th percentile, which still is an update towards N~=M/10 over N>>M.
The required assumptions for the argument are just:
humans couldn't have evolved from a species with a level of intelligence less than X
species with X intelligence started appearing t years ago in evolutionary history
there are t' years left where we expect such species to be able to appear
we assume the appearence rate of such species to be either constant or increasing over time
Then, "it's easy to get humans from X" predicts t<<t' while "it's devilishly difficult to get humans from X" predicts t~=t' (or t>>t' if the appearance rate is strongly increasing over time). Since we observe t<<t', we should update towards the former.
This is the argument that I was trying to make in the grand-grand-grand-parent. I then reformulated it from an argument about time into an argument about pretty smart species in the grand-parent to mesh better with your response.
The claim I'm making is more like: for every 1 species that reaches human-level intelligence, there will be N species that get pretty smart, then get stuck, where N is fairly large
My point is that – if N is fairly large – then it's surprising that human-level intelligence evolved from one of the first ~3 species that became "pretty smart" (primates, dolphins, and probably something else).
If the Earth's history would contain M>>N pretty smart species, then in expectation human-level intelligence should appear in the N:th species. If Earth's history would contain M<<N pretty smart species, then we should expect human-level intelliigence to have equal probability to appear in any of the pretty smart species, so in expectation it should appear in the M/2:th pretty smart species.
Becoming "pretty smart" is apparently easy (because we've had >1 pretty smart species evolve so far) so in the rest of the Earth's history, we would expect plenty more species to become pretty smart. If we expect M to be non-trivial (like maybe 30) then the fact that the 3rd pretty smart species reached human-level intelligence is evidence in favor of N~=2 over N>>M.
(Just trying to illustrate the argument at this point; not confident in the numbers given.)
I'm curious about the extent to which you expect the future to be awesome-by-default as long as we avoid all clear catastrophes along the way; vs to what extent you think we just has a decent chance of getting a non-negligible fraction of all potential value (and working to avoid catastrophes is one of the most tractable ways of improving the expected value).
Proposed tentative operationalisation:
World A is just like our world, except that we don't experience any ~GCR on Earth in the next couple of centuries, and we solve the problem of making competitive intent-aligned AI.
In world B, we also don't experience any GCR soon and we also solve alignment. In addition, you and your chosen collaborators get to design and implement some long-reflection-style scheme that you think will best capture the aggregate of human and non-human desires. All coordination and cooperation problems on Earth are magically solved. Though no particular values are forced upon anyone, everyone is happy to stop and think about what they really want, and contribute to exercises designed to illuminate this.
How much better do you think world B is compared to world A? (Assuming that a world where Earth-originating intelligence goes extinct has a baseline value of 0.)
It is intrinsically easier to gather flexible influence in pursuit of some goals, because
1. It's easier to build AIs to pursue goals that are easy to check.
3. It's easier to build institutions to pursue goals that are easy to check.
9. It's easier to coordinate around simpler goals.
plus 4 and 5 insofar as some values require continuously surviving humans to know what to eventually spend resources on, and some don't.
plus 6 insofar as humans are otherwise an important part of the strategic environment, such that it's beneficial to have values that are easy-to-argue.
Jessica Taylor's argument require that the relevant games are zero sum. Since this isn't true in the real world:
7. A threat of destroying value (e.g. by threatening extinction) could be used as a bargaining tool, with unpredictable outcomes.
~8. Some groups actively wants other groups to have less resources, in which case they can try to reduce the total amount of resources more or less actively.
~8. Smaller groups have less incentive to contribute to public goods (such as not increasing the probability of extinction), but benefit equally from larger groups' contributions, which may lead them to getting a disproportionate fraction of resources by defecting in public-goods games.
Ah, you were talking about this article. Me and Daniel were saying that "Kolmogorov Complexity" never shows up in the linked ssc article (thinking that Zvi accidentally wrote "Kolmogorov Complexity" when he meant "Kolmogorov Complicity").
Starting with amplification as a baseline; am I correct to infer that imitative generalisation only boosts capabilities, and doesn't give you any additional safety properties?
My understanding: After going through the process of finding z, you'll have a z that's probably too large for the human to fully utilise on their own, so you'll want to use amplification or debate to access it (as well as to generally help the human reason). If we didn't have z, we could train an amplification/debate system on D' anyway, while allowing the human and AIs to browse through D for any information that they need. I don't see how the existence of z makes amplification or debate any more aligned, but it seems plausible that it could improve competitiveness a lot. Is that the intention?
Bonus question: Is the intention only to boost efficiency, or do you think that IA will fundamentally allow amplification to solve more problems? (Ie., solve more problems with non-ridiculous amounts of compute – I'd be happy to count an exponential speedup as the latter.)
It's worth noting that their language model still uses BPEs, and as far as I can tell the encoding is completely optimised for English text rather than code (see section 2). It seems like this should make coding unusually hard compared to the pretraining task; but maybe make pretraining more useful, as the model needs time to figure out how the encoding works.
I'm really surprised at how big your cards are! When I did anki regularly, I remember getting a big ugh-feeling from cards much smaller than yours, just because there were so many things that I had to consciously recapitulate. It was also fairly common that I missed some little detail and had to choose between starting the whole card over from scratch (which is a big time sink since the card takes so much time at every repeat) or accept that I might never remember that detail.
I'm super curious about your experience of e.g. encountering the function question. Do you try to generate both an example and a formalism, or just the formalism? Do you consciously recite a definition in words, or check some feeling of remembering what the definition is, or mumble something in your mind about how a function is a set of ordered pairs? Is the domain/range-definitions just there as a reminder when you read it, or do you aim to remember them every time? Do you reset or accept if you forget to mention a detail?
Cool, seems reasonable. Here are some minor responses: (perhaps unwisely, given that we're in a semantics labyrinth)
Evan's footnote-definition doesn't rule out malign priors unless we assume that the real world isn't a simulation
Idk, if the real world is a simulation made by malign simulators, I wouldn't say that an AI accurately predicting the world is falling prey to malign priors. I would probably want my AI to accurately predict the world I'm in even if it's simulated. The simulators control everything that happens anyway, so if they want our AIs to behave in some particular way, they can always just make them do that no matter what we do.
you are changing the definition of outer alignment if you think it assumes we aren't in a simulation
Fwiw, I think this is true for a definition that always assumes that we're outside a simulation, but I think it's in line with previous definitions to say that the AI should think we're not in a simulation iff we're not in a simulation. That's just stipulating unrealistically competetent prediction. Another way to look at it is that in the limit of infinite in-distribution data, an AI may well never be able to tell whether we're in the real world or in a simulation that's identical to the real world; but they would be able to tell whether we're in a simulation with simulators who actually intervene, because it would see them intervening somewhere in its infinite dataset. And that's the type of simulators that we care about. So definitions of outer alignment that appeal to infinite data automatically assumes that AIs would be able to tell the difference between worlds that are functionally like the real world, and worlds with intervening simulators.
And then, yeah, in practice I agree we won't be able to learn whether we're in a simulation or not, because we can't guarantee in-distribution data. So this is largely semantics. But I do think definitions like this end up being practically useful, because convincing the agent that it's not individually being simulated is already an inner alignment issue, for malign-prior-reasons, and this is very similar.
Isn't that exactly the point of the universal prior is misaligned argument? The whole point of the argument is that this abstraction/specification (and related ones) is dangerous.
I guess your title made it sound like you were teaching us something new about prediction (as in, prediction can be outer aligned at optimum) when really you are just arguing that we should change the definition of outer-aligned-at-optimum, and your argument is that the current definition makes outer alignment too hard to achieve
I mean, it's true that I'm mostly just trying to clarify terminology. But I'm not necessarily trying to propose a new definition – I'm saying that the existing definition already implies that malign priors are an inner alignment problem, rather than than an issue with outer alignment. Evan's footnote requires the model to perform optimally on everything it actually encounters in the real world (rather than asking it to do as well as it can across the multiverse, given its training data); so that definition doesn't have a problem with malign priors. And as Richard notes here, common usage of "inner alignment" refers to any case where the model performs well on the training data but is misaligned during deployment, which definitely includes problems with malign priors. And per Rohin's comment on this post, apparently he already agrees that malign priors are an inner alignment problem.
Basically, the main point of the post is just that the 11 proposals post is wrong about mentioning malign priors as a problem with outer alignment. And then I attached 3 sections of musings that came up when trying to write that :)
Things I believe about what sort of AI we want to build:
It would be kind of convenient if we had an AI that could help us do acausal trade. If assuming that it's not in a simulation would preclude an AI from doing acausal trade, that's a bit inconvenient. However, I don't think this matters for the discussion at hand, for reasons I describe in the final array of bullet points below.
Even if it did matter, I don't think that the ability to do acausal trade is a deal-breaker. If we had a corrigible, aligned, superintelligent AI that couldn't do acausal trade, we could ask it to scan our brains, then compete through any competitive period on Earth / in space, and eventually recreate us and give us enough time to figure out this acausal trade thing ourselves. Thus, for practical purposes, an AI that assumes it isn't in a simulation doesn't seem defective to me, even if that means it can't do acausal trade.
Things I believe about how to choose definitions:
When choosing how to define our terms, we should choose based on what abstractions are most useful for the task at hand. For the outer-alignment-at-optimum vs inner alignment distinction, we're trying to choose a definition of "optimal performance" such that we can separately:
Design an intent-aligned AI out of idealised training procedures that always yield "optimal performance" on some metric. If we successfully do this, we've solved outer alignment.
Figure out a training procedure that produces an AI that actually does very well on the chosen metric (sufficiently well to be aligned, even if it doesn't achieve absolute optimal performance). If we do this, we've solved inner alignment.
Things I believe about what these candidate definitions would imply:
For every AI-specification built with the abstraction "Given some finite training data D, the AI predicts the next data point X according to how common it is that X follows D across the multiverse", I think that AI is going to be misaligned (unless it's trained with data that we can't get our hands on, e.g. infinite in-distribution data), because of the standard universal-prior-is-misaligned-reasons. I think this holds true even if we're trying to predict humans like in IDA. Thus, this definition of "optimal performance" doesn't seem useful at all.
For AI-specification built with the abstraction "Given some finite training data D, the AI predicts the next data point X according to how common it is that X follows D on Earth if we aren't in a simulation", I think it probably is possible to build aligned AIs. Since it also doesn't seem impossible to train AIs to do something like this (ie we haven't just moved the impossibility to the inner alignment part of the problem), it seems like a pretty good definition of "optimal performance".
Surprisingly, I think it's even possible to build AIs that do assign some probability to being in a simulation out of this. E.g. we could train the AI via imitation learning to imitate me (Lukas). I assign a decent probability to being in a simulation, so a perfect Lukas-imitator would also assign a decent probability to being in a simulation. This is true even if the Lukas-imitator is just trying to imitate the real-world Lukas as opposed to the simulated Lukas, because real-world Lukas assigns some probability to being simulated, in his ignorance.
I'm also open to other definitions of "optimal performance". I just don't know any useful ones other than the ones I mention in the post.
We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the future
I don't think this is right. I've put my proposed modifications in cursive:
We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the past [we don't have ground-truth for the future, so we can't test how well humans can reason about it] and how well humans think it would generalise to the future. Then, we train a separate network to predict what humans with access to the previous network would predict about the future.
(It might be a good idea to share some parameters between the second and first network.)
Oops, I actually wasn't trying to discuss whether the action-space was wide enough to take over the world. Turns out concrete examples can be ambiguous too. I was trying to highlight whether the loss function and training method incentivised taking over the world or not.
Instead of an image-classifier, lets take GPT-3, which has a wide enough action-space to take over the world. Lets assume that:
1. GPT-3 is currently being tested on on a validation set which have some correct answers. (I'm fine with "optimal performance" either requiring that GPT-3 magically returns these correct answers; or requiring that it returns some distribution along the lines that I defined in my post.)
2. The researchers will read what GPT-3 outputs, and there exists some string that causes them to go mad and give control over the world to GPT-3.
In this case, if we define optimal performance as "correctly predicting as many words as possible" or "achieve minimum total loss over the entire history of the world", I agree that optimal performance would plausibly involve taking over the world to feed itself the right questions. However, I think a better definition is "Correctly predicts every word it's asked to predict", because I think this tracks what (un)supervised learning actually selects for (absent inner alignment concerns).
To make that last point more clear; I'm claiming that in a supervised learning scenario where predicting label X always gives the neural network a small punishment, but then causes the training procedure to give the model 10 easy questions, the model will learn to not choose label X. (At least in the long run. In the short run, I could see the model being a little bit more likely to predict label X than expected, because easy questions could give the model a less steep gradient, so a neural network that predicts X would stick around a bit longer than it would have if X hadn't yielded easy questions. However, every time X is chosen, the nn is directionally pushed away from choosing it again; so in the infinite data limit, I think it would learn to not do that.) Conversely, if predicting label Y is correct on some inputs, but predicting Y immeidately lead to 10 hard questions, I don't think the model would learn to avoid label Y (though I'm unsure if the learning process would converge to choosing Y or just be unstable and never converge). I'm actually very curious if you agree with this; it seems like an important question.
(Conversely, I think that a robot RL-agent would be incentivised to control its future challenges, because of how RL loss functions are shaped. A definition of RL optimal behavior should naturally take this into account.)
Now, by defining optimal behavior as "Correctly predicts every word it's asked to predict", I'm saying that the model is outer aligned at optimum iff the model cannot simultaneously answer every question optimally and adversarially optimise against humans. And if we only consider the normal output channel, I'm saying it couldn't, because it would have to make incorrect predictions in order to broadcast its message; in which case it wouldn't be optimal.
If we also consider side-channels, this gets messier, because my chosen definition doesn't imply any constraints on what the AI could do with side channels. Thus, if it could use side channels to take over the world in any way, my definition would say it wasn't outer aligned. My preferred solution to this is just to intentionally ignore side channels when defining outer alignment at optimum (as you correctly inferred). Three reasons for this:
As mentioned above, taking side channels into account would mean that any model with powerful side channels is classified as outer misaligned, even if there's no incentive to use these side channels in any particular way.
Separately, I suspect that supervised learning normally doesn't incentivise neural networks to use side channels in any particular way (absend inner alignment concerns).
Finally, It just seems kind of useful to talk about the outer alignment properties of abstract agent-models, since not all abstract agent-models are outer aligned. Side-constraints can be handled separately.
(Btw I'd say the bad image classifier is aligned but less performance-competitive than a good image classifier, though I'm sympathetic to the view that it doesn't make sense to talk about its alignment properties at all.)
That is, if you write down a loss function like "do the best possible science", then the literal optimal AI would take over the world and get a lot of compute and robots and experimental labs to do the best science it can do.
I think this would be true for some way to train a STEM AI with some loss functions (especially if it's RL-like, can interact with the real world, etc) but I think that there are some setups where this isn't the case (e.g. things that look more like alphafold). Specifically, I think there exists some setups and some parsimonious definition of "optimal performance" such that optimal performance is aligned: and I claim that's the more useful definition.
To be more concrete, do you think that an image classifier (trained with supervised learning) would have convergent instrumental goals that goes against human interests? For image classifiers, I think there's a natural definition of "optimal performance" that corresponds to always predicting the true label via the normal output channel; and absent inner alignment concerns, I don't think a neural network trained on infinite data with SGD would ever learn anything less aligned than that. If so, it seems like best definition of "at optimum" is the definition that says that the classifier is outer aligned at optimum.
He's definitely given some money, and I don't think the 990 absence means much. From here:
in 2016, the IRS was still processing OpenAI’s non-profit status, making it impossible for the organization to receive charitable donations. Instead, the Musk Foundation gave $10m to another young charity, YC.org. [...] The Musk Foundation’s grant accounted for the majority of YC.org’s revenue, and almost all of its own funding, when it passed along $10m to OpenAI later that year.
Also, when he quit in 2018, OpenAI wrote "Elon Musk will depart the OpenAI Board but will continue to donate and advise the organization". The same blog post lists multiple other donors than Sam Altman, so donating to OpenAI without showing up on the 990s must be the default, for some reason.
This has definitely been productive for me. I've gained useful information, I see some things more clearly, and I've noticed some questions I still need to think a lot more about. Thanks for taking the time, and happy holidays!
Yeah, that's a good question. It's similar to training image classifiers on human-labelled data – they can become cheaper than humans and they can become more consistent than humans (ie., since humans make uncorrelated errors, the answer that the most humans would pick can be systematically better than the answer that a random human would pick), but they can't gain vastly superhuman classification abilities.
In this case, one plausible route to outperforming humans would be to start out with a GPT-like model, and then finetune it on some downstream task in an RL-like fashion (see e.g. this). I don't see any reason why modelling the internet couldn't lead to latent superhuman ability, and finetuning could then be used to teach the model to use its capabilities in ways that humans wouldn't. Indeed, there's certainly no single human who could optimally predict every next word of internet-text, so optimal performance on the training task would require the model to become superhuman on at least that task.
Or if we're unlucky, sufficiently large models trained for sufficiently long could lead to something like a misaligned mesa optimizer, which would already "want" to use its capabilities in ways that humans wouldn't.
I'm not sure exactly what you mean here, but if you mean "holding an ordinary conversation with a human" as a task, my sense is that is extremely hard to do right (much harder than, e.g., SuperGLUE). There's a reason that it was essentially proposed as a grand challenge of AI; in fact, it was abandoned once it was realized that actually it's quite gameable.
More seriously, I agree that a full blown turing test is hard, but this is because the interrogator can choose whatever question is most difficult for a machine to answer. My statement about "ordinary conversation" was vague, but I was imagining something like sampling sentences from conversations between humans, and then asking questions about them, e.g. "What does this pronoun refer to?", "Does this entail or contradict this other hypothesis?", "What will they say next?", "Are they happy or sad?", "Are they asking for a cheeseburger?".
For some of these questions, my original claim follows trivially. "What does this pronoun refer to?" is clearly easier for randomly chosen sentences than for winograd sentences, because the latter have been selected for ambiguity.
And then I'm making the stronger claim that a lot of tasks (e.g. many personal assistant tasks, or natural language interfaces to decent APIs) can be automated via questions that are similarly hard as the benchmark questions; ie., that you don't need more than the level of understanding signalled by beating a benchmark suite (as long as the model hasn't been optimised for that benchmark suite).
Cool, thanks. I agree that specifying the problem won't get solved by itself. In particular, I don't think that any jobs will become automated by describing the task and giving 10 examples to an insanely powerful language model. I realise that I haven't been entirely clear on this (and indeed, my intuitions about this are still in flux). Currently, my thinking goes along the following lines:
Fine-tuning on a representative dataset is really, really powerful, and it gets more powerful the narrower the task is. Since most benchmarks are more narrow than the things we want to automate, and it's easier to game more narrow benchmarks, I don't trust trends based on narrow, fine-tuned benchmarks that much.
However, in a few-shot setting, there's not enough data to game the benchmarks in an overly narrow way. Instead, they can be fairly treated as a sample from all possible questions you could ask the model. If the model can answer some superglue questions that seem reasonably difficult, then my default assumption is that it could also answer other natural language questions that seem similarly difficult.
This isn't always an accurate way of predicting performance, because of our poor abilities to understand what questions are easy or hard for language models.
However, it seems like should at least be an unbiased prediction; I'm as likely to think that benchmark question A is harder than non-benchmark question B as I am to think that B is harder than A (for A, B that are in fact similarly hard for a language model).
However, when automating stuff in practice, there are two important problems that speak against using few-shot prompting:
As previously mentioned, tasks-to-be-automated are less narrow than the benchmarks. Prompting with examples seems less useful for less narrow situations, because each example may be much longer and/or you may need more prompts to cover the variation of situations.
Finetuning is in fact really powerful. You can probably automate stuff with finetuning long before you can automate it with few-shot prompting, and there's no good reason to wait for models that can do the latter.
Thus, I expect that in practice, telling the model what to do will happen via finetuning (perhaps even in an RL-fashion directly from human feedback), and the purpose of the benchmarks is just to provide information about how capable the model is.
I realise this last step is very fuzzy, so to spell out a procedure somewhat more explicitly: When asking whether a task can be automated, I think you can ask something like "For each subtask, does it seem easier or harder than the ~solved benchmark tasks?" (optionally including knowledge about the precise nature of the benchmarks, e.g. that the model can generally figure out what an ambiguous pronoun refers to, or figure out if a stated hypothesis is entailed by a statement). Of course, a number of problem makes this pretty difficult:
It assumes some way of dividing tasks into a number of sub-tasks (including the subtask of figuring out what subtask the model should currently be trying to answer).
Insofar as that which we're trying to automate is "farther away" from the task of predicting internet corpora, we should adjust for how much finetuning we'll need to make up for that.
We'll need some sense of how 50 in-prompt-examples showing the exact question-response format compares to 5000 (or more; or less) finetuning samples showing what to do in similar-but-not-exactly-the-same-situation.
Nevertheless, I have a pretty clear sense that if someone told me "We'll reach near-optimal performance on benchmark X with <100 examples in 2022" I would update differently on ML progress than if they told me the same thing would happen in 2032; and if I learned this about dozens of benchmarks, the update would be non-trivial. This isn't about "benchmarks" in particular, either. The completion of any task gives some evidence about the probability that a model can complete another task. Benchmarks are just the things that people spend their time recording progress on, so it's a convenient list of tasks to look at.
for us to know the exact thing we want and precisely characterize it is basically the condition for something being subject to automation by traditional software. ML can come into play where the results don't really matter that much, with things like search/retrieval, ranking problems,
I'm not sure what you're trying to say here? My naive interpretation is that we only use ML when we can't be bothered to write a traditional solution, but I don't think you believe that. (To take a trivial example: ML can recognise birds far better than any software we can write.)
My take is that for us to know the exact thing we want and precisely characterize it is indeed the condition for writing traditional software; but for ML, it's sufficient that we can recognise the exact thing that we want. There are many problems where we recognise success without having any idea about the actual steps needed to perform the task. Of course, we also need a model with sufficient capacity, and a dataset with sufficiently many examples of this task (or an environment where such a dataset can be produced on the fly, RL-style).
Re 3: Yup, this seems like a plausibly important training improvement. FWIW, when training GPT-3, they did filter the common crawl using a classifier that was trained to recognise high-quality data (with wikipedia, webtext, and some books as positive examples) but unfortunately they don't say how big of a difference it made.
I've been assuming (without much thoughts) that doing this better could make training up to ~10x cheaper, but probably not a lot more than that. I'd be curious if this sounds right to you, or if you think it could make a substantially bigger difference.
Benchmarks are filtered for being easy to use, and useful for measuring progress. (...) So they should be difficult, but not too difficult. (...) Only very recently has this started to change with adversarial filtering and evaluation, and the tasks have gotten much more ambitious, because of advances in ML.
That makes sense. I'm not saying that all benchmarks are necessarily hard, I'm saying that these ones look pretty hard to me (compared with ~ordinary conversation).
many of these ambitious datasets turn out ultimately to be gameable
My intuition is that this is far less concerning for GPT-3 than for other models, since it gets so few examples for each benchmark. You seem to disagree, but I'm not sure why. In your top-level comment, you write:
While it seems to be an indicator of generality, in the particular case of GPT-3's few-shot learning setting, the output is controlled by the language modeling objective. This means that even though the model may not catch on to the same statistical regularities as task-specific trained models do from their datasets, it essentially must rely on statistical regularities that are in common between the language modeling supervision and the downstream task.
If for every benchmark, there were enough statistical regularities in common between language modeling supervision and the benchmark to do really well on them all, I would expect that there would also be enough statistical regularities in common between language modeling supervision and whatever other comparably difficult natural-language task we wanted to throw at it. In other words, I feel more happy about navigating with my personal sense of "How hard is this language task?" when we're talking about few-shot learning than when we're talking about finetuned models, becase finetuned models can entirely get by with heuristics that only work on a single benchmark, while few-shot learners use sets of heuristics that cover all tasks they're exposed to. The latter seem far more likely to generalise to new tasks of similar difficulty (no matter if they do it via reasoning or via statistics).
You also write "It stands to reason that this may impose a lower ceiling on model performance than human performance, or that in the task-specific supervised case." I don't think this is right. In the limit of training on humans using language, we would have a perfect model of the average human in the training set, which would surely be able to achieve human performance on all tasks (though it wouldn't do much better). So the only questions are:
How fast will more parameters + more data increase performance on the language modeling task? (Including: Will performance asymptote before we've reached the irreducible entropy?)
As the performance on language modeling increases, in what order will the model master what tasks?
There are certainly some tasks were the parameter+data requirements are far beyond our resources; but I don't see any fundamental obstacle to reaching human performance.
I think this is related to your distinction between a "general-purpose few-shot learner" a "general-purpose language model", which I don't quite understand. I agree that GPT-3 won't achieve bayes-optimality, so in that sense it's limited in its few shot learning abilities; but it seems like it should be able to reach human-level performance through pure human-imitation in the limit of excelling on the training task.
Take for example writing news / journalistic articles. [...] I think similar concerns apply to management, accounting, auditing, engineering, programming, social services, education, etc. And I can imagine many ways in which ML can serve as a productivity booster in these fields but concerns like the ones I highlighted for journalism make it harder for me to see how AI of the sort that can sweep ML benchmarks can play a singular role in automation, without being deployed along a slate of other advances.
Completely agree that high benchmark performance (and in particular, GPT-3 + 6 orders of magnitude) is insufficient for automating these jobs.
(To be clear, I believe this independent about concerns of accountability. I think GPT-3 + 6 OOM just wouldn't be able to perform these jobs as competently as a human.)
On 1b and economically useful tasks: you mention customer service, personal assistant, and research assistant work. [...] But beyond the restaurant setting, retail ordering, logistics, and delivery seems already pretty heavily automated by, e.g., the likes of Amazon. So it's hard for me to see what exactly could be "transformative" here.
For personal assistant and research assistant work, it also seems to me that an incredible amount of this is already automated. [...] Again, here, I'm not sure exactly what "transformation" by powerful function approximation alone would look like.
I strongly agree with this. I think predictions of when we'll automate what low-level tasks is informative for general trends in automation, but I emphatically do not believe that automation of these tasks would constitute transformative AI. In particular, I'm honestly a bit surprised that the internet hasn't increased research productivity more, and I take it as pretty strong evidence that time-saving productivity improvements needs to be extremely good and general if they are to accelerate things to any substantial degree.
Thanks! I agree that if we required GPT-N to beat humans on every benchmark question that we could throw at them, then we would have a much more difficult task.
I don't think this matters much in practice, though, because humans and ML are really differently designed, so we're bound to be randomly better at some things and randomly worse at some things. By the time ML is better than humans at all things, I think they'll already be vastly better at most things. And I care more about the point when ML will first surpass humans at most things. This is most clearly true when considering all possible tasks (e.g. "when will AIs beat humans at surviving on a savannah in a band of hunter-gatherers?"), but I think it's also true when considering questions of varying difficulty in a fairly narrow benchmark. Looking at the linked papers, I think contrastive learning seems like a fair challenge; but I suspect that enough rounds of ANLI could yield questions that would be very rare in a normal setting .
To make that a little bit more precise, I want to answer the question "When will transformative AI be created?". Exactly what group of AI technologies would or wouldn't be transformative is an open question, but I think one good candidate is AI that can do the vast majority of economically important tasks cheaper than a human. If I then adopt the horizon-length frame (which I find plausible but not clearly correct), the relevant question for GPT-N becomes "When will GPT-N be able to perform (for less cost than a human) the vast majority of all economically relevant sub-tasks with a 1-token horizon length"
This is an annoyingly vague question, for sure. However, I currently suspect it's more fruitful to think about this from the perspective of "How high reliability do we need for typical jobs? How expensive would it be to make GPT-N that reliable?" than to think about this from the perspective of "When will be unable to generate questions that GPT-N fails at?"
Another lens on this is to look at tasks that have metrics other than how well AI can imitate humans. Computers beat us at chess in the 90s, but I think humans are still better in some situations, since human-AI teams do better than AIs alone. If we had evaluated chess engines on the metric of beating humans in every situation, we would have overestimated the point at which AIs beat us at chess by at least 20 years
(Though in the case of GPT-N, this analogy is complicated by the fact that GPT-3 doesn't have any training signal other than imitating humans.)
Though being concerned about safety, I would be delighted if people became very serious about adversarial testing. ↩︎
Thank you, this is very useful! To start out with responding to 1:
1a. Even when humans are used to perform a task, and even when they perform it very effectively, they are often required to participate in rule-making, provide rule-consistent rationales for their decisions, and stand accountable (somehow) for their decisions
I agree this is a thing for judges and other high-level decisions, but I'm not sure how important it is for other tasks. We have automated a lot of things in the past couple of 100 years with unaccountable machines and unaccountable software, and the main difference with ML seems to be that it's less interpretable. Insofar as humans like having reasons for failures, I'm willing to accept this as one reason that reliability standards could be a bit higher for ML, but I doubt it would be drastically higher. I'd love a real example (outside of criminal justice) where this is a bottleneck. I'd guess that some countries will have harsh regulations for self-driving cars, but that does have a real risk of killing people, so it's still tougher than most applications.
1b. Integration with traditional software presents difficulties which also mean a very high bar for AI-based automation. (...) example of how this actually looks in practice might be Andrej Karpathy's group in Tesla, based on what he said in this talk.
I liked the talk! I take it as evidence that it's really hard to get >99.99% accuracy, which is a big problem when your neural network is piloting a metric ton of metal at high speeds in public spaces. I'm not sure how important reliability is in other domains, though. Your point "failure of abstractions can have nonlinear effects on the outcome in a software system" is convincing for situations when ML is deeply connected with other applications. I bet there's a lot of cool stuff that ML could do there, so the last 0.01% accuracy could definitely be pretty important. An error rate of 0.1%-1% seems fine for a lot of other tasks, though, including all examples in Economically useful tasks.
For ordering expensive stuff, you want high reliability. But for ordering cheap stuff, 0.1%-1% error rate should be ok? That corresponds to getting the wrong order once a year if you order something every day.
0.1%-1% error rate also seems fine for personal assistant work, especially since you can just personally double-check any important emails before they're sent, or schedule any especially important meeting yourself.
Same thing for research assistant work (which – looking at the tasks – actually seems useful to a lot of non-researchers too). Finding 99% of all relevant papers is great; identifying 99% of trivial errors in your code is great; writing routine code that's correct 99% of the time is great (you can just read through it or test it); reading summaries that have an error 99% of the time is a bit annoying, but still very useful.
(Note that a lot of existing services have less than 99% reliability, e.g. the box on top of google search, google translate, spell check, etc.)
Also, many benchmarks are already filtered for being difficult and ambiguous, so I expect 90% performance on most of them to correspond to at least 99% performance in ordinary interactions. I'd be curious if you (and other people) agree with these intuitions?
Re API actions: Hm, this seems a lot easier than natural lanaguage to me. Even if finetuning a model to interact with APIs is an annoying engineering task, it seems like it should be doable in less than a year once we have a system that can handle most of the ambiguities of natural language (and benchmarks directly tests the ability to systematically respond in a very specific way to a vague input). As with google duplex, the difficulty of interacting with APIs is upper-bounded by the difficulty of interacting with human interfaces (though to be fair, interactions with humans can be more forgiving than interfaces-for-humans).
Right, sorry. The power law is a function from compute to reducible error, which goes towards 0. This post's graphs have the (achievable) accuracy on the y-axis, where error=1-accuracy (plus or minus a constant to account for achievability/reducibility). So a more accurate statement would be "the lower end of an inverted s-curve [a z-curve?] (on a linear-log scale) eventually look roughly like a power law (on a linear-linear scale)".
In other words, a power law does have an asymptote, but it's always an asymptote towards 0. So you need to transform the curve as 1-s to get the s-curve to also asymptote towards 0.
Right, this does not apply to these graphs. It's just a round-about way of saying that the upper end of s-curves (on a linear-log scale) eventually look roughly like power laws (on a linear-linear scale). We do have some evidence that errors are typically power laws in compute (and size and data), so I wanted to emphasize that s-curves are in line with that trend.
In fact I was imagining that maybe most (or even all) of them would be narrow AIs / tool AIs for which the concept of alignment doesn't really apply.
Ah, yeah, for the purposes of my previous comment I count this as being aligned. If we only have tool AIs (or otherwise alignable AIs), I agree that Evan's conclusion 2 follow (while the other ones aren't relevant).
I think the relevant variable for homogeneity isn't whether we've solved alignment--maybe it's whether the people making AI think they've solved alignment
So for homogenity-of-factions, I was specifically trying to say that alignment is necessary to have multiple non-tool AIs on the same faction, because at some point, something must align them all to the faction's goals.
However, I'm now noticing that this requirement is weaker than what we usually mean with alignment. For our purposes, we want to be able to align AIs to human values. However, for the purpose of building a faction, it's enough if there exists an AI that can align other AIs to its values, which may be much easier.
Concretely, my best guess is that you need inner alignment, since failure of inner alignment probably produces random goals, which means that multiple inner-misaligned AIs are unlikely to share goals. However, outer alignment is much easier for easily-measurable values than for human values, so I can imagine a world where we fail outer alignment, unthinkingly create an AI that only care about something easy (e.g. maximize money) and then that AI can easily create other AIs that want to help it (with maximizing money).
Not a typo, but me being ambiguous. When I wrote about updating "it" downward, I was referring to my median estimate of 5-6 orders of magnitude. I've now added a dollar cost to that ($100B-$1T), hopefully making it a bit more clear.
I think this is only right if we assume that we've solved alignment. Otherwise you might not be able to train a specialised AI that is loyal to your faction.
Here's how I imagine Evan's conclusions to fail in a very CAIS-like world:
1. Maybe we can align models that do supervised learning, but can't align RL, so we'll have humans+GPT-N competing against a rogue RL-agent that someone created. (And people initially trained both of these because GPT-N makes for a better chatbot, while the RL agent seemed better at making money-maximizing decisions at companies.)
2. A mesa-optimiser arising in GPT-N may be very dissimilar to a money-maximising RL-agent, but they may still end up in conflict. None of them can add an analogue to the other to their team, because they don't know how to align it.
3. If we use lots of different methods for training lots of different specialised models, any one of them can produce a warning shot (which would ideally make us suspect all other models). Also, they won't really understand or be able to coordinate with the other systems.
4. It's not as important if the first advanced AI system is aligned, since there will be lots of different systems of different types. If everyone is training unaligned chatbots, you still care about aligning everyone's personal assistants.
I think this depends a ton on your reference class. If you compare AI with military fighter planes: very homogenous. If you compare AI with all vehicles: very heterogenous.
Maybe the outside view can be used to say that all AIs designed for a similar purpose will be homogenous, implying that we only get heterogenity in a CAIS scenario, where there are many different specialised designs. But I think the outside view also favors a CAIS scenario over a monolithic AI scenario (though that's not necessarily decisive).
I find the prospect of multiple independent mesa-optimizers inside of the same system relatively unlikely.
I think Jesse was just claiming that it's more likely that everyone uses an architecture especially prone to mesa optimization. This means that (if multiple people train that architecture from scratch) the world is likely to end up with many different mesa optimizers in it (each localised to a single system). Because of the random nature of mesa optimization, they may all have very different goals.
If we're comparing europe to china, did ships+navigation tech really have anything to do with it? We certainly don't need to invoke them, since certain emperors' whims are sufficient to explain why china didn't colonise. And some chinese ships were going to east africa already by the 9th century (afaict from wikipedia), which seems like it could be sufficient to start colonising? I suspect it was farther than europeans was going at the time.
Or did you only mean to cite ships as something that europeans was disproportionally good at compared to other advanced societies? (maybe middle eastern ones?)
The time it took to reach human-level intelligence (HLI) was quite short, though, which is decent evidence that HLI is easy. Our common ancestor with dolphins was just 100mya, whereas there's probably more than 1 billion years left for life on Earth to evolve.
Here's one way to think about the strength of this evidence. Consider two different hypotheses:
HLI is easy. After our common ancestor with dolphins, it reliably takes N million years of steady evolutionary progress to develop HLI, where N is uniformly distributed.
HLI is hard. After our common ancestor with dolphins, it reliably takes at least N million years (uniformly distributed) of steady evolutionary progress, and for each year after that, there's a constant, small probability p that HLI is developed. In particular, assume that p is so small that, if we condition on HLI happening at some point (for anthropics reasons), the time at which HLI happens is uniform between the end of the N million years and the end of all life on Earth.
Lets say HLI emerged on Earth exactly 100mya after our common ancestor with dolphins. After our common ancestor with dolphins, lets say there were 1100 million years remaining for life to evolve on Earth (I think it's close to that). We can treat N as being distributed uniformly between 1 and 100, because we know it's not more than 100 (our existence contradicts that). If so:
P(HLI at 100my | HLI is easy) = 1100
P(HLI at 100my | HLI is hard) = Σ100n=1110011100−n>Σ100n=1110011000=11000
Thus, us evolving at 100my is roughly a 10:1 update in favor of HLI being easy.
(Note that, since the question under dispute is the ease of getting to HLI from dolphin intelligence, counting from 100mya is really conservative; it might be more appropriate to count from whenever primates acquired dolphin intelligence. This could lead to much stronger updates; if we count time from e.g. 20mya instead of 100mya, the update would be 50:1 instead of 10:1, since P(HLI at 20my | HLI is easy) would be 1/20.)
This is somewhat but not totally robust to small probabilities of variations. E.g. if we assign 20% chance to life actually needing to evolve within 200 million years after our common ancestor with dolphins, we get:
P(HLI at 100my | HLI is easy) = 1100
P(HLI at 100my | HLI is hard) = 0.8∗Σ100n=1110011100−n+0.2∗Σ100n=111001200−n≈1100(0.8∗0.1+0.2∗0.7)=2.21000
So the update would be more like 1:0.22 ~ 4.5:1 in favor of HLI is easy.
If you think dolphin intelligence is probably easy, I think you shouldn't be that confident that HLI is hard, so after updating on earliness, I think HLI being easy should be the default hypothesis.
The paper lists "intelligence" as a potentially hard step, which is of extra interest for estimating AI timelines. However, I find all the convergent evolution described in section 5 of this paper (or more shortly described in this blogpost) to be pretty convincing evidence that intelligence was quite likely to emerge after our first common ancestor with octopuses ~800 mya; and as far as I can tell, this paper doesn't contradict that.
We're not licensed to ignore it, and in fact such an update should be done. Ignoring that update represents an implicit assumption that our prior over "how habitable are long-lived planets?" is so weak that the update wouldn't have a big effect on our posterior. In other words, if the beliefs "long-lived planets are habitable" and "Z is much bigger than Y" are contradictory, we should decrease our confidence in both; but if we're much more confident in the latter than the former, we mostly decrease the probability mass we place on the former.
Of course, maybe this could flip around if we get overwhelmingly strong evidence that long-lived planets are habitable. And that's the Popperian point of making the prediction: if it's wrong, the theory making the prediction (ie "Z is much bigger than Y") is (to some extent) falsified.