Posts
Comments
I agree on the margin I fall into the trap of doing more of this than I should. I do curate my Twitter feed to try and make this a better form of reaction than it would otherwise be, but I should raise the bar for that relative to my other bars.
Always good to get reminders on this.
However, as you allude to, you're in the spot where you're already checking many of the same sources on Twitter, whereas one of the points of these posts for a lot of readers is so they don't have to do that. I'd definitely do it radically differently if I thought most readers of mine were going to be checking Twitter a lot anyway.
Ah, thanks for clearing that up. That definitely wasn't made clear to me.
Ah, he didn't realize he was getting signal boosted and edited after he got a bunch of inquiries. Under the old wording, I didn't think they had no alignment teams, but I read it as 'a new alignment team.' It makes sense under Google's general structure to have multiples, in fact it would be weird if you didn't.
How far does this go? Does this mean if I e.g. had stupid questions or musings about Q learning, I shouldn't talk about that in public in case I accidentally hit upon something or provoked someone else to say something?
My presumption is that doing this while leaving Altman in place as CEO risks Altman engaging in hostile action, and it represents a vote of no confidence in any case. It isn't a stable option. But I'd have gamed it out?
It would be sheer insanity to have a rule that you can't vote on your own removal, I would think, or else a tied board will definitely shrink right away.
Yeah, should have put that in the main, forgot. Added now.
Initially I saw it from Kara Swisher (~1mm views) then I saw it from a BB employee. I presume it is genuine.
I definitely do not think this is on the level of the EO or Summit.
Vote via reactions agree or disagree (or unsure etc) to the following proposition: This post should also go on my Substack.
EDIT: Note that this is currently +5 agreement, but no one actually used a reaction (the icons available at the bottom right corner). Please use the reactions instead, this is much more useful than the +/-.
For my own markets, it is not retroactive if I didn't say it at the time (which I did for many markets). In that case, I would resist doing so exactly because I think this is a low-probability but possible event, and I continue to find it interesting. If it was 3% (trading at Superconductor) I would be tempted to early resolve, but 5% isn't there yet.
To be clear, I WOULD likely resolve the Superconductor market now under this rule, I think it is trading at interest rate.
For the bet, at 93% with active arguing on both sides and real trading, definitely not. Even if this were 95%, I wouldn't resolve, because it is based on a 150-to-1 baseline bet and there is a clear contingent arguing the other way. So if I made a UFO market like this I would say 'This cannot resolve early to YES, period.'
For the election case, if I saw the desks collectively resolving I would resolve, but if something is going to be 99.99% a day later and it's 99% now, might as well wait. If it's going to be two months, do it now.
Definitely good to keep this in mind, but to me some of this stuff seems obviously super impressive even if you do not know the technical details. Generating complex rich pictures on demand that mostly match requested details not being impressive doesn't parse for me.
Yep, as the edit says I don't think we disagree on the first point - there are versions that are oppressive, but also versions that are not that still have large positive effects.
On the second point, I believe this is because it is much harder to introduce safeguards than to remove them, because removing them is a highly blunt target, whereas good safeguards have to be detailed to avoid false positives (which Llama-2 did not do a good job avoiding, but they did try). This is the key asymmetry here, the amount Meta (or anyone else) spends tuning does not help here.
I think part 2 that details the reactions will provide important color here - if this had impacted those other than the major labs right away, I believe the reaction would have been quite bad, and that setting it substantially lower would have been a strategic error and also a very hard sell to the Biden Administration. But perhaps I am wrong about that. They do reserve the ability to change the threshold in the future.
This seems to be misunderstanding several points I was attempting to make so I'll clear those up here. Apologies if I gave the wrong idea.
- On longtermism I was responding to Lewis' critique, saying that you do not need full longtermism to care about the issues longtermists care about, that there were also (medium term?) highly valuable issues at stake that would already be sufficient to care about such matters. It was not intended as an assertion that longtermism is false, nor do I believe that.
- I am asserting there that I believe that things other than subjective experience of pleasure/suffering matter, and that I think the opposite position is nuts both philosophically and in terms of it causing terrible outcomes. I don't think this requires belief in personhood mattering per se, although I would indeed say that it matters. And when people say 'I have read the philosophical literature on this and that's why nothing you think matters matters, why haven't you done your homework'... well, honestly, that's basically why I almost never talk philosophy online and most other people don't either, and I think that sucks a lot. But if you want to know what's behind that on a philosophical level? I mean, I've written quite a lot of words both here and in various places. But I agree that this was not intended to turn someone who had read 10 philosophy books and bought Benthamite Utilitarianism into switching.
- On Alameda, I was saying this from the perspective of Jane Street Capital. Sorry if that was unclear. As in, Lewis said JS looked at EAs suspiciously for not being greedy. Whereas I said no, that's false, EAs got looked at suspiciously because they left in the way they did. Nor is this claiming they were not doing it for the common good - it is saying that from the perspective of JSC, them saying it was 'for the common good' doesn't change anything, even if true. My guess, as is implied elsewhere, is that the EAs did believe this consciously. As for whether they 'should have been' loyal to JSC, my answer is they shouldn't have stayed out of loyalty, but they should have left in a more cooperative fashion.
I would be ecstatic to learn that only 2% of Y-Combinator companies that ever hit $100mm were engaged in serious fraud, and presume the true number is far higher.
And yes, YC does do that and Matt Levine frequently talks about the optimal amount of fraud (from the perspective of a VC) being not zero. For them, this is a feature, not a bug, up to a (very high) point.
I would hope we would feel differently, and also EA/rationality has had (checks notes) zero companies/people bigger than FTX/SBF unless you count any of Anthropic, OpenAI and DeepMind. In which case, well, other issues, and perhaps other types of fraud.
If it wasn't Guzey I would have dismissed the whole thing as trolling or gaslighting, and I wouldn't have covered it beyond one line and a link. He's definitely very confused somewhere.
Pretty big if true. If EV actively is censoring attempts to reflect upon what happened, then that is important information to pin down.
I would hope that if someone tried to do that to me, I would resign.
I wish he had said (perhaps after some time to ponder) "I now realize that SBF used FTX to steal customer funds. SBF and FTX had a lot of goodwill, that I contributed to, and I let those people and the entire community down.
As a community, we need to recognize that this happened in part because of us. And I recognize that this happened partly because of me, in particular. Yes, we want to make the world better, and yes, we should be ambitious in the pursuit of that. But we have been doing so in a way that we can now see can set people on extremely dark and destructive paths.
No promise to do good justifies fraud, or the encouragement of fraud. We have to find a philosophy that does not drive people towards fraud.
We must not see or treat ourselves as above common-sense ethical norms, and must engage criticism with humility. We must fundamentally rethink how to embody utilitarianism where it is useful, within such a framework, recognizing that saying 'but don't lie or do fraud' at the end often does not work.
I know others have worried that our formulation of EA ideas could lead people to do harm. I used to think this was unlikely. I now realize it was not, and that this was part of a predictable pattern that we must end, so that we can be a force for good once more.
I was wrong. I will continue to reflect in the coming months."
And then, ya know, reflect, and do some things.
The statement he actually made I interpret as a plea for time to process while affirming the bare minimum. Where was his follow-up?
I have an answer but I think it would be better to see how others answer this, at least first?
What I meant was that I saw talk of need for systemic/philosophical change and to update, that talk died down, and what I see now does not seem so different from what I saw then. As Ben points out, there has been little turnover. I don't see a difference in epistemics of discussions. I don't see examples of decisions being made using better theories. And so on.
Concretely recently: Reaction to Elizabeth's post seemed what I would have expected in 2021, from both LW and EA Forum. The whole nonlinear thing was only exposed after Ben Pace put infinite hours into it, otherwise they were plausibly in process of rooting a lot of EA. Etc. My attempts to get various ideas across don't feel like they're getting different reactions from EAs than I would have expected in 2021.
Yes, people have realized SBF the particular person was bad, but they have not done much to avoid making more SBFs that I can see? Or to guard against such people if they don't do exactly the same thing next time?
Situation with common sense morality and honesty seem not to have changed much from where I sit, and note e.g. that Oliver/Ben who interact with them more seem to basically despair on this front.
I will report back after EAG Boston if it updates me, but this has not been my experience at all, and I am curious what persistent changes you believe I should have noticed, other than adapting to the new funding situation.
Yep. I'm making small fixes to the Substack version as I go, but there have been like 20 tiny ones so I'm waiting to update WP/LW all at once.
I think the following can be and are both true at once:
- What happened was not anything like enough.
- What happened was more than one would expect from a political party, or a social movement such as social justice.
Vegans believe that they should follow a deontological rule, to never eat meat, rather than weighing the costs and benefits of individual food choices. They don't consume meat even when it is expensive (in various senses) to not do so. And they advocate for others to commit to doing likewise.
Whereas EA thinking in other areas instead says to do the math.
I think that's a little more recent than the last confirmation I remembered, but yes, exactly.
I agree that any regulation that hits OA/DM/AN (CS) hits OS. If we could actually pass harsh enough restrictions on CS that we'd be fine with OS on the same level, then that would be great, but I don't see how that happens? Or how the restrictions we put on CS in that scenario don't amount to including a de facto OS ban?
That seems much harder, not easier, than getting OS dealt with alone? And also, OS needs limits that are stricter than CS needs, and if we try to hit CS with the levels OS requires we make things so much harder. Yes, OS people are tribal opposition, but they've got nothing on all of CS. And getting incremental restrictions done (e.g. on OS) helps you down the line in my model, rather than hurting you. Also, OS will be used as justifications for why we can't restrict CS, and helps fuel competition that will do the same, and I do think there's a decent chance it matters in the end. Certainly the OS people think so.
Meanwhile, do we think that if we agree to treat OS=CS, that OS would moderate their position at all? I think no. Their position is to oppose all restrictions on principle. They might be slightly less mad if they're not singled out, but I doubt very much so if it would still have the necessary teeth. I've never seen an OS advocate call for restrictions or even fail to oppose restrictions on CS. Unless that restriction was to require them to be OS. Nor should they, given their other beliefs.
On the part after the quote, I notice I am confused. Why do you expect these highly tribal people standing on a principle to come around? What would make them come around? I see them as only seeing every release that does not cause catastrophe as more evidence OS is great, and hardening their position. I am curious what you think would be evidence that would bring the bulk of OS to agree to turn against OS AI scaling enough to support laws against it. I can't think of an example of a big OS advocate who has said 'if X then I would change my mind on that' where X is something that leaves most of us alive.
After all the worries about people publishing things they shouldn't, I found it very surprising to see Oliver advocating for publishing when John wanted to hold back, combined with the request for incremental explanations of progress to justify continued funding.
John seems to have set a very high alternative proof bar here - do things other people can't do. That seems... certainly good enough, if anything too stringent? We need to find ways to allow deep, private work.
In case it needs to be said... fund this John Wentworth character?
Weird confluence here? I don't know what the categories listed have to do with the question of whether a particular intervention makes sense. And we agree of course that any given intervention might not be a good intervention.
For this particular intervention, in addition to slowing development, it allows us to potentially avoid AI being relied upon or trusted in places it shouldn't be, to allow people to push back and protect themselves. And it helps build a foundation of getting such things done to build upon.
Also I would say it is good in a mundane utility sense.
Agreed that it is no defense against an actual ASI, and not a permanent solution. But no one (and I do mean no one, that I can recall anyway) is presenting this as a full or permanent solution, only as an incremental thing one can do.
In this case yes, I should have checked the primary source directly, it was worth the effort - I've learned to triage such checks but got this one wrong given that I already had the primary source handy.
Is evaluation of capabilities, which as you note requires fine-tuning and other such techniques, a realistic thing to properly do continuously during model training, without that being prohibitively slow or expensive? Would doing this be part of the intended RSP?
What I'm saying as it applies to us is, in this case and at human-level, with our level of affordances/compute/data/etc, humans have found that the best way to maximize X' is to instead maximize some set of things Z, where Z is a complicated array of intermediate goals, so we have evolved/learned to do exactly this. The exact composition of Z aims for X', and often misses. And because of the ways humans interact with the world and other humans, if you don't do something pretty complex, you won't get much X' or X, so I don't know what else you were expecting - the world punishes wireheading pretty hard.
But, as our affordances/capabilities/local-data increase and we exert more optimization pressure, we should expect to see more narrow and accurate targeting of X', in ways that nicely generalize less. And indeed I think we largely do see this in humans even over our current ranges, fwiw, in highly robust fashion (although not universally).
Neural networks do not always overfit if we get the settings right. But they very often do, we just throw those out and try again, which is also part of the optimization process. As I understand it, if you give them the ability to fully overfit, they will totally do it, which is one reason you have to mess with the settings, you have to set them up so that the way to max X' is not to do what to us we would call an overfit, either specifically or generalizations of that, which is a large part of why everything is so finicky.
Or, the general human pattern: When dealing with arbitrary outside forces more powerful than you, that you don't fully understand, you learn to and do abide the spirit of the enterprise, to avoid rustling feathers (impact penalties), to not aim too narrowly, try not to lie, etc. But that's because we lack the affordances to stop doing that.
I am not fully convinced I am wrong about this, but I am confident I am not making the simple form of this mistake that I believe you think that I am making here. I do get that models don't 'get reward' and that there will be various ways in which the best available way to maximize reward will in effect be to maximize something else that tends to maximize reward, although I expect this difference to shrink as capabilities advance.
Yeah, I agree what I wrote was a simplification (oversimplification?) and the problem is considerably more screwed-up because of it...
So this isn't as central as I'd like, but there are a number of ways that humans react to lie detectors and lie punishers that I expect highlight things you would expect to see.
One solution is to avoid knowing. If you don't know, you aren't lying. Since lying is a physical thing, the system won't then detect it. This is ubiquitous in the modern world, the quest to not know the wrong things. The implications seem not great if this happens.
A further solution is to believe the false thing. It's not a lie, if you believe it. People do a ton of this, as well. Once the models start doing this, they both can fool you, and also they fool themselves. And if you have an AI whose world model contains deliberate falsehoods, then it is going to be dangerously misaligned.
A third solution is to not think of it as lying, because that's a category error, words do not have meanings, or that in a particular context you are not being asked for a true answer so giving a false ('socially true' or 'contextually useful', no you do not look fat, yes you are excited to work here) one does not represent a lie, or that your statement is saying something else (e.g. I am not 'bluffing' or 'not bluffing', I am saying that 'this hand was mathematically a raise here, solver says so.')
Part of SBF's solution to this, a form of the third, was to always affirm whatever anyone wanted and then decide later which of his statements were true. I can totally imagine an AI doing a more sensible variation on that because the system reinforces that. Indeed, if we look at e.g. Claude, we see variations on this theme already.
The final solution is to be the professional, and learn how to convincingly bald-face lie, meaning fool the lie detector outright, perhaps through some help from the above. I expect this, too.
I also do not think that if we observe the internal characteristics of current models, and notice a mostly-statistically-invariant property we can potentially use, that this gives us confidence that this property holds in the future?
And yes, I would worry a lot about changes in AI designs in response to this as well, if and once we start caring about it, once there is generally more optimization pressure being used as capabilities advance, etc, but going to wrap up there for now.
So that I understand the first point: Do you see the modestly-fast takeoff scenarios you are predicting, presumably without any sort of extended pause, as composing a lot of the worlds where we survive? So much so that (effectively) cutting off our ability to slow down, to have this time bomb barreling down upon us as it were, is not a big deal?
Then, the follow-up, which is what the post-success world would then have to look like, if we didn't restrict open source, and thus had less affordances to control or restrict relevant AIs. Are you imagining that we would restrict things after we're well into the takeoff, when it becomes obvious?
In terms of waiting for clearer misuse, I think that if that is true you want to lay the groundwork now for the crisis later when the misuse happens so you don't waste it, and that's how politics works. And also that you should not overthink things, if we need to do X, mostly you should say so. (And even if others shouldn't always do that, that I should here and now.)
The debate is largely tribal because (it seems to me) the open source advocates are (mostly) highly tribal and ideological and completely unopen to compromise or nuance or to ever admit a downside, and attack the motives of everyone involved as their baseline move. I don't know what to do about that. Also, they punch far above their weight in us-adjacent circles.
Also, I don't see how not to fight this, without also not fighting for the other anti-faster-takeoff strategies? Which implies a complete reorientation and change of strategies, and I don't see much promise there.
I think that in today's age it is exceedingly hard to 'get away with' doing this without players comparing notes and figuring it out.
I am curious to what extent you or Nate think I understand that frame? And how easy it would be to help me fully get it? I am confused about how confused I am.
I realize this is accidentally sounds like it's saying two things at once (that autonomous learning relies on the generator-discriminator gap of the domain, and then that it relies on the gap for the specific agent (or system in general)). I think it's the agent's capabilities that matter, that the domain determines how likely the agent is to have a persistent gap between generation and discrimination, and I don't think the (basic) dynamics are too difficult.
You start with a model M and initial data distribution D. You train M on D such that M is now a model of D. You can now sample from M, and those samples will (roughly) have whatever range of capabilities were to be found in D.
Now, suppose you have some classifier, C, which is able to usefully distinguish samples from M on the basis of that sample's specific level of capabilities. Note that C doesn't have to just be an ML model. It could be any process at all, including "ask a human", "interpret the sample as a computer program trying to solve some problem, run the program, and score the output", etc.
Having C allows you to sample from a version of M's output distribution that has been "updated" on C, by continuously sampling from M until a sample scores well on C. This lets you create a new dataset D', which you can then train M' on to produce a model of the updated distribution.
So long as C is able to provide classification scores which actually reflect a higher level of capabilities among the samples from M / M' / M'' / etc, you can repeat this process to continually crank up the capabilities. If your classifier C was some finetune of M, then you can even create a new C' off of M', and potentially improve the classifier along with your generator. In most domains though, classifier scores will eventually begin to diverge from the qualities that actually make an output good / high capability, and you'll eventually stop benefiting from this process.
This process goes further in domains where it's easier to distinguish generations by their quality. Chess / other board games are extreme outliers in this regard, since you can always tell which of two players actually won the game. Thus, the game rules act as a (pairwise) infallible classifier of relative capabilities. There's some slight complexity around that last point, since a given trajectory could falsely appear good by beating an even worse / non-representative policy, but modern self-play approaches address such issues by testing model versions against a variety of opponents (mostly past versions of themselves) to ensure continual real progress. Pure math proofs is another similarly skewed domain, where building a robust verifier (i.e., a classifier) of proofs is easy. That's why Steven was able to use it as a valid example of where self-play gets you very far.
Most important real world domains do not work like this. E.g., if there were a robust, easy-to-query process that could classify which of two scientific theories / engineering designs / military strategies / etc was actually better, the world would look extremely different.
Thank you, this is helpful for me thinking further about this, the first paragraph seems almost right, except that instead of the single agent what you care about is the best trainable or available agent, since the two agents (M and C) need not be the same? What you get from this is an M that maximizes C, right? And the issue, as you note, is that in most domains a predictor of your best available C is going to plateau, so it comes down to whether having M gives you the ability to create C' that can let you move 'up the chain' of capability here, while preserving any necessary properties at each transition including alignment. But where M will inherit any statistical or other flaws, or ways to exploit, C, in ways we don't have any reason to presume we have a way to 'rescue ourselves from' in later iterations, and instead would expect to amplify over time?
(And thus, you need a security-mindset-level-robust-to-M C at each step for this to be a safe strategy to iterate on a la Christiano or Leike, and you mostly only should expect to get that in rare domains like chess, rather than expecting C to win the capabilities race in general? Or something like that? Again, comment-level rough here.)
On the additional commentary section:
On the first section, we disagree on the degree of similarity in the metaphors.
I agree with you that we shouldn't care about 'degree of similarity' and instead build causal models. I think our actual disagreements here lie mostly in those causal models, the unpacking of which goes beyond comment scope. I agree with the very non-groundbreaking insights listed, of course, but that's not what I'm getting out of it. It is possible that some of this is that a lot of what I'm thinking of as evolutionary evidence, you're thinking of as coming from another source, or is already in your model in another form to the extent you buy the argument (which often I am guessing you don't).
On the difference in SLT meanings, what I meant to say was: I think this is sufficient to cause our alignment properties to break.
In case it is not clear: My expectation is that sufficiently large capabilities/intelligence/affordances advances inherently break our desired alignment properties under all known techniques.
On the passage you find baffling: Ah, I do think we had confusion about what we meant by inner optimizer, and I'm likely still conflating the two somewhat. That doesn't change me not finding this heartening, though? As in, we're going to see rapid big changes in both the inner optimizer's power (in all senses) and also in the nature and amount of training data, where we agree that changing the training data details changes alignment outcomes dramatically.
On the impossible-to-you world: This doesn't seem so weird or impossible to me? And I think I can tell a pretty easy cultural story slash write an alternative universe novel where we honor those who maximize genetic fitness and all that, and have for a long time - and that this could help explain why civilization and our intelligence developed so damn slowly and all that. Although to truly make the full evidential point that world then has to be weirder still where humans are much more reluctant to mode shift in various ways. It's also possible this points to you having already accepted from other places the evidence I think evolution introduces, so you're confused why people keep citing it as evidence.
The comment in response to parallels provides some interesting thoughts and I agree with most of it. The two concrete examples are definitely important things to know. I still notice the thing I noticed in my comment about the parallels - I'd encourage thinking about what similar logic would say in the other cases?
On concrete example 2: I see four bolded claims in 'fast takeoff is still possible.' Collectively, to me, in my lexicon and way of thinking about such things, they add up to something very close to 'alignment is easy.'
The first subsection says human misalignment does not provide evidence for AI misalignment, which isn't one of the two mechanisms (as I understand this?), and is instead arguing against an alignment difficulty.
The bulk of the second subsection, starting with 'Let’s consider eight specific alignment techniques,' looks to me like an explicit argument that alignment looks easy based on your understanding of the history from AI capabilities and alignment developments so far?
The third subsection seems to also spend most of its space on arguing its scenario would involve manageable risks (e.g. alignment being easy), although you also argue that evolution/culture still isn't 'close enough' to teach us anything here?
I can totally see how these sections could have been written out with the core intention to explain how distinct-from-evolution mechanisms could cause fast takeoffs. From my perspective as a reader, I think my response and general takeaway that this is mostly an argument for easy alignment is reasonable on reflection, even if that's not the core purpose it serves in the underlying structure, and it's perhaps not a fully general argument.
On concrete example 3: I agree that what I said was a generalization of what you said, and you instead said something more specific. And that your later caveats make it clear you are not so confident that things will go smoothly in the future. So yes I read this wrong and I'm sorry about that.
But also I notice I am confused here - if you didn't mean for the reader to make this generalization, if you don't think that failure of current capabilities advances to break current alignment techniques isn't strong evidence for future capabilities advances not breaking then-optimal alignment techniques, then why we are analyzing all these expected interactions here? Why state the claim that such techniques 'already generalize' (which they currently mostly do as far as I know, which is not terribly far) if it isn't a claim that they will likely generalize in the future?
On Quintin's secondly's concrete example 1 from above:
I think the core disagreement here is that Quintin thinks that you need very close parallels in order for the evolutionary example to be meaningful, and I don't think that at all. And neither of us can fully comprehend why the other person is going with as extreme a position as we are on that question?
Thus, he says, yes of course you do not need all those extra things to get misalignment, I wasn't claiming that, all I was saying was this would break the parallel. And I'm saying both (1) that misalignment could happen these other ways which he agrees with in at least some cases (but perhaps not all cases) and (2) also I do not think that these extra clauses are necessary for the parallel to matter.
And also (3) yes, I'll totally cop to, because I don't see why the parallel is in danger with these changes, I didn't fully have in my head the distinction Quintin is trying to draw here, when I was writing that.
But I will say that, now that I do have it in my head, that I am at least confused why those extra distinctions are necessary for the parallel to hold, here? Our models of what is required here are so different that I'm pretty confused about it, and I don't have a good model of why e.g. it matters that there are 9 OOMs of difference, or that the creation of the inner optimizer is deliberate (especially given that nothing evolution did was in a similar sense deliberate, as I understand these things at least - my model is that evolution doesn't do deliberate things at all). And in some cases, to the extent that we want a tighter parallel, Quintin's requirements seem to move away from that? Anyway, I notice I am confused.
Concrete example 4: Am I wrong here that you're arguing that this path still exhibits key differences from cultural development and thus evolution does not apply? And then you also argue that this path does not cause additional severe alignment difficulties beyond those above. So I'm not sure where the misreading is here. After that, I discuss a disagreement with a particular claim.
(Writing at comment-speed, rather than carefully-considered speed, apologies for errors and potential repetitions, etc)
On the Evo-Clown thing and related questions in the Firstly section only.
I think we understand each other on the purpose of the Evo-Clown analogy, and I think it is clear what our disagreement is here in the broader question?
I put in the paragraph Quintin quoted in order to illustrate that, even in an intentionally-absurd example intended to illustrate that A and B share no causal factors, A and B still share clear causal factors, and the fact that A happened this way should give you substantial pause about the prospects for B, versus A never having happened at all and the things that caused A not having happened. I am curious (since Quintin does not comment) whether he agrees about the example, now that I bring up the reasons to be concerned.
The real question is the case of evolution versus AI development.
I got challenged by Quintin and by others as interpreting Quintin too broadly when I said:
That seems like quite a leap. If there is one particular development in humanity’s history that we can fully explain, we should then not cite evolution in any way, as an argument for anything?
In response to Quintin saying:
- THEN, there's no reason to reference evolution at all when forecasting AI development rates, not as evidence for a sharp left turn, not as an "illustrative example" of some mechanism / intuition which might supposedly lead to a sharp left turn in AI development, not for anything.
I am happy to accept the clarification that I interpreted Quintin's statement stronger than he intended it.
I still am confused how else I could have interpreted the original statement? But that does not matter, what matters is the disagreements we still clearly do have here.
I now understand Quintin's model as saying (based on the comment plus his OP) that evolution so obviously does an overdetermined sharp left turn that it isn't evidence of anything (e.g. that the world I proposed as an alternative breaks so many of his models that it isn't worth considering)?
I agree that if evolution's path is sufficiently overdetermined, then there's no reason to cite that path as evidence. In which case we should instead be discussing the mechanisms that are overdetermining that result, and what they imply.
I think the reason we talk about evolution here is exactly because for most people, the underlying mechanisms very much aren't obvious and overdetermined before looking at the results - if you skipped over the example people would think you were making a giant leap.
Concrete example 2: One general hypothesis you could have about RL agents is "RL agents just do what they're trained to do, without any weirdness". (To be clear, I'm not endorsing this hypothesis. I think it's much closer to being true than most on LW, but still false.) In the context of AI development, this has pretty benign implications. In the context of evolution, due to the bi-level nature of its optimization process and the different data that different generations are "trained" on, this causal factor in the evolution graph predicts significant divergence between the behaviors of ancestral and modern humans.
Zvi says this is an uncommon standard of epistemics, for there to be no useful inferences from one set of observations (evolutionary outcomes) to another (AI outcomes). I completely disagree. For the vast majority of possible pairs of observations, there are not useful inferences to draw. The pattern of dust specks on my pillow is not a useful reference point for making inferences about the state of the North Korean nuclear weapons program. The relationship between AI development and human evolution is not exceptional in this regard.
Ok, sure. I agree that for any given pair of facts there is essentially nothing to infer from one about the other, given what else we already know, and that the two facts Quintin cites as an example are a valid example. But it seems wrong to say that AI developments and evolutionary developments relate to each other in a similar way or reference class to a speck on your pillow to the nuclear weapons program? Or that the distinctions proposed should generally be of a sufficient degree to imply there are no implications from one to the other?
What I was saying that Quintin is challenging in the second paragraph above, specifically, was not that for observations A and B it would be unusual for A to not have important implications for B. What I was saying was that there being distinctions in the causal graphs behind A and B is not a good reason to dismiss A having implications for B - certainly differences reduce it somewhat, but most of the time that A impacts B, there are important causal graph differences that could draw a similar parallel. And, again, this would strike down most reference class arguments.
Quintin does say there are non-zero implications in the comment, so I suppose the distinction does not much matter in the end. Nor does it much matter whether we are citing evolution, or citing our underlying models that also explain evolution's outcomes, if we can agree on those models?
As in, we would be better served looking at:
One general hypothesis you could have about RL agents is "RL agents just do what they're trained to do, without any weirdness." In the context of AI development, this has pretty benign implications.
I think I kind of... do believe this? For my own perhaps quite weird definitions of 'weirdness' and 'what you train it for'? And for those values, no, this is not benign at all, because I don't consider SLT behaviors to be weird when you have the capabilities for them. That's simply what you would expect, including from a human in the same spot, why are we acting so surprised?
If you define 'weirdness' sufficiently differently then it would perhaps be benign, but I have no idea why you would expect this.
And also, shouldn't we use our knowledge of humans here, when faced with similar situations? Humans, a product of evolution, do all sorts of local SLTs in situations far removed from their training data, the moment you give them the affordance to do so and the knowledge that they can.
It is also possible we are using different understandings of SLT, and Quintin is thinking about it more narrowly than I am, as his later statements imply. In that case, I would say that I think the thing I care about, in terms of whether it happens, is the thing (or combination of things) I'm talking about.
Thus, in my view, humans did not do only the one big anti-evolution (?) SLT. Humans are constantly doing SLTs in various contexts, and this is a lot of what I am thinking about in this context.
What prevents there being useful updates from evolution to AI development is the different structure of the causal graphs.
Aha?!?
Quintin, I think (?) is saying that the fact that evolution provided us with a central sharp left turn is not evidence, because that is perfectly compatible with and predicted by AI models that aren't scary.
So I notice I disagree with this twice.
First, I don't think that the second because clause entirely holds, for reasons that I largely (but I am guessing not entirely) laid out in my OP, for reasons that I am confident Quintin disagrees with and would take a lot to untangle, although I do agree there is some degree of overdeterminedness here where if we hadn't done the exact SLT we did but had still ramped up our intelligence, we would have instead done a slightly-to-somewhat different-looking SLT later.
Second, I think this points out a key thing I didn't say explicitly and should have, which is the distinction between the evidence that humans did all their various SLTs (yes, plural, both collectively and individually), and the evidence that humans did these particular SLTs in these particular ways because of these particular mechanisms. Which I do see as highly relevant.
I can imagine a world where humans did an SLT later in a different way, and are less likely to do them on an individual level (note: I agree that this may be somewhat non-standard usage of SLT, but hopefully it's mostly clear from context what I'm referring to here?) , and everything happened slower and more continuously (on the margin presumably we can imagine this without our models breaking, if only via different luck). And where we look at the details and say, actually it's pretty hard to get this kind of thing to happen, and moving humans out of their training distributions causes them to hold up in a way we'd metaphorically like out of AIs really well even when they are smart enough and have enough info and reflection time to know better, and so on.
(EDIT: It's late, and I've now responded in stages to the whole thing, which as Quintin noted was longer than my OP. I'm thankful for the engagement, and will read any further replies, but will do my best to keep any further interactions focused and short so this doesn't turn into an infinite time sink that it clearly could become, even though it very much isn't a demon thread or anything.)
Thank you for the very detailed and concrete response. I need to step through this slowly to process it properly and see the extent to which I did misunderstand things, or places where we disagree.
Agreed, but that was the point I was trying to make. If you take away the AI's ability to lie, it gains the advantage that you believe what it says, that it is credible. That is especially dangerous when the AI can make credible threats (which potentially include threats to create simulations, but simpler things work too) and also credible promises if only you'd be so kind as to [whatever helps the AI get what it wants.]
We don't have the human model weights, so we can't use it.
My guess is that if we had sufficiently precise and powerful brain scans, and used a version of it tuned to humans, it would work, but that humans who cared enough would in time figure out how to defeat it at least somewhat.