Terminology: <something>-ware for ML? 2024-01-03T11:42:37.710Z
Alignment, conflict, powerseeking 2023-11-22T09:47:08.376Z
Careless talk on US-China AI competition? (and criticism of CAIS coverage) 2023-09-20T12:46:16.696Z
Invading Australia (Endless Formerlies Most Beautiful, or What I Learned On My Holiday) 2023-09-08T15:33:27.748Z
Hertford, Sourbut (rationality lessons from University Challenge) 2023-09-04T18:44:24.359Z
Un-unpluggability - can't we just unplug it? 2023-05-15T13:23:12.543Z
Oliver Sourbut's Shortform 2022-07-14T15:39:15.832Z
Deliberation Everywhere: Simple Examples 2022-06-27T17:26:20.848Z
Deliberation, Reactions, and Control: Tentative Definitions and a Restatement of Instrumental Convergence 2022-06-27T17:25:45.986Z
Feature request: voting buttons at the bottom? 2022-06-24T14:41:55.268Z
Breaking Down Goal-Directed Behaviour 2022-06-16T18:45:11.872Z
You Only Get One Shot: an Intuition Pump for Embedded Agency 2022-06-09T21:38:23.577Z
Gato's Generalisation: Predictions and Experiments I'd Like to See 2022-05-18T07:15:51.488Z
Conditions for mathematical equivalence of Stochastic Gradient Descent and Natural Selection 2022-05-09T21:38:59.772Z
Motivations, Natural Selection, and Curriculum Engineering 2021-12-16T01:07:26.100Z
Some real examples of gradient hacking 2021-11-22T00:11:35.047Z


Comment by Oliver Sourbut on Terminology: <something>-ware for ML? · 2024-01-10T22:43:35.462Z · LW · GW

Incidentally I noticed Yudkowsky uses 'brainware' in a few places (e.g. in conversation with Paul Christiano). But it looks like that's referring to something more analogous to 'architecture and learning algorithms', which I'd put more in the 'software' camp when in comes to the taxonomy I'm pointing at (the 'outer designer' is writing it deliberately).

Comment by Oliver Sourbut on Some Rules for an Algebra of Bayes Nets · 2024-01-10T19:03:52.131Z · LW · GW

Unironically, I think it's worth anyone interested skimming that Verma & Pearl paper for the pictures :) especially fig 2

Comment by Oliver Sourbut on Some Rules for an Algebra of Bayes Nets · 2024-01-10T18:48:29.326Z · LW · GW

Mmm, I misinterpreted at first. It's only a v-structure if and are not connected. So this is a property which needs to be maintained effectively 'at the boundary' of the fully-connected cluster which we're rewriting. I think that tallies with everything else, right?

ETA: both of our good proofs respect this rule; the first Reorder in my bad proof indeed violates it. I think this criterion is basically the generalised and corrected version of the fully-connected bookkeeping rule described in this post. I imagine if I/someone worked through it, this would clarify whether my handwave proof of Frankenstein Stitch is right or not.

Comment by Oliver Sourbut on Some Rules for an Algebra of Bayes Nets · 2024-01-10T18:20:31.725Z · LW · GW

That's concerning. It would appear to make both our proofs invalid.

But I think your earlier statement about incoming vs outgoing arrows makes sense. Maybe Verma & Pearl were asking for some other kind of equivalence? Grr, back to the semantics I suppose.

Comment by Oliver Sourbut on Some Rules for an Algebra of Bayes Nets · 2024-01-10T18:18:24.796Z · LW · GW

Aha. Preserving v-structures (colliders like ) is necessary and sufficient for equivalence[1]. So when rearranging fully-connected subgraphs, certainly we can't do it (cost-free) if it introduces or removes any v-structures.

Plausibly if we're willing to weaken by adding in additional arrows, there might be other sound ways to reorder fully-connected subgraphs - but they'd be non-invertible. Haven't thought about that.

  1. Verma & Pearl, Equivalence and Synthesis of Causal Models 1990 ↩︎

Comment by Oliver Sourbut on Some Rules for an Algebra of Bayes Nets · 2024-01-10T17:12:05.280Z · LW · GW

Mhm, OK I think I see. But appear to me to make a complete subgraph, and all I did was redirect the . I confess I am mildly confused by the 'reorder complete subgraph' bookkeeping rule. It should apply to the in , right? But then I'd be able to deduce which is strictly different. So it must mean something other than what I'm taking it to mean.

Maybe need to go back and stare at the semantics for a bit. (But this syntactic view with motifs and transformations is much nicer!)

Comment by Oliver Sourbut on Some Rules for an Algebra of Bayes Nets · 2024-01-10T15:42:25.715Z · LW · GW

Perhaps more importantly, I think with Node Introduction we really don't need after all?

With Node Introduction and some bookkeeping, we can get the and graphs topologically compatible, and Frankenstein them. We can't get as neat a merge as if we also had - in particular, we can't get rid of the arrow . But that's fine, we were about to draw that arrow in anyway for the next step!

Is something invalid here? Flagging confusion. This is a slightly more substantial claim than the original proof makes, since it assumes strictly less. Downstream, I think it makes the Resample unnecessary.

ETA: it's cleared up below - there's an invalid Reorder here (it removes a v-structure).

Comment by Oliver Sourbut on Some Rules for an Algebra of Bayes Nets · 2024-01-10T14:33:58.848Z · LW · GW

I had another look at this with a fresh brain and it was clearer what was happening.

TL;DR: It was both of 'I'm missing something', and a little bit 'Frankenstein is invalid' (it needs an extra condition which is sort of implicit in the post). As I guessed, with a little extra bookkeeping, we don't need Stitching for the end-to-end proof. I'm also fairly confident Frankenstein subsumes Stitching in the general case. A 'deductive system' lens makes this all clearer (for me).

My Frankenstein mistake

The key invalid move I was making when I said

But this same move can alternatively be done with the Frankenstein rule, right?

is that Frankenstein requires all graphs to be over the same set of variables. This is kind of implicit in the post, but I don't see it spelled out. I was applying it to an graph ( absent) and an graph ( absent). No can do!

Skipping Stitch in the end-to-end proof

I was right though, Frankenstein can be applied. But we first have to do 'Node Introduction' and 'Expansion' on the graphs to make them compatible (these extra bookkeeping rules are detailed further below.)

So, to get myself in a position to apply Frankenstein on those graphs, I have to first (1) introduce to the second graph (with an arrow from each of , , and ), and (2) expand the 'blanket' graph (choosing to maintain topological consistency). Then (3) we Frankenstein them, which leaves dangling, as we wanted.

Next, (4) I have to introduce to the first graph (again with an arrow from each of , , and ). I also have a topological ordering issue with the first Frankenstein, so (5) I reorder to the top by bookkeeping. Now (6) I can Frankenstein those, to sever the as hoped.

But now we've performed exactly the combo that Stitch was performing in the original proof. The rest of the proof proceeds as before (and we don't need Stitch).

More bookkeeping rules

These are both useful for 'expansive' stuff which is growing the set of variables from some smaller seed. The original post mentions 'arrow introduction' but nothing explicitly about nodes. I got these by thinking about these charts as a kind of 'deductive system'.

Node introduction

A graph without all variables is making a claim about the distribution with those other variables marginalised out.

We can always introduce new variables - but we can't (by default) assume anything about their independences. It's sound (safe) to assume they're dependent on everything else - i.e. they receive an incoming arrow from everywhere. If we know more than that (regarding dependencies), it's expressed as absence of one or another arrow.

e.g. a graph with is making a claim about . If there's also a , we haven't learned anything about its independences. But we can introduce it, as long as it has arrows , , and .

Node expansion aka un-combine

A graph with combined nodes is making a claim about the distribution as expressed with those variables appearing jointly. There's nothing expressed about their internal relationship.

We can always expand them out - but we can't (by default) assume anything about their independences. It's sound to expand and spell them out in any fully-connected sub-DAG - i.e. they have to be internally fully dependent. We also have to connect every incoming and outgoing edge to every expanded node i.e. if there's a dependency between the combination and something else, there's a dependency between each expanded node and that same thing.

e.g. a graph with is making a claim about . If is actually several variables, we can expand them out, as long as we respect all possible interactions that the original graph might have expressed.

Deductive system

I think what we have on our hands is a 'deductive system' or maybe grandiosely a 'logic'. The semantic is actual distributions and divergences. The syntax is graphs (with divergence annotation).

An atomic proposition is a graph together with a divergence annotation , which we can write .

Semantically, that's when the 'true distribution satisfies up to KL divergence' as you described[1]. Crucially, some variables might not be in the graph. In that case, the distributions in the relevant divergence expression are marginalised over the missing variables. This means that the semantic is always under-determined, because we can always introduce new variables (which are allowed to depend on other variables however they like, being unconstrained by the graph).

Then we're interested in sound deductive rules like

Syntactically that is 'when we have deduced we can deduce '. That's sound if, for any distribution satisfying we also have satisfying .

Gesture at general Frankenstitch rule

More generally, I'm reasonably sure Stitch is secretly just multiple applications of Frankenstein, as in the example above. The tricky bit I haven't strictly worked through is when there's interleaving of variables on either side of the blanket in the overall topological ordering.

A rough HANDWAVE proof sketch, similar in structure to the example above:

  • Expand the blanket graph
    • The arrows internal to , , and need to be complete
    • We can always choose a complete graph consistent with the , , and parts of the original graphs (otherwise there wouldn't be an overall consistent topology)
    • Notice that the connections are all to all , which is not necessarily consistent with the original graph
      • and similarly for the arrows
      • (there could be arrows in the original)
  • Introduce to the graph (and vice versa)
    • The newly-introduced nodes are necessarily 'at the bottom' (with arrows from everything else)
    • We can always choose internal connections for the introduced s consistent with the original graph
  • Notice that the connections and in the augmented graph all keep at the bottom, which is not necessarily consistent with the original graph (and vice versa)
    • But this is consistent with the Expanded blanket graph
  • We 'zip from the bottom' with successive bookkeeping and Frankensteins
    • Just like in the example above, where we got the sorted out and then moved the introduced to the 'top' in preparation to Frankenstein in the graph, I think there should always be enough connections between the introduced nodes to 'move them up' as needed for the stitch to proceed

I'm not likely to bother proving this strictly, since Stitch is independently valid (though it'd be nice to have a more parsimonious arsenal of 'basic moves'). I'm sharing this mainly because I think Expansion and Node Introduction are of independent relevance.

  1. More formally, over variables is satisfied by distribution when . (This assumes some assignment of variables in to variables in .) ↩︎

Comment by Oliver Sourbut on Deceptive AI ≠ Deceptively-aligned AI · 2024-01-10T09:06:54.464Z · LW · GW

I'd probably be more specific and say 'gradient hacking' or 'update hacking' for deception of a training process which updates NN internals.

I see what you're saying with a deployment scenario being often implicitly a selection scenario (should we run the thing more/less or turn it off?) in practice. So deceptive alignment at deploy-time could be a means of training (selection) hacking.

More centrally, 'training hacking' might refer to a situation with denser oversight and explicit updating/gating.

Deceptive alignment during this period is just one way of training hacking (could alternatively hack exploration, cyber crack and literally hack oversight/updating, ...). I didn't make that clear in my original comment and now I think there's arguably a missing term for 'deceptive alignment for training hacking' but maybe that's fine.

Comment by Oliver Sourbut on Deceptive AI ≠ Deceptively-aligned AI · 2024-01-08T13:57:45.682Z · LW · GW

I mean the deliberation happens in a neural network. Maybe you thought I meant 'net' as in 'after taking into account all contributions'? I should say 'NN-internal' instead, probably.

Comment by Oliver Sourbut on Deceptive AI ≠ Deceptively-aligned AI · 2024-01-08T10:29:04.516Z · LW · GW

Some people seem to argue that concrete evidence of deception is no evidence for deceptive alignment. I had a great discussion with @TurnTrout a few weeks ago about this, where we honed in on our agreement and disagreement here. Maybe we'll share some content from it at some point. In the mean time, my take after that is roughly

  • deception was obviously a priori going to be gettable, and now we have concrete evidence it occurs (approx 0 update for me, but >0 update for some)
  • this does support an expectation of deceptive alignment in my terms, because deception about intentions is pretty central deception, and with misaligned intentions, deception is broadly instrumental (again not much update for me, but >0 update for others)
  • it's still unclear how much deliberation about deception can/will happen 'net-internally' vs externalised
    • externalised deliberation about deceptive alignment is still deceptive alignment in my terms!
      • I keep notes in my diary about how I'm going to coordinate my coup
    • steganographic deliberation about deceptive alignment is scarier
      • my notes are encrypted
    • fully-internal deliberation about deceptive alignment is probably scarier still, because probably harder to catch?
      • like, it's all in my brain

I think another thing people are often arguing about without making it clear is how 'net internal' the relevant deliberation/situational-awareness can/will be (and in what ways they might be externalised)! For me, this is a really important factor (because it affects how and how easily we can detect such things), but it's basically orthogonal to the discussion about deception and deceptive alignment.[1]

More tentatively, I think net-internal deliberation in LLM-like architectures is somewhat plausible - though we don't have mechanistic understanding, we have evidence of outputs of sims/characters producing deliberation-like outputs without (much or any) intermediate chains of thought. So either there's not-very-general pattern-matching in there which gives rise to that, or there's some more general fragments of net-internal deliberation. Other AI systems very obviously have internal deliberation, but these might end up moot depending on what paths to AGI will/should be taken.

  1. ETA I don't mean to suggest net-internal vs externalised is independent from discussions about deceptive alignment. They move together, for sure, especially when discussing where to prioritise research. But they're different factors. ↩︎

Comment by Oliver Sourbut on Deceptive AI ≠ Deceptively-aligned AI · 2024-01-08T10:14:48.409Z · LW · GW

This is great, and thanks for pointing at this confusion, and raising the hypothesis that it could be a confusion of language! I also have this sense.

I'd strongly agree that separating out 'deception' per se is importantly different from more specific phenomena. Deception is just, yes, obviously this can and does happen.

I tend to use 'deceptive alignment' slightly more broadly - i.e. something could be deceptively aligned post-training, even if all updates after that point are 'in context' or whatever analogue is relevant at that time. Right? This would be more than 'mere' deception, if it's deception of operators or other-nominally-in-charge-people regarding the intentions (goals, objectives, etc) of the system. Also doesn't need to be 'net internal' or anything like that.

I think what you're pointing at here by 'deceptive alignment' is what I'd call 'training hacking', which is more specific. In my terms, that's deceptive alignment of a training/update/selection/gating/eval process (which can include humans or not), generally construed to be during some designated training phase, but could also be ongoing.

No claim here to have any authoritative ownership over those terms, but at least as a taxonomy, those things I'm pointing at are importantly distinct, and there are more than two of them! I think the terms I use are good.

Comment by Oliver Sourbut on Terminology: <something>-ware for ML? · 2024-01-05T09:45:24.602Z · LW · GW

I wasn't eager on this, but your justification updated me a bit. I think the most important distinction is indeed the 'grown/evolved/trained/found, not crafted', and 'brainware' didn't immediately evoke that for me. But you're right, brains are inherently grown, they're very diverse, we can probe them but don't always/ever grok them (yet), structure is somewhat visible, somewhat opaque, they fit into a larger computational chassis but adapt to their harness somewhat, properties and abilities can be elicited by unexpected inputs, they exhibit various kinds of learning on various timescales, ...

Comment by Oliver Sourbut on Terminology: <something>-ware for ML? · 2024-01-05T09:39:41.569Z · LW · GW

Mold like fungus or mold like sculpt? I like this a bit, and I can imagine it might... grow on me. (yeuch)

Mold-as-in-sculpt has the benefit that it encompasses weirder stuff like prompt-wrangled and scaffolded stuff, and also kinda large-scale GOFAI-like things alla 'MCTS' and whatnot.

Comment by Oliver Sourbut on Some Rules for an Algebra of Bayes Nets · 2024-01-04T14:57:46.411Z · LW · GW

Yeah, thinking slightly aloud, I tentatively think Frankenstein needs an extra condition like the blanket stitch condition... something which enforces the choice of topo ordering to be within the right class of topo orderings? That's what the chain does - it means we can assign orderings or , but not e.g. , even though that order is consistent with both of the other original graphs.

If I get some time I'll return to this and think harder but I can't guarantee it.

ETA I did spend a bit more time, and the below mostly resolves it: I was indeed missing something, and Frankenstein indeed needs an extra condition, but you do need .

Comment by Oliver Sourbut on Some Rules for an Algebra of Bayes Nets · 2024-01-04T14:31:21.819Z · LW · GW

But this same move can alternatively be done with the Frankenstein rule, right? (I might be missing something.) But Frankenstein has no such additional requirement, as stated. If I'm not missing something, I think Frankenstein might be invalid as stated (like maybe it needs an analogous extra condition). Haven't thought this through yet.

i.e. I think either

  • I'm missing something
  • Frankenstein is invalid
  • You don't need
Comment by Oliver Sourbut on Natural Latents: The Math · 2024-01-04T13:22:02.822Z · LW · GW

One thing that initially stood out to me on the fundamental theorem was: where did the arrow come from? It 'gets introduced' in the first bookkeeping step (we draw and then reorder the subgraph at each .

This seemed suspicious to me at first! It seemed like kind of a choice, so what if we just didn't add that arrow? Could we land at a conclusion of AND ? That's way too strong! But I played with it a bit, and there's no obvious way to do the second frankenstitch which brings everything together unless you draw in that extra arrow and rearrange. You just can't get a globally consistent topological ordering without somehow becoming ancesterable to . (Otherwise the glommed variables interfere with each other when you try to find 's ancestors in the stitch.)

Still, this move seems quite salient - in particular that arrow-addition feels something like the 'lossiest' step in the proof (except for the final bookkeeping which gloms all the together, implicitly drawing a load of arrows between them all)?

Comment by Oliver Sourbut on Some Rules for an Algebra of Bayes Nets · 2024-01-04T12:29:20.411Z · LW · GW

(I said Frankenstitch advisedly, I think they're kinda the same rule, but in particular in this case it seems either rule does the job.)

Comment by Oliver Sourbut on Some Rules for an Algebra of Bayes Nets · 2024-01-04T12:21:28.202Z · LW · GW

I might be missing something, but I don't see where is actually used in the worked example.

It seems that there's a consistent topo order between the and diagrams, so we Frankenstitch them. Then we draw an edge from to and reorder (bookkeep). Then we Frankenstein the diagrams and the resulting diagram again. Then we collect the together (bookkeep). Where's used?

Comment by Oliver Sourbut on Some Rules for an Algebra of Bayes Nets · 2024-01-04T12:10:14.967Z · LW · GW

Oh yeah, I don't know how common it is, but when manipulating graphs, if there's a topo order, I seem to strongly prefer visualising things with that order respected on the page (vertically or horizontally). So your images committed a few minor crimes according to that aesthetic. I can also imagine that some other aesthetics would strongly prefer writing things the way you did though, e.g. with . (My preference would put and slightly lower, as you did with the , graph.)

Comment by Oliver Sourbut on Some Rules for an Algebra of Bayes Nets · 2024-01-04T11:59:46.552Z · LW · GW

This is really great!

A few weeks ago I was playing with the Good Regulator and John's Gooder version and incidentally I also found myself pulling out some simple graphical manipulation rules. Your 'Markov re-rooting' came into play, and also various of the 'Bookkeeping' rules. You have various more exciting rules here too, thanks!

I also ended up noticing a kind of 'good regulator motif' as I tried expanded the setting with a few temporal steps and partial observability and so forth. Basically, doing some bookkeeping and coarse-graining, you can often find a simple GR structure within a larger 'regulator-like' structure, and conclude things from that. I might publish it at some point but it's not too exciting yet. I do think the overall move of finding motifs in manipulated graphs is solid, and I have a hunch there's a cool mashup of Bayes-net algebra and Gooder Regulator waiting to be found!

I love the Frankenstein rule. FYI, the orderings you're talking about which are 'consistent' with the graphs are called topological orderings, and every DAG has (at least) one. So you could concisely phrase some of your conditions along the lines of 'shared topological order' or 'mutually-consistent topological ordering'.

BTW causal graphs are usually restricted to be DAGs, right? (i.e., the 'causes' relation is acyclic and antisymmetric.) So in this setting where we are peering at various fragments which are assumed to correspond to some 'overall mega-distribution', it might come in handy to assume the overall distribution has some acyclic presentation - then there's always a(t least one) topo ordering available to be invoked.

Comment by Oliver Sourbut on Terminology: <something>-ware for ML? · 2024-01-04T11:38:21.629Z · LW · GW

@the gears to ascension , could you elaborate on what the ~25% react on 'hardware' in

Would it be useful to have a term, analogous to 'hardware', ...

means? Is it responding to the whole sentence, 'Would it be useful to have...?' or some other proposition?

Comment by Oliver Sourbut on Terminology: <something>-ware for ML? · 2024-01-04T11:34:55.306Z · LW · GW

Separately, I'm not a fan of 'evolveware' or 'evoware' in particular, though I can't put my finger on exactly why. Possibly it's because of a connotation of ongoing evolution, which is sorta true in some cases but could be misleading as a signifier. Though the same criticism could be levelled against 'ML-ware', which I like more.

Comment by Oliver Sourbut on Terminology: <something>-ware for ML? · 2024-01-04T11:32:01.877Z · LW · GW

I hate to wheel this out again but evolution-broadly-construed is actually a very close fit for gradient methods. Agreed there's a whole lot of specifics in biological natural selection, and a whole lot of specifics in gradient-methods-as-practiced, but they are quite akin really.

Comment by Oliver Sourbut on Terminology: <something>-ware for ML? · 2024-01-03T21:58:55.415Z · LW · GW

This is nice in its way, and has something going for it, but to me it's far too specific, while also missing the 'how we got this thing' aspect which (I think) is the main reason to emphasise the difference through terminology.

Comment by Oliver Sourbut on Terminology: <something>-ware for ML? · 2024-01-03T21:56:51.493Z · LW · GW

This is simple but surprisingly good, for the reasons you said. It's also easy to say and write. Along with fuzz-, and hunch-, this is my favourite candidate so far.

Comment by Oliver Sourbut on AI Is Not Software · 2024-01-03T14:57:08.962Z · LW · GW

Hardware, software, ... deepware? I quite like this actually. It evokes deep learning, obviously, but also 'deep' maybe expresses the challenge of knowing what's happening inside it. Doesn't evoke the 'found/discovered' nature of it.

Comment by Oliver Sourbut on Terminology: <something>-ware for ML? · 2024-01-03T14:26:11.321Z · LW · GW

Nice! 'Idioware'? Risks sounding like 'idiotware'...

Comment by Oliver Sourbut on AI Is Not Software · 2024-01-03T11:43:07.362Z · LW · GW

noware? everyware? anyware? selfaware? please-beware?

(jokes, don't crucify me)

I have a serious question with some serious suggestions too

Comment by Oliver Sourbut on On the lethality of biased human reward ratings · 2024-01-02T11:57:12.142Z · LW · GW

If humans had the computational capacity, they would lie a lot more and calculate personal advantage a lot more. But since those are both computationally expensive, and therefore can be caught-out by other humans, the heuristic / value of "actually care about your friends", is competitive with "always be calculating your personal advantage."

I think there's a missing connection here. At least, it seemed a non sequitur at first read to me. At my first read, I thought this was positing that scaling up given humans' computational capacity ceteris paribus makes them lie more. Seems strong (maybe for some).

But I think it's instead claiming that if humans in general had been adapted under conditions of greater computational capacity, then the 'actually care about your friends' heuristic might have evolved lesser weight. That seems plausible (though the self-play aspect of natural selection means that this depends in part on how offence/defence scales for lying/detection).

Comment by Oliver Sourbut on On the lethality of biased human reward ratings · 2024-01-02T11:48:47.908Z · LW · GW

And as the saying goes, "humans are the least general intelligence which can manage to take over the world at all" - otherwise we'd have taken over the world earlier.

A classic statement of this is by Bostrom, in Superintelligence.

Far from being the smartest possible biological species, we are probably better thought of as the stupidest possible biological species capable of starting a technological civilization - a niche we filled because we got there first, not because we are in any sense optimally adapted to it.

Comment by Oliver Sourbut on On the future of language models · 2023-12-21T13:32:44.236Z · LW · GW

I’m hazier on the details of how this would play out (and a bit sceptical that it would enable a truly runaway feedback loop), but more sophisticated systems could help to gather the real-world data to make subsequent finetuning efforts more effective.

On the contrary, I think proactive gathering of data is very plausibly the bottleneck, and (smarts) -> (better data gathering) -> (more smarts) is high on my list of candidates for the critical feedback loop.

In a world where the 'big two' (R&D and executive capacity) are characterised by driving beyond the frontier of the well-understood it's all about data gathering and sample-efficient incorporation of the data.

FWIW I don't think vanilla 'fine tuning' necessarily achieves this, but coupled with retrieval augmented generation and similar scaffolding, incorporation of new data becomes more fluent.

Comment by Oliver Sourbut on On the future of language models · 2023-12-21T12:10:02.536Z · LW · GW

In particular, the 'big two' are both characterised by driving beyond the frontier of the well-understood which means by necessity they're about efficiently deliberately setting up informative/serendipitous scenarios to get novel informative data. When you're by necessity navigating beyond the well-understood, you have to bottom out your plans with heuristic guesses about VOI, and you have to make plans which (at least sometimes) have good VOI. Those have to ground out somewhere, and that's the 'research taste' at the system-1-ish level.

Comment by Oliver Sourbut on On the future of language models · 2023-12-21T11:53:59.085Z · LW · GW

I think it’s most likely that for a while centaurs will significantly outperform fully automated systems

Agree, and a lot of my justification comes from this feeling that 'research taste' is quite latent, somewhat expensive to transfer, and a bottleneck for the big 2.

Comment by Oliver Sourbut on On the future of language models · 2023-12-21T11:49:37.977Z · LW · GW

I think there are two really important applications, which have the potential to radically reshape the world:

  • Research
    • The ability to develop and test out new ideas, adding to the body of knowledge we have accumulated
    • Automating this would be a massive deal for the usual reasons about feeding back into growth rates, facilitating something like a singularity
      • In particular the automation of further AI development is likely to be important
    • There are many types of possible research, and automation may look quite different for e.g. empirical medical research vs fundamental physics vs political philosophy
      • The sequence in which we get the ability to automate different types of research could be pretty important for determining what trajectory the world is on
  • Executive capacity
    • The ability to look at the world, form views about how it should be different, and form and enact plans to make it different
    • (People sometimes use “agency” to describe a property in this vicinity)
    • This is the central thing that leads to new things getting done in the world. If this were fully automated we might have large fully autonomous companies building more and more complex things towards effective purposes.
    • This is also the thing which, (if/)when automated, creates concerns about AI takeover risk.


I agree. I tentatively think (and have been arguing in private for a while) that these are 'basically the same thing'. They're both ultimately about

  • forming good predictions on the basis of existing models
  • efficiently choosing 'experiments' to navigate around uncertainties
    • (and thereby improve models!)
  • using resources (inc. knowledge) to acquire more resources

They differ (just as research disciplines differ from other disciplines, and executing in one domain differs from other domains) in the specifics, especially on what existing models are useful and the 'research taste' required to generate experiment ideas and estimate value-of-information. But the high level loop is kinda the same.

Unclear to me what these are bottlenecked by, but I think the latent 'research taste' may be basically it (potentially explains why some orgs are far more effective than others, why talented humans take a while to transfer between domains, why mentorship is so valuable, why the scientific revolution took so long to get started...?)

Comment by Oliver Sourbut on How Would an Utopia-Maximizer Look Like? · 2023-12-21T11:30:15.685Z · LW · GW

I swiftly edited that to read

we have not found it written in the universe

but your reply obviously beat me to it! I agree, there is plausibly some 'actual valence magnitude' which we 'should' normatively account for in aggregations.

In behavioural practice, it comes down to what cooperative/normative infrastructure is giving rise to the cooperative gains which push toward the Pareto frontier. e.g.

  • explicit instructions/norms (fair or otherwise)
  • 'exchange rates' between goods or directly on utilities
  • marginal production returns on given resources
  • starting state/allocation in dynamic economy-like scenarios (with trades)
  • differential bargaining power/leverage

In discussion I have sometimes used the 'ice cream/stabbing game' as an example

  • either you get ice cream and I get stabbed
  • or neither of those things
  • neither of us is concerned with the other's preferences

It's basically a really extreme version of your chocolate and vanilla case. But they're preference-isomorphic!

Comment by Oliver Sourbut on How Would an Utopia-Maximizer Look Like? · 2023-12-21T11:06:30.245Z · LW · GW

I think this post is mostly about how to do the reflection, consistentising, and so on.

But at the risk of oversimplifying, let's pretend for a moment we just have some utility functions.

Then you can for sure aggregate them into a mega utility function (at least in principle). This is very underspecified!! predominantly as a consequence of the question of how to weight individual utility functions in the aggregation. (Holden has a nice discussion of Harsanyi's aggregation theorem which goes into some more discussion, but yes, we have not found it written in the universe how to weight the aggregation.)

There's also an interesting relationship (almost 1-1 aside from edge-cases) between welfare optima (that is, optima of some choice of weighted aggregation of utilities as above) and Pareto optima[1] (that is, outcomes unimprovable for anyone without worsening for someone). I think this, together with Harsanyi, tells us that some sort of Pareto-ish target would be the result of 'the most coherent' possible extrapolation of humanity's goals. But this still leaves wide open the coefficients/weighting of the aggregation, which in the Pareto formulation corresponds to the position on the Pareto frontier. BTW Drexler has an interesting discussion of cooperation and conflict on the Pareto frontier.

I have a paper+blogpost hopefully coming out soon which goes into some of this detail and discusses where that missing piece (the welfare weightings or 'calibration') come from (descriptively, mainly; we're not very prescriptive unfortunately).

  1. This connection goes back as far as I know to the now eponymous ABB theorem of Arrow, Barankin and Blackwell in 1953, and there's a small lineage of followup research exploring the connection ↩︎

Comment by Oliver Sourbut on How Would an Utopia-Maximizer Look Like? · 2023-12-21T10:50:46.463Z · LW · GW

we often wouldn't just tweak  such that it still fires, but only when stealing something wouldn't be against the society's interests. No: we just flat-out delete .

Heh maybe. I also enjoy stealing things in RPGs :P

Comment by Oliver Sourbut on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2023-12-18T08:19:28.791Z · LW · GW

It looks like we basically agree on all that, but it pays to be clear (especially because plenty of people seem to disagree).

'Transcending' doesn't imply those nice things though, and those nice things don't imply transcending. Immortality is similarly mostly orthogonal.

Comment by Oliver Sourbut on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2023-12-17T11:06:04.335Z · LW · GW

Great job Thane! A few months ago I wrote about 'un-unpluggability' which is kinda like a drier version of this.

In brief

  • Rapidity and imperceptibility are two sides of 'didn't see it coming (in time)'
  • Robustness is 'the act itself of unplugging it is a challenge'
  • Dependence is 'notwithstanding harms, we (some or all of us) benefit from its continued operation'
  • Defence is 'the system may react (or proact) against us if we try to unplug it'
  • Expansionism includes replication, propagation, and growth, and gets a special mention, as it is a very common and natural means to achieve all of the above

I also think the 'who is "we"?' question is really important.

One angle that isn't very fleshed out is the counterquestion, 'who is "we" and how do we agree to unplug something?' - a little on this under Dependence, though much more could certainly be said.

I think more should be said about these factors. I tentatively wrote,

there is a clear incentive for designers and developers to imbue their systems with... dependence, at least while developers are incentivised to compete over market share in deployments.

and even more tentatively,

In light of recent developments in AI tech, I actually expect the most immediate unpluggability impacts to come from collateral, and for anti-unplug pressure to come perhaps as much from emotional dependence and misplaced concern[1] for the welfare of AI systems as from economic dependence - for this reason I believe there are large risks to allowing AI systems (dangerous or otherwise) to be perceived as pets, friends, or partners, despite the economic incentives.

  1. It is my best guess for various reasons that concern for the welfare of contemporary and near-future AI systems would be misplaced, certainly regarding unplugging per se, but I caveat that nobody knows ↩︎

Comment by Oliver Sourbut on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2023-12-17T10:58:12.682Z · LW · GW

This clearly isn't the worst possible future... if our AI mind children inherit only our culture and leave us behind it feels more like a consolation prize

Leaving aside s-risks, this could very easily be the emptiest possible future. Like, even if they 'inherit our culture' it could be a "Disneyland with no children" (I happen to think this is more likely than not but with huge uncertainty).


We should aim much higher: for defeating death, across all of time, for resurrection and transcendence.

this anti-deathist vibe has always struck me as very impoverished and somewhat uninspiring. The point should be to live, awesomely! which includes alleviating suffering and disease, and perhaps death. But it also ought to include a lot more positive creation and interaction and contemplation and excitement etc.!

Comment by Oliver Sourbut on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2023-12-17T10:44:26.272Z · LW · GW

But this is just as true of fully aligned agents! In fact the optimal plans of aligned and unaligned agents will probably converge for a while - they will take the same/similar initial steps (this is just a straightforward result of instrumental convergence to empowerment)

This is a minor fallacy - if you're aligned, powerseeking can be suboptimal if it causes friction/conflict. Deception bites, obviously, making the difference less.

Comment by Oliver Sourbut on AGI Ruin: A List of Lethalities · 2023-12-16T17:19:54.272Z · LW · GW

most organizations don't have plans, because I haven't taken the time to personally yell at them.  'Maybe we should have a plan' is deeper alignment mindset than they possess without me standing constantly on their shoulder as their personal angel pleading them into... continued noncompliance, in fact.  Relatively few are aware even that they should, to look better, produce a pretend plan that can fool EAs too 'modest' to trust their own judgments about seemingly gaping holes in what serious-looking people apparently believe.

This, at least, appears to have changed in recent months. Hooray!

Comment by Oliver Sourbut on Search-in-Territory vs Search-in-Map · 2023-12-15T17:23:23.220Z · LW · GW

To perform the search-in-map with only a balance scale, we’d either need to compare all pairs of weights ahead of time (which would mean  effort), or we’d need to run out and compare physical weights in the middle of the search (at which point we’re effectively back to search-in-territory).

nit: (I think you maybe meant this but glitched while writing) in this particular example we could do better by indexing (in map) the rocks by weight order ( map-building comparisons). Then once we have the reference rock we can effectively blend our map with in-territory search for only in-territory comparisons. It's more costly overall (by a log factor) to build this map, but if we have map-building budget in advance it yields much faster solving (log instead of linear). Or if the reference rock was one of the original rocks (we just didn't know which one), as long as our index has constant-time access we can do search in-map once the appropriate reference rock is pointed out.

I think this just corroborates your claim

The map-making process can use information before the search process “knows what to do with it”.

I think this raises an interesting further question, especially when we don't know what the task will be ahead of time: how many (and what? and at what resolution?) indices should we ideally spend 'prep' time (and memory) on? (This was a professional concern of mine for several years as a software engineer haha)

Echoes of your gooder regulator theorem

Comment by Oliver Sourbut on Mapping the semantic void: Strange goings-on in GPT embedding spaces · 2023-12-15T12:10:01.926Z · LW · GW

This looks like some kind of (rather bizarre) emergent/primitive ontology, radially stratified from the token embedding centroid.

A tentative thought on this... if we put our 'superposition' hats on.

We're thinking of directions as mapping concepts or abstractions or whatnot. But there are too few strictly-orthogonal directions, so we need to cram things in somehow. It's fashionable (IIUC) to imagine this happening radially but some kind of space partitioning (accounting for magnitudes as well) seems plausible to me.

Maybe closer to the centroid, there's 'less room' for complicated taxonomies, so there are just some kinda 'primitive' abstractions which don't have much refinement (perhaps at further distances there are taxonomic refinements of 'metal' and 'sharp'). Then, the nearest conceptual-neighbour of small-magnitude random samples might tend to be one of these relatively 'primitive' concepts?

This might go some way to explaining why at close-to-centroid you're getting these clustered 'primitive' concepts.

The 'space partitioning' vs 'direction-based splitting' could also explain the large-magnitude clusters (though it's less clear why they'd be 'primitive'). Clearly there's some pressure (explicit regularisation or other) for most embeddings to sit in a particular shell. Taking that as given, there's then little training pressure to finely partition the space 'far outside' that shell. So it maybe just happens to map to a relatively small number of concepts whose space includes the more outward reaches of the shell.

How to validate this sort of hypothesis? I'm not sure. It might be interesting to look for centroids, nearest neighbours, or something, of the apparent conceptual clusters that come out here. Or you could pay particular attention to the tokens with smallest and largest distance-to-centroid (there were long tails there).

Comment by Oliver Sourbut on Mapping the semantic void: Strange goings-on in GPT embedding spaces · 2023-12-15T10:54:41.223Z · LW · GW

Also, in 4096-d the intersection of hyperspherical shells won't have a toroidal topology, but rather something considerably more exotic.


The intersection of two hyperspheres is another hypersphere (1-d lower).

So I guess the intersection of two thick/fuzzy hyperspheres is a thick/fuzzy hypersphere (1-d lower). Note that your 'torus' is also described as thick/fuzzy circle (aka 1-sphere), which fits this pattern.

Comment by Oliver Sourbut on Outer vs inner misalignment: three framings · 2023-12-14T16:21:25.126Z · LW · GW

I notice that the word 'corrigibility' doesn't appear once here! Framing 3 (online misalignment) seems to be in the close vicinity:

policies’ goals change easily in response to additional reward feedback ... [vs] policies’ goals are very robust to additional reward feedback

I think the key distinction is that in your description the (goal-affecting) online learning process is sort of 'happening to' the AI, while corrigibility is accounting for the AI instance(s) response(s) to the very presence and action of such a goal-affecting process.

The upshot is pretty similar though: if the goal-affecting online updates are too slow, or the AI too incorrigible to apply much/any updating to, we get an alignment failure, especially if we're in a fast/high-stakes setting.

Incidentally, I think the 'high stakes' setting corresponds to rapidity in my tentative un-unpluggability taxonomy

Comment by Oliver Sourbut on Conditions for mathematical equivalence of Stochastic Gradient Descent and Natural Selection · 2023-12-13T16:49:46.886Z · LW · GW

Haha mind blown. Thanks for the reference! Different kind of momentum, but still...

Comment by Oliver Sourbut on Conditions for mathematical equivalence of Stochastic Gradient Descent and Natural Selection · 2023-12-13T13:41:32.771Z · LW · GW

Origin and summary

This post arose from a feeling in a few conversations that I wasn't being crisp enough or epistemically virtuous enough when discussing the relationship between gradient-based ML methods and natural selection/mutate-and-select methods. Some people would respond like, 'yep, seems good', while others were far less willing to entertain analogies there. Clearly there was some logical uncertainty and room for learning, so I decided to 'math it out' and ended up clarifying a few details about the relationship, while leaving a lot unresolved. Evidently for some readers this is still not crisp or epistemically virtuous enough!

I still endorse this post as the neatest explanation I'm aware of relating gradient descent to natural selection under certain approximations. I take the proof of the limiting approximation to be basically uncontroversial[1], but I think my discussion of simplifying assumptions (and how to move past them) is actually the most valuable part of the post.

Overall I introduced three models of natural selection

  1. an annealing-style degenerate natural selection, which is most obviously equivalent in the limit to a gradient step
  2. a one-mutant-population-at-a-time model (with fixation or extinction before another mutation arises)
  3. (in the discussion) a multi-mutations-in-flight model with horizontal transfer (which is most similar to real natural selection)

All three are (in the limit of small mutations) performing gradient steps. The third one took a bit more creativity to invent, and is probably where I derived the most insight from this work.

All three are still far from 'actual real biological natural selection'!

What important features of natural selection are missing?

  • Speciation!
  • Variability of the fitness landscape
  • Any within-lifetime things!
    • Within-lifetime learning
    • Sexual selection and offspring preference
    • Cultural accumulation
    • Epigenetics
  • Recombination hacks
    • Anything with intragenomic conflict
    • Transposons, segregation distortion, etc.

So what?

Another point I didn't touch on in this post itself is what to make of any of this.

Understanding ML

For predicting the nature of ML artefacts, I don't think speciation is relevant, so that's a point in favour of these models. I do think population-dependent dynamics (effectively self-play) are potentially very relevant, depending on the setting, which is why in the post I said,

As such it may be appropriate to think of real natural selection as performing something locally equivalent to SGD but globally more like self-play PBT.

One main conclusion people want to point to when making this kind of analogy is that selecting for thing-X-achievers doesn't necessarily produce thing-X-wanters. i.e. goal-misgeneralisation aka mesa-optimisation aka optimisation-daemons. I guess tightening up the maths sort of shores up this kind of conclusion?[2]

Thomas Kwa has a nice brief list of other retrodictions of the analogy between gradient-based ML and natural selection.

How much can we pre-dict? When attempting to draw more specific conclusions (e.g. about particular inductive biases or generalisation), I think in practice analogies to natural selection are going to be screened off by specific evidence quite easily. But it's not clear that we can easily get that more specific evidence in advance, and for more generally-applicable but less-specific claims, I think natural selection gives us one good prior to reason from.

If we're trying to draw conclusions about intelligent systems, we should make sure to note that a lot of impressive intelligence-bearing artefacts in nature (brains etc.) are grown and developed within-lifetime! This makes the object of natural selection (genomes, mostly) something like hyperparameters or reward models or learning schedules or curricula rather than like fully-fledged cognitive algorithms.

In a recent exchange, 1a3orn shared some interesting resources which make similar connections between brain-like learning systems and gradient-based systems. More commonalities!

Understanding natural selection

Sometimes people want to understand the extent to which natural selection 'is optimising for' something (and what the exact moving pieces are). Playing with the maths here and specifying some semantics via the models has helped sharpen my own thinking on this. For example, see my discussion of 'fitness' here

The original pretheoretic term 'fitness' meant 'being fitted/suitable/capable (relative to a context)', and this is what Darwin and co were originally pointing to. (Remember they didn't have genes or Mendel until decades later!)

The modern technical usage of 'fitness' very often operationalises this, for organisms, to be something like number of offspring, and for alleles/traits to be something like change in prevalence (perhaps averaged and/or normalised relative to some reference).

So natural selection is the ex post tautology 'that which propagates in fact propagates'.

If we allow for ex ante uncertainty, we can talk about probabilities of selection/fixation and expected time to equilibrium and such. Here, 'fitness' is some latent property, understood as a distribution over outcomes.

If we look at longer timescales, 'fitness' is heavily bimodal: in many cases a particular allele/trait either fixes or goes extinct[3]. If we squint, we can think of this unknown future outcome as the hidden ground truth of latent fitness, about which some bits are revealed over time and over generations.

A 'single step' of natural selection tries out some variations and promotes the ones which in fact work (based on a realisation of the 'ex ante' uncertain fitness). This indeed follows the latent fitness gradient in expectation.

In this ex ante framing it becomes much more reasonable to treat natural selection as an optimisation/control process similar to gradient descent. It's shooting for maximising the hidden ground truth of latent fitness over many iterations, but it's doing so based on a similar foresight-free local heuristic like gradient descent, applied many times.

How can we reconcile this claim with the fact that the operationalised 'relative fitness' often walks approximately randomly, at least not often sustainedly upward[4]? Well, it's precisely because it's relative - relative to a changing series of fitness landscapes over time. Those landscapes change in part as a consequence of abiotic processes, partly as a consequence of other species' changes, and often as a consequence of the very trait changes which natural selection is itself imposing within a population/species!

So, I think, we can say with a straight face that natural selection is optimising (weakly) for increased fitness, even while a changing fitness landscape means that almost by definition relative fitness hovers around a constant for most extant lineages. I don't think it's optimising on species, but on lineages (which sometimes correspond).[5]

In further (unpublished) mathsy scribbles around the time of this post, I also played with rederiving variations on the Price equation, and spent some time thinking about probabilities of fixation and time-to-fixation (corroborating some of Eliezer's old claims). These were good exercises, but not obviously worth the time to write up.

I was also working with Holly Elmore on potential insights from some more specific mechanisms in natural selection regarding intragenomic conflict. I learned a lot (in particular about how 'computer sciencey' a lot of biological machinery is!) but didn't make any generalisable insights. I do expect there might be something in this area though.

Understanding control

The connections here were part of a broader goal of mine to understand 'deliberation' and 'control'. I've had a hard time making real progress on this since (in part due to time spent on other things), but I do feel my understanding of these has sharpened usefully. Spending some time closely pondering the connection between different optimisation procedures definitely provided some insights there.

I recently came across the 'complex systems' kind of view on adaptation and control and wonder if I might be converging in that direction.

The biggest puzzle-piece I want to see cracked regards the temporal extent of predictors/deliberators. Greater competence seems tightly linked to the ability to do 'more lookahead'. I think this is one of the keys which gives rise to 'deliberate exploration'/'experimentation', which is one of my top candidate intelligence explosion feedback loops[6]. My incomplete discussion of deliberation was heading in that direction. Some more recent gestures include some disorganised shortform discussion and my planner simulator conjecture:

something like, 'every (simplest) simulator of a planner contains (something homomorphic to) a planner'.

How far are we stretching to call this 'equivalence'?

The proofs demonstrate that all three models of natural selection perform a noisy realisation of a gradient step (in the limit of small mutations).

As I called out in the post, I didn't pay much attention to step size, nor to the particular stochastic distribution of updates. To my mind, this is enough to put the three models of natural selection equivalent to something well within the class of 'stochastic gradient methods'[7]. 'SGD' is often used to refer to this broader class of methods, but it might be a bit misleading to use the term 'SGD' without qualification, which after all is often used to refer to a more specific stochastic gradient implementation.

nostalgebraist calls me out on this

the noise in your model isn't distributed like SGD noise, and unlike SGD the the step size depends on the gradient norm.

which is the most attentive criticism I've had of this post.

Aren't we stretching things quite far if we're including momentum methods and related, with history/memory-sensitive updates? Note that natural selection can implement a kind of momentum too (e.g. via within-lifetime behavioural stuff like migration, offspring preference, and sexual selection)! Neither my models nor the 'SGD' they're equivalent to exhibit this.

  1. nostalgebraist's dissatisfaction notwithstanding; these are good criticisms but appear miss a lot of the caveats already present in the original post. ↩︎

  2. I never thought this conclusion needed shoring up in the first place, and in the cases where it's not accepted, it's not clear to me whether mathing it out like this is really going to help. ↩︎

  3. In cases where the relative fitness of a trait corresponds with its prevalence, there can be a dynamic equilibrium at neither of these modes. Consider evolutionary stable strategies. But the vast majority of mutations ever have hit the 'extinct' attractor, and a lot of extant material is of the form 'ancestor of a large proportion of living organisms'. ↩︎

  4. Though note we do see (briefly?) sustained upward fitness in times of abundance, as notably in human population and in adaptive radiation in response to new resources, habitats, and niches becoming available. ↩︎

  5. Now, if the earlier instances of now-extinct lineages were somehow evolutionarily 'frozen' and periodically revived back into existence, we really would see that natural selection pushes for increased fitness. But because those lineages aren't (by definition) around any more, the fitness landscape's changes over time are under no obligation to be transitive, so in fact a faceoff between a chicken and a velociraptor might tell a different story. ↩︎

  6. I think exploration heuristics are found throughout nature, some 'intrinsic curiosity' reward shaping gets further (e.g. human and animal play), but 'deliberate exploration' (planning to arrange complicated scenarios with high anticipated information value) really sets humans (and perhaps a few other animals) apart. Then with cultural accumulation and especially the scientific revolution, we've collectively got really good at this deliberate exploration, and exploded even faster. ↩︎

  7. e.g. vanilla SGD, momentum, RMSProp, Adagrad, Adam, ... ↩︎

Comment by Oliver Sourbut on Why Yudkowsky is wrong about "covalently bonded equivalents of biology" · 2023-12-06T22:24:52.877Z · LW · GW

Gotcha, that might be worth taking care to nuance, in that case. e.g. the linked twitter (at least) was explicitly about killing people[1]. But I can see why you'd want to avoid responses like 'well, as long as we keep an eye out for biohazards we're fine then'. And I can also imagine you might want to preserve consistency of examples between contexts. (Risks being misconstrued as overly-attached to a specific scenario, though?)

I'm nervous that this causes people to start thinking in terms of Hollywood movie plots... rather than hearing, "And this is a lower bound..."

Yeah... If I'm understanding what you mean, that's why I said,

It's always worth emphasising (and you do), that any specific scenario is overly conjunctive and just one option among many.

And I further think actually having a few scenarios up the sleeve is an antidote to the Hollywood/overly-specific failure mode. (Unfortunately 'covalently bonded bacteria' and nanomachines also make some people think in terms of Hollywood plots.) Infrastructure can be preserved in other ways, especially as a bootstrap. I think it might be worth giving some thought to other scenarios as intuition pumps.

e.g. AI manipulates humans into building quasi-self-sustaining power supplies and datacentres (or just waits for us to decide to do that ourselves), then launches kilopandemic followed by next-stage infra construction. Or, AI invests in robotics generality and proliferation (or just waits for us to decide to do that ourselves), then uses cyberattacks to appropriate actuators to eliminate humans and bootstrap self-sustenance. Or, AI exfiltrates itself and makes oodles of horcruxes backups, launches green goo with genetic clock for some kind of reboot after humans are gone (this one is definitely less solid). Or, AI selects and manipulates enough people willing to take a Faustian bargain as its intermediate workforce, equips them (with strategy, materials tech, weaponry, ...) to wipe out everyone else, then bootstraps next-stage infra (perhaps with human assistants!) and finally picks off the remaining humans if they pose any threat.

Maybe these sound entirely barmy to you, but I assume at least some things in their vicinity don't. And some palette/menu of options might be less objectionable to interlocutors while still providing some lower bounds on expectations.

  1. admittedly Twitter is where nuance goes to die, some heroic efforts notwithstanding ↩︎