deep

Posts
Comments

Posts

deep's Shortform 2025-03-27T23:56:30.403Z

Comments

Comment by deep on To Understand History, Keep Former Population Distributions In Mind · 2025-04-23T19:20:32.823Z · LW · GW

Neat post! The Europe-Africa ratio is especially striking, and will change my mental model of colonization a fair bit.

Also of interest for thinking about colonization / imperialism is the size of Japan's population in that last map compared to the Southeast Asian territories it conquered during WWII.

Indeed Japan in general seems to have grown in population slower than just about any other major country I look at -- a measly ~70% increase from 70m to 125m over 125 years. ("Russian Empire" then was bigger than Russia today, but Wikipedia has Russia proper at 70m then vs 144m today.)

The implied super-rapid relative population growth rates in Africa & parts of Asia in the 1900s also help me understand why people got freaked out about global overpopulation in the late 1900s, and why that pop growth needed innovations like Golden Rice to sustain it.

Comment by deep on Will compute bottlenecks prevent a software intelligence explosion? · 2025-04-23T19:00:38.007Z · LW · GW

Training run size has grown much faster than the world’s total supply of AI compute. If these near-frontier experiments were truly a bottleneck on progress, AI algorithmic progress would have slowed down over the past 10 years.

I think this history is consistent with near-frontier experiments being important, and labs continuing to do a large number of such experiments as part of the process of increasing lab spending on training compute.

ie: suppose OAI now spends $100m/model instead of $1m/model. There's no reason that they couldn't still be spending, say, 50% of their training compute on running 500 0.1%-scale experiments.

Caveat: This is at the firm level; you could argue that fewer near-frontier experiments are being done in total across the AI ecosystem, and certainly there's less information flow between organizations conducting these experiments.

Comment by deep on deep's Shortform · 2025-03-28T15:38:56.232Z · LW · GW

Thanks for your thoughts!

I was thinking Kanye as well. Hence being more interested in the general pattern. really wasn't intending to subtweet one person in particular -- I have some sense of the particular dynamics there, though your comment is illuminating. :)

Comment by deep on deep's Shortform · 2025-03-27T23:56:30.403Z · LW · GW

What's up with incredibly successful geniuses having embarassing & confusing public meltdowns? What's up with them getting into naziism in particular?

Components of my model:

Selecting for the tails of success selects for weird personalities; moderate success can come in lots of ways, but massive success in part requires just a massive amount of drive and self-confidence. Bipolar people have this. (But more than other personality types?)
Endless energy & willingness to engage with stuff is an amazing trait that can go wrong if you have an endless pit of stupid internet stuff grabbing for your attention.
If you're selected for overconfidence and end up successful, you assume you're amazing at everything. (And you are in fact great at some stuff, and have enough taste to know it, so it's hard to change your mind.)
Selecting for the tails of success selects for contrarianism? Seems plausible -- one path to great success, at least, is to make a huge contrarian bet that pays off.
Nothing's more contrarian than being a Nazi, especially if you're trying to flip the bird to the Cathedral.

Comment by deep on On the Rationality of Deterring ASI · 2025-03-27T16:05:24.573Z · LW · GW

I think it's fine that Eliezer wrote it, though. Not maximally strategic by any means, but the man's done a lot and he's allowed his hail mary outreach plans.

I think at the time I and others were worried this would look bad for "safety as a whole", but at this point concerns about AI risk are common and varied enough, and people with those concerns have often strong local reputations w/ different groups. So this is no longer as big of an issue, which I think is really healthy for AI risk folks -- it means we can have Pause AI and Eliezer and Hendrycks and whoever all doing their own things, able to say "no I'm not like those folks, here's my POV", and not feeling like they should get a veto over each other. And in retrospect I think we should have anticipated and embraced this vision earlier on.

tbh, this is part of what I think went wrong with EA -- a shared sense that community reputation was a resource everyone benefitted from and everyone wanted to protect and polish, that people should get vetoes over what each other do and say. I think it's healthy that there's much less of a centralized and burnished "EA brand" these days, and much more of a bunch of people following their visions of the good. Though there's still the problem of Open Phil as a central node in the network, through which reputation effects flow.

Comment by deep on On the Rationality of Deterring ASI · 2025-03-27T15:51:57.065Z · LW · GW

You're missing some ways Eliezer could have predictably done better with the Time article, if he were framing it for national security folks (rather than an attempt at brutal honesty, or perhaps most acccurately a cri de coeur).

@davekasten - Eliezer wasn't arguing for bombing as retaliation for a cyberattack. Rather, as a preemptive measure against noncompliant AI developments:

If intelligence says that a country outside the agreement is building a GPU cluster, be less scared of a shooting conflict between nations than of the moratorium being violated; be willing to destroy a rogue datacenter by airstrike.

If you zoom out several layers of abstraction, that's not too different from the escalation ladder concept described in this paper. A crucial difference, though, is that Eliezer doesn't mention escalation ladders at all -- or other concepts that would help neutral readers be like "OK, this guy gets how big a lift all this would be, and he has some ideas for building blocks to enact it". Examples include "how do you get an international agreement on this stuff", "how do you track all the chips", "how do you prevent people building super powerful AI despite the compute threshold lowering", "what about all the benefits of AI that we'd be passing up" (besides a brief mention that narrow-AI-for-bio might be worth it), "how confident can we be that we'd know if someone was contravening the deal".

Second, there was a huge inferential gap to this idea of AGI as key national security threat -- there's still a large one today, despite rhetoric around AGI. And Eliezer doesn't do enough meeting in the middle here.

He gives the high-level argument that to him is sufficient, but is/was not convincing to most people -- that AI by some metrics is growing fast, in principle can be superhuman, etc. Unfortunately most people in government don't have the combination of capacity, inclination, and time to assess these kinds of first-principles arguments for themselves, and they really need concreteness in the form of evidence or expert opinion.

Also, frankly, I just think Eliezer is wrong to be as confident in his view of "doom by default" as he is, and the strategic picture looks very very different if you place say 20% or even 50% probability on this.

If I had Eliezer's views I'd probably focus on evals and red-teaming type research to provide fire alarms, convince technical folks p(doom) was really high, and then use that technical consensus or quasi-consensus to shift policy. This isn't totally distinct from what Eliezer did in the past with more abstract arguments, and it kinda worked (there are a lot of people with >10% p(doom) in policy world, there was that 2023 moment where everyone was talking about it). I think in worlds where Eliezer's right, but timelines are say more like 2030 than 2027, there's real scope for people to be convinced of high p(doom) as AI advances, and that could motivate some real policy change.

Comment by deep on How to Corner Liars: A Miasma-Clearing Protocol · 2025-02-28T18:15:47.363Z · LW · GW

I think a realistic example would be useful! I suspect a lot of the nuance (nuance that might feel obvious to you) is in how to apply this over a long conversation with lots of data points, amendments on both sides, etc.

Comment by deep on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-27T18:19:12.531Z · LW · GW

Nope, you're right, I was reading quickly & didn't parse that :)

Comment by deep on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-27T18:02:11.159Z · LW · GW

Yeah, good point on this being about HHH. I would note that some of the stuff like "kill / enslave all humans" feels much less aligned to human values (outside of a small but vocal e/acc crowd perhaps), but it does pattern-match well as "the opposite of HHH-style harmlessness"

This technique definitely won't work on base models that are not trained on data after 2020.

The other two predictions make sense, but I'm not sure I understand this one. Are you thinking "not trained on data after 2020 AND not trained to be HHH"? If so, that seems plausible to me.

I could imagine a model with some assistantship training that isn't quite the same as HHH would still learn an abstraction similar to HHH-style harmlessness. But plausibly that would encode different things, e.g. it wouldn't necessarily couple "scheming to kill humans" and "conservative gender ideology". Likewise, "harmlessness" seems like a somewhat natural abstraction even in pre-training space, though there might be different subcomponents like "agreeableness", "risk-avoidance", and adherence to different cultural norms.

Comment by deep on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-25T21:30:42.907Z · LW · GW

Thanks, that's cool to hear about!

The trigger thing makes sense intuitively, if I imagine it can model processes that look like aligned-and-competent, aligned-and-incompetent, or misaligned-and-competent. The trigger word can delineate when to do case 1 vs case 3, while examples lacking a trigger word might look like a mix of 1/2/3.

Comment by deep on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-25T21:11:08.031Z · LW · GW

Fascinating paper, thank you for this work!

I'm confused about how to parse this. One response is "great, maybe 'alignment' -- or specifically being a trustworthy assistant -- is a coherent direction in activation space."

Another is "shoot, maybe misalignment is convergent, it only takes a little bit of work to knock models into the misaligned basin, and it's hard to get them back." Waluigi effect type thinking.

Relevant parameters:

How much effort (e.g. how many samples) does it take to knock models out of the HHH space?
- They made 6000 training updates, varying # of unique data points. 500 data points is insufficient.
- I don't have an intuition for whether this is large for a fine-tuning update. Certainly it's small compared to the overall GPT-4o training set.
How far do they get knocked out? How much does this generalize?
- Only an average of 20% of responses are misaligned.
- The effect varies a lot by question type.
- Compare to 60% of misaligned responses for prompts containing python code -- suggests it only partly generalizes.
How reversible is the effect? e.g. can we fine-tune back in the trustworthiness direction?
- This isn't explored in the paper -- would be interesting to see.
How big is the effect from a mixture of malign and benign examples? Especially if the examples are overall plausibly generated from a benign process (e.g. a beginning coder)?
- I would guess that typical training includes some small share of flawed code examples mixed in with many less malicious-looking examples. That would suggest some robustness to a small share of malicious examples. But maybe you actually need to clean your data set or finetune a decent amount to ensure reliability and HHH, given the base rate of malicious content on the internet? Would be interesting to know more.

Comment by deep on ryan_greenblatt's Shortform · 2025-01-25T07:01:56.226Z · LW · GW

Neat, thanks a ton for the algorithmic-vs-labor update -- I appreciated that you'd distinguished those in your post, but I forgot to carry that through in mine! :)

And oops, I really don't know how I got to 1.6 instead of 1.5 there. Thanks for the flag, have updated my comment accordingly!

The square relationship idea is interesting -- that factor of 2 is a huge deal. Would be neat to see a Guesstimate or Squiggle version of this calculation that tries to account for the various nuances Tom mentions, and has error bars on each of the terms, so we both get a distribution of r and a sensitivity analysis. (Maybe @Tom Davidson already has this somewhere? If not I might try to make a crappy version myself, or poke talented folks I know to do a good version :)

Comment by deep on ryan_greenblatt's Shortform · 2025-01-25T06:57:52.832Z · LW · GW

Really appreciate you covering all these nuances, thanks Tom!

Can you give a pointer to the studies you mentioned here?

There are various sources of evidence on how much capabilities improve every time training efficiency doubles: toy ML experiments suggest the answer is ~1.7; human productivity studies suggest the answer is ~2.5. I put more weight on the former, so I’ll estimate 2. This doubles my median estimate to r = ~2.8 (= 1.4 * 2).

Comment by deep on ryan_greenblatt's Shortform · 2025-01-22T20:20:35.435Z · LW · GW

Hey Ryan! Thanks for writing this up -- I think this whole topic is important and interesting.

I was confused about how your analysis related to the Epoch paper, so I spent a while with Claude analyzing it. I did a re-analysis that finds similar results, but also finds (I think) some flaws in your rough estimate. (Keep in mind I'm not an expert myself, and I haven't closely read the Epoch paper, so I might well be making conceptual errors. I think the math is right though!)

I'll walk through my understanding of this stuff first, then compare to your post. I'll be going a little slowly (A) to help myself refresh myself via referencing this later, (B) to make it easy to call out mistakes, and (C) to hopefully make this legible to others who want to follow along.

Using Ryan's empirical estimates in the Epoch model

The Epoch model

The Epoch paper models growth with the following equation:
1. ,

where A = efficiency and E = research input. We want to consider worlds with a potential software takeoff, meaning that increases in AI efficiency directly feed into research input, which we model as $\frac{d (l n A)}{d t} \sim A^{- β} A^{λ} = A^{λ - β}$ . So the key consideration seems to be the ratio $\frac{λ}{β}$ . If it's 1, we get steady exponential growth from scaling inputs; greater, superexponential; smaller, subexponential.^[1]

Fitting the model
How can we learn about this ratio from historical data?

Let's pretend history has been convenient and we've seen steady exponential growth in both variables, so $A = A_{0} e^{r t}$ and $E = E_{0} e^{q t}$ . Then $\frac{d (l n A)}{d t}$ has been constant over time, so by equation 1, $A (t)^{- β} E (t)^{λ}$ has been constant as well. Substituting in for A and E, we find that $A_{0} e^{- β r t} E_{0} e^{λ q t}$ is constant over time, which is only possible if $β r = λ q$ and the exponent is always zero. Thus if we've seen steady exponential growth, the historical value of our key ratio is:

2. $\frac{λ}{β} = \frac{r}{q}$ .

Intuitively, if we've seen steady exponential growth while research input has increased more slowly than research output (AI efficiency), there are superlinear returns to scaling inputs.

Introducing the Cobb-Douglas function

But wait! $E$ , research input, is an abstraction that we can't directly measure. Really there's both compute and labor inputs. Those have indeed been growing roughly exponentially, but at different rates.

Intuitively, it makes sense to say that "effective research input" has grown as some kind of weighted average of the rate of compute and labor input growth. This is my take on why a Cobb-Douglas function of form (3) $E \sim C^{p} L^{1 - p}$ , with a weight parameter $0 < p < 1$ , is useful here: it's a weighted geometric average of the two inputs, so its growth rate is a weighted average of their growth rates.

Writing that out: in general, say both inputs have grown exponentially, so $C (t) = C_{0} e^{q_{c} t}$ and $L (t) = L_{0} e^{q_{l} t}$ . Then E has grown as $E (t) = E_{0} e^{q t} = E_{0} e^{p q_{c} t + (1 - p) q_{l} t}$ , so $q$ is the weighted average (4) $q = p q_{c} + (1 - p) q_{l}$ of the growth rates of labor and capital.

Then, using Equation 2, we can estimate our key ratio $\frac{λ}{β}$ as $\frac{r}{q} = \frac{r}{p q_{c} + (1 - p) q_{l}}$ .

Let's get empirical!

Plugging in your estimates:

Historical compute scaling of 4x/year gives $q_{c} = l n (4)$ ;
Historical labor scaling of 1.6x gives $q_{l} = l n (1.6)$ ;
Historical compute elasticity on research outputs of 0.4 gives $p = 0.4$ ;
Adding these together, $q = 0.79 \sim l n (2.3)$ .^[2]
Historical efficiency improvement of 3.5x/year gives $r = l n (3.5)$ .
So $\frac{λ}{β} = \frac{l n (3.5)}{l n (2.3)} = 1.5$ ^[3]

Adjusting for labor-only scaling

But wait: we're not done yet! Under our Cobb-Douglas assumption, scaling labor by a factor of 2 isn't as good as scaling all research inputs by a factor of 2; it's only $2^{0.6} / 2$ as good.

Plugging in Equation 3 (which describes research input $E$ in terms of compute and labor) to Equation 1 (which estimates AI progress $A$ based on research), our adjusted form of the Epoch model is $\frac{d (l n A)}{d t} \sim A^{- β} E^{λ} \sim A^{- β} * C^{p λ} * L^{(1 - p) λ}$ .

Under a software-only singularity, we hold compute constant while scaling labor with AI efficiency, so $\frac{d (l n A)}{d t} \sim A (t)^{- β} * L (t)^{(1 - p) λ}$ multiplied by a fixed compute term. Since labor scales as A, we have $\frac{d (l n A)}{d t} = A^{- β t} A^{λ (1 - p) t} = A^{λ (1 - p) t - β t}$ . By the same analysis as in our first section, we can see A grows exponentially if $\frac{λ (1 - p)}{β} = 1$ , and grows grows superexponentially if this ratio is >1. So our key ratio $\frac{λ}{β}$ just gets multiplied by $1 - p$ , and it wasn't a waste to find it, phew!

Now we get the true form of our equation: we get a software-only foom iff $\frac{λ}{β} (1 - p) > 1$ , or (via equation 2) iff we see empirically that $\frac{r}{q} (1 - p) > 1$ . Call this the takeoff ratio: it corresponds to a) how much AI progress scales with inputs and b) how much of a penalty we take for not scaling compute.

Result: Above, we got $\frac{λ}{β} = 1.5$ , so our takeoff ratio is $0.6 * 1.5 = .9$ . That's quite close! If we think it's more reasonable to think of a historical growth rate of 4 instead of 3.5, we'd increase our takeoff ratio by a factor of $\frac{l n (4)}{l n (3.5)} = 1.1$ , to a ratio of $.99$ , right on the knife edge of FOOM. ^[4] [note: I previously had the wrong numbers here: I had lambda/beta = 1.6, which would mean the 4x/year case has a takeoff ratio of 1.05, putting it into FOOM land]

So this isn't too far off from your results in terms of implications, but it is somewhat different (no FOOM for 3.5x, less sensitivity to the exact historical growth rate).

Analyzing your approach:

Tweaking alpha:

Your estimate of $α$ is in fact similar in form to my ratio $\frac{r}{q}$ - but what you're calculating instead is $α = e^{r} / e^{q} = 3.5 / (4^{0.4} * {1.6}^{0.6})$ .

One indicator that something's wrong is that your result involves checking whether $α * 2^{1 - p} > 2$ , or equivalently whether $l n (α) + (1 - p) l n (2) > l n (2)$ , or equivalently whether $l n (α) > p * l n (2)$ . But the choice of 2 is arbitrary -- conceptually, you just want to check if scaling software by a factor n increases outputs by a factor n or more. Yet $l n (α) - p * l n (n)$ clearly varies with n.

One way of parsing the problem is that alpha is (implicitly) time dependent - it is equal to exp(r * 1 year) / exp(q * 1 year), a ratio of progress vs inputs in the time period of a year. If you calculated alpha based on a different amount of time, you'd get a different value. By contrast, r/q is a ratio of rates, so it stays the same regardless of what timeframe you use to measure it.^[5]

Maybe I'm confused about what your Cobb-Douglas function is meant to be calculating - is it E within an Epoch-style takeoff model, or something else?

Nuances:

Does Cobb-Douglas make sense?

The geometric average of rates thing makes sense, but it feels weird that that simple intuitive approach leads to a functional form (Cobb-Douglas) that also has other implications.

Wikipedia says Cobb-Douglas functions can have the exponents not add to 1 (while both being between 0 and 1). Maybe this makes sense here? Not an expert.

How seriously should we take all this?

This whole thing relies on...

Assuming smooth historical trends
Assuming those trends continue in the future
And those trends themselves are based on functional fits to rough / unclear data.

It feels like this sort of thing is better than nothing, but I wish we had something better.

I really like the various nuances you're adjusting for, like parallel vs serial scaling, and especially distinguishing algorithmic improvement from labor efficiency. ^[6] Thinking those things through makes this stuff feel less insubstantial and approximate...though the error bars still feel quite large.

^{^}
Actually there's a complexity here, which is that scaling labor alone may be less efficient than scaling "research inputs" which include both labor and compute. We'll come to this in a few paragraphs.
^{^}
This is only coincidentally similar to your figure of 2.3 :)
^{^}
I originally had 1.6 here, but as Ryan points out in a reply it's actually 1.5. I've tried to reconstruct what I could have put into a calculator to get 1.6 instead, and I'm at a loss!
^{^}
~~I was curious how aggressive the superexponential growth curve would be with a takeoff ratio of a mere~~ $0.96 * 1.1 = 1.056$ . A couple of Claude queries gave me different answers (maybe because the growth is so extreme that different solvers give meaningfully different approximations?), but they agreed that growth is fairly slow in the first year (~5x) and then hits infinity by the end of the second year. I wrote this comment with the wrong numbers (0.96 instead of 0.9), so it doesn't accurately represent what you get if you plug in 4x capability growth per year. Still cool to get a sense of what these curves look like, though.
^{^}
I think can be understood in terms of the alpha-being-implicitly-a-timescale-function thing -- if you compare an alpha value with the ratio of growth you're likely to see during the same time period, e.g. alpha(1 year) and n = one doubling, you probably get reasonable-looking results.
^{^}
I find it annoying that people conflate "increased efficiency of doing known tasks" with "increased ability to do new useful tasks". It seems to me that these could be importantly different, although it's hard to even settle on a reasonable formalization of the latter. Some reasons this might be okay:
- There's a fuzzy conceptual boundary between the two: if GPT-n can do the task at 0.01% success rate, does that count as a "known task?" what about if it can do each of 10 components at 0.01% success, so in practice we'll never see it succeed if run without human guidance, but we know it's technically possible?
- Under a software singularity situation, maybe the working hypothesis is that the model can do everything necessary to improve itself a bunch, maybe just not very efficiently yet. So we only need efficiency growth, not to increase the task set. That seems like a stronger assumption than most make, but maybe a reasonable weaker assumption is that the model will 'unlock' the necessary new tasks over time, after which point they become subject to rapid efficiency growth.
- And empirically, we have in fact seen rapid unlocking of new capabilities, so it's not crazy to approximate "being able to do new things" as a minor but manageable slowdown to the process of AI replacing human AI R&D labor.

User info