When AI 10x's AI R&D, What Do We Do?

elriggs

When AI 10x's AI R&D, What Do We Do?

post by Logan Riggs (elriggs) · 2024-12-21T23:56:11.069Z · LW · GW · 16 comments

  Scale Capabilities Safely
    Step 1: Hardening Defenses & More Control
    Step 2: Automate Interp
  Conclusion
None
16 comments

Note: below is a hypothetical future written in strong terms and does not track my actual probabilities.

Throughout 2025, a huge amount of compute is spent on producing data in verifiable tasks, such as math^[1] (w/ "does it compile as a proof?" being the ground truth label) and code (w/ "does it compile and past unit tests?" being the ground truth label).

In 2026, when the next giant compute clusters w/ their GB200's are built, labs train the next larger model over 100 days, then some extra RL(H/AI)F and whatever else they've cooked up by then.

By mid-2026, we have a model that is very generally intelligent, that is superhuman in coding and math proofs.

Naively, 10x-ing research means releasing 10x the same quality amount of papers in a year; however, these new LLM's have a different skill profile, allowing different types of research and workflows.

If AI R&D is just (1) Code (2) Maths & (3) Idea generation/understanding, then LLMs will have 1 & 2 covered and top-researchers will have (3).^[2]

In practice, a researcher tells the LLM "Read the maths sections of arxiv and my latest research on optimizers, and sort by most relevant.", and then they sort through the top 20 results and select the project to pursue. Then they dictate which code or math proofs to focus on (or ask the LLM to suggest potential paths), and let the LLM run.

Throughout the day, each researcher is supervising 2-10 different projects; reviewing the results from LLMs as they come in.

GPU's are still a constraint, but research projects can be filtered by compute: fundamental research into optimizers or compute cluster protocols or interpreting GPT-2 can be done w/ a small number of GB200's (or the previous compute cluster).

With these generally intelligent LLMs w/ superhuman coding & math proofs, how can we...

Scale Capabilities Safely

Having a plan on the order of research to pursue allows researchers & labs a way to coordinate scaling capabilties. RSPs are an example of this; however, they currently don't specify what research to pursue when LLM's can automate AI R&D.

Ideally we can produce a plan that you would trust (with your life) any lab honestly pursuing. This is my version 1 of this:

Step 1: Hardening Defenses & More Control

LLMs can now produce larger amounts of high-quality code and proofs. It might be practical now to write a secure OS or do massive amounts of white-hat hacking to patch vulnerabilities before they're found, including social hacking w/ AI voice and video.

Additionally, more work on the control agenda would ideally increase the amount of useful work you can get out of potentially misaligned LLMs. However, control fails when there's too large of a gap between your most trusted model and your most capable model. This leads to:

Step 2: Automate Interp

I'm expecting in the 18 months before this super LLM is made, we've already done more basic interp, such as finding features of deception/sycophancy/etc through model organism research, and figuring out parts of circuits of personality traits & personas.^[3]

But how do we win?

If we can reverse engineer all the algorithms that LLMs learned when predicting text, then we can build larger and larger trust in more capable models. Understanding the algorithms that a model runs is a way to scale capabilities safely.

For example, only scale past [specific capabilties] once you understand the entire model up to that level.

What do you mean by "understand"? There's a million facts about LLMs, but not all facts are useful for goals we care about (e.g. neuron 3030 is > 4 10% of the time is an "understanding").

By "understand", I mean we should be able to replace each part of the model w/ our hypothesis of it. For example, if we say part of the model is used to increase the entropy of the output distribution, then we should be able to replace it w/ code that explicitely does that.

Just because every single part is interpretable, doesn't mean you understand the whole or understand its dynamics in long-contexts.

I agree that focusing on single part misses its interactions w/ other parts. It's easier to predict how a feature generalizes if you understand how it interacts w/ previous/later functions, in the same way that understanding a variable in a program is easier when you know how it was made and how its used in downstream functions.

In fact, suppose we interpret layer 0 features and how those compute layer 1 features (ie Z = 3A + 4B), then we now understand layer 2 features.

Okay, but it does sound like a whole lot of work and will be intractable.

This is why it's Step 2 and not 1. Step 1 is all the reasonable, practical things; the minimal viable product that we can get working in time. Step 2 is intended as a solution that scales to even higher capabilities.

Although I am curious what other research agenda we could rely on that allow safe scaling (I mostly work in interp, so that's what I know).

But just thinking through the benefits of this. If we have an alternative computation graph (ACG) of e.g. GPT-2, where (some better version of ) SAE-features are the nodes and the edges are how earlier features compute later features. Then:

We can have similar guarantees of generalization behavior as normal, large-scale coding projects (which still mess up, but we do have experience in).
We can remove all backdoors by simply running computation through this ACG (alternative computation graph).
1. If this is more expensive, we should be able to translate this back into a normal NN, leaving out backdoors which weren't captured in the ACG.
We can do massive model distillation: keep the circuits you need for e.g. coding, and remove everything else.
1. Although this might be outcompeted by simply training another model on only coding data generated by an LLM
Jailbreaks might be solved by default: suppose Z = 3A + 4B, and a jailbreak ends up activating Z in a weird way e.g. Z = 10C + 20D (but C & D don't co-occur in distribution, but does on the jailbreak prompt)

Currently we do not have a good enough ontology/ unit-of-computation for LLMs. SAE's are a solid v1, but are not good enough. I'm hoping in the next year, we can solve this, then the automating AI R&D can just scale w/ [GPT-5]. If not, then pursuing fundamental interp research is valuable but requires in-house interp researchers to lead.

As an example, trying to fully interpret small LLMs w/ current best techniques (SAEs), finding the holes in that, and iterating is a solid research direction where current researchers (and those in the future w/ super-coding/math LLMs) can work on.

Conclusion

It's valuable to have a plan to coordinate the order of research to pursue when AI's can 10x research (whether in 1.5 years or 5). RSPs are an attempt at this coordination metric, but currently don't specify any plans on what to research w/ accelerated AI R&D.

Above is my current best guess of a v1 of what an ideal world would do. There are most likely alternative research directions that also would be desirable, and I'd appreciate comments that pitch them (e.g. SLT for converting data into algorithms, automating meta-philosophy somehow, etc)

Additionally what's missing is any sort of enforcement mechanism (e.g. self-enforcement by actually convincing people of the risks, government agencies, etc)

^{^}
The verifiable part is going from e.g. Lean code to a proof in Lean, but not from words to Lean code. However, there is work already of converting papers into Lean code, which might already be good enough to rely on.
^{^}
Although I do gain more understanding of what I want & mean by coding, so it might be the case that researchers who stop coding will lose their touch. Although, there might be enough verification methods (e.g. sanity checking research ideas w/ graphs and corollaries and talking to an LLM w/ interruptible voice chat on the results) such that researchers can stay grounded.
^{^}
If you decide to work on this, I can give feedback on this or mentoring

16 comments

Comments sorted by top scores.

comment by ryan_greenblatt · 2024-12-22T18:09:17.312Z · LW(p) · GW(p)

This plan seems to underemphasize security. I expect that for 10x AI R&D^[1], you strongly want state proof security (SL5) against weight exfiltration and then quickly after that you want this level of security for algorithmic secrets and unauthorized internal usage^[2]^[3].

Things don't seem to be on track for this level of security so I expect a huge scramble to achieve this.

10x AI R&D could refer to "software progress in AI is now 10x faster" or "the labor input to software progress is now 10x faster". See discussion here [LW · GW]. If it's the second one, it is plausible that fully state proof security isn't the most important thing if AI progress is mostly bottlenecked by other factors. However, 10x labor acceleration is pretty crazy and I think you want SL5. ↩︎
Unauthorized internal usage includes stuff like foreign adversaries doing their weapons R&D on your cluster using your model weights. ↩︎
SL5 is just robustness to top priority attacks, but you might need robustness to unprecedentedly high effort attack shortly after this, so you'll potentially need to transition to unprecedentedly high levels of security. ↩︎

Replies from: jbash

↑ comment by jbash · 2024-12-23T18:10:18.413Z · LW(p) · GW(p)

fully state proof security

That's not a thing. Arguably fully anything-proof security is not a thing, but fully state-proof security definitely isn't.

Also, formalistic documents that try to define "security levels" tend to equate "security" with "enforcement of policy". Even if you had perfect enforcement, it's not obvious you could write a perfect policy to enforce.

(SL5)

Which SL5?

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-12-23T18:14:44.261Z · LW(p) · GW(p)

Sure, by "state-proof" security, I mean that even a top priority (e.g top 3 priority) chinese effort would likely (80-90%) fail to steal the weights in 2 years. SL5 from the RAND report roughly corresponds to this.

We might want something something stronger than this shortly after as we might need to resist an unprecedentedly high effort attack.

The word "fully" is maybe a bit sloppy, I've cut it from the above comment.

Replies from: jbash

↑ comment by jbash · 2024-12-23T18:41:52.918Z · LW(p) · GW(p)

You mean an effort that's in the top 3 priorities for the entire Chinese state? Like up there at the same level with "maintain the survival of the state", and above stuff like "get Taiwan back" or "avoid unrest over the economy"?

You're not going to get 80 percent two-year protection against that, period. The measures that RAND describes wouldn't do it, and I don't think any other measures would either (short of just not creating the target).

I doubt those measures would work even if it were just a top 3 priority for only the spies. In fact, they left out a whole bunch of standard "spy" techniques.

Also, realistically, those measures would never be adopted by anybody. The isolation requirement alone would make them a non-starter, and so would the supply chain requirements. Notice that I'm not saying those things aren't needed. I'm saying you won't get them. Before I escaped the formal security world, I had a lot of bitter experience saying "If you want the level of assurance you claim to want, the only way to get it is X", and being told "we're not doing X". Usually followed by them implementing some deeply inadequate substitute and pretending it was equivalent^[1].

By the way, I asked "which SL5", because n-level models are a dime a dozen, and "SL" is short and generic and likely to get used a lot. I was guessing you meant some extension to ISA/IEC 62443 (which goes up to SL4, but I could imagine people working on an SL5), or maybe the Lloyd's maturity model. I'm pretty sure I've seen other documents that had "SLn" structures, although I can't name them offhand. Or maybe it was "CL", or "ML", or "AL", or all of the above. People from the more generic security world are going to get confused if you just talk about "SL5" without saying where it comes from.

... and when I start seeing stuff like "Strict limitation of external connections to the completely isolated network", I tend to think I'm seeing the beginnings of that process... ↩︎

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-12-23T18:56:37.742Z · LW(p) · GW(p)

By top 3 priority, I mean "among the top 3 most prioritized cyber attacks of that year". Precisely, I'm discussing robustness against OC5 as defined in the RAND report linked above:

OC5 Top-priority operations by the top cyber-capable institutions

Operations roughly less capable than or comparable to 1,000 individuals who have experience and expertise years ahead of the (public) state of the art in a variety of relevant professions (cybersecurity, human intelligence gathering, physical operations, etc.) spending years with a total budget of up to $1 billion on the specific operation, with state-level infrastructure and access developed over decades and access to state resources such as legal cover, interception of communication infrastructure, and more.

This includes the handful of operations most prioritized by the world’s most capable nation-states.

Emphasis mine.

Replies from: jbash

↑ comment by jbash · 2024-12-23T19:40:17.419Z · LW(p) · GW(p)

OK, sorry. That's slightly below "top 3 priorities for the spies", I think, but I still don't think it's reasonable to expect to protect a file that's in use against it for 2 years.

Replies from: mrtreasure

↑ comment by mrtreasure · 2024-12-24T00:07:25.875Z · LW(p) · GW(p)

@jbash [LW · GW] What do you think would be a better strategy/more reasonable? Should there be more focus on mitigating risks after potential model theft? Or a much stronger effort to convince key actors to implement unprecedentedly strict security for AI?

Replies from: jbash

↑ comment by jbash · 2025-01-11T14:59:26.182Z · LW(p) · GW(p)

Sorry; I'm not in the habit of reading the notifications, so I didn't see the "@" tag.

I don't have a good answer (which doesn't change the underlying bad prospects for securing the data). I think I'd tend to prefer to "mitigating risks after potential model theft", because I believe "convince key actors" is fundamentally futile. The kind of security you'd need, if it's possible, would basically shut them down. Which is equivalent to abandoning the "key actor" role to whoever does not implement that kind of security.

Unfortunately, "key actors" would also have to be convinced to "mitigate risks", which they're unlikely to do because that would require them to accept that their preventative measures are probably going to fail. So even the relatively mild "go ahead and do it, but don't expect it to work" is probably not going to happen.

Replies from: Gurkenglas

↑ comment by Gurkenglas · 2025-01-11T15:45:42.691Z · LW(p) · GW(p)

Account settings let you set mentions to notify you by email :)

comment by Logan Riggs (elriggs) · 2024-12-21T23:58:19.661Z · LW(p) · GW(p)

This is mostly a "reeling from o3"-post. If anyone is doom/anxiety-reading these posts, well, I've been doing that too! At least, we're in this together:)

comment by jacquesthibs (jacques-thibodeau) · 2024-12-22T05:57:41.088Z · LW(p) · GW(p)

Hey Logan, thanks for writing this!

We talked about this recently, but for others reading this: given that I'm working on building an org focused on this kind of work and wrote a relevant shortform lately [LW(p) · GW(p)], I wanted to ping anybody reading this to send me a DM if you are interested in either making this happen (looking for a cracked CTO atm and will be entering phase 2 of Catalyze Impact in January) or provide feedback to an internal vision doc.

comment by Chris_Leong · 2024-12-22T14:20:12.078Z · LW(p) · GW(p)

I've said this elsewhere, but I think we need to also be working on training wise AI advisers in order to help us navigate these situations.

Replies from: elriggs

↑ comment by Logan Riggs (elriggs) · 2024-12-28T02:07:19.920Z · LW(p) · GW(p)

Could you go into more details into what skills these advisers would have or what situations to navigate?

Because I'm baking in the "superhuman in coding/maths" due to the structure of those tasks, and other tasks can either be improved through:
1. general capabilies
2. Specific task

And there might be ways to differentially accelarate that capability.

Replies from: Chris_Leong

↑ comment by Chris_Leong · 2024-12-28T02:14:40.866Z · LW(p) · GW(p)

I don't exactly know the most important capabilities yet, but things like, advising on strategic decisions, improving co-ordination and non-manipulative communication seem important.

comment by Caleb Biddulph (caleb-biddulph) · 2024-12-22T07:34:45.895Z · LW(p) · GW(p)

I really like your vision for interpretability!

I've been a little pessimistic about mech interp as compared to chain of thought, since CoT already produces very understandable end-to-end explanations for free (assuming faithfulness etc.).

But I'd be much more excited if we could actually understand circuits to the point of replacing them with annotated code or perhaps generating on-demand natural language explanations in an expandable tree. And as long as we can discover the key techniques for doing that, the fiddly cognitive work that would be required to scale it across a whole neural net may be feasible with AI only as capable as o3.

Replies from: elriggs

↑ comment by Logan Riggs (elriggs) · 2024-12-22T11:48:45.369Z · LW(p) · GW(p)

Thanks!

I forgot about faithful CoT and definitely think that should be a "Step 0". I'm also concerned here that AGI labs just don't do the reasonable things (ie training for briefness making the CoT more steganographic).

For Mech-interp, ya, we're currently bottlenecked by:

Finding a good enough unit-of-computation (which would enable most of the higher-guarantee research)
Computing Attention_in--> Attention_out (which Keith got the QK-circuit [LW · GW]-> Attention pattern working a while ago, but haven't hooked up w/ the OV-circuit)

When AI 10x's AI R&D, What Do We Do?

Contents

Scale Capabilities Safely

Step 1: Hardening Defenses & More Control

Step 2: Automate Interp

Conclusion

16 comments