When AI 10x's AI R&D, What Do We Do?
post by Logan Riggs (elriggs) · 2024-12-21T23:56:11.069Z · LW · GW · 6 commentsContents
Scale Capabilities Safely Step 1: Hardening Defenses & More Control Step 2: Automate Interp Conclusion None 6 comments
Note: below is a hypothetical future written in strong terms and does not track my actual probabilities.
Throughout 2025, a huge amount of compute is spent on producing data in verifiable tasks, such as math[1] (w/ "does it compile as a proof?" being the ground truth label) and code (w/ "does it compile and past unit tests?" being the ground truth label).
In 2026, when the next giant compute clusters w/ their GB200's are built, labs train the next larger model over 100 days, then some extra RL(H/AI)F and whatever else they've cooked up by then.
By mid-2026, we have a model that is very generally intelligent, that is superhuman in coding and math proofs.
Naively, 10x-ing research means releasing 10x the same quality amount of papers in a year; however, these new LLM's have a different skill profile, allowing different types of research and workflows.
If AI R&D is just (1) Code (2) Maths & (3) Idea generation/understanding, then LLMs will have 1 & 2 covered and top-researchers will have (3).[2]
In practice, a researcher tells the LLM "Read the maths sections of arxiv and my latest research on optimizers, and sort by most relevant.", and then they sort through the top 20 results and select the project to pursue. Then they dictate which code or math proofs to focus on (or ask the LLM to suggest potential paths), and let the LLM run.
Throughout the day, each researcher is supervising 2-10 different projects; reviewing the results from LLMs as they come in.
GPU's are still a constraint, but research projects can be filtered by compute: fundamental research into optimizers or compute cluster protocols or interpreting GPT-2 can be done w/ a small number of GB200's (or the previous compute cluster).
With these generally intelligent LLMs w/ superhuman coding & math proofs, how can we...
Scale Capabilities Safely
Having a plan on the order of research to pursue allows researchers & labs a way to coordinate scaling capabilties. RSPs are an example of this; however, they currently don't specify what research to pursue when LLM's can automate AI R&D.
Ideally we can produce a plan that you would trust (with your life) any lab honestly pursuing. This is my version 1 of this:
Step 1: Hardening Defenses & More Control
LLMs can now produce larger amounts of high-quality code and proofs. It might be practical now to write a secure OS or do massive amounts of white-hat hacking to patch vulnerabilities before they're found, including social hacking w/ AI voice and video.
Additionally, more work on the control agenda would ideally increase the amount of useful work you can get out of potentially misaligned LLMs. However, control fails when there's too large of a gap between your most trusted model and your most capable model. This leads to:
Step 2: Automate Interp
I'm expecting in the 18 months before this super LLM is made, we've already done more basic interp, such as finding features of deception/sycophancy/etc through model organism research, and figuring out parts of circuits of personality traits & personas.[3]
But how do we win?
If we can reverse engineer all the algorithms that LLMs learned when predicting text, then we can build larger and larger trust in more capable models. Understanding the algorithms that a model runs is a way to scale capabilities safely.
For example, only scale past [specific capabilties] once you understand the entire model up to that level.
What do you mean by "understand"? There's a million facts about LLMs, but not all facts are useful for goals we care about (e.g. neuron 3030 is > 4 10% of the time is an "understanding").
By "understand", I mean we should be able to replace each part of the model w/ our hypothesis of it. For example, if we say part of the model is used to increase the entropy of the output distribution, then we should be able to replace it w/ code that explicitely does that.
Just because every single part is interpretable, doesn't mean you understand the whole or understand its dynamics in long-contexts.
I agree that focusing on single part misses its interactions w/ other parts. It's easier to predict how a feature generalizes if you understand how it interacts w/ previous/later functions, in the same way that understanding a variable in a program is easier when you know how it was made and how its used in downstream functions.
In fact, suppose we interpret layer 0 features and how those compute layer 1 features (ie Z = 3A + 4B), then we now understand layer 2 features.
Okay, but it does sound like a whole lot of work and will be intractable.
This is why it's Step 2 and not 1. Step 1 is all the reasonable, practical things; the minimal viable product that we can get working in time. Step 2 is intended as a solution that scales to even higher capabilities.
Although I am curious what other research agenda we could rely on that allow safe scaling (I mostly work in interp, so that's what I know).
But just thinking through the benefits of this. If we have an alternative computation graph (ACG) of e.g. GPT-2, where (some better version of ) SAE-features are the nodes and the edges are how earlier features compute later features. Then:
- We can have similar guarantees of generalization behavior as normal, large-scale coding projects (which still mess up, but we do have experience in).
- We can remove all backdoors by simply running computation through this ACG (alternative computation graph).
- If this is more expensive, we should be able to translate this back into a normal NN, leaving out backdoors which weren't captured in the ACG.
- We can do massive model distillation: keep the circuits you need for e.g. coding, and remove everything else.
- Although this might be outcompeted by simply training another model on only coding data generated by an LLM
- Jailbreaks might be solved by default: suppose Z = 3A + 4B, and a jailbreak ends up activating Z in a weird way e.g. Z = 10C + 20D (but C & D don't co-occur in distribution, but does on the jailbreak prompt)
Currently we do not have a good enough ontology/ unit-of-computation for LLMs. SAE's are a solid v1, but are not good enough. I'm hoping in the next year, we can solve this, then the automating AI R&D can just scale w/ [GPT-5]. If not, then pursuing fundamental interp research is valuable but requires in-house interp researchers to lead.
As an example, trying to fully interpret small LLMs w/ current best techniques (SAEs), finding the holes in that, and iterating is a solid research direction where current researchers (and those in the future w/ super-coding/math LLMs) can work on.
Conclusion
It's valuable to have a plan to coordinate the order of research to pursue when AI's can 10x research (whether in 1.5 years or 5). RSPs are an attempt at this coordination metric, but currently don't specify any plans on what to research w/ accelerated AI R&D.
Above is my current best guess of a v1 of what an ideal world would do. There are most likely alternative research directions that also would be desirable, and I'd appreciate comments that pitch them (e.g. SLT for converting data into algorithms, automating meta-philosophy somehow, etc)
Additionally what's missing is any sort of enforcement mechanism (e.g. self-enforcement by actually convincing people of the risks, government agencies, etc)
- ^
The verifiable part is going from e.g. Lean code to a proof in Lean, but not from words to Lean code. However, there is work already of converting papers into Lean code, which might already be good enough to rely on.
- ^
Although I do gain more understanding of what I want & mean by coding, so it might be the case that researchers who stop coding will lose their touch. Although, there might be enough verification methods (e.g. sanity checking research ideas w/ graphs and corollaries and talking to an LLM w/ interruptible voice chat on the results) such that researchers can stay grounded.
- ^
If you decide to work on this, I can give feedback on this or mentoring
6 comments
Comments sorted by top scores.
comment by ryan_greenblatt · 2024-12-22T18:09:17.312Z · LW(p) · GW(p)
This plan seems to underemphasize security. I expect that for 10x AI R&D[1], you strongly want fully state proof security (SL5) against weight exfiltration and then quickly after that you want this level of security for algorithmic secrets and unauthorized internal usage[2].
Things don't seem to be on track for this level of security so I expect a huge scramble to achieve this.
10x AI R&D could refer to "software progress in AI is now 10x faster" or "the labor input to software progress is now 10x faster". If it's the second one, it is plausible that fully state proof security isn't the most important thing if AI progress is mostly bottlenecked by other factors. However, 10x labor acceleration is pretty crazy and I think you want SL5. ↩︎
Unauthorized internal usage includes stuff like foreign adversaries doing their weapons R&D on your cluster using your model weights. ↩︎
comment by Logan Riggs (elriggs) · 2024-12-21T23:58:19.661Z · LW(p) · GW(p)
This is mostly a "reeling from o3"-post. If anyone is doom/anxiety-reading these posts, well, I've been doing that too! At least, we're in this together:)
comment by jacquesthibs (jacques-thibodeau) · 2024-12-22T05:57:41.088Z · LW(p) · GW(p)
Hey Logan, thanks for writing this!
We talked about this recently, but for others reading this: given that I'm working on building an org focused on this kind of work and wrote a relevant shortform lately [LW(p) · GW(p)], I wanted to ping anybody reading this to send me a DM if you are interested in either making this happen (looking for a cracked CTO atm and will be entering phase 2 of Catalyze Impact in January) or provide feedback to an internal vision doc.
comment by Chris_Leong · 2024-12-22T14:20:12.078Z · LW(p) · GW(p)
I've said this elsewhere, but I think we need to also be working on training wise AI advisers in order to help us navigate these situations.
comment by CBiddulph (caleb-biddulph) · 2024-12-22T07:34:45.895Z · LW(p) · GW(p)
I really like your vision for interpretability!
I've been a little pessimistic about mech interp as compared to chain of thought, since CoT already produces very understandable end-to-end explanations for free (assuming faithfulness etc.).
But I'd be much more excited if we could actually understand circuits to the point of replacing them with annotated code or perhaps generating on-demand natural language explanations in an expandable tree. And as long as we can discover the key techniques for doing that, the fiddly cognitive work that would be required to scale it across a whole neural net may be feasible with AI only as capable as o3.
Replies from: elriggs↑ comment by Logan Riggs (elriggs) · 2024-12-22T11:48:45.369Z · LW(p) · GW(p)
Thanks!
I forgot about faithful CoT and definitely think that should be a "Step 0". I'm also concerned here that AGI labs just don't do the reasonable things (ie training for briefness making the CoT more steganographic).
For Mech-interp, ya, we're currently bottlenecked by:
- Finding a good enough unit-of-computation (which would enable most of the higher-guarantee research)
- Computing Attention_in--> Attention_out (which Keith got the QK-circuit [LW · GW]-> Attention pattern working a while ago, but haven't hooked up w/ the OV-circuit)