Posts
Comments
A better strategy: tool AI
The (obvious) counter is that this doesn't seem competitive, especially in the long run, but plausibly not even today. E.g. where would o1[-preview] fit? It doesn't seem obvious how to build very confident quantitative safety guarantees for it (and especially for successors, for which incapability arguments will stop holding), so should it be banned / should this tech tree be abandoned (e.g. by the West)? OTOH, putting it under the 'tool AI' category seems like a bit of a stretch.
In conclusion, the potential of tool AI is absolutely stunning and, in my opinion, dramatically underrated. In contrast, AGI does not add much value at the present time beyond what tool AI will be able to deliver
This seems very unlikely to me, e.g. if LLMs/LM agents are excluded from the tool AI category.
I predict that if the pro-replacement school ran a global referendum on this question, they would be disappointed by the result.
I'm not sure this line of reasoning has the force some people seem to assume. What would you expect the results of hypothetical, similar referendums would have been e.g. before the industrial revolution and before the agricultural revolution, on those changes?
If you disagree with my assertion, I challenge you to cite or openly publish an actual plan for aligning or controlling a hybrid AGI system.
Not claiming to speak on behalf of the relevant authors/actors, but quite a few (sketches) of such plans have been proposed, e.g. The Checklist: What Succeeding at AI Safety Will Involve, The case for ensuring that powerful AIs are controlled.
I don't know, intuitively it would seem suboptimal to put very little of the research portfolio on preparing the scaffolding, since somebody else who isn't that far behind on the base model (e.g. another lab, maybe even the opensource community) might figure out the scaffolding (and perhaps not even make anything public) and get ahead overall.
Indeed the leadership of all the major AI corporations are actively excited about, and gunning for, an intelligence explosion. They are integrating AI into their AI R&D as fast as they can.
Can you expand on this? My rough impression (without having any inside knowledge) is that auto AI R&D is probably very much underelicited, including e.g. in this recent OpenAI auto ML evals paper; which might suggest they're not gunning for it as hard as they could?
Seems a bit in the spirit of what's called 'high-level cross-validation' here, between SAEs + auto-interp on the one hand, and activation engineering on the other.
For similar reasons to the discussion here about why individuals and small businesses might be expected to be able to (differentially) contribute to LM scaffolding (research), I expect them to be able to differentially contribute to [generalized] inference scaling [research]; plausibly also the case for automated ML research agents. Also relevant: Before smart AI, there will be many mediocre or specialized AIs.
(cross-posted from https://x.com/BogdanIonutCir2/status/1844787728342245487)
the plausibility of this strategy, to 'endlessly trade computation for better performance' and then have very long/parallelized runs, is precisely one of the scariest aspects of automated ML; even more worrying that it's precisely what some people in the field are gunning for, and especially when they're directly contributing to the top auto ML scaffolds; although, everything else equal, it might be better to have an early warning sign, than to have a large inference overhang: https://x.com/zhengyaojiang/status/1844756071727923467
(cross-posted from https://x.com/BogdanIonutCir2/status/1844451247925100551, among others)
I'm concerned things might move much faster than most people expected, because of automated ML (figure from OpenAI's recent MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering, basically showing automated ML engineering performance scaling as more inference compute is used):
Figures 3 and 4 from MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering seem like some amount of evidence for this view:
I haven't thought about this much, but I do want to echo calls for more white-box work on control.
Agree, I had actually kind of made a similar call a while ago.
(still) speculative, but I think the pictures of Shard Theory, activation engineering and Simulators (and e.g. Bayesian interpretations of in-context learning) are looking increasingly similar: https://www.lesswrong.com/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed?commentId=qX4k7y2vymcaR6eio
https://www.lesswrong.com/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed#SfPw5ijTDi6e3LabP
Here's some intuition of why I think this could be a pretty big deal, straight from Claude, when prompted with 'come up with an explanation/model, etc. that could unify the findings in these 2 papers' and fed Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition and Understanding and Controlling a Maze-Solving Policy Network:
Based on the findings from these two papers, I propose the following unified explanation/model for how goal representations and task superposition work in neural networks:
Key Unified Concepts:
- Distributed and Redundant Goal Representations: Both papers demonstrate that goals and tasks are represented in a distributed and redundant manner across multiple network components. The maze-solving policy has 11 channels that track the cheese location, while the language models exhibit task superposition across multiple network components.
- Context-Dependent Goal Activation: Goals and tasks are activated in a context-dependent manner. The maze policy pursues different goals based on maze features, while language models activate different tasks based on in-context examples.
- Continuous Goal/Task Spaces: Rather than discrete goals/tasks, networks seem to operate in continuous goal/task spaces. This is evidenced by the ability to interpolate between tasks in language models and steer the maze policy to arbitrary locations.
- Separable Goal/Task Vectors: Goals and tasks can be represented as separable vectors in activation space. This allows for manipulation through vector addition/subtraction to modify network behavior.
- Emergence of Goal/Task Representations: These goal and task representations emerge naturally from training, even when not explicitly trained for multi-task capabilities. Both papers show this emergent behavior in models trained on single objectives.
Unified Model:
I propose that neural networks develop an internal "goal/task space" during training. This space has the following properties:
- It is continuous and high-dimensional, allowing for representation of many possible goals/tasks.
- Different regions of this space correspond to different goals/tasks.
- The network's activations at certain layers map to points in this space, representing the current goal/task.
- Multiple redundant mappings exist between network activations and the goal/task space, providing robustness.
- The mapping from inputs to points in the goal/task space is context-dependent and learned during training.
- The goal/task point then influences subsequent network computations to pursue that goal/task.
- Vector operations in activation space correspond to movements in the goal/task space, allowing for controlled modification of network behavior.
This unified model explains the observed phenomena in both papers:
- Distributed and redundant goal representations
- Context-dependent goal/task selection
- Ability to interpolate and combine goals/tasks
- Steering of behavior through activation engineering
It also provides a framework for understanding how complex goal-directed and multi-task behaviors can emerge in neural networks trained on seemingly simpler objectives. This model could guide future research into understanding and controlling the goal/task representations in AI systems.
When further prompted with 'what about superposition and task vectors?':
You're right to point out superposition and task vectors - these are indeed key concepts that I should have emphasized more explicitly in the unified model. Let me expand on how these fit into and enhance the unified explanation:
Superposition:
- Task Superposition: Both papers demonstrate that neural networks can represent and execute multiple tasks or goals simultaneously. This is a core aspect of superposition.
- In language models, this manifests as the ability to perform multiple in-context learning tasks in a single forward pass.
- In the maze-solving policy, superposition is evident in the network's ability to pursue multiple context-dependent goals (e.g., navigating to cheese or to the top-right corner).
- Superposition allows networks to efficiently represent multiple tasks/goals within the same set of parameters, rather than requiring separate networks for each task.
Task Vectors:
- Task Vector Representation: Both papers show that tasks or goals can be represented as vectors in the network's activation space.
- In language models, task vectors are extracted from the difference in activations between task-specific and baseline contexts.
- For the maze-solving policy, "cheese vectors" and "top-right vectors" are computed similarly, representing the difference in activations with and without specific goals.
- These task vectors can be added or subtracted from network activations to modify behavior, demonstrating their role in representing and controlling task execution.
- The ability to combine task vectors (e.g., subtracting the cheese vector while adding the top-right vector) shows how networks can blend multiple task representations.
Enhanced Unified Model:
Incorporating these concepts more explicitly, we can enhance the unified model as follows:
- The internal "goal/task space" of the network supports superposition, allowing multiple goals/tasks to be represented and pursued simultaneously.
- Points in this space can be thought of as weighted combinations of task vectors, with the weights determining the relative importance of each task.
- The network's activations at certain layers encode these superposed task representations.
- Task vectors provide a mechanism for moving through the goal/task space, allowing for fine-grained control over network behavior.
- The superposition property allows the network to efficiently represent a vast range of possible task combinations within its fixed parameter space.
- Context-dependent inputs determine how these superposed task representations are constructed and weighted during inference.
- The network's subsequent computations operate on this superposed representation, effectively pursuing multiple goals/tasks in parallel.
This enhanced model better captures the key insights from both papers, explaining how networks can efficiently represent and execute multiple tasks, and how this capability can be analyzed and controlled through the lens of task vectors and superposition. It provides a powerful framework for understanding the emergent multi-task capabilities of neural networks and suggests methods for steering their behavior through targeted interventions in activation space.
Then, when fed Alex's shortform comment and prompted with 'how well would this fit this 'semi-formalization of shard theory'?':
This semi-formalization of shard theory aligns well with our unified model of goal representations and task superposition in neural networks. Let's analyze how it fits and where it might extend our understanding:
- Multiple Motivational Circuits: The idea of multiple "motivational circuits" or shards aligns perfectly with our observation of distributed and redundant goal representations. In the maze-solving policy, we identified multiple channels tracking the cheese location, which could be seen as components of a "cheese-seeking shard."
- Independent Activation: The ability of shards to activate independently matches our finding of context-dependent goal activation. Different maze features activate different goals, just as different contexts activate different shards.
- Compositional Representation: While the semi-formalization suggests compositional representation, it acknowledges that this is unrealistic due to superposition. This aligns with our model's emphasis on superposition and continuous goal/task spaces.
- Implementation-Level Definition: The focus on implementation rather than behavior fits well with our approach of examining internal network representations rather than just output behavior.
- Subjective Granularity: The point about the subjective nature of shard definitions resonates with our observation of redundant and distributed representations. We could choose to model the 11 cheese-tracking channels as separate shards or as components of a single "cheese-seeking shard."
- Steering Vectors: The connection between shard theory and steering vectors is particularly relevant. Our unified model uses task vectors, which are essentially steering vectors, to modify network behavior.
Extensions and Refinements:
- Motivational Circuits: We could refine the definition of "motivational circuits" using our understanding of goal-tracking channels and task vectors. A motivational circuit could be defined as a set of network components that, when activated, consistently bias the network's behavior towards a specific goal or task.
- Superposition: Our model explicitly incorporates superposition, addressing one of the limitations in the semi-formalization. We can represent shards as vectors in a continuous space, allowing for non-orthogonal combinations.
- Granularity and Hierarchy: The subjective nature of shard definitions could be formalized by considering hierarchical representations in the goal/task space. Lower-dimensional subspaces could represent more general shards, while higher-dimensional representations could capture finer distinctions.
- Activation Patterns: We could extend the model to consider not just the presence of shards, but their activation patterns across different contexts. This could help differentiate between systems with similar shards but different activation dynamics.
In conclusion, this semi-formalization of shard theory fits well with our unified model and provides a valuable perspective on understanding complex, multi-goal behaviors in neural networks. It offers a framework for thinking about how multiple motivations can coexist and interact within a single system, which is crucial for understanding and controlling advanced AI systems.
(I guess there might also be a meta-point here about augmented/automated safety research, though I was only using Claude for convenience. Notice that I never fed it my own comment and I only fed it Alex's at the end, after the 'unifying theory' had already been proposed. Also note that my speculation successfully predicted the task vector mechanism before the paper came out; and before the senior author's post/confirmation.)
Edit: and throwback to an even earlier speculation, with arguably at least some predictive power: https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector?commentId=wHeawXzPM3g9xSF8P.
Intuitively, AFAICT, this kind of capability doesn't seem that (strictly) necessary for almost any thing we might want to do with powerful models. Do you have thoughts about the promise of using unlearning to try to remove it as much as possible (and about the tractability of designing relevant benchmarks/datasets)?
Maybe somewhat oversimplifying, but this might suggest non-trivial similarities to Simulators and having [the representations of] multiple tasks in superposition (e.g. during in-context learning). One potential operationalization/implemention mechanism (especially in the case of in-context learning) might be task vectors in superposition.
On a related note, perhaps it might be interesting to try SAEs / other forms of disentanglement to see if there's actually something like superposition going on in the representations of the maze solving policy? Something like 'not enough dimensions' + ambiguity in the reward specification sounds like it might be a pretty attractive explanation for the potential goal misgeneralization.
Edit 1: more early evidence.
Edit 2: the full preprint referenced in tweets above is now public.
on unsupervised learning/clustering in the activation space of multiple systems as a potential way to deal with proxy problems when searching for some concepts (e.g. embeddings of human values): https://www.lesswrong.com/posts/Nwgdq6kHke5LY692J/alignment-by-default#8CngPZyjr5XydW4sC
Supervised/Reinforcement: Proxy Problems
Another plausible approach to deal with the proxy problems might be to do something like unsupervised clustering/learning on the representations of multiple systems that we'd have good reasons to believe encode the (same) relevant values - e.g. when exposed to the same stimulus (including potentially multiple humans and multiple AIs). E.g. for some relevant recent proof-of-concept works: Identifying Shared Decodable Concepts in the Human Brain Using Image-Language Foundation Models, Finding Shared Decodable Concepts and their Negations in the Brain, AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space, Rosetta Neurons: Mining the Common Units in a Model Zoo, Cross-GAN Auditing: Unsupervised Identification of Attribute Level Similarities and Differences between Pretrained Generative Models, Quantifying stimulus-relevant representational drift using cross-modality contrastive learning. Automated interpretability (e.g. https://multimodal-interpretability.csail.mit.edu/maia/) could also be useful here. This might also work well with concepts like corrigibility/instruction following and arguments about the 'broad basin of attraction' and convergence for corrigibility.
(cross-posted from https://x.com/BogdanIonutCir2/status/1842914635865030970)
maybe there is some politically feasible path to ML training/inference killswitches in the near-term after all: https://www.datacenterdynamics.com/en/news/asml-adds-remote-kill-switch-to-tsmcs-euv-machines-in-case-china-invades-taiwan-report/
I can then share my attempts with someone at Anthropic.
Alternately, collaborating/sharing with e.g. METR or UK AISI auto ML evals teams might be interesting. Maybe even Pallisade or similar orgs from a 'scary demo' perspective? @jacquesthibs might also be interested. I might also get to work on this or something related, depending on how some applications go.
I also expect Sakana, Jeff Clune's group and some parts of the open-source ML community will try to push this, but I'm more uncertain at least in some of these cases about the various differential acceleration tradeoffs.
But states seem quite likely to fall under 6e, no?
What did you mean exactly in 2016 by the scaling hypothesis ?
Something like 'we could have AGI just by scaling up deep learning / deep RL, without any need for major algorithmic breakthroughs'.
Having past 2035 timelines and believing in the pure scaling maximalist hypothesis (which fwiw i don't believe in for reasons i have explained elsewhere) are in direct conflict so id be curious if you could more exactly detail your beliefs back then.
I'm not sure this is strictly true, though I agree with the 'vibe'. I think there were probably a couple of things in play:
- I still only had something like 20% on scaling, and I expected much more compute would likely be needed, especially in that scenario, but also more broadly (e.g. maybe something like the median in 'bioanchors' - 35 OOMs of pretraining-equivalent compute, if I don't misremember; though I definitely hadn't thought very explicitly about how many OOMs of compute at that time) - so I thought it would probably take decades to get to the required amount of compute.
- I very likely hadn't thought hard and long enough to necessarily integrate/make coherent my various beliefs.
- Probably at least partly because there seemed to be a lot of social pressure from academic peers against even something like '20% on scaling', and even against taking AGI and AGI safety seriously at all. This likely made it harder to 'viscerally feel' what some of my beliefs might imply, and especially that it might happen very soon (which also had consequences in delaying when I'd go full-time into working on AI safety; along with thinking I'd have more time to prepare for it, before going all in).
Fwiw, in 2016 I would have put something like 20% probability on what became known as 'the scaling hypothesis'. I still had past-2035 median timelines, though.
Thanks a lot for posting these!
I wrote some agenda-setting and brainstorming docs (#10 and #11 in this list) which people are welcome to read and comment on if interested.
Unknowingly, I happen to have worked on some very related topics during the Astra Fellowship winter '24 with @evhub (and also later). Most of it is still unpublished, but this is the doc (draft) of the short presentation I gave at the end; and I mention some other parts in this comment.
(I'm probably biased and partial but) I think the rough plan laid out in #10 and #11 in the list is among the best and most tractable I've ever seen. I really like the 'core system - amplified system' framework and have had some related thoughts during Astra (comment; draft). I also think there's been really encouraging recent progress on using trusted systems (in Redwood's control framework terminology; often by differentially 'turning up' the capabilities on the amplification part of the system [vs. those of the core system]) to (safely) push forward automated safety work on the core system; e.g. A Multimodal Automated Interpretability Agent. And I could see some kind of safety case framework where, as we gain confidence in the control/alignment of the amplified system and as the capabilities of the systems increase, we move towards increasingly automating the safety research applied to the (increasingly 'interior' parts of the) core system. [Generalized] Inference scaling laws also seem pretty good w.r.t. this kind of plan, though worrying in other ways.
'Krenn thinks that o1 will accelerate science by helping to scan the literature, seeing what’s missing and suggesting interesting avenues for future research. He has had success looping o1 into a tool that he co-developed that does this, called SciMuse. “It creates much more interesting ideas than GPT-4 or GTP-4o,” he says.' (source; related: current underelicitation of auto ML capabilities)
I might go so far as to say it Pareto dominates most people’s agendas on importance and tractability. While being pretty neglected.
(Others had already written on these/related topics previously, e.g. https://www.lesswrong.com/posts/r3xwHzMmMf25peeHE/the-translucent-thoughts-hypotheses-and-their-implications, https://www.lesswrong.com/posts/FRRb6Gqem8k69ocbi/externalized-reasoning-oversight-a-research-direction-for, https://www.lesswrong.com/posts/fRSj2W4Fjje8rQWm9/thoughts-on-sharing-information-about-language-model#Accelerating_LM_agents_seems_neutral__or_maybe_positive_, https://www.lesswrong.com/posts/dcoxvEhAfYcov2LA6/agentized-llms-will-change-the-alignment-landscape, https://www.lesswrong.com/posts/ogHr8SvGqg9pW5wsT/capabilities-and-alignment-of-llm-cognitive-architectures, https://intelligence.org/visible/.)
I've made some attempts over the past year to (at least somewhat) raise the profile of this kind of approach and its potential, especially when applied to automated safety research (often in conversation with / prompted by Ryan Greenblatt; and a lot of it stemming from my winter '24 Astra Fellowship with @evhub):
I might write a high-level post at some point (which should hopefully help with the visibility of this kind of agenda more than the separate comments on various other [not-necessarily-that-related] posts).
(crossposted from https://x.com/BogdanIonutCir2/status/1840775662094713299)
I really wish we'd have some automated safety research prizes similar to https://aimoprize.com/updates/2024-09-23-second-progress-prize…. Some care would have to be taken to not advance capabilities [differentially], but at least some targeted areas seem quite robustly good, e.g. …https://multimodal-interpretability.csail.mit.edu/maia/.
This work seems potentially relevant: Interpreting Learned Feedback Patterns in Large Language Models.
It might be interesting to develop/put out RFPs for some benchmarks/datasets for unlearning of ML/AI knowledge (and maybe also ARA-relevant knowledge), analogously to WMDP for CBRN. This might be somewhat useful e.g. in a case where we might want to use powerful (e.g. human-level) AIs for cybersecurity, but we don't fully trust them.
From Understanding and steering Llama 3:
A further interesting direction for automated interpretability would be to build interpreter agents: AI scientists which given an SAE feature could create hypotheses about what the feature might do, come up with experiments that would distinguish between those hypotheses (for instance new inputs or feature ablations), and then repeat until the feature is well-understood. This kind of agent might be the first automated alignment researcher. Our early experiments in this direction have shown that we can substantially increase automated interpretability performance with an iterative refinement step, and we expect to be able to push this approach much further.
Seems like it would be great if inference scaling laws worked for this particular application.
Success at currently-researched generalized inference scaling laws might risk jeopardizing some of the fundamental assumptions of current regulatory frameworks.
- the o1 results have illustrated specialized inference scaling laws, for model capabilities in some specialized domains (e.g. math); notably, these don't seem to generally hold for all domains - e.g. o1 doesn't seem better than gpt4o at writing;
- there's ongoing work at OpenAI to make generalized inference scaling work;
- e.g. this could perhaps (though maybe somewhat overambitiously) be framed, in the language of https://epochai.org/blog/trading-off-compute-in-training-and-inference, as there no longer being an upper bound in how many OOMs of inference compute can be traded for equivalent OOMs of pretraining;
- to the best of my awarenes, current regulatory frameworks/proposals (e.g. the EU AI Act, the Executive Order, SB 1047) frame the capabilities of models in terms of (pre)training compute and maybe fine-tuning compute (e.g. if (pre)training FLOP > 1e26, the developer needs to take various measures), without any similar requirements framed in terms of inference compute; so current regulatory frameworks seem unprepared for translating between indefinite amounts of inference compute and pretraining compute (capabilities);
- this would probably be less worrying in the case of models behind APIs, since it should be feasible to monitor for misuse (e.g. CBRN), especially since current inference scaling methods (e.g. CoT) seem relatively transparent; but this seems different for open-weights or leaked models;
- notably, it seems like Llama-3 405b only requires 8 good GPUs to run on one's own hardware (bypassing any API), apparently costing <100k$; the relatively low costs could empower a lot of bad actors, especially since guardrails are easy to remove and potentially more robust methods like unlearning still seem mostly in the research phase;
- this would also have consequences for (red-teaming) evals; if running a model for longer potentially leads to more (misuse) uplift, it becomes much harder to upper-bound said uplift;
- also, once a model's weights are out, it might be hard to upperbound all the potential inference uplift that future fine-tunes might provide; e.g. suppose it did become possible to fine-tune 4o into something even better than o1, that could indeed do almost boundless inference scaling; this might then mean that somebody could then apply the same fine-tuning to Llama-3 405b and release that model open-weights, and then (others) mights potentially be able to misuse it in very dangerous ways using a lot of inference compute (but still comparatively low vs. pretraining compute);
- OTOH, I do buy the claims from https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/ that until now open-weights have probably been net beneficial for safety, and I expect this could easily keep holding in the near-term (especially as long as open-weights models stay somewhat behind the SOTA); so I'm conflicted about what should be done, especially given the uncertainty around the probability of making generalized inference scaling work and over how much more dangerous this could make current or future open-weights releases; one potentially robust course of action might be to keep doing research on unlearning methods, especially ones robust to fine-tuning, like in Tamper-Resistant Safeguards for Open-Weight LLMs, which might be sufficient to make misuse (especially in some specialized domains, e.g. CBRN) as costly as needing to train from scratch
Algorithmic progress can use 1. high iteration speed 2. well-defined success metrics (scaling laws) 3.broad knowledge of the whole stack (Cuda to optimization theory to test-time scaffolds) 4. ...
Alignment broadly construed is less engineering and a lot more blue skies, long horizon, and under-defined (obviously for engineering heavy alignment sub-tasks like jailbreak resistance, and some interp work this isn't true).
Generally agree, but I do think prosaic alignment has quite a few advantages vs. prosaic capabilities (e.g. in the extra slides here) and this could be enough to result in aligned (-enough) automated safety researchers which can be applied to the more blue skies parts of safety research. I would also very much prefer something like a coordinated pause around the time when safety research gets automated.
This line of reasoning is part of why I think it's valuable for any alignment researcher (who can) to focus on bringing the under-defined into a well-defined framework. Shovel-ready tasks will be shoveled much faster by AI shortly anyway.
Agree, I've written about (something related to) this very recently.
This relies on an assumption that you can make up for lack-of-intelligence by numbers or speed. Without that assumption, you could expect that AI research will be dominated by humans until AIs finally “get it”, after which they’ll take over with a huge margin.
I interpret the research program described here as aiming to make this assumption true.
I don’t think that at present conditions are good for automating R&D.
I kind of wish this was true, because it would likely mean longer timelines, but my expectation is that the incoming larger LMs + better scaffolds + more inference-time compute could quite easily pass the threshold of significant algorithmic progress speedup from automation (e.g. 2x).
Agree with a lot of this, but scaffolds still seem to me pretty good, for reasons largely similar to those in https://www.lesswrong.com/posts/fRSj2W4Fjje8rQWm9/thoughts-on-sharing-information-about-language-model#Accelerating_LM_agents_seems_neutral__or_maybe_positive_.
Some quick thoughts:
- automating research that 'tackles the hard bits' seems to probably be harder (happen chronologically later) and like it might be more bottlenecked by good (e.g. agent foundations) human researcher feedback - which does suggest it might be valuable to recruit more agent foundations researchers today
- but my impression is that recruiting/forming agent foundations researchers itself is much harder and has much worse feedback loops
- I expect if the feedback loops are good enough, the automated safety researchers could make enough prosaic safety progress, feeding into agendas like control and/or conditioning predictive models (+ extensions to e.g. steering using model internals, + automated reviewing, etc.), to allow the use of automated agent foundations researchers with more confidence and less human feedback
- the hard bits might also become much less hard - or prove to be irrelevant - if for at least some of them empirical feedback could be obtained by practicing on much more powerful models
Overall, my impression is that focusing on the feedback loops seems probably significantly more scalable today, and the results would probably be enough to allow to mostly defer the hard bits to automated research.
(Edit: but also, I think if funders and field builders 'went hard', it might be much less necessary to choose.)
Quick take: on the margin, a lot more research should probably be going into trying to produce benchmarks/datasets/evaluation methods for safety (than more directly doing object-level safety research).
Some past examples I find valuable - in the case of unlearning: WMDP, Eight Methods to Evaluate Robust Unlearning in LLMs; in the case of mech interp - various proxies for SAE performance, e.g. from Scaling and evaluating sparse autoencoders, as well as various benchmarks, e.g. FIND: A Function Description Benchmark for Evaluating Interpretability Methods. Prizes and RFPs seem like a potentially scalable way to do this - e.g. https://www.mlsafety.org/safebench - and I think they could be particularly useful on short timelines.
Including differentially vs. doing the full stack of AI safety work - because I expect a lot of such work could be done by automated safety researchers soon, with the right benchmarks/datasets/evaluation methods. This could also make it much easier to evaluate the work of automated safety researchers and to potentially reduce the need for human feedback, which could be a significant bottleneck.
Better proxies could also make it easier to productively deploy more inference compute - e.g. from Large Language Monkeys: Scaling Inference Compute with Repeated Sampling: 'When solving math word problems from GSM8K and MATH, coverage with Llama-3 models grows to over 95% with 10,000 samples. However, common methods to pick correct solutions from a sample collection, such as majority voting or reward models, plateau beyond several hundred samples and fail to fully scale with the sample budget.' Similar findings in other references, e.g. in Trading Off Compute in Training and Inference.
Of course, there's also the option of using automated/augmented safety research to produce such benchmarks, but it seems potentially particularly useful to have them ahead of time.
This also seems like an area where we can't piggyback off of work from capabilities researchers, like will probably be significantly the case for applying automated researchers (to safety).
It does not produce anything close to fully formed scientific papers. It's output is really not better than just prompting o1 yourself. Of course, o1 and even Sonnet and GPT-4 are very impressive, but there is no update to be made after you've played around with that.
(Again) I think this is missing the point that we've now (for the first time, to my knowledge) observed an early demo the full research workflow being automatable, as flawed as the outputs might be.
(This also has implications for automating AI safety research).
To spell it out more explicitly, the current way of scaling inference (CoT) seems pretty good vs. some of the most worrying threat models, which often depend on opaque model internals.
A fairer comparison would probably be to actually try hard at building the kind of scaffold which could use ~10k$ in inference costs productively. I suspect the resulting agent would probably not do much better than with 100$ of inference, but it seems hard to be confident.
Also, there are notable researchers and companies working on developing 'a truly general way of scaling inference compute' right now and I think it would be cautious to consider what happens if they succeed.
(This also has implications for automating AI safety research).
I suspect current approaches probably significantly or even drastically under-elicit automated ML research capabilities.
I'd guess the average cost of producing a decent ML paper is at least 10k$ (in the West, at least) and probably closer to 100k's $.
In contrast, Sakana's AI scientist cost on average 15$/paper and .50$/review. PaperQA2, which claims superhuman performance at some scientific Q&A and lit review tasks, costs something like 4$/query. Other papers with claims of human-range performance on ideation or reviewing also probably have costs of <10$/idea or review.
Even the auto ML R&D benchmarks from METR or UK AISI don't give me at all the vibes of coming anywhere near close enough to e.g. what a 100-person team at OpenAI could accomplish in 1 year, if they tried really hard to automate ML.
A fairer comparison would probably be to actually try hard at building the kind of scaffold which could use ~10k$ in inference costs productively. I suspect the resulting agent would probably not do much better than with 100$ of inference, but it seems hard to be confident. And it seems harder still to be confident about what will happen even in just 3 years' time, given that pretraining compute seems like it will probably grow about 10x/year and that there might be stronger pushes towards automated ML.
This seems pretty bad both w.r.t. underestimating the probability of shorter timelines and faster takeoffs, and in more specific ways too. E.g. we could be underestimating by a lot the risks of open-weights Llama-3 (or 4 soon) given all the potential under-elicitation.
'ChatGPT4 generates social psychology hypotheses that are rated as original as those proposed by human experts' https://x.com/BogdanIonutCir2/status/1836720153444139154
I think this one (and perhaps better operationalizations) should probably have many eyes on it:
And more explicitly, from GDM's Frontier Safety Framework, pages 5-6:
Related:
Depending on how far inference scaling laws go, the situation might be worse still. Picture LLama-4-o1 scaffolds that anyone can run for indefinite amounts of time (as long as they have the money/compute) to autonomously do ML research on various ways to improve Llama-4-o1 and open-weights descendants, to potentially be again appliable to autonomous ML research. Fortunately, lack of access to enough quantities of compute for pretraining the next-gen model is probably a barrier for most actors, but this still seems like a pretty (increasingly, with every open-weights improvement) scary situation to be in.
A list of rough ideas of things I find potentially promising to do/fund:
- RFP(s) for control agenda (e.g. for the long list here https://lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled#Appendix__A_long_list_of_control_techniques)
- Scale up and decentralize some of the grantmaking, e.g. through regrantors
RFP(s) for potentially very scalable agendas, e.g. applying/integrating automated research to various safety research agendas
- RFPs / direct funding for coming up with concrete plans for what to do (including how to deploy funding) in very short timelines (e.g. couple of years); e.g. like https://sleepinyourhat.github.io/checklist/, but ideally made even more concrete (to the degree it’s not info-hazardous)
- Offer more funding to help scale up MATS, Astra, etc. both in terms of number of mentors and mentees/mentor (if mentors are up for it); RFPs for building more similar programs
- RFPs for entrepreneurship/building scalable orgs, and potentially even incubators for building such orgs, e.g. https://catalyze-impact.org/apply
- Offer independent funding, but no promise of mentorship to promising-enough candidates coming out of field-building pipelines (e.g. MATS, Astra, AGISF, ML4Good, ARENA, AI safety camp, Athena).
And for now, it's fortunate that inference scaling means CoT and similarly differentially transparent (vs. model internals) intermediate outputs, which makes it probably a safer way of eliciting capabilities.
Also, I think it's much harder to pause/slow down capabilities than to accelerate safety, so I think more of the community focus should go to the latter.
Related, I expect the delay from needing to build extra infrastructure to train much larger LLMs will probably differentially affect capabilities progress more, by acting somewhat like a pause/series of pauses; which it would be nice to exploit by enrolling more safety people to try to maximally elicit newly deployed capabilities - and potentially be uplifted - for safety research. (Anecdotally, I suspect I'm already being uplifted at least a bit as a safety researcher by using Sonnet.)
I suspect (though with high uncertainty) there are some factors differentially advantaging safety research, especially prosaic safety research (often in the shape of fine-tuning/steering/post-training), which often requires OOMs less compute and more broadly seems like it should have faster iteration cycles; and likely even 'human task time' (e.g. see figures in https://metr.github.io/autonomy-evals-guide/openai-o1-preview-report/) probably being shorter for a lot of prosaic safety research, which makes it likely it will on average be automatable sooner. The existence of cheap, accurate proxy metrics might point the opposite way, though in some cases there seem to exist good, diverse proxies - e.g. Eight Methods to Evaluate Robust Unlearning in LLMs. More discussion in the extra slides of this presentation.
A potential alternative to this may be the use of machine unlearning so as to selectively unlearn data or capabilities.
Any thoughts on how useful existing benchmarks like WMDP-cyber would be; or, alternately, on how difficult it would be to develop a similar benchmark, but more tailored vs. secret collusion?
I think it's quite likely we're already in crunch time (e.g. in a couple years' time we'll see automated ML accelerating ML R&D algorithmic progress 2x) and AI safety funding is *severely* underdeployed. We could soon see many situations where automated/augmented AI safety R&D is bottlenecked by lack of human safety researchers in the loop. Also, relying only on the big labs for automated safety plans seems like a bad idea, since the headcount of their safety teams seems to grow slowly (and I suspect much slower than the numbers of safety researchers outside the labs). Related: https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/.