Posts
Comments
Here's a somewhat wild idea to have a 'canary in a coalmine' when it comes to steganography and non-human (linguistic) representations: monitor for very sharp drops in BrainScores (linear correlations between LM activations and brain measurements, on the same inputs) - e.g. like those calculated in Scaling laws for language encoding models in fMRI. (Ideally using larger, more diverse, higher-resolution brain data.)
Mostly the same, perhaps a minor positive update on the technical side (basically, from systems getting somewhat stronger - so e.g. closer to automating AI safety research - while still not showing very dangerous capabilities, like ASL-3, prerequisites to scheming, etc.). My views are even more uncertain / unstable on the governance side though, which probably makes my overall p(doom) (including e.g. stable totalitarianism, s-risks, etc.) more like 20% than 5% (I was probably mostly intuitively thinking of extinction risk only when giving the 5% figure a year ago; overall my median probably hasn't changed much, but I have more variance, coming from the governance side).
Proposal part 3: Use Paraphraser: One more complication. The outputs of the Shoggoth? Paraphrase them. For example, after each line of reasoning CoT is complete, pause the Shoggoth and have a different specialized paraphraser model generate 10 syntactically distinct but semantically equivalent ways of saying the same thing. Then pick one at random and replace the original line of reasoning with it. Then boot up the Shoggoth again and have it continue the CoT.
Any thoughts on how much of a (computational) alignment tax this would impose? Related, thoughts on how big of a worry scheming/opaque cognitions in the paraphrasers would be? (e.g. one intuition here might be that the paraphrasers might be 'trusted' in control terminology - incapable of scheming because too weak; in this case the computational alignment tax might also be relatively low, if the paraphrasers are much smaller than the Face and the Shoggoth).
'China hawk and influential Trump AI advisor Jacob Helberg asserted to Reuters that “China is racing towards AGI," but I couldn't find any evidence in the report to support that claim.' https://x.com/GarrisonLovely/status/1859022323799699474
AFAICT, there seems to quite heavy overlap between the proposal and Daniel's motivation for it and safety case (sketch) #3 in https://alignment.anthropic.com/2024/safety-cases/.
'The report doesn't go into specifics but the idea seems to be to build / commandeer the computing resources to scale to AGI, which could include compelling the private labs to contribute talent and techniques.
DX rating is the highest priority DoD procurement standard. It lets DoD compel companies, set their own price, skip the line, and do basically anything else they need to acquire the good in question.' https://x.com/hamandcheese/status/1858902373969564047
(screenshot in post from PDF page 39 of https://www.uscc.gov/sites/default/files/2024-11/2024_Annual_Report_to_Congress.pdf)
'🚨 The annual report of the US-China Economic and Security Review Commission is now live. 🚨
Its top recommendation is for Congress and the DoD to fund a Manhattan Project-like program to race to AGI.
Buckle up...'
And the space of interventions will likely also include using/manipulating model internals, e.g. https://transluce.org/observability-interface, especially since (some kinds of) automated interpretability seem cheap and scalable, e.g. https://transluce.org/neuron-descriptions estimated a cost of < 5 cents / labeled neuron. LM agents have also previously been shown able to do interpretability experiments and propose hypotheses: https://multimodal-interpretability.csail.mit.edu/maia/, and this could likely be integrated with the above. The auto-interp explanations also seem roughly human-level in the references above.
As well as (along with in-context mechanisms like prompting) potentially model internals mechanisms to modulate how much the model uses in-context vs. in-weights knowledge, like in e.g. Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models. This might also work well with potential future advances in unlearning, e.g. of various facts, as discussed in The case for unlearning that removes information from LLM weights.
Any thoughts on potential connections with task arithmetic? (later edit: in addition to footnote 2)
Would the prediction also apply to inference scaling (laws) - and maybe more broadly various forms of scaling post-training, or only to pretraining scaling?
Epistemic status: at least somewhat rant-mode.
I find it pretty ironic that many in AI risk mitigation would make asks for if-then committments/RSPs from the top AI capabilities labs, but they won't make the same asks for AI safety orgs/funders. E.g.: if you're an AI safety funder, what kind of evidence ('if') will make you accelerate how much funding you deploy per year ('then')?
A few additional relevant recent papers: Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models, Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures.
Similarly, the argument in this post and e.g. in Robust agents learn causal world models seem to me to suggest that we should probably also expect something like universal (approximate) circuits, which it might be feasible to automate the discovery of using perhaps a similar procedure to the one demo-ed in Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.
Later edit: And I expect unsupervised clustering/learning could help in a similar fashion to the argument in the parent comment (applied to features), when applied to the feature circuits(/graphs).
So, to the extent that the chain-of-thought helps produce a better answer in the end, we can conclude that this is "basically" improved due to the actual semantic reasoning which the chain-of-thought apparently implements.
I like the intuition behind this argument, which I don't remember seeing spelled out anywhere else before.
I wonder how much hope one should derive from the fact that, intuitively, RL seems like it should be relatively slow at building new capabilities from scratch / significantly changing model internals, so there might be some way to buy some safety from also monitoring internals (both for dangerous capabilities already existant after pretraining, and for potentially new ones [slowly] built through RL fine-tuning). Related passage with an at least somewhat similar intuition, from https://www.lesswrong.com/posts/FbSAuJfCxizZGpcHc/interpreting-the-learning-of-deceit#How_to_Catch_an_LLM_in_the_Act_of_Repurposing_Deceitful_Behaviors (the post also discusses how one might go about monitoring for dangerous capabilities already existant after pretraining):
If you are concerned about the possibility the model might occasionally instead reinvent deceit from scratch, rather than reusing the copy already available (which would have to be by chance rather than design, since it can't be deceitful before acquiring deceit), then the obvious approach would be to attempt to devise a second Interpretablity (or other) process to watch for that during training, and use it alongside the one outlined here. Fortunately that reinvention is going to be a slower process, since it has to do more work, and it should at first be imperfect, so it ought to be easier to catch in the act before the model gets really good at deceit.
Anyhow I totally agree on the urgency, tractability, and importance of faithful CoT research. I think that if we can do enough of that research fast enough, we'll be able to 'hold the line' at stage 2 for some time, possibly long enough to reach AGI.
Do you have thoughts on how much it helps that autointerp seems to now be roughly human-level on some metrics and can be applied cheaply (e.g. https://transluce.org/neuron-descriptions), so perhaps we might have another 'defensive line' even past stage 2 (e.g. in the case of https://transluce.org/neuron-descriptions, corresponding to the level of 'granularity' of autointerp applied to the activations of all the MLP neurons inside an LLM)?
Later edit: Depending on research progress (especially w.r.t. cost effectiveness), other levels of 'granularity' might also become available (fully automated) soon for monitoring, e.g. sparse (SAE) feature circuits (of various dangerous/undesirable capabilities), as demo-ed in Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.
News about an apparent shift in focus to inference scaling laws at the top labs: https://www.reuters.com/technology/artificial-intelligence/openai-rivals-seek-new-path-smarter-ai-current-methods-hit-limitations-2024-11-11/
- Power Concentration Risk
- This involves AI systems giving already-powerful actors dramatically more power over others
- Examples could include:
- Government leaders using AI to stage a self-coup then install a permanent totalitarian regime, using AI to maintain a regime with currently impossible levels of surveillance.
- AI company CEOs using advanced AI systems to become world dictator.
- The key risk here is particular already-powerful people getting potentially unassailable advantages
Maybe somewhat of a tangent, but I think this might be a much more legible/better reason to ask for international coordination, then the more speculative-seeming (and sometimes, honestly, wildly overconfident IMO) arguments about the x-risks coming from the difficulty of (technically) aligning superintelligence.
Sam Altman says AGI is coming in 2025 (and he is also expecting a child next year) https://x.com/tsarnick/status/1854988648745517297
Something like a safety case for automated safety research (but I'm biased)
Summary threads of two recent papers which seem like significant evidence in favor of the Simulators view of LLMs (especially after just pretraining): https://x.com/aryaman2020/status/1852027909709382065 https://x.com/DimitrisPapail/status/1844463075442950229
I think the portfolio of safety effort should include some BRT, but the returns to effort are such that other AI control techniques should have three times as much effort put into them.
Would this factor (and maybe even the conclusions of the whole post) change with (roughly) human-level automated (including multi-turn) red-teaming, as e.g. claimed here: https://blog.haizelabs.com/posts/cascade/?
Here's a potentially relevant reference, though I've only skimmed - Encoding innate ability through a genomic bottleneck:
Significance
Our manuscript formulates and provides a solution to a central problem in computing with neural circuits: How can a complex neural circuit, with trillions of individual connections, arise from a comparatively simple genome? What makes this problem challenging is the largely overlooked fact that these circuits, at or soon after birth and with minimal learning, are able to specify a tremendously rich repertoire of innate behaviors. The fact that animals are endowed with such sophisticated and diverse innate behaviors is obvious to anyone who has seen a spider spin a web. We formulate the question in terms of artificial networks, which allows us a rigorous and quantitative framework for assessing our ideas.
Abstract
Animals are born with extensive innate behavioral capabilities, which arise from neural circuits encoded in the genome. However, the information capacity of the genome is orders of magnitude smaller than that needed to specify the connectivity of an arbitrary brain circuit, indicating that the rules encoding circuit formation must fit through a “genomic bottleneck” as they pass from one generation to the next. Here, we formulate the problem of innate behavioral capacity in the context of artificial neural networks in terms of lossy compression of the weight matrix. We find that several standard network architectures can be compressed by several orders of magnitude, yielding pretraining performance that can approach that of the fully trained network. Interestingly, for complex but not for simple test problems, the genomic bottleneck algorithm also captures essential features of the circuit, leading to enhanced transfer learning to novel tasks and datasets. Our results suggest that compressing a neural circuit through the genomic bottleneck serves as a regularizer, enabling evolution to select simple circuits that can be readily adapted to important real-world tasks. The genomic bottleneck also suggests how innate priors can complement conventional approaches to learning in designing algorithms for AI.
Claim of roughly human-level automated multi-turn red-teaming: https://blog.haizelabs.com/posts/cascade/
It is pretty plausible to me that AI control is quite easy, because you actually can remove affordances from the AIs that are doing AI R&D such that it’s hard for them to cause problems.
Here's one (somewhat handwavy) reason for optimism w.r.t. automated AI safety research: most safety research has probably come from outside the big labs (see e.g. https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/) and thus has likely mostly used significantly sub-SOTA models. It seems quite plausible then that we could have the vast majority of (controlled) automated AI safety research done on much smaller and less dangerous (e.g. trusted) models only, without this leading to intolerably-large losses in productivity; and perhaps have humans only/strongly in the loop when applying the results of that research to SOTA, potentially untrusted, models.
from https://jack-clark.net/2024/08/18/import-ai-383-automated-ai-scientists-cyborg-jellyfish-what-it-takes-to-run-a-cluster/…, commenting on https://arxiv.org/abs/2408.06292: 'Why this matters – the taste of automated science: This paper gives us a taste of a future where powerful AI systems propose their own ideas, use tools to do scientific experiments, and generate results. At this stage, what we have here is basically a ‘toy example’ with papers of dubious quality and insights of dubious import. But you know where we were with language models five years ago? We had things that could barely write a paragraph. Now they can do this. I predict that by the summer of 2026 we will have seen at least one genuinely interesting research paper that was soup-to-nuts generated via a tool-using generative AI system.'
Potentially also relevant - Contrastive Preference Learning: Learning from Human Feedback without RL, TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space, Bridging Associative Memory and Probabilistic Modeling.
And I could see some kind of safety case framework where, as we gain confidence in the control/alignment of the amplified system and as the capabilities of the systems increase, we move towards increasingly automating the safety research applied to the (increasingly 'interior' parts of the) core system.
E.g. I would interpret the results from https://transluce.org/neuron-descriptions as showing that we can now get 3-minute-human-level automated interpretability on all the MLP neurons of a LLM ('core system'), for about 5 cents / neuron (using sub-ASL-3 models and very unlikely to be scheming because bad at prerequisites).
under the assumptions here (including Chinchilla scaling laws), depth wouldn't increase by more than about 3x before the utilization rate starts dropping (because depth would increase with exponent about 1/6 of the total increase in FLOP); which seems like great news for the legibility of CoT outputs and similar and vs. opaque reasoning in models: https://lesswrong.com/posts/HmQGHGCnvmpCNDBjc/current-ais-provide-nearly-no-data-relevant-to-agi-alignment#mcA57W6YK6a2TGaE2
Speculatively, this might also differentially incentivize (research on generalized) inference scaling, with various potential strategic implications, including for AI safety (current inference scaling methods tend to be tied to CoT and the like, which are quite transparent) and for regulatory frameworks/proliferation of dangerous capabilities.
fragment of chat with claude on All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling and Shared computational principles for language processing in humans and deep language models (my prompt first):
'in 'all or none...' they mention 'the general form of next-token predictors used in our analysis.'; if we assume that the language network (in biological brains) is indeed a next-token predictor (as e.g. the next-word prediction results in 'shared principles...' suggest), shouldn't this suggest that the linear properties should then also apply to biological brains, not just LLMs?
This is a very insightful observation! Let's analyze this:
The key logical chain would be:
1. The "All or None" paper's results apply to next-token predictors of the form:
```
p(xt+1|x1:t) = exp(f(x1:t)ᵀg(xt+1))/Z(x1:t)
```
where f maps sequences to embeddings and g maps tokens to unembeddings
2. The "Shared Computational Principles" paper provides evidence that the human brain:
- Engages in next-word prediction before word onset
- Uses context to predict next tokens
- Shows surprise signals consistent with prediction error
- Represents words in ways that can be mapped to vector spaces
3. If the brain's language network is implementing a next-token predictor of this general form, then theoretically:
- The same mathematical framework should apply
- The same identifiability results should hold
- The same "all or none" properties of linear features should be present
However, there are some important considerations:
1. Implementation Level:
- While the brain might implement this type of predictor at a computational level, the neural implementation could add constraints or properties not captured by the mathematical framework
- The brain might use a more complex or different form of next-token prediction
2. Verification Challenges:
- It's harder to verify if the brain's implementation exactly matches this mathematical form
- We can't directly observe all components like f and g in biological systems
3. Scale/Scope:
- The brain might implement multiple prediction systems operating at different scales
- The mathematical framework might apply to some but not all aspects of neural language processing
So while your reasoning seems sound - if the brain implements this type of next-token predictor, the mathematical results should apply - verifying this empirically would be challenging. It's a fascinating theoretical bridge between the mathematical analysis and biological implementation.'
'Data movement bottlenecks limit LLM scaling beyond 2e28 FLOP, with a "latency wall" at 2e31 FLOP. We may hit these in ~3 years. Aggressive batch size scaling could potentially overcome these limits.' https://epochai.org/blog/data-movement-bottlenecks-scaling-past-1e28-flop
Some related cool-seeming recent results: All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling, Cross-Entropy Is All You Need To Invert the Data Generating Process.
Now suppose SFF & LTFF's budget suddenly & magically scaled up 10x. Would this dramatically change the state of the field? I think so!
I do think so, especially if they also increased/decentralized more their grantmaking capacity, and perhaps increased the field-building capacity earlier in the pipeline (e.g. AGISF, ML4G, etc., though I expect those programs to mostly be doing differentially quite well and not to be the main bottlenecks).
No, I mean that EA + AI safety funders probably would have a lot of money earmarked for AI risk mitigation, but they don't seem able/willing to deploy it fast enough (according to my timelines, at least, but probably also according to many of theirs).
Governments mostly just don't seem on the ball almost at all w.r.t. AI, even despite the recent progress (e.g. the AI safety summits, establishment of AISIs, etc.).
So why not that funding model? Mostly a money thing.
*seems like mostly a funding deployment issue, probably due to some structural problems, AFAICT, without having any great inside info (within the traditional AI safety funding space; the rest of the world seems much less on the ball than the traiditional AI safety funding space).
how recent reports of OpenAI’s o1 being deceptive have been questioned.
This seems to be confusing a dangerous capability eval (of being able to 'deceive' in a visible scratchpad) with an assessment of alignment, which seems like exactly what the 'questioning' was about.
This kind of possibility forces AI systems to push back bad behavior into cases where they are more and more confident that it's never going to be noticed. But the space of interventions we get to try, if we subject this issue to rigorous scientific scrutiny, includes significantly modifying the AI's training data and limiting information about the world. So "super confident that humans will never notice" is a very high bar.
And the space of interventions will likely also include using/manipulating model internals, e.g. https://transluce.org/observability-interface, especially since (some kinds of) automated interpretability seem cheap and scalable, e.g. https://transluce.org/neuron-descriptions estimated a cost of < 5 cents / labeled neuron. LM agents have also previously been shown able to do interpretability experiments and propose hypotheses: https://multimodal-interpretability.csail.mit.edu/maia/, and this could likely be integrated with the above. The auto-interp explanations also seem roughly human-level in the references above.
Later edit: maybe also relevant - claim of roughly human-level automated multi-turn red-teaming: https://blog.haizelabs.com/posts/cascade/. Also a demo of integrating mech interp with red-teaming: https://blog.haizelabs.com/posts/steering/.
Indirectly, because 90 papers seems like a tiny number, vs. what got published on arxiv during that same time interval. (Depending on how one counts) I wouldn't be surprised if there were > 90 papers from outside the labs even looking only at the unlearning category.
My main takeaway would be that this seems like quite strong evidence towards the view expressed in https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/, that most safety research doesn't come from the top labs.
For similar arguments, I think it's gonna be very hard/unlikely to stop China from having AGI within a couple of years of the US (and most relevant AI chips currently being produced in Taiwan should probably further increase the probability of this). So taking on a lot more x-risk to try and race hard vs. China doesn't seem like a good strategy from this POV.
Whatever innovations smaller companies make can be easily stolen by tech giants.
And they / their basic components are probably also published by academia, though the precise hyperparameters, etc. might still matter and be non-trivial/costly to find.
Information you should probably remove from the weights
Perhaps it might also be useful to remove information which might reduce the likelihood that 'A TAI which kills all humans might also doom itself' (especially in short timelines/nearcast scenarios).
This would seem like a great benchmark/dataset/eval to apply automated research to. Would you have thoughts/recommendations on that? E.g. how worried might/should one be about risks of/from Goodharting?
Later edit: I guess it's kind of already been tried by e.g. Tamper-Resistant Safeguards for Open-Weight LLMs and other approaches combining unlearning with meta-learning, though not necessarily with exactly the same motivation.
Here's a recent paper which might provide [inspiration for] another approach: Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts (though it seems at least somewhat related to the tamper-resistant paper mentioned in another comment).
Edit: I'd also be curious to see if editing-based methods, potentially combined with interp techniques (e.g. those in Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization and in Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces), might fare better, and there might also be room for cross-polination of methodologies.
OTOH, Challenges in Explaining Representational Similarity through Identifiability presents good counter-points. Also, Position: Understanding LLMs Requires More Than Statistical Generalization.
It's not just MedARC, there's e.g. an entire subfield at the intersection of neuroscience and AI, see e.g. The neuroconnectionist research programme. Hundreds, if not thousands of papers, have been published at this point. You can critique their methodologies, assumptions, etc. (indeed, there are lively debates within the subfield itself), but pointing at MedARC and trying to infer how promising the whole endeavour is just from that is kind of unserious/unaware.
Also A Timeline and Analysis for Representation Plasticity in Large Language Models.
What feels underexplored to me is: If we can control roughly human-level AI systems, what do we DO with them?
Automated/strongly-augmented AI risk mitigation research, among various other options that Redwood discusses in some of their posts/public appearances.
(I might expand on this comment later but for now) I'll point out that there are some pretty large literatures out there which seem at least somewhat relevant to these questions, including on causal models, identifiability and contrastive learning, and on neuroAI - for some references and thoughts see e.g.:
https://www.lesswrong.com/posts/Nwgdq6kHke5LY692J/alignment-by-default?commentId=8CngPZyjr5XydW4sC
And for some very recent potentially relevant work, using SAEs:
Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Some past examples I find valuable - in the case of unlearning: WMDP, Eight Methods to Evaluate Robust Unlearning in LLMs; in the case of mech interp - various proxies for SAE performance, e.g. from Scaling and evaluating sparse autoencoders, as well as various benchmarks, e.g. FIND: A Function Description Benchmark for Evaluating Interpretability Methods. Prizes and RFPs seem like a potentially scalable way to do this - e.g. https://www.mlsafety.org/safebench - and I think they could be particularly useful on short timelines.
Potentially also https://www.lesswrong.com/posts/yxdHp2cZeQbZGREEN/improving-model-written-evals-for-ai-safety-benchmarking.