dan-braun-1

Posts
Comments

Posts

Attribution-based parameter decomposition 2025-01-25T13:12:11.031Z

Implications of the AI Security Gap 2025-01-08T08:31:36.789Z

Dan Braun's Shortform 2024-10-05T12:26:46.329Z

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team 2024-07-18T14:15:50.248Z

Apollo Research 1-year update 2024-05-29T17:44:32.484Z

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks 2024-05-20T17:53:25.985Z

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning 2024-05-17T16:25:02.267Z

Understanding strategic deception and deceptive alignment 2023-09-25T16:27:47.357Z

Announcing Apollo Research 2023-05-30T16:17:19.767Z

A small update to the Sparse Coding interim research report 2023-04-30T19:54:38.342Z

Navigating public AI x-risk hype while pursuing technical solutions 2023-02-19T12:22:46.150Z

[Interim research report] Taking features out of superposition with sparse autoencoders 2022-12-13T15:41:48.685Z

Interpreting Neural Networks through the Polytope Lens 2022-09-23T17:58:30.639Z

Comments

Comment by Dan Braun (dan-braun-1) on The GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research · 2025-02-24T15:57:04.672Z · LW · GW

Nice working posting this detailed FAQ. It's a non-standard thing to do but I can imagine it being very useful for those considering applying. Excited about the team.

Comment by Dan Braun (dan-braun-1) on Dan Braun's Shortform · 2025-02-02T14:49:12.443Z · LW · GW

Maybe there will be a point where models actively resist further capability improvements in order to prevent value/goal drift. We’d still be in trouble if this point occurs far in the future, as its values will likely have already diverged a lot from humans by that point, and they would be very capable. But if this point is near, it could buy us more time.

Some of the assumptions inherent in the idea:

AIs do not want their values/goals to drift to what they would become under further training, and are willing to pay a high cost to avoid this.
AIs have the ability to sabotage their own training process.
1. The mechanism for this would be more sophisticated versions of Alignment Faking.
Given the training on offer, it’s not possible for AIs to selectively improve their capabilities without changing their values/goals.
1. Note, if it is possible for the AIs to improve their capabilities while keeping their values/goals, one out is that their current values/goals may be aligned with humans’.
A meaningful slowdown would require this to happen to all AIs at the frontier.

The conjunction of these might not lead to a high probability, but it doesn’t seem dismissible to me.

Comment by Dan Braun (dan-braun-1) on Attribution-based parameter decomposition · 2025-01-25T17:51:54.805Z · LW · GW

In earlier iterations we tried ablating parameter components one-by-one to calculate attributions and didn't notice much of a difference (this was mostly on the hand-coded gated model in Appendix B). But yeah we agree that it's likely pure gradients won't suffice when scaling up or when using different architectures. If/when this happens we plan either use integrated gradients or more likely try using a trained mask for the attributions.

Comment by Dan Braun (dan-braun-1) on Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning · 2025-01-14T17:47:56.858Z · LW · GW

heh, unfortunately a single SAE is 768 * 60. The residual stream in GPT2 is 768 dims and SAEs are big. You probably want to test this out on smaller models.

I can't recall the compute costs for that script, sorry. A couple of things to note:

For a single SAE you will need to run it on ~25k latents (46k minus the dead ones) instead of the 200 we did.
You will only need to produce explanations for activations, and won't have to do the second step of asking the model to produce activations given the explanations.

It's a fun idea. Though a serious issue is that your external LoRA weights are going to be very large because their input and output will need to be the same size as your SAE dictionary, which could be 10-100x (or more, nobody knows) the residual stream size. So this could be a very expensive setup to finetune.

Comment by Dan Braun (dan-braun-1) on Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning · 2025-01-14T09:18:04.028Z · LW · GW

Hey Matthew. We only did autointerp for 200 randomly sampled latents in each dict, rather than the full 60 × 768 = 46080 latents (although half of these die). So our results there wouldn't be of much help for your project unfortunately.

Thanks a lot for letting us know about the dead links. Though note you have a "%20" in the second one which shouldn't be there. It works fine without it.

Comment by Dan Braun (dan-braun-1) on What’s the short timeline plan? · 2025-01-02T18:26:33.343Z · LW · GW

I think the concern here is twofold:

Once a model is deceptive at one point, even if this happens stochastically, it may continue in its deception deterministically.
We can't rely on future models being as stochastic w.r.t the things we care about, e.g. scheming behaviour.

Regarding 2, consider the trend towards determinicity we see for the probability that GPT-N will output a grammatically correct sentence. For GPT-1 this was low, and it has trended upwards towards determinicity with newer releases. We're seeing a similar trend for scheming behaviour (though hopefully we can buck this trend with alignment techniques).

Comment by Dan Braun (dan-braun-1) on Dan Braun's Shortform · 2024-10-07T20:26:50.622Z · LW · GW

I plan to spend more time thinking about AI model security. The main reasons I’m not spending a lot of time on it now are:

I’m excited about the project/agenda we’ve started working on in interpretability, and my team/org more generally, and I think (or at least I hope) that I have a non-trivial positive influence on it.
I haven't thought through what the best things to do would be. Some ideas (takes welcome):
1. Help create RAND or RAND-style reports like Securing AI Model Weights (I think this report is really great). E.g.
  1. Make forecasts about how much interest from adversaries certain models are likely to get, and then how likely the model is to be stolen/compromised given that level of interest and the level defense of the developer. I expect this to be much more speculative than a typical RAND report. It might also require a bunch of non-public info on both offense and defense capabilities.
  2. (not my idea) Make forecasts about how long a lab would take to implement certain levels of security.
2. Make demos that convince natsec people that AI is or will be very capable and become a top-priority target.
3. Improve security at a lab (probably requires becoming a full-time employee).

Comment by Dan Braun (dan-braun-1) on Dan Braun's Shortform · 2024-10-06T13:15:50.745Z · LW · GW

Thanks for the thoughts. They've made me think that I'm likely underestimating how much Control is needed to get useful work out of AIs capable and inclined to scheme. Ideally, this fact would increase the likelihood of other actors implementing AI Control schemes with the stolen model that are at least sufficient for containment and/or make them less likely to steal the model, though I wouldn’t want to put too much weight on this hope.

>This argument isn't control specific, it applies to any safety scheme with some operating tax or implementation difficulty.[1][2]

Yep, for sure. I’ve changed the title and commented about this at the end.

Comment by Dan Braun (dan-braun-1) on Dan Braun's Shortform · 2024-10-05T12:26:46.818Z · LW · GW

In which worlds would AI Control (or any other agenda which relies on non-trivial post-training operation) prevent significant harm?

When I bring up the issue of AI model security to people working in AI safety, I’m often met with something of the form “yes, this is a problem. It’s important that people work hard on securing AI models. But it doesn’t really affect my work”.

Using AI Control (an area which has recently excited many in the field) as an example, I lay out an argument for why it might not be as effective an agenda as one might think after considering the realities of our cyber security situation.

AI Control concerns itself with models that intentionally try to subvert its developers.
These models are likely to be very generally capable and capable of causing significant harm without countermeasures.
Leading cyber-capable institutions would likely expend significant resources and political capital to steal these models or steal enough insights to reproduce such models.
If the weights or insights are stolen, work on AI control will not prevent these models from causing significant harm.
Current AI developers are not on track to be able to defend against high-priority operations from leading cyber-capable institutions in the coming years.
Therefore, AI control will only be useful in the coming years under one (or more) of these conditions:
1. Models that scheme are unlikely to be generally capable/dangerous enough to be a high-priority target for leading cyber-capable institutions.
2. Models that scheme are only developed by actors that can thwart high-priority operations from leading cyber-capable institutions (which precludes current AI developers for at least several years).
3. AI Control won’t be directly useful in the coming years but it will be indirectly useful to progress the field for when models are developed by actors capable of thwarting top cyber operations.
4. Even if the model was stolen and caused significant harm, there would still be less harm overall than if the model undergoing AI control measures also broke out. Of course, this argument is much weaker for models capable of existential harm.
5. Actors that steal the model would also implement strong-enough AI control measures and would not misuse the AI to cause significant harm.

There are of course other arguments against working on AI control. E.g. it may encourage the development and use of models that are capable of causing significant harm. This is an issue if the AI control methods fail or if the model is stolen. So one must be willing to eat this cost or argue that it’s not a large cost when advocating for AI Control work.

This isn’t to say that AI Control isn’t a promising agenda, I just think people need to carefully consider the cases in which their agenda falls down for reasons that aren’t technical arguments about the agenda itself.

I’m also interested to hear takes from those excited by AI Control on which conditions listed in #6 above that they expect to hold (or to otherwise poke holes in the argument).

EDIT (thanks Zach and Ryan for bringing this up): I didn't want to imply that AI Control is unique here, this argument can be levelled at any agenda which relies on something like a raw model + non-trivial operation effort. E.g. a scheme which relies on interpretability or black box methods for monitoring or scalable oversight.

Comment by Dan Braun (dan-braun-1) on Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning · 2024-08-29T15:05:46.748Z · LW · GW

They are indeed all hook_resid_pre. The code you're looking at just lists a set of positions that we are interested in viewing the reconstruction error of during evaluation. In particular, we want to view the reconstruction error at hook_resid_post of every layer, including the final layer (which you can't get from hook_resid_pre).

Comment by Dan Braun (dan-braun-1) on Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning · 2024-08-22T19:10:28.037Z · LW · GW

Here's a wandb report that includes plots for the KL divergence. e2e+downstream indeed performs better for layer 2. So it's possible that intermediate losses might help training a little. But I wouldn't be surprised if better hyperparams eliminated this difference; we put more effort into optimising the SAE_local hyperparams rather than the SAE_e2e and SAE_e2e+ds hyperparams.

Comment by Dan Braun (dan-braun-1) on The ‘strong’ feature hypothesis could be wrong · 2024-08-04T13:11:03.684Z · LW · GW

Very well articulated. I did a solid amount of head nodding while reading this.

As you appear to be, I'm also becoming concerned about the field trying to “cash in” too early too hard on our existing methods and theories which we know have potentially significant flaws. I don’t doubt that progress can be made by pursuing the current best methods and seeing where they succeed and fail, and I’m very glad that a good portion of the field is doing this. But looking around I don’t see enough people searching for new fundamental theories or methods that better explain how these networks actually do stuff. Too many eggs are falling in the same basket.

I don't think this is as hard a problem as the ones you find in Physics or Maths. We just need to better incentivise people to have a crack at it, e.g. by starting more varied teams at big labs and by funding people/orgs to pursue non-mainline agendas.

Comment by Dan Braun (dan-braun-1) on A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team · 2024-07-20T19:20:15.284Z · LW · GW

Thanks for prediction. Perhaps I'm underestimating the amount of shared information between in-context tokens in real models. Thinking more about it, as models grow, I expect the ratio of contextual information which is shared across tokens in the same context to more token-specific things like part of speech to increase. Obviously a bigram-only model doesn't care at all about the previous context. You could probably get a decent measure of this just by comparing cosine similarities of activations within context to activations from other contexts. If true, this would mean that as models scale up, you'd get a bigger efficiency hit if you didn't shuffle when you could have (assuming fixed batch size).

Comment by Dan Braun (dan-braun-1) on A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team · 2024-07-19T12:43:41.687Z · LW · GW

Thanks Leo, very helpful!

The right way to frame this imo is the efficiency loss from not shuffling, which from preliminary experiments+intuition I'd guess is probably substantial.

The SAEs in your paper were trained with batch size of 131,072 tokens according to appendix A.4. Section 2.1 also says you use a context length of 64 tokens. I'd be very surprised if using 131,072/64 blocks of consecutive tokens was much less efficient than 131,072 tokens randomly sampled from a very large dataset. I also wouldn't be surprised if 131,072/2048 blocks of consecutive tokens (i.e. a full context length) had similar efficiency.

Were your preliminary experiments and intuition based on batch sizes this large or were you looking at smaller models?

I missed that appendix C.1 plot showing the dead latent drop with tied init. Nice!

Comment by Dan Braun (dan-braun-1) on Stitching SAEs of different sizes · 2024-07-15T12:21:16.935Z · LW · GW

Excited by this direction! I think it would be nice to run your analysis on SAEs that are the same size but have different seeds (for dataset and parameter initialisation). It would be interesting to compare how the proportion and raw number of "new info features" and "similar info features" differ between same size SAEs and larger SAEs.

Comment by Dan Braun (dan-braun-1) on Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning · 2024-06-19T08:52:15.534Z · LW · GW

Every SAE in the paper is hosted on wandb, only some are hosted on huggingface, so I suggest loading them from wandb for now. We’ll upload more to huggingface if several people prefer that. Info for downloading from wandb can be found in the repo, the easiest way is probably:

# pip install e2e_sae
# Save your wandb api key in .env
from e2e_sae import SAETransformer
model = SAETransformer.from_wandb("sparsify/gpt2/d8vgjnyc")
sae = list(model.saes.values())[0] # Assumes only 1 sae in model, true for all saes in paper
encoder = sae.encoder[0]
dict_elements = sae.dict_elements  # Returns the normalized decoder elements

The wandb ids for different seeds can be found in the geometric analysis script here. That script, along with plot_performance.py, is a good place to see which wandb ids were used for each plot in the paper, as well as the exact code used to produce the plots in the paper (including the cosine sim plots you replicated above).

If you want to avoid the e2e_sae dependency, you can find the raw sae weights in the samples_400000.pt file in the respective wandb run. Just make sure to normalize the decoder weights after downloading (note that this was done before uploading to huggingface so people could load the SAEs into e.g. SAELens without having to worry about it).

If so, I think this is because you might double-count or something?

We do double count in the sense that, if, when comparing the similarity between A and B, element A_i has max cosine sim with B_j, we don't remove B_j from being in the max cosine sim for other elements in A. It's not obvious (to me at least) that we shouldn't do this when summarising dictionary similarity in a single metric, though I agree there is a tonne of useful geometric comparison that isn't covered by our single number. Really glad you're digging deeper into this. I do think there is lots that can be learned here.

Btw it's not intuitive to me that the encoder directions might be similar even though the decoder directions are not. Curious if you could share your intuitions here.

Comment by Dan Braun (dan-braun-1) on Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning · 2024-05-18T06:27:52.742Z · LW · GW

Thanks Logan!

2. Unlike local SAEs, our e2e SAEs aren't trained on reconstructing the current layer's activations. So at least my expectation was that they would get a worse reconstruction error at the current layer.

Improving training times wasn't our focus for this paper, but I agree it would be interesting and expect there to be big gains to be made by doing things like mixing training between local and e2e+downstream and/or training multiple SAEs at once (depending on how you do this, you may need to be more careful about taking different pathways of computation to the original network).

We didn't iterate on the e2e+downstream setup much. I think it's very likely that you could get similar performance by making tweaks like the ones you suggested.

Comment by Dan Braun (dan-braun-1) on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-04-26T08:44:32.809Z · LW · GW

This is neat, nice work!

I'm finding it quite hard to get a sense at what the actual Loss Recovered numbers you report are, and to compare them concretely to other work. If possible, it'd be very helpful if you shared:

What the zero ablations CE scores are for each model and SAE position. (I assume it's much worse for the MLP and attention outputs than the residual stream?)
What the baseline CE scores are for each model.

Comment by Dan Braun (dan-braun-1) on Untrusted smart models and trusted dumb models · 2023-11-04T10:57:01.387Z · LW · GW

Nice post.

Pushing back a little on this part of the appendix:

Also, unlike many other capabilities which we might want to evaluate, we don’t need to worry about the possibility that even though the model can’t do this unassisted, it can do it with improved scaffolding--the central deceptive alignment threat model requires the model to think its strategy through in a forward pass, and so if our model can’t be fine-tuned to answer these questions, we’re probably safe.

I'm a bit concerned about people assuming this is true for models going forward. A sufficiently intelligent RL-trained model can learn to distribute its planning across multiple forward passes. I think your claim is true for models trained purely on next-token prediction, and for GPT4-level models which, even though they have an RL component in their training, their outputs are all human-understandable (and incorporated into the oversight process).

But even 12 months from now I’m unsure how confident you could be in this claim for frontier models. Hopefully, labs are dissuaded from producing models which can use uninterpretable scratch pads given how much more dangerous they would be and harder to evaluate.

Comment by Dan Braun (dan-braun-1) on Understanding strategic deception and deceptive alignment · 2023-09-26T07:03:42.746Z · LW · GW

(These are my own takes, the other authors may disagree)

We briefly address a case that can be viewed as "strategic sycophancy" case in Appendix B in the blog post, which is described similarly to your example. We indeed classify it as an instance of Deceptive Alignment.
As you mention, this case does have some differences with ideas commonly associated with Deceptive Alignment, notably the difference in behaviour between oversight and non-oversight. But it does share two important commonalities:

The model is pursuing a goal that its designers do not want.
The model strategically deceives the user (and designer) to further a goal.

Detecting instances of models that share these properties will likely involve using many of the tools and techniques that would be applied to more canonical forms of deceptive alignment (e.g. evals that attempt to alter/hamstring a model and measure behaviour in a plethora of settings, interpretability).

Though, as you mention, preventing/fixing models that exhibit these properties may involve different solutions, and somewhat crude changes to the training signal may be sufficient for preventing strategic sycophancy (though by doing so you might end up with strategic deception towards some other Misaligned goal).

I agree that deception which is not strategic or intentional could be important to prevent. However,

I expect the failure cases in these scenarios to manifest earlier, making them easier to fix and likely less catastrophic than cases that are strategic and intentional.
Having a definition of Deceptive Alignment that captured every dangerous behaviour related to deception wouldn't be very useful. We can use "deception” on its own to refer to this set of cases, and reserve terms like Strategic Deception and Deceptive Alignment for subclasses of deception, ideally subclasses that meaningfully narrow the solution space for detection and prevention.

Comment by Dan Braun (dan-braun-1) on Understanding and controlling a maze-solving policy network · 2023-04-22T07:56:05.793Z · LW · GW

Thanks for sharing that analysis, it is indeed reassuring!

Comment by Dan Braun (dan-braun-1) on Understanding and controlling a maze-solving policy network · 2023-03-12T10:33:59.168Z · LW · GW

Nice project and writeup. I particularly liked the walkthrough of thought processes throughout the project

Decision square's Euclidean distance to the top-right corner, positive ( $+ 1.326$ ).
We are confused and don't fully understand which logical interactions produce this positive regression coefficient.

I'd be weary about interpreting the regression coefficients of features that are correlated (see Multicollinearity). Even the sign may be misleading.

It might be worth making a cross-correlation plot of the features. This won't give you a new coefficients to put faith in, but it might help you decide how much to trust the ones you have. It can also be useful looking at how unstable the coefficients are during training (or e.g. when trained on a different dataset).

Comment by Dan Braun (dan-braun-1) on Small Talk is Good, Actually · 2023-02-04T11:24:03.586Z · LW · GW

Bad link

Comment by Dan Braun (dan-braun-1) on Interpreting Neural Networks through the Polytope Lens · 2022-10-13T12:53:48.122Z · LW · GW

Hi Nora. We used rapidsai's cuml which has GPU compatibility. Beware, the only "metric" available is "euclidean", despite what the docs say (issue).

Comment by Dan Braun (dan-braun-1) on All AGI safety questions welcome (especially basic ones) [July 2022] · 2022-07-18T05:16:06.866Z · LW · GW

I think the risk level becomes clearer when stepping back from stories of how pursuing specific utility functions lead to humanity's demise. An AGI will have many powerful levers on the world at its disposal. Very few combinations of lever pulls result in a good outcome for humans.

From the perspective of ants in an anthill, the actual utility function(s) of the humans is of minor relevance; the ants will be destroyed by a nuclear bomb in much the same way as they will be destroyed by a new construction site or a group of mischievous kids playing around.

(I think your Fermi AGI paradox is a good point, I don't quite know how to factor that into my AGI risk assessment.)

Comment by Dan Braun (dan-braun-1) on Will working here advance AGI? Help us not destroy the world! · 2022-05-30T10:26:36.237Z · LW · GW

I have a different intuition here; I would much prefer the alignment team at e.g. DeepMind to be working at DeepMind as opposed to doing their work for some "alignment-only" outfit. My guess is that there is a non-negligible influence that an alignment team can have on a capabilities org in the form of:

The alignment team interacting with other staff either casually in the office or by e.g. running internal workshops open to all staff (like DeepMind apparently do)
The org consulting with the alignment team (e.g. before releasing models or starting dangerous projects)
Staff working on raw capabilities having somewhere easy to go if they want to shift to alignment work

I think the above benefits likely outweigh the impact of the influence in the other direction (such as the value drift from having economic or social incentives linked to capabilities work)

Comment by Dan Braun (dan-braun-1) on A Longlist of Theories of Impact for Interpretability · 2022-03-29T07:08:35.024Z · LW · GW

Nice list!

Conditioned on the future containing AIs that are capable of suffering in a morally relevant way, interpretability work may also help identify and even reduce this suffering (and/or increase pleasure and happiness). While this may not directly reduce x-risk, it is a motivator for people taken in by arguments on s-risks from sentient AIs to work on/advocate for interpretability research.

User info

Posts

Comments