Posts

Understanding Counterbalanced Subtractions for Better Activation Additions 2023-08-17T13:53:37.813Z
Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic 2023-07-28T19:43:12.235Z
UK Foundation Model Task Force - Expression of Interest 2023-06-18T09:43:27.734Z
ojorgensen's Shortform 2023-05-04T13:51:33.152Z
(Extremely) Naive Gradient Hacking Doesn't Work 2022-12-20T14:35:33.591Z
Which Issues in Conceptual Alignment have been Formalised or Observed (or not)? 2022-11-01T22:32:25.243Z
Strange Loops - Self-Reference from Number Theory to AI 2022-09-28T14:10:00.106Z
Evaluating OpenAI's alignment plans using training stories 2022-08-25T16:12:38.924Z
Disagreements about Alignment: Why, and how, we should try to solve them 2022-08-09T18:49:57.710Z

Comments

Comment by ojorgensen on Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic · 2024-01-28T19:29:07.188Z · LW · GW

Yeah I think we have the same understanding here (in hindsight I should have made this more explicit in the post / title).

I would be excited to see someone empirically try to answer the question you mention at the end. In particular, given some direction  and a LayerNormed vector , one might try to quantify how smoothly rotating from  towards  changes the output of the MLP layer.  This seems like a good test of whether the Polytope Lens is helpful / necessary for understanding the MLPs of Transformers (with smooth changes corresponding to your 'random jostling cancels out' corresponding to not needing to worry about Polytope Lens style issues).

Comment by ojorgensen on Open Thread – Winter 2023/2024 · 2023-12-11T12:54:25.200Z · LW · GW

It would save me a fair amount of time if all lesswrong posts had an "export BibTex citation" button, exactly like the feature on arxiv.  This would be particularly useful for alignment forum posts!

Comment by ojorgensen on Against Almost Every Theory of Impact of Interpretability · 2023-11-10T15:52:05.574Z · LW · GW

One central criticism of this post is its pessimism towards enumerative safety. (i.e. finding all features in the model, or at least all important features). I would be interested to hear how the author / others have updated on the potential of enumerative safety in light of recent progress on dictionary learning, and finding features which appear to correspond to high-level concepts like truth, utility and sycophancy. It seems clear that there should be some positive update here, but I would be interested in understanding issues which these approaches will not contribute to solving.

Comment by ojorgensen on Some ML-Related Math I Now Understand Better · 2023-08-29T12:02:14.870Z · LW · GW

But this does not hold for tiny cosine similarities (e.g. 0.01 for , which gives a lower bound of 2 using the formula above). I'm not aware of a lower bound better than  for tiny angles.

Unless I'm misunderstanding, a better lower bound for almost orthogonal vectors when cosine similarity is approximately  is just , by taking an orthogonal basis for the space. 

My guess for why the formula doesn't give this is because it is derived by covering a sphere with non-intersecting spherical caps, which is sufficient for almost orthogonality but not necessary.  This is also why the lower bound of vectors makes sense when we require cosine similarity to be approximately , since then the only way you can fit two spherical caps onto the surface of a sphere is by dividing it into  hemispheres.

This doesn't change the headline result (still exponentially much room for almost orthogonal vectors), but the actual numbers might be substantially larger thanks to almost orthogonal vectors being a weaker condition than spherical cap packing.

Comment by ojorgensen on Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo · 2023-07-17T12:08:26.154Z · LW · GW

(Potential spoilers!)

There is some relevant literature which explores this phenomenon, also looking at the cosine similarity between words across layers of transformers. I think the most relevant is (Cai et. al 2021), where they also find this higher than expected cosine similarity between residual stream vectors in some layer for BERT, D-BERT, and GPT. (Note that they use some somewhat confusing terminology: they define inter-type cosine similarity to be cosine similarity between embeddings of different tokens in the same input; and intra-type cosine similarity to be cosine similarity of the same token in different inputs. Inter-type cosine similarity is the one that is most relevant here).

They find that the residual stream vectors for GPT-2 small tend to lie in two distinct clusters. Once you re-centre these clusters, the average cosine similarity between residual stream vectors falls to close to 0 throughout the layers of the model, as you might expect.

 

Comment by ojorgensen on ojorgensen's Shortform · 2023-05-04T13:51:33.242Z · LW · GW

Problem: we want to make it hard for ML systems (trained via SGD) to perform naive gradient hacking. By naive gradient hacking, I mean "being able to keep some weights of the network constant for an arbitrary step of SGD".

Solution: do "stochastic" regularisation, e.g. sample the amount of regularisation we perform randomly (could use quantum stuff if we want true randomness). This seems like it should make naive gradient hacking almost impossible - in order to keep some target weights unchanged, you'd have to match their +ve contribution to the loss to the degree of regularisation. If the degree of regularisation is stochastic, the loss contribution must also be stochastic, which is not possible with a determinstic NN!

The question of how useful this is from preventing any gradient hacking depends on how stable the loss landscape around some "deceptive / gradient hacking minimum is". Seems possible the surrounding loss landscape could be pretty unstable to random pertubations?

Comment by ojorgensen on Excessive AI growth-rate yields little socio-economic benefit. · 2023-04-04T20:01:45.514Z · LW · GW

Just a nit-pick but to me "AI growth-rate" suggests economic growth due to progress in AI, as opposed to simply techincal progress in AI. I think "Excessive AI progress yields little socio-economic benefit" would make the argument more immediately clear.

Comment by ojorgensen on EIS XI: Moving Forward · 2023-03-09T11:00:26.176Z · LW · GW

Rando et al. (2022)

This link is broken btw!

Comment by ojorgensen on Abuse in LessWrong and rationalist communities in Bloomberg News · 2023-03-08T18:21:14.493Z · LW · GW

Didn't get that impression from your previous comment, but this seems like a good strategy!

Comment by ojorgensen on Abuse in LessWrong and rationalist communities in Bloomberg News · 2023-03-08T16:44:36.475Z · LW · GW

This seems like a bad rule of thumb. If your social circle is largely comprised of people who have chosen to remain within the community, ignoring information from "outsiders" seems like a bad strategy for understanding issues with the community.

Comment by ojorgensen on Bing Chat is blatantly, aggressively misaligned · 2023-02-15T16:58:24.713Z · LW · GW

Even if OpenAI don't have the option to stop Bing Chat being released now, this would surely have been discussed during investment negotiations. It seems very unlikely this is being released without approval from decision-makers at OpenAI in the last month or so. If they somehow didn't foresee that something could go wrong and had no mitigations in place in case Bing Chat started going weird, that's pretty terrible planning.

Comment by ojorgensen on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] · 2023-02-01T17:51:19.153Z · LW · GW

This seems very similar to recent work that has come out of the Stanford AI Lab recently, linked to here.

Comment by ojorgensen on Gradient hacking is extremely difficult · 2023-01-25T08:28:58.023Z · LW · GW

Great post! This helps to clarify and extend lots of fuzzy intuitions I had around gradient hacking, so thanks! If anyone is interested in a different perspective / set of intuitions for how some properties of gradient descent affect gradient hacking, I wrote a small post about this here: https://www.lesswrong.com/posts/Nnb5AqcunBwAZ4zac/extremely-naive-gradient-hacking-doesn-t-work

I’d expect this to mainly be of use if the properties of gradient descent labelled 1, 4, 5 were not immediately obvious to you.

Comment by ojorgensen on Disagreements about Alignment: Why, and how, we should try to solve them · 2022-12-29T20:24:26.987Z · LW · GW

Hey! Not currently working on anything related to this, but would be excited to read anything related to this you are writing :))

Comment by ojorgensen on A Walkthrough of A Mathematical Framework for Transformer Circuits · 2022-10-26T13:46:42.555Z · LW · GW

Understanding Infra-Bayesianism :))

Comment by ojorgensen on A Walkthrough of A Mathematical Framework for Transformer Circuits · 2022-10-26T07:11:10.000Z · LW · GW

I went through the paper for a reading group the other day, and I think the video really helped me to understand what is going on in the paper. Parts I found most useful were indications which parts of the paper / maths were most important to be able to understand, and which were not (tensor products).

I had made some effort to read the paper before with little success, but now feel like I understand the overall results of the paper pretty well. I’m very positive about this video, and similar things like this being made in the future!

Personal context: I also found the intro to IB video series similarly useful. I’m an AI masters student who has some pre-existing knowledge about AI alignment. I have a maths background.

Comment by ojorgensen on Strange Loops - Self-Reference from Number Theory to AI · 2022-09-28T20:55:36.327Z · LW · GW

Firstly, thanks for reading the post! I think you're referring mainly to realisability here which I'm not that clued up on tbh, but I'll give you my two cents because why not. 

I'm not sure to what extent we should focus on unrealisability when aligning systems. I think I have a similar intuition to you that the important question is probably "how can we get good abstractions of the world, given that we cannot perfectly model it". However, I think better arguments for why unrealisability is a core problem in alignment than I have laid out probably do exist, I just haven't read that much into it yet. I'll link again to this video series on IB (which I'm yet to finish) as I think there are probably some good arguments here.

Comment by ojorgensen on Strange Loops - Self-Reference from Number Theory to AI · 2022-09-28T20:37:11.064Z · LW · GW

I'm not sure if this is what you're looking for, but Hofstadter gives a great analogy using record players which I find useful in terms of thinking about how changing the situation changes our results (which is paraphrased here).

  • A (hi-fi) record player that tries to playing every possible sound can't actually play its own self-breaking sound, so it is incomplete by virtue of its strength.
  • A (low-fi) record player that refuses to play all sounds (in order to avoid destruction from its self-breaking sound) is incomplete by virtue of its weakness. 

We may think of the hi-fi record player as a formal system like Peano Arithmetic: the incompleteness arises precisely because it is strong enough to be able to capture number theory. This is what allows us to use Gödel Numbering, which then allows PA to do meta-reasoning about itself.

The only way to fix it is to make a system that is weaker than PA, so that we cannot do Gödel Numbering. But then we have a system that isn't even trying to express what we mean by number theory. This is the low-fi record player: as soon as we fix the one issue of self-reference, we fail to capture the thing we care about (number theory).

I think an example of a weaker formal system is Propositional Calculus. Here we do actually have completeness, but that is only because Propositional Calculus is too weak to be able to capture number theory.

Comment by ojorgensen on How likely is deceptive alignment? · 2022-09-12T13:12:51.630Z · LW · GW

I found this post really interesting, thanks for sharing it!

It doesn’t seem obvious to me that the methods of understanding a model given a high path-dependence world become significantly less useful if we are in a low path-dependence world. I think I see why low path-dependence would give us the opportunity to use different methods of analysis, but I don’t see why the high path-dependence ones would no longer be useful.

For example, here is the reasoning behind “how likely is deceptive alignment” in a high path-dependence world (quoted from the slide). 

  1. We start with a proxy-aligned model
  2. In early training, SGD jointly focuses on improving the model's understanding of the world along with improving its proxies
  3. The model learns about the training process from its input data
  4. SGD makes the model's proxies into more long-term goals, resulting in it instrumentally optimizing for the training objective for the purposes of staying around
  5. The model's proxies "crystallize", as they are no longer relevant to performance, and we reach an equilibrium

Let's suppose that this reasoning, and the associated justification of why this is likely to arise due to SGD seeking the largest possible marginal performance improvements, are sound for a high path-dependence world.  Why does it no longer hold in a low path-dependence world?

Comment by ojorgensen on My emotional reaction to the current funding situation · 2022-09-09T23:17:02.367Z · LW · GW

I really like this post! I can’t see whether you’ve already cross posted this to the EA forum, but it seems valuable to have this there too (as it is focussed on the EA community).