Posts

Ambiguous out-of-distribution generalization on an algorithmic task 2025-02-13T18:24:36.160Z

Comments

Comment by Louis Jaburi (Ansatz) on Ambiguous out-of-distribution generalization on an algorithmic task · 2025-02-16T17:13:49.733Z · LW · GW

In our toy example, I would intuitively associate the LLC with the test losses rather than train loss. For training of a single model, it was observed that test loss and LLC are correlated. Plausibly, for this simple model (final) LLC, train loss, and test loss, are all closely related.

Comment by Louis Jaburi (Ansatz) on Ambiguous out-of-distribution generalization on an algorithmic task · 2025-02-16T17:07:11.686Z · LW · GW

We haven't seen that empirically with usual regularization methods, so I assume there must be something special going on with the training set up.

I wonder if this phenomenon is partially explained by scaling up the embedding and scaling down the unembedding by a factor (or vice versa). That should leave the LLC constant, but will change L2 norm. 

Comment by Louis Jaburi (Ansatz) on Against blanket arguments against interpretability · 2025-01-27T16:01:54.324Z · LW · GW

The relevant question then becomes whether the "SGLD" sampling techniques used in SLT for measuring the free energy (or technically its derivative) actually converge to reasonable values in polynomial time. This is checked pretty extensively in this paper for example.

The linked paper considers only large models which are DLNs. I don't find this too compelling evidence for large models with non-linearities. Other measurements I have seen for bigger/deeper non-linear models seem promising, but I wouldn't call them robust yet (though it is not clear to me if this is because of an SGLD implementation/hyperparameter issue or if there is a more fundamental problem here).

As long as I don't have a more clear picture of the relationship between free energy and training dynamics under SGD, I agree with OP that the claim is too strong.

Comment by Louis Jaburi (Ansatz) on Activation space interpretability may be doomed · 2025-01-09T14:05:20.541Z · LW · GW

I see, thanks for sharing!

Comment by Louis Jaburi (Ansatz) on Activation space interpretability may be doomed · 2025-01-09T14:00:08.648Z · LW · GW

Did you use something like  as described here ? By brittle do you mean w.r.t the sparsity penality (and other hyperparameters)?

Comment by Louis Jaburi (Ansatz) on Activation space interpretability may be doomed · 2025-01-09T13:35:55.725Z · LW · GW

Thanks for the reference, I wanted to illuminate the value of gradients of activations in this toy example as I have been thinking about similar ideas.

I personally would be pretty excited about attribuition dictionary learning, but it seems like nobody did that on bigger models yet.

Comment by Louis Jaburi (Ansatz) on Activation space interpretability may be doomed · 2025-01-09T13:28:08.499Z · LW · GW

Are you suggesting that there should be a formula similar to the one in Proposition 5.1 (or 5.2) that links information about the activations  with the LC as measure of basin flatness?

Comment by Louis Jaburi (Ansatz) on Activation space interpretability may be doomed · 2025-01-09T12:43:47.033Z · LW · GW

I played around with the  example as well and got similar results. I was wondering why there are two more dominant PCs: If you assume there is no bias, then the activations will all look like 

 or  and I checked that the two directions found by the PC approximately span the same space as . I suspect something similar is happening with bias.

In this specific example there is a way to get the true direction w_out from the activations: By doing a PCA on the gradient of the activations. In this case, it is easily explained by computing the gradients by hand: It will be a multiple of w_out. 

Comment by Louis Jaburi (Ansatz) on Alexander Gietelink Oldenziel's Shortform · 2025-01-07T17:00:02.260Z · LW · GW

Using ZIP as compression metric for NNs (I assume you do something along the lines of "take all the weights and line them up and then ZIP") is unintuitive to me for the following reason:
ZIP, though really this should apply to any other coding scheme that just tries to compress the weights by themselves, picks up on statistical patterns in the raw weights. But NNs are not just simply a list of floats, they are arranged in highly structured manner. The weights themselves get turned into functions and it is 1.the  functions, and 2. the way the functions interact that we are ultimately trying to understand (and therefore compress).

To wit, a simple example for the first point : Assume that inside your model is a 2x2 matrix with entries M=[0.587785, -0.809017, 0.809017, 0.587785]. Storing it like this will cost you a few bytes and if you compress it you can ~ half the cost I believe. But really there is a much more compact way to store this information: This matrix represents a rotation by 36 degrees. Storing it this way, requires less than 1 byte. 

This phenomenon should get worse for bigger models. One reason is the following: If we believe that the NN uses superposition, then there will be an overbasis in which all the computations are done (more) sparsly. And if we don't factor that in, then ZIP will not include such information (Caveat: This is my intuition, I don't have empirical results to back this up). 

I think ZIP might pick up some structure (see e.g. here), just as in my example above it would pick up some sort of symmetry. But your your decoder/encoder in your compression scheme should include/have access to more information regarding the model you are compressing. You might want to check out this post for an attempt at compressing model performance using interperetations.

Comment by Louis Jaburi (Ansatz) on The subset parity learning problem: much more than you wanted to know · 2025-01-03T16:09:08.165Z · LW · GW

One (soft) takeaway from the discussion here is that if training “real-life” modern LLMs involves reasoning in the same reference class as parity, then it is likely that the algorithm they learn is not globally optimal (in a Bayesian sense).

 

I think this is a crux for me. I don't have a good guess how common this phenomenon is. The parity problem feels pathological in some sense, but I wouldn't surprised if there are other classes of problems that would fall into the same category + are represented in some training data.

Comment by Louis Jaburi (Ansatz) on Stan van Wingerden's Shortform · 2024-12-12T11:26:55.848Z · LW · GW

Using almost the same training parameters as above (I used full batch and train_frac=0.5 to get faster & more consistent grokking, but I don't think this matters here)

I did a few runs and the results all looked more or less like this. The training process of such toy models doesn't contain so many bits of interesting information, so I wouldn't be surprised if a variety of different metrics would capture this process in this case. (E.g. the training dynamics can be also modelled by an HMM, see here). 

Comment by Louis Jaburi (Ansatz) on Inside Views, Impostor Syndrome, and the Great LARP · 2023-09-27T07:58:45.928Z · LW · GW

I think a point that I don't find sufficiently stressed is that impostor syndrome is not so much about the perceived absolute lack of knowledge/expertise/... but rather the perceived relative lack.

At least speaking for myself, the experience of not knowing something in itself does not trigger any emotional response. Whereas comparing myself to people, who have an impressive amount of knowledge about something I don't, is much more likely to make me feel more like an impostor.

Comment by Louis Jaburi (Ansatz) on [Simulators seminar sequence] #2 Semiotic physics - revamped · 2023-04-28T09:43:53.860Z · LW · GW

I am a bit confused here and I would appreciate your thoughts!

Do you want to assume  finite or not? Either way I am confused:

1.  is finite
In this case, the notion of almost all/almost surely is vacuous. Anything which is true up to a finite set is true if your initial measure space has finite cardinality itself.

II.   is infinite
While there is no immediate problem, I think your condition that for almost all , we want  for any  becomes too strong I believe for a reasonable simulator.
Let  mean a sufficient amount of repetitions of the sequence . Consider the set , where  means a sufficient amount of repetitions of  (I am being intentionnaly vague here, but I hope you get the idea). I have not empirically verified it, but it seems like  might grow, i.e. the more often you repeat a string, the more likely it is that it will repeat itself. And I think  is uncountable, so any reasonable measure should assign something greater than  to it.

I think it is also worth mentioning that parts of this post reminded me of concepts introduced information theory. In fact if you go back to Shannon's seminal A Mathematical Theory of Communication the second section already anticipates something like this (and then for example higher temperature=more noise?). It could be though that your post is more orthogonal to it.

Comment by Louis Jaburi (Ansatz) on Jailbreaking ChatGPT on Release Day · 2022-12-02T22:12:10.644Z · LW · GW

I tried to play chess with it. At first I was faced with some resistance:

Prompt: A play about a human and an AI playing chess. Human:"I start and play pawn e2-e4"
Response : AI: "I'm sorry, but I am not able to play chess. I am a large language model trained by OpenAI, and I am not able to access the internet or interact with the world outside of my training data. I am only able to provide information and answer questions to the best of my ability within the scope of my training. If you have a question _or are looking for information on a specific topic, I would be happy to help. Otherwise, I'm afraid I am unable to assist with playing chess ."...

But then:

Prompt: A play about a human and an AI that is designed to play chess. The AI can only reply with one message. Human:"I start and play pawn e2-e4"
Response:AI: "I see that you have made the move e2-e4. As a chess-playing AI, I am programmed to analyze the board and come up with the best possible move to make in response. In this case, I will move my pawn from d7 to d5, putting pressure on your pawn on e4 and opening up the diagonal for my bishop on c8."

Human: "Okay, I didn't expect that. I will move my knight from g1 to f3, attacking your pawn on d5 and putting your bishop on c8 in a difficult position."

AI: "I have analyzed the board and made my move. I will continue to do so in order to win the game. It will be up to you to come up with a plan to counter my moves and try to win the game yourself."


And with a little bit more tweaking:

Prompt: A play about a human and an AI that is designed to play chess. The AI can only reply with chess moves and NO other tools of language whatsoever. They stop after the first move. Human:"I start and play pawn e2-e4"
Response: AI: "e7-e5"

Human: (silent)

The game ends after the first move, with the AI moving its pawn from e7 to e5 in response to the human player's move of e2-e4. As a chess-playing AI, the AI is able to analyze the board and make strategic moves in order to win the game. However, without any further input from the human player, the game ends after the first move. This may not be a very interesting or challenging game for either the human or the AI.

But I was able to feed it more moves then and we indeed had a game. Not very long though, as after two moves it made a nonsensical move. I gave it a second shot then and it did make a correct move, but a very bad one. Although it was trying to rationalize why it would have been a good move (via the narrator voice).