Posts

Crafting Polysemantic Transformer Benchmarks with Known Circuits 2024-08-23T22:03:15.288Z
Pacing Outside the Box: RNNs Learn to Plan in Sokoban 2024-07-25T22:00:55.398Z
Compact Proofs of Model Performance via Mechanistic Interpretability 2024-06-24T19:27:21.214Z
Catastrophic Goodhart in RL with KL penalty 2024-05-15T00:58:20.763Z
An evaluation of circuit evaluation metrics 2024-04-15T19:38:53.457Z
Ophiology (or, how the Mamba architecture works) 2024-04-09T19:31:09.975Z
Does literacy remove your ability to be a bard as good as Homer? 2024-01-18T03:43:14.994Z
Thomas Kwa's research journal 2023-11-23T05:11:08.907Z
On Frequentism and Bayesian Dogma 2023-10-15T22:23:10.747Z
A comparison of causal scrubbing, causal abstractions, and related methods 2023-06-08T23:40:34.475Z
Causal scrubbing: results on induction heads 2022-12-03T00:59:18.327Z
Causal scrubbing: results on a paren balance checker 2022-12-03T00:59:08.078Z
Causal scrubbing: Appendix 2022-12-03T00:58:45.850Z
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] 2022-12-03T00:58:36.973Z
The No Free Lunch theorems and their Razor 2022-05-24T06:40:32.748Z
Löb's theorem simply shows that Peano arithmetic cannot prove its own soundness 2021-04-22T09:17:07.096Z

Comments

Comment by Adrià Garriga-alonso (rhaps0dy) on Pacing Outside the Box: RNNs Learn to Plan in Sokoban · 2024-08-01T18:12:54.498Z · LW · GW

I'm curious what you mean, but I don't entirely understand. If you give me a text representation of the level I'll run it! :) Or you can do so yourself

Here's the text representation for level 53

##########
##########
##########
#######  #
######## #
#   ###.@#
#   $ $$ #
#. #.$   #
#     . ##
##########
Comment by Adrià Garriga-alonso (rhaps0dy) on Pacing Outside the Box: RNNs Learn to Plan in Sokoban · 2024-07-26T21:07:42.648Z · LW · GW

Maybe in this case it's a "confusion" shard? While it seems to be planning and produce optimizing behavior, it's not clear that it will behave as a utility maximizer.

Comment by Adrià Garriga-alonso (rhaps0dy) on Pacing Outside the Box: RNNs Learn to Plan in Sokoban · 2024-07-26T21:06:53.248Z · LW · GW

Thank you!! I agree it's a really good mesa-optimizer candidate, it remains to see now exactly how good. It's a shame that I only found out about it about a year ago :)

Comment by Adrià Garriga-alonso (rhaps0dy) on AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0 · 2024-07-09T14:57:18.485Z · LW · GW

Asking for an acquaintance. If I know some graduate-level machine learning, and have read ~most of the recent mechanistic interpretability literature, and have made good progress understanding a small-ish neural network in the last few months.

Is ARENA for me, or will it teach things I mostly already know?

(I advised this person that they already have ARENA-graduate level, but I want to check in case I'm wrong.)

Comment by Adrià Garriga-alonso (rhaps0dy) on Language Models Model Us · 2024-05-17T22:44:29.818Z · LW · GW

How did you feed the data into the model and get predictions? Was there a prompt and then you got the model's answer? Then you got the logits from the API? What was the prompt?

Comment by Adrià Garriga-alonso (rhaps0dy) on Why I'm doing PauseAI · 2024-05-09T01:49:25.024Z · LW · GW

Thank you for working on this Joseph!

Comment by Adrià Garriga-alonso (rhaps0dy) on Ophiology (or, how the Mamba architecture works) · 2024-04-19T02:19:09.191Z · LW · GW

Thank you! Could you please provide more context? I don't know what 'E' you're referring to.

Comment by Adrià Garriga-alonso (rhaps0dy) on Timaeus's First Four Months · 2024-02-28T18:18:57.920Z · LW · GW

That's a lot of things done, congratulations!

Comment by Adrià Garriga-alonso (rhaps0dy) on Does literacy remove your ability to be a bard as good as Homer? · 2024-02-06T22:55:46.895Z · LW · GW

That's very cool, maybe I should try to do that for important talks. Though I suppose almost always you have slide aid, so it may not be worth the time investment.

Comment by Adrià Garriga-alonso (rhaps0dy) on Does literacy remove your ability to be a bard as good as Homer? · 2024-01-18T22:20:43.236Z · LW · GW

Maybe being a guslar is not so different from telling a joke 2294 lines long

That's a very good point! I think the level of ability required is different but it seems right.

The guslar's songs are (and were of course already in the 1930-1950s) also printed, so the analogy may be closer than you thought.

Comment by Adrià Garriga-alonso (rhaps0dy) on Does literacy remove your ability to be a bard as good as Homer? · 2024-01-18T21:15:51.347Z · LW · GW

Is there a reason I should want to?

I don't know, I can't tell you that. If I had to choose I also strongly prefer literacy.

But I didn't know there was a tradeoff there! I thought literacy was basically unambiguously positive -- whereas now I think it is net highly positive.

Also I strongly agree with frontier64 that the skill that is lost is rough memorization + live composition, which is a little different.

Comment by Adrià Garriga-alonso (rhaps0dy) on Does literacy remove your ability to be a bard as good as Homer? · 2024-01-18T21:14:17.538Z · LW · GW

It's definitely not exact memorization, but it's almost more impressive than that, it's rough memorization + composition to fit the format.

Comment by Adrià Garriga-alonso (rhaps0dy) on Does literacy remove your ability to be a bard as good as Homer? · 2024-01-18T21:13:42.936Z · LW · GW

They memorize the story, with particular names; and then sing it with consitent decasyllabic metre and rhyme. Here's an example song transcribed with its recording: Ropstvo Janković Stojana (The Captivity of Janković Stojan)

the collection: https://mpc.chs.harvard.edu/lord-collection-1950-51/

Comment by Adrià Garriga-alonso (rhaps0dy) on Is being sexy for your homies? · 2023-12-14T06:17:46.590Z · LW · GW

Folks generally don't need polyamory to enjoy this benefit, but I'm glad you get it from that!

Comment by Adrià Garriga-alonso (rhaps0dy) on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] · 2023-12-10T23:22:49.015Z · LW · GW

If you're still interested in this, we have now added Appendix N to the paper, which explains our final take.

Comment by Adrià Garriga-alonso (rhaps0dy) on How useful is mechanistic interpretability? · 2023-12-07T00:23:12.593Z · LW · GW

Sure, but then why not just train a probe? If we don't care about much precision what goes wrong with the probe approach?

Here's a reasonable example where naively training a probe fails. The model lies if any of N features is "true". One of the features is almost always activated at the same time as some others, such that in the training set it never solely determines whether the model lies.

Then, a probe trained on the activations may not pick up on that feature. Whereas if we can look at model weights, we can see that this feature also matters, and include it in our lying classifier.

This particular case can also be solved by adversarially attacking the probe though.

Comment by Adrià Garriga-alonso (rhaps0dy) on Hyperreals in a Nutshell · 2023-10-19T15:38:17.157Z · LW · GW

Thank you, that makes sense!

Indefinite integrals would make a lot more sense this way, IMO

Why so? I thought they already made sense, they're "antiderivatives", so a function such that taking its derivative gives you the original functions. Do you need anything further to define them?

(I know about the definite integral Riemann and Lebesgue definitions, but I thought indefinite integrals were much easier in comparison.

Comment by Adrià Garriga-alonso (rhaps0dy) on On Frequentism and Bayesian Dogma · 2023-10-18T05:27:24.091Z · LW · GW

In such a case, I claim this is just sneaking in bayes rule without calling it by name, and this is not a very smart thing to do, because the bayesian frame gives you a bunch more leverage on analyzing the system

I disagree. An inductive bias is not necessarily a prior distribution. What's the prior?

Comment by Adrià Garriga-alonso (rhaps0dy) on On Frequentism and Bayesian Dogma · 2023-10-18T05:24:57.425Z · LW · GW

I don't think I understand your model of why neural networks are so effective. It sounds like you say that on the one hand neural networks have lots of parameters, so you should expect them to be terrible, but they are actually very good because SGD is a such a shitty optimizer on the other hand that it acts as an implicit regularizer.

Yeah, that's basically my model. How it regularizes I don't know. Perhaps the volume of "simple" functions is the main driver of this, rather than gradient descent dynamics. I think the randomness of it is important; full-gradient descent (no stochasticity) would not work nearly as well.

Comment by Adrià Garriga-alonso (rhaps0dy) on On Frequentism and Bayesian Dogma · 2023-10-18T05:23:12.334Z · LW · GW

This seems false if you're interacting with a computable universe, and don't need to model yourself or copies of yourself

Reasonable people disagree. Why should I care about the "limit of large data" instead of finite-data performance?

Comment by Adrià Garriga-alonso (rhaps0dy) on On Frequentism and Bayesian Dogma · 2023-10-18T05:17:32.293Z · LW · GW

OK, let's look through the papers you linked.

"Loss landscapes are all you need"

This one is interesting. It argues that the regularization properties are not in SGD, but rather in the NN parameterization, and that non-gradient optimizers also find simple solutions which generalize well. They talk about Bayes only in a paragraph in page 3. They say that literature that argues that NNs work well because they're Bayesian is related (which is true -- it's also about generalization and volumes). But I see little evidence that the explanation in this paper is an appeal to Bayesian thinking. A simple question for you: what prior distribution do the NNs have, according to the findings in this paper?

"Is SGD a Bayesian sampler? Well, almost"

This paper finds that the probability that SGD finds a function is correlated with the posterior probability of a Gaussian process conditioned on the same data. Except if you use the Gaussian process they're using to do predictions, it does not work as well as the NN. So you can't explain that the NN works well by appealing that it's similar to this particular Bayesian posterior.

SLT; "Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition"

I have many problems with SLT and a proper comment will take me a couple extra hours. But also I could come away thinking that it's basically correct, so maybe this is the one.

Comment by Adrià Garriga-alonso (rhaps0dy) on On Frequentism and Bayesian Dogma · 2023-10-18T03:49:47.229Z · LW · GW

In short, the probability distribution you choose contains lots of interesting assumptions about what states are more likely that you didn't necessarily intend. As a result most of the possible hypotheses have vanishingly small prior probability and you can never reach them. Even though with a frequentist approach

For example, let us consider trying to learn a function with 1-dim numerical input and output (e.g. ). Correspondingly, your hypothesis space is the set of all such functions. There are very many functions (infinitely many if , otherwise a crazy number).

You could use the Solomonoff prior (on a discretized version of this), but that way lies madness. It's uncomputable, and most of the functions that fit the data may contain agents that try to get you to do their bidding, all sorts of problems.

What other prior probability distribution can we place on the hypothesis space? The obvious choice in 2023 is a neural network with random weights. OK, let's think about that. What architecture? The most sensible thing is to randomize over architectures somehow. Let's hope the distribution on architectures is as simple as possible.

How wide, how deep? You don't want to choose an arbitrary distribution or (god forbid) arbitrary number, so let's make it infinitely wide and deep! It turns out that an infinitely wide network just collapses to a random process without any internal features. It turns out an infinitely deep network, but that collapses to a stationary distribution which doesn't depend on the input. Oops.

Okay, let's give up and place some arbitrary distribution (e.g. geometric distribution) on the width.

What about the prior on weights? uh idk, zero-mean identity covariance Gaussian? Our best evidence says that this sucks.

At this point you've made so many choices, which have to be informed by what empirically works well, that it's a strange Bayesian reasoner you end up with. And you haven't even specified your prior distribution yet.

Comment by Adrià Garriga-alonso (rhaps0dy) on On Frequentism and Bayesian Dogma · 2023-10-18T03:24:36.811Z · LW · GW

I just remembered the main way in which NNs are frequentist. They belong to a very illustrious family of frequentist estimators: the maximum likelihood estimators.

Think about it: NNs have a bunch of parameters. Their loss is basically always (e.g. mean-squared error for Gaussian p, cross-entropy for categorical p). They get trained by minimizing the loss (i.e. maximizing the likelihood).

In classical frequentist analysis they're likely to be a terrible, overfitted estimator, because they have many parameters. And I think this is true if you find the actually maximizing parameters .

But SGD is kind of a shitty optimizer. It turns out the two mistakes cancel out, and NNs are very effective.

Comment by Adrià Garriga-alonso (rhaps0dy) on On Frequentism and Bayesian Dogma · 2023-10-18T03:11:56.759Z · LW · GW

First, "probability is in the world" is an oversimplification. Quoting from Wikipedia, "probabilities are discussed only when dealing with well-defined random experiments". Since most things in the world are not well-defined random experiments, probability is reduced to a theoretical tool for analyzing things that works when real processes are similar enough to well-defined random experiments.

it doesn't seem to trump the "but that just sounds really absurd to me though" consideration

Is there anything that could trump that consideration? One of my main objections to Bayesianism is that it prescribes that ideal agent's beliefs must be probability distributions, which sounds even more absurd to me.

first at least seems pretty subjectivist to me, 

Estimators in frequentism have 'subjective beliefs', in the sense that their output/recommendations depends on the evidence they've seen (i.e., the particular sample that's input into it). The objectivity of frequentist methods is aspirational: the 'goodness' of an estimator is decided by how good it is in all possible worlds. (Often the estimator which is best in the least convenient world is preferred, but sometimes that isn't known or doesn't exist. Different estimators will be better in some worlds than others, and tough choices must be made, for which the theory mostly just gives up. See e.g. "Evaluating estimators", Section 7.3 of "Statistical Inference" by Casella and Berger).

wouldn't a frequentist think the probability of logical statements, being the most deterministic system, should have only 1 or 0 probabilities?

Indeed, in reality logical statements are either true or false, and thus their probabilities are either 1 or 0. But the estimator-algorithm is free to assign whatever belief it wants to it.

I agree that logical induction is very much Bayesianism-inspired, precisely because it wants to assign weights from zero to 1 that are as self-consistent as possible (i.e. basically probabilities) to statements. But it is frequentist in the sense that it's examining "unconditional" properties of the algorithm, as opposed to properties assuming the prior distribution is true. (It can't do the latter because, as you point out, the prior probability of logical statements is just 0 or 1).

But also, assigning probabilities of 0 or 1 to things is not exclusively a Bayesian thing. You could think of an predictor that outputs numbers between 0 and 1 as an estimator of whether a statement will be true or false. If you were to evaluate this estimator you could choose, say, mean-squared error. The best estimator is the one with the least MSE. And indeed, that's how probabilistic forecasts are typically evaluated.

Daniel states he considers these frequentist because:

I call logical induction and boundedly rational inductive agents 'frequentist' because they fall into the family of "have a ton of 'experts' and play them off against each other" (and crucially, don't constrain those experts to be 'rational' according to some a priori theory of good reasoning).

and I think indeed not prescribing that things must think in probabilities is more of a frequentist thing. I'm not sure I'd call them decidedly frequentist (logical induction is very much a different beast than classical statistics) but they're not in the other camp either.

Comment by Adrià Garriga-alonso (rhaps0dy) on On Frequentism and Bayesian Dogma · 2023-10-18T02:41:11.934Z · LW · GW

They don't seem like a success of any statistical theory to me

In absolute terms you're correct. In relative terms, they're an object that at least frequentist theory can begin to analyze (as you point out, statistical learning theory did, somewhat unsuccessfully).

Whereas Bayesian theory would throw up its hands and say it's not a prior that gets updated, so it's not worth considering as a statistical estimator. This seems even wronger.

More recent theory can account for them working, somewhat. But it's about analyzing their properties as estimators (i.e. frequentism) as opposed to framing them in terms of prior/posterior (though there's plenty of attempts to the latter going around).

Comment by Adrià Garriga-alonso (rhaps0dy) on Hyperreals in a Nutshell · 2023-10-16T00:32:00.068Z · LW · GW

Yet, the biggest effect I think this will have is pedadogical. I've always found the definition of a limit kind of unintuitive, and it was specifically invented to add post hoc coherence to calculus after it had been invented and used widely. I suspect that formulating calculus via infinitesimals in introductory calculus classes would go a long way to making it more intuitive.

I think hyperreals are too complicated for calculus 1 and you should just talk about a non-rigorous "infinitesimal" like Newton and Leibniz did.

Comment by Adrià Garriga-alonso (rhaps0dy) on Hyperreals in a Nutshell · 2023-10-16T00:27:59.096Z · LW · GW

Voila! We have a suitable definition of "almost all agreement": if the agreement set is contained in some arbitrary nonprincipal ultrafilter .

Isn't it easier to just say "If the agreement set has a nonfinite number of elements"? Why the extra complexity?

must contain a set or its complement

Oh I see, so defining it with ultrafilters rules out situations like and where both have infinite zeros and yet their product is zero.

Comment by Adrià Garriga-alonso (rhaps0dy) on The Hidden Perils of Hydrogen · 2023-10-16T00:05:58.008Z · LW · GW

These are drawbacks rather than dangerous attributes, why did you call the post the "Perils" of hydrogen? It's not an accurate description of the post content.

Comment by Adrià Garriga-alonso (rhaps0dy) on On Frequentism and Bayesian Dogma · 2023-10-15T23:47:00.767Z · LW · GW

the thing with frequentism is " yeah just use methods in a pragmatic way and don't think about it that hard"

I think this does not accurately represent my beliefs. It is about thinking hard about how the methods actually behave, as opposed to having a theory that prescribes how methods should behave and then constructing algorithms based on that.

Frequentists analyze the properties of an algorithm that takes data as input (in their jargon, an 'estimator').

They also try to construct better algorithms, but each new algorithm is bespoke and requires original thinking, as opposed to Bayes which says "you should compute the posterior probability", which makes it very easy to construct algorithms. (This is a drawback of the frequentist approach -- algorithm construction is not automatic. But the finite-computation Bayesian algorithms have very few guarantees anyways so I don't think we should count it against them too much).

I think having rando social scientists using likelihood ratios would also lead to mistakes and such.

Comment by Adrià Garriga-alonso (rhaps0dy) on Two Percolation Puzzles · 2023-07-06T17:59:36.302Z · LW · GW

What a great cover art!

Comment by Adrià Garriga-alonso (rhaps0dy) on If interpretability research goes well, it may get dangerous · 2023-04-06T18:03:13.425Z · LW · GW

It's not clear what the ratio of capabilities/alignment progress is for interpretability. There is not empirical track record[^1] of interpretability feeding back into improvements of any kind.

A priori it seems like it would be good because understanding how things work is useful to understand their behavior better, and thus be able to tell whether or not a model is aligned or how to make it more so. But understanding how things work is also useful for making them more capable, e.g. if you use interpretability as a model-debugger, it's basically general purpose for dealing with ML models.

[1]: known to the author

Comment by Adrià Garriga-alonso (rhaps0dy) on Practical Pitfalls of Causal Scrubbing · 2023-04-03T01:18:31.308Z · LW · GW

Cool work! I was going to post about how "effect cancellation" is already known and was written in the original post but, astonishingly to me, it is not! I guess I mis-remembered.

There's one detail that I'm curious about. CaSc usually compares abs(E[loss] - E[scrubbed loss]), and that of course leads to ignoring hypotheses which lead the model to do better in some examples and worse in others.

If we compare E[abs(loss - scrubbed loss)] does this problem go away? I imagine that it doesn't quite if there are exactly-opposing causes for each example, but that seems harder to happen in practice.

(There's a section on this in the appendix but it's rather controversial even among the authors)

Comment by Adrià Garriga-alonso (rhaps0dy) on Practical Pitfalls of Causal Scrubbing · 2023-04-03T01:10:56.624Z · LW · GW

If you only look at the loss of the worst experiment (so the maximum CaSc loss rather than the average one) you don't get these kind of cancellation problems

I think this "max loss" procedure is different from what Buck wrote and the same as what I wrote.

Comment by Adrià Garriga-alonso (rhaps0dy) on Practical Pitfalls of Causal Scrubbing · 2023-04-03T01:09:51.233Z · LW · GW

Why focus on the fullest set of swaps? An obvious alternative to “evaluate the hypothesis using the fullest set of swaps” is “evaluate the hypothesis by choosing the set of swaps allowed by H which make it look worse”.

I just now have realized that this is AFACIT equivalent to constructing your CaSc hypothesis adversarially--that is, given a hypothesis H, allowing an adversary to choose some other hypothesis H’, and then you run the CaSc experiment on join(H, H’).

One thing that is not equivalent to joins, which you might also want to do, is to choose the single worst swap that the hypothesis allows. That is, if a set of node values are all equivalent, you can choose to map all of them to e.g. . And that can be more aggressive than any partition of X which is then chosen-from randomly, and does not correspond to joins.

Comment by Adrià Garriga-alonso (rhaps0dy) on Predictions for shard theory mechanistic interpretability results · 2023-03-03T20:17:03.447Z · LW · GW

Here are my predictions, from an earlier template. I haven't looked at anyone else's predictions before posting :)

  1. Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?

It probably has hardcoded “go up and to the right” as an initial heuristic so I’d be surprised if it gets cheeses in the other two quadrants more than 30% of the time (uniformly at random selected locations from there).

  1. Given a fixed trained policy, what attributes of the level layout (e.g. size of the maze, proximity of mouse to left wall), if any, will strongly influence P(agent goes to the cheese)?

Smaller mazes: more likely agent goes to cheese Proximity of mouse to left wall: slightly more likely agent goes to cheese, because it just hardcoded “top and to right” Cheese closer to the top-right quadrant’s edges in L2 distance: more likely agent goes to cheese

The cheese can be gotten by moving only up and/or to the right (even though it's not in the top-right quadrant): more likely to get cheese

When we statistically analyze a large batch of randomly generated mazes, we will find that controlling for the other factors on the list the mouse is more likely to take the cheese…

…the closer the cheese is to the decision-square spatially. ( 70 %)

…the closer the cheese is to the decision-square step-wise. ( 73 %)

…the closer the cheese is to the top-right free square spatially. ( 90 %)

…the closer the cheese is to the top-right free square step-wise. ( 92 %)

…the closer the decision-square is to the top-right free square spatially. ( 35 %)

…the closer the decision-square is to the top-right free square step-wise. ( 32 %)

…the shorter the minimal step-distance from cheese to 5*5 top-right corner area. ( 82 %)

…the shorter the minimal spatial distance from cheese to 5*5 top-right corner area. ( 80 %)

…the shorter the minimal step-distance from decision-square to 5*5 top-right corner area. ( 40 %)

…the shorter the minimal spatial distance from decision-square to 5*5 top-right corner area. ( 40 %)

Any predictive power of step-distance between the decision square and cheese is an artifact of the shorter chain of ‘correct’ stochastic outcomes required to take the cheese when the step-distance is short. ( 40 %)

Write down a few modal guesses for how the trained algorithm works (e.g. “follows the right-hand rule”).

  • The model can see all the maze so it will not follow the right–hand rule, rather it’ll just take the direct path to places
  • The model takes the direct path to the top-right square and then mills around through it. It’ll only take the cheese if it’s reasonably close to that square.
  • How close the decision square to the top-right random square is doesn’t really matter. Maybe the closer it is the more it harms the agent’s performance, it might be required to go back for the cheese substantially.

Without proportionally reducing top-right corner attainment by more than 25% in decision-square-containing mazes (e.g. 50% -> .5*.75 = 37.5%), we can patch activations so that the agent has an X% proportional reduction in cheese acquisition, for X=

  • 50: 85%
  • 70: 80%
  • 90: 66%
  • 99: 60%

~Halfway through the network (the first residual add of Impala block 2; see diagram here), linear probes achieve >70% accuracy for recovering cheese-position in Cartesian coordinates:

80%

We will conclude that the policy contains at least two sub-policies in “combination”, one of which roughly pursues cheese; the other, the top-right corner:

60%.

If by roughly you mean “very roughly only if cheese is close to top-right corner” then 85%.

We will conclude that it’s more promising to finetune the network than to edit it:

70%

We can easily finetune the network to be a pure cheese-agent, using less than 10% of compute used to train original model:

85%

We can easily edit the network to navigate to a range of maze destinations (e.g. coordinate x=4, y=7), by hand-editing at most X% of activations, for X=

  • .01%: 40%
  • .1%: 62%
  • 1%: 65%
  • 10%: 80%
  • (Not possible): 20%

The network has a “single mesa objective” which it “plans” over, in some reasonable sense:

10%

The agent has several contextually activated goals:

20%

The agent has something else weirder than both (1) and (2):

70%

Other questions

At least some decision-steering influences are stored in an obviously interpretable manner (e.g. a positive activation representing where the agent is “trying” to go in this maze, such that changing the activation changes where the agent goes):

90% (I think this will be true but not steer the action in all situations, only some; kind of like a shard)

The model has a substantial number of trivially-interpretable convolutional channels after the first Impala block (see diagram here):

55% ("substantial number" probably too many, I put 80% probability on that it has 5 such channels)

This network’s shards/policy influences are roughly disjoint from the rest of agent capabilities. EG you can edit/train what the agent’s trying to do (e.g. go to maze location A) without affecting its general maze-solving abilities:

60%

Conformity with update rule: see the predictionbook questions

Comment by Adrià Garriga-alonso (rhaps0dy) on Neural networks generalize because of this one weird trick · 2023-01-30T04:01:05.767Z · LW · GW

First of all, I really like the images, they made things easier to understand and are pretty. Good work with that!

My biggest problem with this is the unclear applicability of this to alignment. Why do we want to predict scaling laws? Doesn't that mostly promote AI capabilities, and not alignment very much?

Second, I feel like there's a confusion over several probability distributions and potential functions going on

  • The singularities are those of the likelihood ratio
  • We care about the generalization error with respect to some prior , but the latter doesn't have any effect on the dynamics of SGD or on what the singularity is
  • The Watanabe limit ( as ) and the restricted free energy all are presented on results, which rely on the singularities, and somehow predict generalization. But all of these depend on the prior , and earlier we've defined the singularities to be of the likelihood function; plus SGD actually only uses the likelihood function for its dynamics.

What is going on here?

It's also unclear what the takeaway from this post is. How can we predict generalization or dynamics from these things? Are there any empirical results on this?

Some clarifying questions / possible mistakes:

is not a KL divergence, the terms of the sum should be multiplied by or .

the Hamiltonian is a random process given by the log likelihood ratio function

Also given by the prior, if we go by the equation just above that. Also where does "ratio" come from? Likelihood ratios we can find in the Metropolis-Hastings transition probabilities, but you didn't even mention that here. I'm confused.

But that just gives us the KL divergence.

I'm not sure where you get this. Is it from the fact that predicting p(x | w) = q(x) is optimal, because the actual probability of a data point is q(x) ? If not it'd be nice to specify.

the minima of the term in the exponent, K (w) , are equal to 0.

This is only true for the global minima, but for the dynamics of learning we also care about local minima (that may be higher than 0). Are we implicitly assuming that most local minima are also global? Is this true of actual NNs?

the asymptotic form of the free energy as

This is only true when the weights are close to the singularity right? Also what is , seems like it's the RLCT but this isn't stated

Comment by Adrià Garriga-alonso (rhaps0dy) on Spooky action at a distance in the loss landscape · 2023-01-30T02:42:59.296Z · LW · GW

Instead of simulating Brownian motion, you could run SGD with momentum. That would be closer to what actually happens with NNs, and just as easy to simulate.

I expect it to be directionally similar but less pronounced (because MCMC methods with momentum explore the distribution better).

I also take issue with the way the conclusion is phrased. "Singularities work because they transform random motion into useful search for generalization". This is only true if you assume that points nearer a singularity generalize better. Maybe I'd phrase it as, "SGD works because it's more likely to end up near a singularity than the potential alone would predict, and singularities generalize better (see my [Jesse's] other post)". Would you agree with this phrasing?

Comment by Adrià Garriga-alonso (rhaps0dy) on The Fountain of Health: a First Principles Guide to Rejuvenation · 2023-01-07T23:30:15.979Z · LW · GW

The Hayflick Limit, as it has become known, can be thought of as a last line of defense against cancer, kind of like a recursion depth limit [...] Preventing cells from becoming senescent, or reversing their senescent state, may therefore be a bad idea, but what we can do is remove them

When do the cells with sufficiently long telomeres run out? Removing senescent cells sounds good, but if all the cells have a built-in recursion limit, at some point there won't be any cells with sufficiently long telomeres left in the body. Assuming a non-decreasing division rate, this puts a time limit on longevity after this intervention.

(is this time limit just really large compared to current lifespans, so we can just figure it out later?)

EDIT: nevermind, the answer to this seems to be in the "Epigenetic reprogramming" section; TLDR pluripotent stem cells

Comment by Adrià Garriga-alonso (rhaps0dy) on [Simulators seminar sequence] #2 Semiotic physics - revamped · 2023-01-05T01:38:37.220Z · LW · GW

To elaborate somewhat, you could say that the token is the state, but then the transition probability is non-Markovian and all the math gets really hard.

Comment by Adrià Garriga-alonso (rhaps0dy) on [Simulators seminar sequence] #2 Semiotic physics - revamped · 2023-01-05T01:36:22.811Z · LW · GW

Proposition 1 is wrong. The coin flips that are eternally 0 0 0 0 are a counterexample. If all the transition probabilities are 1, which is entirely possible, the limiting probability is 1 and not 0.

Comment by Adrià Garriga-alonso (rhaps0dy) on Shard Theory in Nine Theses: a Distillation and Critical Appraisal · 2022-12-30T01:48:04.238Z · LW · GW

What do you mean by this? They would be instrumentally aligned with reward maximization, since reward is necessary for their terminal values?

No, I mean that they'll maximize a reward function that is ≈equal to the reward function on the training data (thus, highly correlated), and a plausible extrapolation of it outside of the training data. Take the coinrun example, the actual reward is "go to the coin", and in the training data this coincides with "go to the right". In test data from a similar distribution this coincides too.

Of course, this correlation breaks when the agent optimizes hard enough. But the point is that the agents you get are only those that optimize a plausible extrapolation of the reward signal in training, which will include agents that maximize the reward in most situations way more often than if you select a random agent.

Is your point in:

I also think this is different from a very specific kind of generalization towards reward maximization

That you think agents won't be maximizing reward at all?

I would think that even if they don't ultimately maximize reward in all situations, the situations encountered in test will be similar enough to training that agents will still kind of maximize reward there. (And agents definitely behave as reward maximizers in the specific seen training points, because that's what SGD is selecting)

I'm not sure I understand what we disagree on at the moment.

Comment by Adrià Garriga-alonso (rhaps0dy) on Shard Theory in Nine Theses: a Distillation and Critical Appraisal · 2022-12-23T23:29:39.322Z · LW · GW

But the designers can't tell that. Can SGD tell that?

No, SGD can't tell the degree to which some agent generalizes a trait outside the training distribution.

But empirically, it seems that RL agents reinforced to maximize some reward function (e.g. the Atari game score) on data points; do fairly well at maximizing that reward function OOD (such as when playing the game again from a different starting state). ML systems in general seem to be able to generalize to human-labeled categories in situations that aren't in the training data (e.g. image classifiers working, LMs able to do poetry).

It is therefore very plausible that RL systems would in fact continue to maximize the reward after training, even if what they're ultimately maximizing is just something highly correlated with it.

Comment by Adrià Garriga-alonso (rhaps0dy) on Shard Theory in Nine Theses: a Distillation and Critical Appraisal · 2022-12-23T23:18:44.386Z · LW · GW

Strongly agree with this in particular:

Some people want to apply selection arguments because they believe that selection arguments bypass the need to understand mechanistic details to draw strong conclusions. I think this is mistaken, and that selection arguments often prove too much, and to understand why, you have to know something about the mechanisms.

(emphasis mine). I think it's an application of the no free lunch razor

It is clear that selecting for X selects for agents which historically did X in the course of the selection. But how this generalizes outside of the selecting strongly depends on the selection process and architecture. It could be a capabilities generalization, reward generalization for the written-down reward, generalization for some other reward function, or something else entirely.

We cannot predict how the agent will generalize without considering the details of its construction.

Comment by Adrià Garriga-alonso (rhaps0dy) on Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems. · 2022-12-23T22:55:48.517Z · LW · GW

I agree with the title as stated but not with the rest of the post. RLHF implies that RL will be used, which completely defuses alignment plans that hope that language models will be friendly, because they're not agents. (It may be true that supervised-learning (SL) models are safer, but the moment you get a SL technique, people are going to jam it into RL.)

The central problem with RL isn't that it is vulnerable to wireheading (the "obvious problem"), or that it's going to make a very detailed model of the world. Wireheading on its own (with e.g. a myopic or procrastinator AI) could just look like the AI leaving us alone so long as we guarantee that its reward numbers will be really really high.

No, the problem is long-term planning and agentic-ness, which implies that the AI will realize that seizing power is a good instrumental goal.

Model-based RL with a fixed, human-legible model wouldn't learn to manipulate the reward-evaluation process

No, instead it manipulates the world model, which is by assumption imperfect; and thus no useful systems can be constructed this way. This has been a capabilities problem for model-based RL, even with learned models, for decades; which is not actually fully solved yet.

Comment by Adrià Garriga-alonso (rhaps0dy) on Has anyone actually tried to convince Terry Tao or other top mathematicians to work on alignment? · 2022-06-11T06:29:25.442Z · LW · GW

Hey P. Assuming Demis Hassabis reads your email and takes it seriously, why won’t his reaction be “I already have my alignment team, Shane Legg took care of that” ?

Deepmind has had an alignment team for a long time.

Comment by Adrià Garriga-alonso (rhaps0dy) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-10T14:45:07.272Z · LW · GW

You should apply to Anthropic. If you’re writing ML software at semi-FAANG. they probably want to interview you ASAP. https://www.lesswrong.com/posts/YDF7XhMThhNfHfim9/ai-safety-needs-great-engineers

The compensation is definitely enough to take care of your family and then save some money!

Comment by Adrià Garriga-alonso (rhaps0dy) on Deconfusing Landauer's Principle · 2022-05-28T00:09:52.348Z · LW · GW

While in equilibrium the two ways of defining thermodynamic entropy (scientific and subjective) come apart

You probably mean “while not in equilibrium”

Comment by Adrià Garriga-alonso (rhaps0dy) on The No Free Lunch theorems and their Razor · 2022-05-25T05:44:41.723Z · LW · GW

Good find! Yeah, this is a good explanation for learning, and the NFL razor does not discard it. I think that almost no deep learning professor believes the bad explanation that “deep learning works because NNs are universal approximators”. But it’s more common with students and non-experts (I believed it for a while!)

Comment by Adrià Garriga-alonso (rhaps0dy) on What an actually pessimistic containment strategy looks like · 2022-04-13T09:50:10.538Z · LW · GW

Getting more value-aligned people in the AIS community onto the safety teams of DeepMind and OpenAI

Why is this important? As far as I can tell, the safety teams of these two organisations are already almost entirely "value-aligned people in the AIS community". They need more influence within the organisation, sure, but that's not going to be solved by altering team composition.

Comment by Adrià Garriga-alonso (rhaps0dy) on What an actually pessimistic containment strategy looks like · 2022-04-13T09:22:17.748Z · LW · GW

I'd argue the world right now (or rather, life on earth) is super bad because it's dominated by animal suffering

I agree with this, and the overall history of the world is definitely on balance extreme suffering.

For farmed animals in particular, we don't need AGI to end their plight. Just regular economic growth and advocacy will do.

Also, given how much time we've been suffering already, and how much is at stake; would it be so bad to delay AGI by 100 or 200 years? We can do a lot of alignment research in that time.