We Should Prepare for a Larger Representation of Academia in AI Safety 2023-08-13T18:03:19.799Z
Andrew Ng wants to have a conversation about extinction risk from AI 2023-06-05T22:29:07.510Z
Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios 2023-05-16T10:53:32.968Z
[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques 2023-03-16T16:38:33.735Z
Natural Abstractions: Key claims, Theorems, and Critiques 2023-03-16T16:37:40.181Z
Andrew Huberman on How to Optimize Sleep 2023-02-02T20:17:12.010Z
Experiment Idea: RL Agents Evading Learned Shutdownability 2023-01-16T22:46:03.403Z
Disentangling Shard Theory into Atomic Claims 2023-01-13T04:23:51.947Z
Citability of Lesswrong and the Alignment Forum 2023-01-08T22:12:02.046Z
A Short Dialogue on the Meaning of Reward Functions 2022-11-19T21:04:30.076Z
Leon Lang's Shortform 2022-10-02T10:05:36.368Z
Distribution Shifts and The Importance of AI Safety 2022-09-29T22:38:12.612Z
Summaries: Alignment Fundamentals Curriculum 2022-09-18T13:08:05.335Z


Comment by Leon Lang (leon-lang) on Sharing Information About Nonlinear · 2023-09-08T22:36:06.890Z · LW · GW

(Fwiw, I don’t remember problems with stipend payout at seri mats in the winter program. I was a winter scholar 2022/23.)

Comment by Leon Lang (leon-lang) on Long-Term Future Fund: April 2023 grant recommendations · 2023-08-02T09:41:43.743Z · LW · GW

This is very helpful, thanks! Actually, the post includes several sections, including in the appendix, that might be more interesting to many readers than the grant recommendations themselves. Maybe it would be good to change the title a bit so that people also expect other updates.

Comment by Leon Lang (leon-lang) on DSLT 2. Why Neural Networks obey Occam's Razor · 2023-07-12T06:26:39.865Z · LW · GW

Thanks for the reply!

As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. ). Precisely, suppose  is the number of parameters, then you are in the regular case if  can be expressed as a full-rank quadratic form near each singularity, 

Anything less than this is a strictly singular case. 

So if , then  is a singularity but not a strict singularity, do you agree? It still feels like somewhat bad terminology to me, but maybe it's justified from the algebraic-geometry--perspective. 

Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2023-07-05T09:15:29.588Z · LW · GW

Zeta Functions in Singular Learning Theory

In this shortform, I very briefly explain my understanding of how zeta functions play a role in the derivation of the free energy in singular learning theory. This is entirely based on slide 14 of the SLT low 4 talk of the recent summit on SLT and Alignment, so feel free to ignore this shortform and simply watch the video.

The story is this: we have a prior , a model , and there is an unknown true distribution . For model selection, we are interested in the evidence of our model for a data set , which is given by

where  is the empirical KL divergence. In fact, we are interested in selecting the model that maximizes the average of this quantity over all data sets. The average is then given by

where  is the Kullback-Leibler divergence. 

But now we have a problem: how do we compute this integral? Computing this integral is what the free energy formula is about

The answer: by computing a different integral. So now, I'll explain the connection to different integrals we can draw. 


which is called the state density function. Here,  is the Dirac delta function.  For different , it measures the density of states (= parameter vectors) that have . It is thus a measure for the "size" of different level sets. This state density function is connected to two different things. 

Laplace Transform to the Evidence

First of all, it is connected to the evidence above. Namely, let  be the Laplace transform of . It is a function  given by

In first step, we changed the order of integration, and in the second step we used the defining property of the Dirac delta. Great, so this tells us that ! So this means we essentially just need to understand .

Mellin Transform to the Zeta Function

But how do we compute ? By using another transform. Let  be the Mellin transform of . It is a function  (or maybe only defined on part of ?) given by

Again, we used a change in the order of integration and then the defining property of the Dirac delta. This is called a Zeta function. 

What's this useful for?

The Mellin transform has an inverse. Thus, if we can compute the zeta function, we can also compute the original evidence as

Thus, we essentially changed our problem to the problem of studying the zeta function  To compute the integral of the zeta function, it is then useful to perform blowups to resolve the singularities in the set of minima of , which is where algebraic geometry enters the picture. For more on all of this, I refer, again, to the excellent SLT low 4 talk of the recent summit on singular learning theory. 

Comment by Leon Lang (leon-lang) on DSLT 2. Why Neural Networks obey Occam's Razor · 2023-07-03T23:13:00.661Z · LW · GW

Thanks for the answer! I think my first question was confused because I didn't realize you were talking about local free energies instead of the global one :) 

As discussed in the comment in your DSLT1 question, they are both singularities of  since they are both critical points (local minima).

Oh, I actually may have missed that aspect of your answer back then. I'm confused by that: in algebraic geometry, the zero's of a set of polynomials are not necessarily already singularities. E.g., in , the zero set consists of the two axes, which form an algebraic variety, but only at  is there a singularity because the derivative disappears.
Now, for the KL-divergence, the situation seems more extreme: The zero's are also, at the same time, the minima of , and thus, the derivative disappears at every point in the set . This suggests every point in  is singular. Is this correct?

So far, I thought "being singular" means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it's about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused. 

The uninteresting answer is that SLT doesn't care about the prior (other than its regularity conditions) since it is irrelevant in the  limit.

I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey's prior somewhat changes asymptotic behavior for , but I'm not certain of that. 

Comment by Leon Lang (leon-lang) on DSLT 4. Phase Transitions in Neural Networks · 2023-07-03T22:48:04.400Z · LW · GW

Thanks also for this post! I enjoy reading the sequence and look forward to post 5 on the connections to alignment :) 

At some critical value , we recognise a phase transition as being a discontinuous change in the free energy or one of its derivatives, for example the generalisation error .

"Discontinuity" might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these "sudden changes" happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this compatible, or would this suggest that phenomena like grokking aren't related to the phase transitions predicted by SLT?

There is, however, one fundamentally different kind of "phase transition" that we cannot explain easily with SLT: a phase transition of SGD in time, i.e. the number gradient descent steps. The Bayesian framework of SLT does not really allow one to speak of time - the closest quantity is the number of datapoints , but these are not equivalent. We leave this gap as one of the fundamental open questions of relating SLT to current deep learning practice.

As far as I know, modern transformers are often only trained once on each data sample, which should close the gap between SGD time and the number of data samples quite a bit. Do you agree with that perspective?

In general, it seems to me that we're probably most interested in phase transitions that happen across SGD time or with more data samples, whereas phase transitions related to other hyperparameters (for example, varying the truth as in your examples here) are maybe less crucial. Would you agree with that?

Would you expect that most phase transitions in SGD time or the number of data samples are first-order transitions (as is the case when there is a loss-complexity tradeoff), or can you conceive of second-order phase transitions that might be relevant in that context as well?

Which altered the posterior geometry, but not that of  since  (up to a normalisation factor).

I didn't understand this footnote. 

but the node-degeneracy and orientation-reversing symmetries only occur under precise configurations of the truth.

Hhm, I thought that these symmetries are about configurations of the parameter vector, irrespective of whether it is the "true" vector or not.
Are you maybe trying to say the following? The truth determines which parameter vectors are preferred by the free energy, e.g. those close to the truth. For some truths, we will have more symmetries around the truth, and thus lower RLCT for regions preferred by the posterior

We will use the label weight annihilation phase to refer to the configuration of nodes such that the weights all point into the centre region and annihilate one another.

It seems to me that in the other phase, the weights also annihilate each other, so the "non-weight annihilation phase" is a somewhat weird terminology. Or did I miss something?

The weight annihilation phase  is never preferred by the posterior

I think there is a typo and you meant .

Comment by Leon Lang (leon-lang) on DSLT 3. Neural Networks are Singular · 2023-07-03T13:48:45.112Z · LW · GW

Thanks Liam also for this nice post! The explanations were quite clear. 

The property of being singular is specific to a model class , regardless of the underlying truth.

This holds for singularities that come from symmetries where the model doesn't change. However, is it correct that we need the "underlying truth" to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence. 

Both configurations, non-weight-annihilation (left) and weight-annihilation (right)

What do you mean with non-weight-annihilation here? Don't the weights annihilate in both pictures?

Comment by Leon Lang (leon-lang) on Neural networks generalize because of this one weird trick · 2023-06-27T04:09:03.576Z · LW · GW

In particular, it is the singularities of these minimum-loss sets — points at which the tangent is ill-defined — that determine generalization performance.

To clarify: there is not necessarily a problem with the tangent, right? E.g., the function  has a singularity at  because the second derivative vanishes there, but the tangent is define. I think for the same reason, some of the pictures may be misleading to some readers. 

  • A model, parametrized by weights , where  is compact;

Why do we want compactness? Neural networks are parameterized in a non-compact set. (Though I guess usually, if things go well, the weights don't blow up. So in that sense it can maybe be modeled to be compact)

The empirical Kullback-Leibler divergence is just a rescaled and shifted version of the negative log likelihood.

I think it is only shifted, and not also rescaled, if I'm not missing something. 

But these predictions of "generalization error" are actually a contrived kind of theoretical device that isn't what we mean by "generalization error" in the typical ML setting.

Why is that? I.e., in what way is the generalization error different from what ML people care about? Because real ML models don't predict using an updated posterior over the parameter space? (I was just wondering if there is a different reason I'm missing)

Comment by Leon Lang (leon-lang) on DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks · 2023-06-27T01:55:13.807Z · LW · GW

Thanks for the answer mfar!

Yeah I remember also struggling to parse this statement when I first saw it. Liam answered but in case it's still not clear and/or someone doesn't want to follow up in Liam's thesis,  is a free variable, and the condition is talking about linear dependence of functions of .

Consider a toy example (not a real model) to help spell out the mathematical structure involved: Let  so that  and . Then let  and  be functions such that  and .. Then the set of functions  is a linearly dependent set of functions because .

Thanks! Apparently the proof of the thing I was wondering about can be found in Lemma 3.4 in Liam's thesis. Also thanks for your other comments!

Comment by Leon Lang (leon-lang) on DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks · 2023-06-27T01:32:41.017Z · LW · GW

Thanks for the answer Liam! I especially liked the further context on the connection between Bayesian posteriors and SGD. Below a few more comments on some of your answers:

The partition function is equal to the model evidence , yep. It isn’t equal to (I assume  is fixed here?) but is instead expressed in terms of the model likelihood and prior (and can simply be thought of as the “normalising constant” of the posterior), 

and then under this supervised learning setup where we know , we have . Also note that this does “factor over ” (if I’m interpreting you correctly) since the data is independent and identically distributed.  

I think I still disagree. I think everything in these formulas needs to be conditioned on the -part of the dataset. In particular, I think the notation  is slightly misleading, but maybe I'm missing something here.

I'll walk you through my reasoning: When I write  or , I mean the whole vectors, e.g., . Then I think the posterior compuation works as follows:

That is just Bayes rule, conditioned on  in every term. Then,  because from alone you don't get any new information about the conditional  (A more formal way to see this is to write down the Bayesian network of the model and to see that  and  are d-separated). Also, conditioned on  is independent over data points, and so we obtain

So, comparing with your equations, we must have  Do you think this is correct?

Btw., I still don't think this "factors over ". I think that

The reason is that old data points should inform the parameter , which should have an influence on future updates. I think the independence assumption only holds for the true distribution and the model conditioned on 

If you expand that term out you find that 

because the second integral is the first central moment of a Gaussian. The derivative of the prior is irrelevant. 

Right. that makes sense, thank you! (I think you missed a factor of , but that doesn't change the conclusion)

Thanks also for the corrected volume formula, it makes sense now :) 

Comment by Leon Lang (leon-lang) on DSLT 2. Why Neural Networks obey Occam's Razor · 2023-06-25T22:46:25.722Z · LW · GW

Thanks for this nice post! I fight it slightly more vague than the first post, but I guess that is hard to avoid when trying to distill highly technical topics. I got a lot out of it. 

Fundamentally, we care about the free energy  because it is a measure of posterior concentration, and as we showed with the BIC calculation in DSLT1, it tells us something about the information geometry of the posterior.

Can you tell more about why it is a measure of posterior concentration (It gets a bit clearer further below, but I state my question nonetheless to express that this statement isn't locally clear to me here)? I may lack some background in Bayesian statistics here. In the first post, you wrote the posterior as

and it seems like you want to say that if free energy is low, then the posterior is more concentrated. If I look at this formula, then low free energy corresponds to high , meaning the prior and likelihood have to "work quite a bit" to ensure that this expression overall integrates to . Are you claiming that most of that work happens very localized in a small parameter region?

Additionally, I am not quite sure what you mean with "it tells us something about the information geometry of the posterior", or even what you mean by "information geometry" here. I guess one answer is that you showed in post 1 that the Fisher information matrix appears in the formula for the free energy, which contains geometric information about the loss landscape. But then in the proof, you regarded that as a constant that you ignored in the final BIC formula, so I'm not sure if that's what you are referring to here. More explicit references would be useful to me. 

Since there is a correspondence

we say the posterior prefers a region  when it has low free energy relative to other regions of 

Note to other readers (as this wasn't clear to me immediately): That correspondence holds because one can show that 

Here,  is the global partition function. 

The Bayes generalisation loss is then given by 

I believe the first expression should be an expectation over .

It follows immediately that the generalisation loss of a region  is 

I didn't find a definition of the left expression. 

So, the region in  that minimises the free energy has the best accuracy-complexity tradeoff. This is the sense in which singular models obey Occam's Razor: if two regions are equally accurate, then they are preferred according to which is the simpler model. 

Purposefully naive question: can I just choose a region  that contains all singularities? Then it surely wins, but this doesn't help us because this region can be very large.

So I guess you also want to choose small regions. You hinted at that already by saying that  should be compact. But now I of course wonder if sometimes just all of  lies within a compact set. 

There are two singularities in the set of true parameters, 

which we will label as  and  respectively.

Possible correction: one of those points isn't a singularity, but a regular loss-minimizing point (as you also clarify further below).

Let's consider a one parameter model  with KL divergence defined by 

on the region  with uniform prior 

The prior seems to do some work here: if it doesn't properly support the region with low RLCT, then the posterior cannot converge there. I guess a similar story might a priori hold for SGD, where how you initialize your neural network might matter for convergence.

How do you think about this? What are sensible choices of priors (or network initializations) from the SLT perspective?

Also, I find it curious that in the second example, the posterior will converge to the lowest loss, but SGD would not since it wouldn't "manage to get out of the right valley", I assume. This seems to suggest that the Bayesian view of SGD can at most be true in high dimensions, but not for very low-dimensional neural networks. Would you agree with that, or what is your perspective?

Comment by Leon Lang (leon-lang) on DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks · 2023-06-25T01:06:55.232Z · LW · GW

Thank you for this wonderful article! I read it fairly carefully and have a number of questions and comments. 

where the partition function (or in Bayesian terms the evidence) is given by

Should I think of this as being equal to , and would you call this quantity ? I was a bit confused since it seems like we're not interested in the data likelihood, but only the conditional data likelihood under model 

And to be clear: This does not factorize over  because every data point informs  and thereby the next data point, correct?

The learning goal is to find small regions of parameter space with high posterior density, and therefore low free energy.

But the free energy does not depend on the parameter, so how should I interpret this claim? Are you already one step ahead and thinking about the singular case where the loss landscape decomposes into different "phases" with their own free energy?

there is almost sure convergence  as  to a constant  that doesn't depend on [5]

I think the first expression should either be an expectation over , or have the conditional entropy  within the parantheses. 

  • In the realisable case where , the KL divergence is just the euclidean distance between the model and the truth adjusted for the prior measure on inputs, 

I briefly tried showing this and somehow failed. I didn't quite manage to get rid of the integral over . Is this simple? (You don't need to show me how it's done, but maybe mentioning the key idea could be useful)

A regular statistical model class is one which is identifiable (so  implies that ), and has positive definite Fisher information matrix  for all 

The rest of the article seems to mainly focus on the case of the Fisher information matrix. In particular, you didn't show an example of a non-regular model where the Fisher information matrix is positive definite everywhere. 

Is it correct to assume models which are merely non-regular because the map from parameters to distributions is non-injective aren't that interesting, and so you maybe don't even want to call them singular? I found this slightly ambiguous, also because under your definitions further down, it seems like "singular" (degenerate Fisher information matrix) is a stronger condition then "strictly singular" (degenerate Fisher information matrix OR non-injective map from parameters to distributions).

It can be easily shown that, under the regression model,  is degenerate if and only the set of derivatives

is linearly dependent. 

What is  in this formula? Is it fixed? Or do we average the derivatives over the input distribution?

Since every true parameter is a degenerate singularity[9] of , it cannot be approximated by a quadratic form.

Hhm, I thought having a singular model just means that some singularities are degenerate.

One unrelated conceptual question: when I see people draw singularities in the loss landscape, for example in Jesse's post, they often "look singular": i.e., the set of minimal points in the loss landscape crosses itself. However, this doesn't seem to actually be the case: a perfectly smooth curve of loss-minimizing points will consist of singularities because in the direction of the curve, the derivative does not change. Is this correct?

We can Taylor expand the NLL as 

I think you forgot a  in the term of degree 1. 

In that case, the second term involving  vanishes since it is the first central moment of a normal distribution

Could you explain why that is? I may have missed some assumption on  or not paid attention to something. 

In this case, since  for all , we could simply throw out the free parameter  and define a regular model with  parameters that has identical geometry , and therefore defines the same input-output function, .

Hhm. Is the claim that if the loss of the function does not change along some curve in the parameter space, then the function itself remains invariant? Why is that?

Then the dimension  arises as the scaling exponent of , which can be extracted via the following ratio of volumes formula for some 

This scaling exponent, it turns out, is the correct way to think about dimensionality of singularities. 

Are you sure this is the correct formula? When I tried computing this by hand it resulted in , but maybe I made a mistake. 

General unrelated question: is the following a good intuition for the correspondence of the volume with the effective number of parameters around a singularity? The larger the number of effective parameters  around , the more  blows up around  in all directions because we get variation in all directions, and so the smaller the region where  is below . So  contributes to this volume. This is in fact what it does in the formulas, by being an exponent for small 

So, in this case the global RLCT is , which we will see in DSLT2 means that the posterior is most concentrated around the singularity 

Do you currently expect that gradient descent will do something similar, where the parameters will move toward singularities with low RLCT? What's the state of the theory regarding this? (If this is answered in later posts, feel free to just refer to them)

Also, I wonder whether this could be studied experimentally even if the theory is not yet ready: one could probably measure the RLCT around minimal loss points by measuring volumes, and then just check whether gradient descent actually ends up in low-RLCT regions. Maybe this is what you do in later posts. If this is the case, I wonder whether I should be surprised or not: it seems like the lower the RLCT, the larger the number of (fractional) directions where the loss is minimal, and so the larger the basin. So for purely statistical reasons, one may end up in such a region instead of isolated loss-minimizing points of high RLCT. 

Comment by Leon Lang (leon-lang) on Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures · 2023-06-02T03:57:51.049Z · LW · GW

Apparently Bill Gates signed.

Stating the obvious: Do we expect that Bill Gates will donate money to prevent the extinction from AI?

Comment by Leon Lang (leon-lang) on Yoshua Bengio: How Rogue AIs may Arise · 2023-05-23T21:56:40.749Z · LW · GW

It's great to see Yoshua Bengio and other eminent AI scientists like Geoffrey Hinton actively engage in the discussion around AI alignment. He evidently put a lot of thought into this. There is a lot I agree with here.

Below, I'll discuss two points of disagreement or where I'm surprised by his takes, to highlight potential topics of discussion, e.g. if someone wants to engage directly with Bengio.

  • Most of the post is focused on the outer alignment problem -- how do we specify a goal aligned with our intent -- and seems to ignore the inner alignment problem -- how do we ensure that the specified goal is optimized for.
    • E.g., he makes an example of us telling the AI to fix climate change, after which the AI wipes out humanity since that fixes climate change more effectively than respecting our implicit constraints of which the AI has no knowledge. In fact, I think language models show that there may be quite some hope that AI models will understand our implicit intent. Under that view, the problem lies at least as much in ensuring that the AI cares.
    • He also extensively discusses the wireheading problem of entities (e.g., humans, corporations, or AI systems) that try to maximize their reward signal. I think we have reasons to believe that wireheading isn't as much of a concern: inner misalignment will cause the agent to have some other goal than the precise maximization of the reward function, and once the agent is situationally aware, it has incentives to keep its goals from changing by gradient descent. 
    • He does discuss the fact that our brains reward us for pleasure and avoiding pain, which is misaligned with the evolutionary goal of genetic fitness. In the alignment community, this is most often discussed as an inner alignment issue between the "reward function" of evolution and the "trained agent" being our genomes. However, his discussion highlights that he seems to view it as an outer alignment issue between evolution and our reward signals in the brain, which shape our adult brains through in-lifetime learning. This is also the viewpoint in Brain-Like-AGI Safety, as far as I remember, and also seems related to viewpoints discussed in shard theory
  • "In fact, over two decades of work in AI safety suggests that it is difficult to obtain AI alignment [wikipedia], so not obtaining it is clearly possible."
    • I agree with the conclusion, but I am surprised by the argument. It is true that we have seen over two decades of alignment research, but the alignment community has been fairly small all this time. I'm wondering what a much larger community could have done. 
Comment by Leon Lang (leon-lang) on Yoshua Bengio: How Rogue AIs may Arise · 2023-05-23T21:27:49.409Z · LW · GW

Yoshua Bengio was on David Krueger's PhD thesis committee, according to David's CV

Comment by Leon Lang (leon-lang) on Announcing “Key Phenomena in AI Risk” (facilitated reading group) · 2023-05-09T16:54:30.658Z · LW · GW

After filling out the form, I could click on "see previous responses", which allowed me to see the responses of all other people who have filled out the form so far

That is probably not intended?

Comment by Leon Lang (leon-lang) on United We Align: Harnessing Collective Human Intelligence for AI Alignment Progress · 2023-04-21T15:48:33.725Z · LW · GW

I disagree with this. I think the most useful definition of alignment is intent alignment. Humans are effectively intent-aligned on the goal to not kill all of humanity. They may still kill all of humanity, but that is not an alignment problem but a problem in capabilities: humans aren't capable of knowing which AI designs will be safe.

The same holds for intent-aligned AI systems that create unaligned successors. 

Comment by Leon Lang (leon-lang) on $20K In Bounties for AI Safety Public Materials · 2023-02-28T04:36:56.996Z · LW · GW

Has this already been posted? I could not find the post. 

Comment by Leon Lang (leon-lang) on Bing Chat is blatantly, aggressively misaligned · 2023-02-17T17:47:32.003Z · LW · GW

For what it's worth, I think this comment seems clearly right to me, even if one thinks the post actually shows misalignment. I'm confused about the downvotes of this (5 net downvotes and 12 net disagree votes as of writing this). 

Comment by Leon Lang (leon-lang) on The "Minimal Latents" Approach to Natural Abstractions · 2023-02-16T01:15:16.411Z · LW · GW

Now to answer our big question from the previous section: I can find some  satisfying the conditions exactly when all of the ’s are independent given the “perfectly redundant” information. In that case, I just set  to be exactly the quantities conserved under the resampling process, i.e. the perfectly redundant information itself.


In the original post on redundant information, I didn't find a definition for the "quantities conserved under the resampling process". You name this F(X) in that post.

Just to be sure: is your claim that if F(X) exists that contains exactly the conserved quantities and nothing else, then you can define  like this? Or is the claim even stronger and you think such  can always be constructed?

Edit: Flagging that I now think this comment is confused. One can simply define  as the conditional, which is a composition of the random variable  and the function 

Comment by Leon Lang (leon-lang) on Qualities that alignment mentors value in junior researchers · 2023-02-15T01:51:15.969Z · LW · GW

When I converse with junior folks about what qualities they’re missing, they often focus on things like “not being smart enough” or “not being a genius” or “not having a PhD.” It’s interesting to notice differences between what junior folks think they’re missing & what mentors think they’re missing.


There may also be social reasons to give different answers depending on whether you are a mentor or mentee. I.e., answering "the better mentees were those who were smarter" seems like an uncomfortable thing to say, even if it's true. 

(I do not want to say that this social explanation is the only reason that answers between mentors and mentees differed. But I do think that one should take it into account in one's models)

Comment by Leon Lang (leon-lang) on The Additive Summary Equation · 2023-02-09T21:50:44.730Z · LW · GW


Comment by Leon Lang (leon-lang) on The Additive Summary Equation · 2023-02-09T21:27:35.574Z · LW · GW

Then  is a projection matrix, projecting into the span.

To clarify: for this, you probably need the basis  to be orthonormal? 

Comment by Leon Lang (leon-lang) on Double Crux · 2023-02-09T19:52:08.114Z · LW · GW


  • Disagreements often focus on outputs even though underlying models produced those.
    • Double Crux idea: focus on the models!
    • Double Crux tries to reveal the different underlying beliefs coming from different perspectives on reality
  • Good Faith Principle:
    • Assume that the other side is moral and intelligent.
    • Even if some actors are bad, you minimize the chance of error if you start with the prior that each new person is acting in good faith
  • Identifying Cruxes
    • For every belief A, there are usually beliefs B, C, D such that their believed truth supports belief A
      • These are “cruxes” if them not being true would shake the belief in A.
      • Ideally, B, C, and D are functional models of how the world works and can be empirically investigated
    • If you know your crux(es), investigating it has the chance to change your belief in A
  • In Search of more productive disagreement
    • Often, people obscure their cruxes by telling many supporting reasons, most of which aren’t their true crux.
      • This makes it hard for the “opponent” to know where to focus
    • If both parties search for truth instead of wanting to win, you can speed up the process a lot by telling each other the cruxes
  • Playing Double Crux
    • Lower the bar: instead of reaching a shared belief, find a shared testable claim that, if investigated, would resolve the disagreement.
    • Double Crux: A belief that is a crux for you and your conversation partner, i.e.:
      • You believe A, the partner believes not A.
      • You believe testable claim B, the partner believes not B.
      • B is a crux of your belief in A and not B is a crux of your partner’s belief in not B.
      • Investigating conclusively whether B is true may resolve the disagreement (if the cruxes were comprehensive enough)
  • The Double Crux Algorithm
    • Find a disagreement with another person (This might also be about different confidences in beliefs)
    • Operationalize the disagreement (Avoid semantic confusions, be specific)
    • Seek double cruxes (Seek cruxes independently and then compare)
    • Resonate (Do the cruxes really feel crucial? Think of what would change if you believed your crux to be false)
    • Repeat (Are there underlying easier-to-test cruxes for the double cruxes themselves?)
Comment by Leon Lang (leon-lang) on Abstractions as Redundant Information · 2023-02-09T04:36:31.016Z · LW · GW


In this post, John starts with a very basic intuition: that abstractions are things you can get from many places in the world, which are therefore very redundant. Thus, for finding abstractions, you should first define redundant information: Concretely, for a system of n random variables X1, …, Xn, he defines the redundant information as that information that remains about the original after repeatedly resampling one variable at a time while keeping all the others fixed. Since there will not be any remaining information if n is finite, there is also the somewhat vague assumption that the number of variables goes to infinity in that resampling process. 

The first main theorem says that this resampling process will not break the graphical structure of the original variables, i.e., if X1, …, Xn form a Markov random field or Bayesian network with respect to a graph, then the resampled variables will as well, even when conditioning on the abstraction of them. John’s interpretation is that you will still be able to make inferences about the world in a local way even if you condition on your high-level understanding (i.e., the information preserved by the resampling process)

The second main theorem applies this to show that any abstraction F(X1, …, Xn) that contains all the information remaining from the resampling process will also contain all the abstract summaries from the telephone theorem for all the ways that X1, …, Xn (with n going to infinity) could be decomposed into infinitely many nested Markov blankets. This makes F a supposedly quite powerful abstraction.

Further Thoughts

It’s fairly unclear how exactly the resampling process should be defined. If n is finite and fixed, then John writes that no information will remain. If, however, n is infinite from the start, then we should (probably?) expect the mutual information between the original random variable and the end result to also often be infinite, which also means that we should not expect a small abstract summary F.

Leaving that aside, it is in general not clear to me how F is obtained. The second theorem just assumes F and deduces that it contains the information from the abstract summaries of all telephone theorems. The hope is that F is low-dimensional and thus manageable. But no attempt is made to show the existence of a low-dimensional F in any realistic setting. 

Another remark: I don’t quite understand what it means to resample one of the variables “in the physical world”. My understanding is as follows, and if anyone can correct it, that would be helpful: We have some “prior understanding” (= prior probability) about how the world works, and by measuring aspects in the world — e.g., patches full of atoms in a gear — we gain “data” from that prior probability distribution. When forgetting the data of one of the patches, we can look at the others and then use our physical understanding to predict the values for the lost patch. We then sample from that prediction.

Is that it? If so, then this resampling process seems very observer-dependent since there is probably no actual randomness in the universe. But if it is observer-dependent, then the resulting abstractions would also be observer-dependent, which seems to undermine the hope to obtain natural abstractions.

I also have a similar concern about the pencils example: if you have a prior on variables X1, …, Xn and you know that all of them will end up to be “objects of the same type”, and a joint sample of them gives you n pencils, then it makes sense to me that resampling them one by one until infinity will still give you a bunch of pencil-like objects, leading you to conclude that the underlying preserved information is a graphite core inside wood. However, where do the variables X1, …, Xn come from in the first place? Each Xi is already a high-level object and it is unclear to me what the analysis would look like if one reparameterized that space. (Thanks to Erik Jenner for mentioning that thought to me; there is a chance that I misrepresent his thinking, though.)

Comment by Leon Lang (leon-lang) on Internal Double Crux · 2023-02-08T19:15:28.052Z · LW · GW


  • Goal: Find motivation through truth-seeking rather than coercion or self-deception
    • Ideally: the urges are aligned with the high-level goals
    • Turn “wanting to want” into “want”
  • If a person has simultaneously conflicting beliefs and desires, then one of those is wrong.
    • [Comment from myself: I find this, as stated, not evidently true since desires often do not have a “ground truth” due to the orthogonality thesis. However, even if there is a conflict between subsystems, the productive way forward is usually to find a common path in a values handshake. This is how I interpret conflicting desires to be “wrong”]
  • Understanding “shoulds”
    • If you call some urges “lazy”, then you spend energy on a conflict
    • If you ignore your urges, then part of you is not “focused” on the activity, making it less worthwhile
    • Acknowledge your conflicting desires: “I have a belief that it’s good to run and I have a belief that it’s good to watch Netflix”
      • The different parts aren’t right or wrong; they have tunnel vision, not seeing the value of the other desire
    • Shoulds: When there is a default action, there is often a sense that you “should” have done something else. If you would have done this “something else”, then the default action becomes the “should” and the situation is reversed.
    • View shoulds as “data” that is useful for making better conclusions
  • The IDC Algorithm (with an example in the article)
    • Recommendation: Do not tweak the structure of IDC before having tried it a few times
    • Step 0: Find an internal disagreement
      • Identify a “should” that’s counter to a default action
    • Step 1: Draw two dots on a piece of paper and name them with the subagents representing the different positions
      • Choose appropriate names/handles that don’t favor one side over the other
    • Step 2: Decide who speaks first (it might be the side with more “urgency”)
      • Say one thing embodied from that perspective
      • Maybe use Focusing to check that the words resonate
    • Step 3: Get the other side to acknowledge truth.
      • Let it find something true in the statement or derived from it
    • Step 4: The second side also adds “one thing”
      • Be open in general about the means of communication of the sides; they may also scribble something, express a feeling, or …
    • Step 5: Repeat
    • Notes:
      • It’s okay for some sides to “blow off steam” once in a while and not follow the rules; if so, correct that after the fact from a “moderation standpoint”
      • You may write down “moderator interjections” with another color
      • Eventually, you might realize the disagreement to be about something else.
        • This can give clarity on the “internal generators” of conflict
        • If so, start a new piece of paper with two new debaters
        • Ideally, the different parts understand each other better, leading them to stop getting into conflict since they respect each other's values
Comment by Leon Lang (leon-lang) on Focusing · 2023-02-07T20:06:47.337Z · LW · GW


  • Focusing is a technique for bringing subconscious system 1 information into conscious awareness
  • Felt sense: a feeling in the body that is not yet verbalized but may subconsciously influence behavior, and which carries meaning.
  • The dominant factor in patient outcomes: does the patient remain uncertain, instead of having firm narratives
    • A goal of therapy is increased awareness and clarity. Thus, it is not useful to spend much time in the already known
    • The successful patient thinks and listens to information
      • If the verbal part utters something, the patient will check with the felt senses to correct the utterance
      • Listening can feel like “having something on the tip of your tongue”
  • From felt senses to handles
    • A felt sense is like a picture
      • There’s lots of tacit, non-explicit information in it
    • A handle is like a sketch of the picture that is true to it.
      • Handles “resonate” with the felt sense
      • The first attempt at a handle will often not resonate — then you need to iterate
        • In the end, you might get a “click”, “release of pressure”, or “sense of deep rightness”
      • The felt sense can change or disappear once “System 2 got the message”
  • Advice and caveats
    • The felt sense may also not be true — your system 1 may be biased.
    • Tips:
      • Choosing a topic: if you don’t have a felt sense to focus on, produce the utterance “Everything in my life is perfect right now” and see how system 1 responds. This will usually create a topic to focus on
      • Get physically comfortable
      • Don’t “focus” in the sense of effortful attention, but “focus” in the sense of “increase clarity”
      • Hold space: don’t go super fast or “push”; silence in one’s mind is normal
      • Stay with one felt sense at a time
      • Always return to the felt sense, also if the coherent verbalized story feels “exciting”
      • Don’t limit yourself to sensations in your body — there are other felt senses
      • Try saying things out loud (both utterances and questions “to the felt sense”)
      • Try to not “fall into” overwhelming felt senses; they can sometimes make the feeling a “subject” instead of an “object” to hold and talk with
        • Going “meta” and asking what the body has to say about a felt sense can help with not getting sucked in
        • Verbalizing “I feel rage” and then “something in me is feeling rage” etc. can progressively create distance to felt senses
  • The Focusing Algorithm
    • Select something to bring into focus
    • Create space (get physically comfortable and drop in for a minute; Put attention to the body; ask sensations to wait if there are multiple; go meta if you’re overwhelmed)
    • Look for a handle of the felt sense (Iterate between verbalizing and listening until the felt sense agrees; Ask questions to the felt sense; Take time to wait for responses)
Comment by Leon Lang (leon-lang) on Trigger-Action Planning · 2023-02-06T23:07:39.221Z · LW · GW


  • Some things seem complicated/difficult and hard to do. You may also have uncertainties about whether you can achieve it
    • E.g.: running a marathon; or improving motivation
  • The resolve cycle: set a 5-minute timer and just solve the thing.
  • Why does it work?
    • You want to be actually trying, even when there is no immediate need.
      • Resolve cycles are an easy way of achieving that (They make it more likely and less painful to invest effort)
    • One mechanism of its success: when asking “Am I ready for this?” the answer is often “No”. But when there’s a five-minute timer, the answer becomes “Yes”, no matter how hard the problem itself may seem.
  • The Technique
    • Choose a thing to solve — it can be small or big.
    • Try solving it with a 5 minutes timer. Don’t defer parts of it to the future unless they are extremely easy (i.e., there is a TAP in place ensuring that they will get done)
    • If it’s not solved: spend another 5 minutes brainstorming a list of five-minute actions to solve your problem
    • Do one of the five-minute actions to get momentum
  • Developing a “grimoire”:
    • This section lists many questions to ask yourself in a resolve-cycle. These questions can help for generating useful ideas. 
Comment by Leon Lang (leon-lang) on Comfort Zone Exploration · 2023-01-27T19:38:44.736Z · LW · GW

Note: I found this article in particular a bit hard to summarize, especially the section "The argument for CoZE". I find it hard to say what exactly it is telling me, and how it relates to the later sections. 


  • Comfort is a lack of pain, discomfort, negative emotions, fear, and anxiety, …
  • Comfort often comes from experience
  • There’s a gray area between comfort and discomfort that can be worth exploring
  • Explore/Exploit Tradeoff:
    • Should you exploit the current hill and climb higher there, or search for a new one?
    • Problem: there is inherent uncertainty
    • Exploration is risky for individuals; there is a strong bias toward known paths
  • Argument for CoZE
    • Try Things model — cheap, non-destabilizing experiments
    • The text contains a list of questions that can generate a lot of the “things we might try”.
      • Problem: We might be uncomfortable with them.
      • How do we reason through them but also bring System 1 on board, which might have useful insights?
  • Chesterton’s Fence
    • In the story, someone destroys an ugly fence, only to then be attacked by an animal behind.
      • Don’t destroy a barrier before you know exactly why it’s there.
    • CoZE includes the lessons of Chesterton’s fence
      • That’s why it’s called exploration instead of expansion
      • This is the difference to exposure therapy
    • When exploring the area around the fence, remain alert, attentive, receptive:
      • Stay open to all outcomes: the fence shouldn’t be there, the fence is exactly where it should be, the fence should be further away or even closer to you…
  • CoZE Algorithm:
    • Choose an experience to explore (outside of the current action space, or somewhat blocked, maybe with a yum factor)
    • Prepare to accept all worlds: both possibilities need to feel comfortable in your imagination
    • Devise an experiment to “taste” the experience
    • Try the experiment (Potentially with help of others)
      • How do the body and mind react?
      • How does the external world respond?
    • Digest the experience
      • Compare the experience to expectations
      • Should you continue trying something like this? Do not force yourself
      • Give your system 1 space
Comment by Leon Lang (leon-lang) on Againstness · 2023-01-26T19:32:02.537Z · LW · GW


  • Idea: use knowledge of how physiology influences the mind
  • Mental shutdown:
    • Stress → Mental shutdown (trouble thinking, making decisions, …)
    • This leads to decisions that feel correct at the moment but are obviously flawed in hindsight
    • Metacognitive failure: Part of what we lose is also our ability to notice the loss in abilities → Need objective “sobriety test”
  • The automatic nervous system
    • Sympathetic nervous system (SNS): accelerator; fight/flight/freeze, excitement
    • Parasympathetic nervous system (PSNS): brakes; chill, open vulnerability, reflective…
    • Relative arousal of these systems matters
    • The qualities come bundled: not often do you have an “open, relaxed body posture” while also being extremely excited
    • The ancestral need for survival guides our SNS-dominated responses, and they are not reflective
  • Changing the state
    • Algorithm idea:
      • Check where you are on the SNS-PSNS spectrum
      • Change your position at that spectrum “at will”
    • Most of the skill lies in shifting toward PSNS, which is also more useful for our rationality
    • Algorithm for moving toward PSNS:
      • Notice that you are SNS-dominated (TAP: physical sign, perceived hostility, someone saying “calm down”)
      • Open your Body posture
      • Take low, slow, deep breaths: belly-dominated, exhale longer than inhale
      • Get aware of the sensations in your feet
        • Then expand that awareness to the whole body
      • Take another low, slow, deep breath. Enjoy
  • Metacognitive blindspots
    • The againstness-resolving technique is one mechanism to resolve metacognitive blindspots (where our metacognition is not able to reflect on our cognition)
    • General advice: 
      • if everyone sounds wrong/stupid/malicious etc., take seriously the idea that it’s actually you.
      • Get external objective evidence on your blindspot (e.g., sobriety test for driving, noticing SNS-state)
      • Assess your blindspots with the same tools you assess other’s blindspots
Comment by Leon Lang (leon-lang) on Againstness · 2023-01-26T19:28:36.221Z · LW · GW

I share your confusion. 

Comment by Leon Lang (leon-lang) on Systemization · 2023-01-25T21:55:11.919Z · LW · GW


  • Systemizing has large up-front costs for diminished repeated costs
  • Not everything needs systemization: maybe you like your inefficient process, or the costs are too infrequent to bother.
  • Systemization-opportunities: 
    • Common routines: waking up, meals, work routines, computer, social
    • Familiar spaces: bedroom, bathroom, kitchen, living room, vehicle, workspace, backpack
    • Shoulds/Obligations: Physical health, finances, intellectual growth, close relationship, career, emotional well-being, community
  • Framing: instead of trying to “do everything”, focus on “freeing up attention”
    • Systems as “personal assistants”
  • Consider having several smaller systems instead of one overarching framework
  • Motivating Perspectives:
    • Increasing Marginal Returns: attention is a prerequisite for flow, deep connection, and non-default actions
      • The returns are increasing since removing the first distraction is not worth much in a sea of distractions, but the last is
    • TAP-interactions: Instead of changing responses, improve the triggers
  • Qualities of good attention-saving systems: Effortless, reliable (otherwise it’s not attention-freeing), invisible (the triggers should be specific, and not always be visible)
  • Advice for getting started:
    • Pay attention to friction (e.g., by collecting them for a week)
    • Set external reminders
    • Establish a routine
    • Shape other’s expectations regarding communication
    • Eliminate unneeded communication, e.g., email newsletters
    • Use checklists (e.g. for travel or income tax)
    • Outsource routines (housekeeping) or routine building (e.g., personal trainer)
    • Use inboxes
    • Triage (can a task also not be done?)
  • Meta-systems
    • Systems don’t work from the start, and also not forever. Solution:
    • Systemize the implementation/adaptation of your systems. Options:
      • Regular (e.g., weekly) time
      • Irregular, based on a running list of system-problems
      • Immediately when problems arise
  • Systemizing Algorithm:
    • Choose an aspect to be systemized (e.g., costs attention, is effortful, and can be improved)
    • Think of attention sinks and ideas to address them (maybe: aversion-factoring, understanding the cause of the distraction)
    • Reality check (with Myrphyjitsu, effort assessment, common sense objections)
    • Make a plan (set up TAPs, changes in the environment, buy things)
    • Put the plan into action (start right now or create the next action; have a review system in place)
Comment by Leon Lang (leon-lang) on Goodhart's Imperius · 2023-01-24T19:06:37.250Z · LW · GW


  • Claim 1: Goodhart’s Law is true
    • “Any measure which becomes the target ceases to be a good measure”
    • Examples:
      • Any math test supposed to find the best students will cease to work at the 10th iteration — people then “study to be good at the test”
      • Sugar was a good proxy for healthy food in the ancestral environment, but not today
  • Claim 2: If you want to condition yourself to a certain behavior with some reward, then that’s possible if only the delay between behavior and reward is small enough
  • Claim 3: Over time, we develop “taste”: inexplicable judgments of whether some stimulus may lead to progress toward our goals or not.
    • A “stimulus” can be as complex as “this specific hypothesis for how to investigate a disease”
  • Claim 4: Our Brains condition us, often without us noticing
    • With this, the article just means that dopamine spikes don’t exactly occur at the low-level reward, but already at points that predictably will lead to reward.
      • Since the dopamine hit itself can “feel rewarding”, this is a certain type of conditioning towards the behavior that preceded it.
      • In other words, the brain gives a dopamine hit in the same way as the dog trainer produces the “click” before the “treat”.
      • We often don’t “notice” this since we don’t usually explicitly think about why something feels good.
  • Conclusion: Your brain conditions you all the time toward proxy goals (“dopamine hits”), and Goodhart’s law means that conditioning is sometimes wrong
    • E.g., if you get an “anti-dopamine hit” for seeing the number on your bathroom scale, then this may condition you toward never looking at that number ever again, instead of the high-level goal of losing weight
Comment by Leon Lang (leon-lang) on Generalizing Koopman-Pitman-Darmois · 2023-01-24T01:51:31.526Z · LW · GW

An abstraction of a high-dimensional random variable X is a low-dimensional summary G(X) that can be used to make predictions about X. In the case that X is sampled from some parameterized distribution P(X | theta), G(X) may take the form of a sufficient statistic, i.e., a function of X such that P(theta | X) = P(theta | G(X)). To make predictions about X, one may then determine theta from P(theta | G(X)), and predict a new data point X from theta.
In this post, John shows that if you have a very low-dimensional sufficient statistic G(X), then in many cases, X will “almost follow” the exponential family form and thus be fairly easy to deal with.

Further Thoughts
I don’t yet quite understand what John wants to use the theorem for, but I will hopefully learn this in his project update post. My current guess is that he would like to identify abstractions as sufficient statistics of exponential families and that this might be a data structure that is “easier to identify” in the world and in trained machine learning models than our initially “broad guess” for what abstractions could be.
Note that I’m only writing this down to have a prediction to update once I’m reading John’s update post. This seems strictly more useful than not having a prediction at all, even though I don’t place a high chance on my prediction actually being fully correct/nuanced enough.

Another thought is that I feel slightly uneasy about the viewpoint that abstractions are the thing we use to make “predictions about X”. In reality, if a person is in a particular state (meaning that the person is represented by an extremely high-dimensional sample vector X), then to make predictions, I definitely only use a very low-dimensional summary based on my sense of the person’s body part positions and coarse brain state. However, these predictions are not about X: I don’t use the summary to sample vectors that are consistent with the summary; instead, I use the summary to make predictions about summaries themselves. I.e., what will be the person’s mental state and body positions a moment from now, and how will this impact abstractions of other objects in the vicinity of that person? There should then be a “commutative diagram” relating objects in reality to their low-dimensional abstractions, and real-world state transitions to predictions.

I hope to eventually learn more about how this abstraction work feeds into such questions.

Comment by Leon Lang (leon-lang) on Taste & Shaping · 2023-01-23T20:03:33.007Z · LW · GW


  • “I want to exercise but I don’t want to exercise”
    • Wants reflect both for long-term goals and immediate desires
    • Emotional valence makes both not always agree
    • Yum and yuck: the things that are yucky at the moment often contain a yummy quality for the long-term
  • Goal: Yum-feeling is aligned with the in-the-moment actions toward our long-term wants
    • This can be achieved with internal double-crux (IDC)
    • This article is more for understanding what’s going on, and IDC is for tinkering with it. 
  • Our motivations are shaped by hyperbolic discounting:
    • If something feels intrinsically unpleasant then a model in us claims this is bad for one of our goals. 
      • E.g., in the exercise case: Goals like “conserve energy” and “avoid discomfort” are hurt
      • It’s worthwhile to search for the causal structure of this.
        • If that’s not clarified, the feelings can condition us into unhelpful behavior
      • Parking ticket example:
        • Paying the parking ticket and getting a paycheck may both gain us 110$ (in avoided fees vs. salary)
        • However, the first feels like counter to the goal of making money whereas the second feels like serving that goal.
        • It’s worthwhile to try to align the immediate emotional response with the actual effect of our actions
    • Temporal or hyperbolic discounting: quick rewards work way better for conditioning than longer-term rewards
      • E.g., the immediate reward of crisps vs. feeling bad afterward
      • This is because our system 2 modeling ability is pretty weak compared to system 1 sensemaking
      • Even very small yucks/yums can have an outsized effect
  • Resolving model conflicts
    • It seems worth it to track what our yums and yucks reinforce.
    • Reward shaping can help with that: as in dog training, manage to produce a “click” after positive actions that align with the long-term goal (a “treat”) that will be achieved way later.
    • IDC will focus on how to do this while keeping true beliefs.
  • Taste: an Inner Simulator output
    • Mathematics professors have a taste for what yields success, formed by hundreds of seen proof attempts.
    • CFAR’s take: neither believe nor disbelieve your taste; evaluate it.
    • Problem with taste: if you simply believe it, then it may become a self-fulfilling prophecy.
Comment by Leon Lang (leon-lang) on The Telephone Theorem: Information At A Distance Is Mediated By Deterministic Constraints · 2023-01-21T07:09:09.579Z · LW · GW

Uncharitable Summary

Most likely there’s something in the intuitions which got lost when transmitted to me via reading this text, but the mathematics itself seems pretty tautological to me (nevertheless I found it interesting since tautologies can have interesting structure! The proof itself was not trivial to me!). 

Here is my uncharitable summary:

Assume you have a Markov chain M_0 → M_1 → M_2 → … → M_n → … of variables in the universe. Assume you know M_n and want to predict M_0. The Telephone theorem says two things:

  • You don’t need to keep all information about M_n to predict M_0 as well as possible. It’s enough to keep the conditional distribution P(M_0 | M_n).
  • Additionally, in the long run, these conditional distributions grow closer and closer together: P(M_0 | M_n) ~ P(M_0 | M_{n+1})

That’s it. The first statement is tautological, and the second states that you cannot keep losing information. At some point, your uncertainty in M_0 stabilizes.

Further Thoughts

  • I think actually, John wants to claim that the conditional probabilities can be replaced by something else which carries information at a distance and stabilizes over time. Something like:
    • Average measurements
    • Pressure/temperature of a gas
    • Volume of a box
    • Length of a message
  • These things could then serve as “sufficient statistics” that contain everything one needs for making predictions. I have no idea of how one would go about finding such conserved quantities in general systems. John also makes a related remark:
    • “(Side note: the previous work already suggested conditional probability distributions as the type-signature of abstractions, but that’s quite general, and therefore not very efficient to work with algorithmically. Estimates-of-deterministic-constraints are a much narrower subset of conditional probability distributions.)”
  • The math was made easier in the proof by assuming that information at a distance precisely stabilizes at some point. In reality, it may slowly converge without becoming constant. For this, no proof or precise formulation is yet in the text.
  • John claims: “I believe that most abstractions used by humans in practice are summaries of information relevant at a distance. The theorems in this post show that those summaries are estimates/distributions of deterministic (in the limit) constraints in the systems around us.”
    • This confuses me. It seems like this claims that we can also use summaries of very closeby objects to make predictions at arbitrary distance. However, the mathematics doesn’t show this: it only considers varying the “sender” of information, not varying the “receiver” (which in the case of the theorem is M_0!). If you want to make predictions about arbitrarily far away different things in the universe, then it's unclear whether you can throw any information of closeby things away. (But maybe I misunderstood the text here?)

A somewhat more random comment:

I disagree with the claim that the intuitions behind information diagrams fall apart at higher degrees: if you’re “just fine” with negative information, then you can intersect arbitrarily many circles and get additivity rules for information terms. I actually wrote a paper about this, including how one can do this for other information quantities like Kolmogorov complexity and Kullback-Leibler divergence. What’s problematic about this is not the mathematics of intersecting circles, but that we largely don’t have good real-world interpretations and use cases for it. 

Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2023-01-21T00:18:40.125Z · LW · GW

Edit: This is now obsolete with our NAH distillation.

Making the Telephone Theorem and Its Proof Precise

This short form distills the Telephone Theorem and its proof. The short form will thereby not at all be "intuitive"; the only goal is to be mathematically precise at every step.

Let  be jointly distributed finite random variables, meaning they are all functions

starting from the same finite sample space with a given probability distribution  and into respective finite value spaces . Additionally, assume that these random variables form a Markov chain 


Lemma: For a Markov chain , the following two statements are equivalent:


(b) For all  with 


Assume (a): Inspecting an information diagram of  will immediately result in us also observing the Markov chain . Markov chains can be turned around, thus we get the two chains

Factorizing along these two chains, we obtain:

and thus, for all  with  . That proves (b).

Assume (b): We have


where, in the second step, we used the Markov chain  and in the third step, we used assumption (b). This independence gives us the vanishing of conditional mutual information:

Together with the Markov chain , this results, by inspecting an information diagram, in the equality .  


Theorem: Let .  The following are equivalent:


(b) There are functions  defined on , respectively such that:

  •  with probability , i.e., the measure of all  such that the equality doesn't hold is zero.
  • For all , we have the equality , and the same for .

Proof: The Markov chain immediately also gives us a Markov chain , meaning we can without loss of generality assume that . So let's consider the simple Markov chain .

Assume (a): By the lemma, this gives us for all  with 

Define the two functions  by:

Then we have  with probability 1[1], giving us the first condition we wanted to prove. 

For the second condition, we use a trick from Probability as Minimal Map: set , which is a probability distribution. We get

That proves (b).

Assume (b): For the other direction, let  be given with  Let  be such that  and with . We have

and thus

The result follows from the Lemma.[2]

  1. ^

    Somehow, my brain didn't find this obvious. Here is an explanation: 

  2. ^

    There is some subtlety about whether the random variable  can be replaced by  in that equation. But given that they are "almost" the same random variables, I think this is valid inside the probability equation.

Comment by Leon Lang (leon-lang) on Turbocharging · 2023-01-20T19:36:03.864Z · LW · GW


  • Sometimes, reinforcement learning goes wrong: how can this be prevented?
  • Example: math education
    • One student simply “learns to follow along”, and the other “learns to predict what comes next”
    • The other student may gain the ability to solve math problems on their own, while the first plausibly won’t.
  • Turbocharging, general notes:
    • Idea: You get better at the things you practice, and it pays off to think about what, mechanistically, you want to learn.
    • You won’t just learn “what you intend”:
      • If you intend to gain the skill of disarmament of people but hand the weapon back during training with a partner, then that is what you learn.
    • Example of math student revisited: are they…
      • Actively thinking about the symbols?
      • Calling up related material from memory?
      • Generating hypotheses (instead of falling prey to hindsight bias)?
      • Thinking about the underlying structure of the problem?
    • ← These questions determine what’s actually practiced.
  • The Turbocharging Algorithm:
    • Select a skill to be acquired/improved
    • Select a practice method (to be evaluated or to be strengthened/developed)
    • Evaluate the resemblance between method and skill:
      • Does/Do the “practice trigger(s)” resemble the real-world trigger, or at least plausibly generalize?
      • Does/Do the “practice action(s)” resemble real-world actions, or at least plausibly generalize?
    • Possibly adjust the practice method in response to the previous answers
  • Further Notes
    • Declarative and Procedural Knowledge require different types of learning
    • Turbocharging is for procedural learning, which is more of what applied rationality is about
    • The article lists many counterexamples of the theory that turbocharging is “the one and only” way to gain procedural knowledge. 
Comment by Leon Lang (leon-lang) on Experiment Idea: RL Agents Evading Learned Shutdownability · 2023-01-20T17:45:07.323Z · LW · GW

I drop further ideas here in this comment for future reference. I may edit the comment whenever I have a new idea:

  • Someone had the idea that instead of giving a negative reward for acting while an alert is played, one may also give a positive reward for performing the null action in those situations. If the positive reward is high enough, an agent like MuZero (which explicitly tries to maximize the expected predicted reward) may then be incentivized to cause the alert sound from being played during deployment. One could then add further environment details that are learned by the world model and that make it clear to the agent how to achieve the alert in deployment. This also has an analogy in the original shutdown paper. 
Comment by Leon Lang (leon-lang) on Experiment Idea: RL Agents Evading Learned Shutdownability · 2023-01-20T17:36:01.255Z · LW · GW
Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2023-01-20T03:22:48.003Z · LW · GW

Thanks for your answer! 

Credit assignment (AKA policy gradient) credits the diamond-recognizing circuit as responsible for reward, thereby retaining this diamond abstraction in the weights of the network.

This is different from how I imagine the situation. In my mind, the diamond-circuit remains simply because it is a good abstraction for making predictions about the world. Its existence is, in my imagination, not related to an RL update process. 

Other than that, I think the rest of your comment doesn't quite answer my concern, so I try to formalize it more. Let's work in the simple setting that the policy network has no world model and is simply a non-recurrent function  mapping from observations to probability distributions over actions. I imagine a simple version of shard theory to claim that f decomposes as follows:


where i is an index for enumerating shards,  is the contextual strength of activation of the i-th shard (maybe with ), and  is the action-bid of the i-th shard, i.e., the vector of log-probabilities it would like to see for different actions. Then SM is the softmax function, producing the final probabilities.

In your story, the diamond shard starts out as very strong. Let's say it's indexed by  and that  for most inputs  and that  has a large "capacity" at its disposal so that it will in principle be able to represent behaviors for many different tasks. 

Now, if a new task pops up, like solving a maze, in a specific context , I imagine that two things could happen to make this possible:

  •  could get updated to also represent this new behavior
  • The strength  could get weighed down and some other shard could learn to represent this new behavior.

One reason why the latter may happen is that  possibly becomes so complicated that it's "hard to attach more behavior to it"; maybe it's just simpler to create an entirely new module that solves this task and doesn't care about diamonds. If something like this happens often enough, then eventually, the diamond shard may lose all its influence. 

Comment by Leon Lang (leon-lang) on Testing The Natural Abstraction Hypothesis: Project Intro · 2023-01-20T01:52:48.192Z · LW · GW


The natural abstractions hypothesis makes three claims:

  • Abstractability: to make predictions in our world, it’s enough to know very low-dimensional summaries of systems, i.e., their abstractions (empirical claim)
  • Human-compatibility: Humans themselves use these abstractions in their thinking (empirical claim)
  • Convergence/naturality: most cognitive systems use these abstractions to make predictions (mathematical+empirical claim)

John wants to test this hypothesis by:

  • Running simulations of systems and showing that low-information summaries predict how they evolve
  • Checking whether these low-information summaries agree with how humans reason about the system
  • Training predictors/agents on the system and observing whether they use these low-dimensional summaries. Also, try to prove theorems about which systems will use which abstractions in which environments.

The Holy Grail: having a machine that provably detects the low-dimensional abstractions useful for making predictions in almost any system. Then, use it in the real world and observe whether the concepts agree with human real-world concepts. John says this would prove the NAH


Further Thoughts

  • To me, it seems like conceptual/theoretical progress is at least as much needed as empirical progress since we still don’t quite conceptually understand what it means to “make predictions about a system”:
    • Clearly, predicting all the low-level details is not possible with abstract summaries alone.
    • Thus, the only thing one can ever hope to predict with abstract summaries are… other abstract summaries. 
    • However, this seems to create a chicken-egg problem: we already need to know the relevant abstractions in order to assess whether the abstractions are useful for predicting the values of those. It’s not enough to find “any low-dimensional piece of information” that is good for predicting… yeah, for predicting what?
  • The problem of science that John discusses has a nice interpretation for alignment research:
    • Probably there is only a small number of variables to tweak in exactly the right way when building advanced AI, and this will be enough — for a superintelligent being, at least — to correctly predict that everything will remain aligned. Let’s find those variables.
      • This reminds me of Eliezer’s claim that probably, the alignment solution fits into one very simple “book from the future” that contains all the right ideas, similar to how our world now contains the simple idea of a ReLU that wasn’t accessible 20 years ago. 
  • I think if we had this abstraction-thermometer, then we wouldn’t even need “convergence” anymore: simply use the thermometer itself as part of the AGI by pointing the AGI to the revealed human-value concept. Thus, I think I’m fine with reducing the NAH to just the AH, consisting of only the two claims that low-dimensional information will be enough for making predictions and that human concepts (in particular, human values) are such low-dimensional information. Then, we “just” need to build an AGI that points to human values and make sure that no other AGI will be built that doesn’t point there (or even couldn’t point there due to using other abstractions).
    • If we don’t have the “N” part of NAH, then this makes “alignment by default” less likely. But this isn’t that bad from the perspective of trying to get “alignment by design”.
Comment by Leon Lang (leon-lang) on Alignment By Default · 2023-01-20T00:29:43.072Z · LW · GW


This article claims that:

  • Unsupervised learning systems will likely learn many “natural abstractions” of concepts like “trees” or “human values”. Maybe they will even end up being simply a “feature direction”.
    • One reason to expect this is that to make good predictions, you only need to conserve information that’s useful at a distance. And this information can be imagined being a “natural abstraction”. 
  • If you then have an RL system or supervised learner who can use the unsupervised activations to solve a problem, then it can directly behave in such a way as to “satisfy the natural abstraction” of, e.g., human values. 
    • This would be a quick way for the model to behave well on such a task. Later on, the model might find the unnatural proxy goal and maximize that, but that wouldn’t be the first thing found. 
  • Thus, we may up with alignment by default. Such an aligned AGI could then be used to align successor AIs, and it might be better at that than humans are since it’s smarter. 
    • Note that if we have “alignment by default”, then the alignment of a successor system might work better than with competitive HCH or IRL. Reason: humans+bureaucracies may not be aligned, and IRL finds a utility function, which is of the wrong type signature. This may lead to a compounding of alignment errors when building successor AIs. 
    • Variations of “alignment by default” would, e.g., find the “human value abstraction” in one AI and then “plant it” into the search operation of an AGI. I.e., alignment would not be solved by default for all AIs, but it’s defaulty-enough that we still win. 

My Opinion:

  • Human values are less “one thing” than trees or humans are. “Human values are active” is not a sensible piece of information since there are so many different types of human values. Admittedly, this also applies to trees, but it does feel like it’s pointing to a difficulty.
  • Overall, this makes me think that, possibly, the proxy goal is often a more natural abstraction than human values. My hope mainly comes from the thought that the proxy goals are hopefully specific enough to the RL/supervised task that they didn’t appear in the unsupervised training phase.
    • But there are reasons against that: humans will put lots of information about ML training processes into the training data of any unsupervised system, meaning that to make good predictions, the systems should probably represent these proxy goals quite well. Only if they do would they make accurate predictions about, e.g., contemporary alignment errors. 
Comment by Leon Lang (leon-lang) on Aversion Factoring · 2023-01-19T19:06:12.289Z · LW · GW


  • Aversions lead to avoidance or displeasure
  • Aversion Factoring: Technique for overcoming aversions
  • Idea: An activity doesn’t “just suck”: specific elements of it are aversive. 
    • While in goal factoring, the emphasis was on finding the positive factors, here, the emphasis seems on finding the negative factors
    • Then, instead of imagining doing a different activity with all of the benefits, imagine doing the same activity with none of the aversions
  • Story of Critch climbing trees:
    • Dirty clothes: buy dark jeans
    • Danger: practice falling, hard&not high
    • Hands feel icky: exposure therapy and big-picture reframe
    • Overall idea: try to find all aversions and circumvent them while still doing the activity
  • Aversion Factoring Algorithm
    • Choose an activity: worth doing, or already doing but unpleasant
    • Check your motivation: Why is it appealing? Consider goal-factoring to explore replacement activities. Consider internal double crux
    • Factor aversion into parts: include also weird impulses, and trivial inconveniences, and be specific
    • Draw a causal graph: capture all the positives and negatives in it. Do mindful walkthroughs and button tests (“what if the negatives disappeared”) to check if it’s complete
    • Implement solutions: both external factors and internal factors (e.g., reframings and exposure therapy)
      • This may require iteration if some aversions remain
      • Remember to view the implementations as an experiment
      • Remember that not all aversions should be overcome
Comment by Leon Lang (leon-lang) on A shot at the diamond-alignment problem · 2023-01-19T01:59:17.312Z · LW · GW

In this shortform, I explain my main confusion with this alignment proposal. The main thing that's unclear to me: what's the idea here for how the agent remains motivated by diamonds even while doing very non-diamond related things like "solving mazes" that are required for general intelligence?
More details in the shortform itself.

Comment by Leon Lang (leon-lang) on Leon Lang's Shortform · 2023-01-19T01:42:24.907Z · LW · GW

These are rough notes trying (but not really succeeding) to deconfuse me about Alex Turner's diamond proposal. The main thing I wanted to clarify: what's the idea here for how the agent remains motivated by diamonds even while doing very non-diamond related things like "solving mazes" that are required for general intelligence?

  • Summarizing Alex's summary:
    • Multimodal SSL initialization
    • recurrent state, action head
    • imitation learning on humans in simulation, + sim2real
      • low sample complexity
      • Humans move toward diamonds
    • policy-gradient RL: reward the AI for getting near diamonds
      • the recurrent state retains long-term information
    • After each task completion: the AI is near diamonds
  • SSL will make sure the diamond abstraction exists
  • Proto Diamond shard:
    • There is a diamond abstraction that will be active once a diamond is seen. Imagine this as being a neuron.
    • Then, hook up the "move forward" action with this neuron being active. Give reward for being near diamonds.  Voila, you get an agent which obtains reward! This is very easy to learn! More easily than other reward-obtaining computations
    • Also, other such computations may be reinforced, like "if shiny object seen, move towards it" --- do adversarial training to rule those out
    • This is all about prototypical diamonds. Thus, the AI may not learn to create a diamond as large as a sun, but that's also not what the post is about.
  • Preserving diamond abstraction/shard:
    • In Proto planning, the AI primarily thinks about how to achieve diamonds. Such thinking is active across basically all contexts, due to early RL training. 
    • Then, we will give the AI other types of tasks, like "maze solving" or "chess playing" or anything else, from very easy to very hard.
      • At the end of each task, there will be a diamond and reward.
      • By default, at the start of this new training process, the diamond shard will be active since training so far ensures it is active in most contexts. It will bid for actions before the reward is reached, and therefore, its computations will be reinforced and shaped. Also, other shards will be reinforced (ones that plan how to solve a maze, since they also steer toward the reinforcement event), but the diamond shard is ALWAYS reinforced.
        • The idea here is that the diamond shard is contextually activated BY EVERY CONTEXT, and so it is basically one huge circuit thinking about how to reach diamonds that simply gets extended with more sub-computations for how to reach diamonds.
        • Caveat: another shard may be better than the diamond shard at planning toward the end of a maze than the diamond shard which "isn't specialized". And if that's the case, then reinforcement events may make the diamond shard continuously less active in maze-solving contexts until it doesn't activate anymore at start-of-maze contexts. It's unclear to me what the hypothesis is for how to prevent this. 
          • Possibly the hypothesis is captured in this paragraph of Alex, but I don't understand it: 
            "In particular, even though online self-supervised learning continues to develop the world model and create more advanced concepts, the reward events also keep pinging the invocation of the diamond-abstraction as responsible for reward (because insofar as the agent's diamond-shard guides its decisions, then the diamond-shard's diamond-abstraction is in fact responsible for the agent getting reward). The diamond-abstraction gradient starves the AI from exclusively acting on the basis of possible advanced "alien" abstractions which would otherwise have replaced the diamond abstraction. The diamond shard already gets reward effectively, integrating with the rest of the agent's world model and recurrent state, and therefore provides "job security" for the diamond-abstraction. (And once the agent is smart enough, it will want to preserve its diamond abstraction, insofar as that is necessary for the agent to keep achieving its current goals which involve prototypical-diamonds.)"
          • I don't understand what it means to "ping the invocation of the diamond-abstraction as responsible for reward". I can imagine what it means to have subcircuits whose activation is strengthened on certain inputs, or whose computations (if they were active in the context) are changed in response to reinforcement. And so, I imagine the shard itself to be shaped by reward. But I'm not sure what exactly is meant by pinging the invocation of the diamond abstraction as responsible for reward. 
Comment by Leon Lang (leon-lang) on Disentangling Shard Theory into Atomic Claims · 2023-01-19T00:05:23.096Z · LW · GW

I now finally read LawrenceC's Shard Theory in Nine Theses: a Distillation and Critical Appraisal. I think it is worth reading for many people even if they already read my own distillation. A summary of things emphasized that are missing (or less emphasized) in mine (Note: Lawrence doesn't necessarily believe these claims and, for some of them, lists his disagreements):

  • A nice picture of how the different parts of an agent composed of shards interact. This includes a planner, which I've not mentioned in my post. 
  • A comparison of shard theory with the subagent model and rational agent model, and also the learning/steering model
  • A greater emphasis that shards care about world model concepts instead of sensory experiences
  • The following quote: "As far as I can tell, shard theory does not make specific claims about what form these bids take, how a planner works, how much weight each shard has, or how the bids are aggregated together into an action."
    • This part is a bit more conservative than my speculations about how shards influence the log probabilities of different actions, which are more specific and falsifiable.
  • In my description, it may implicitly sound as if all the shards are agents. Lawrence adds more nuance:
    • "While not all shards are agents, shard theory claims that relatively agentic shards exist and will eventually end up “in control” of the agent’s actions."
  • A mechanism for how agentic shards make less agentic shards lose influence over time (namely, by steering behavior such that the less agentic shards are not reinforced)
  • A greater emphasis that learned values are path-dependent (and in a way that can be steered by our choices to bring about the values we want)
  • The sometimes-made claim that learned values are relatively architecture-independent
  • A discussion of how inner misalignment may not be a problem when viewed in the shard frame.
Comment by Leon Lang (leon-lang) on What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems? · 2023-01-18T21:08:20.736Z · LW · GW


This article thinks about what “general purpose search is” and why to expect it in advanced machine learning systems. 

In general, we expect gradient descent to find “simple solutions” with lots of varying parameters (since they take a larger part in solution space) and “general solutions” that are helpful broadly (since we will put the system in diverse environments). Thus, we do expect search processes to emerge.

However, babble and prune will likely not be the resulting process: it’s not compute and memory efficient enough. Instead, John imagines a search process that starts with a constraint/problem and iteratively produces broad strokes of solutions that form new constraints of subproblems, until the problem is solved. If this is roughly correct, it will also mean that the search process is retargetable.

This leaves open how the broad strokes of solutions to constraints are found, which John expects requires heuristics that will often either output a solution itself, or a different problem whose solution is easier to generate, instead of babbling and then pruning. Some heuristics:

  • Relaxed problem: only consider time-constraint, or only consider immediately reducing Euclidean distance.

The specific relaxed problems are “heuristics”. The procedure to relax the problem is a “heuristic generator”

  • One could consider this a “meta-heuristic”. However, the type-signature of “heuristic” is “problem in, solution out” or “problem in, other problem out”, and the type-signature of the meta heuristic is “problem in, heuristic out”, so these are different.

Finally, John gestures at the observation that heuristics seem to be environment-dependent but not goal-dependent, at least for similar types of goals (e.g. for the type “reach X city” or “do X thing this week”). This makes them more generalizable. 

Other Thoughts

  • Don’t chess players sometimes do babble and prune? They might look at the board and literally “babble” possible moves, evaluate them, and search further in the best of those.
    • An alternative to that process would be to think “I want to capture the queen, how do I do this?” and then to explicitly think about moves that achieve that “constraint”. The original constraint/problem is just “win this game”, of which “capture the queen” is already a subconstraint/problem.
    • Still, I do remember Magnus Carlsen saying in an interview that he actually does do relatively exhaustive search in some chess situations. So it seems to at least be some search process of many he applies. But I also remember him saying that this is effortful. 
  • The description of John leaves open the process with which the solutions to constraints are found. Doesn’t that process usually involve babbling?
    • In the case of finding stores, we may say “there is no babbling, the computer program just shows me the open stores that satisfy the constraint.”
      • But doesn’t the computer internally need to babble? I.e., to go through a database of all the options to find the ones satisfying the constraint!
      • In general, I would say babbling is required unless a solution to the constraint can already be retrieved in a somewhat cached form. 
  • I’m not sure if “relax the problem” is a clear instruction. I feel like you already need to have something like a “natural abstraction of problems” in your head in order to be able to do that. This doesn’t really contradict what John is saying, but it highlights that there is some hidden complexity in this.
Comment by Leon Lang (leon-lang) on Goal Factoring · 2023-01-18T19:11:15.809Z · LW · GW


  • The goal of Goal factoring: in tradeoff situations, get all of the good with none of the bad
  • Parable of the orange: two people want the last orange. However, one only wanted the peel, the other the flesh — both can get what they want
  • Case study: preoccupied professor
    • Grading has lots of costs.
    • Grading produces a bag of benefits
    • The professor thought about how to reach all benefits without paying any of the costs
    • In the end, he found a system in which students could grade themselves
  • Goal Factoring algorithm:
    • Choose an action (that you consider doing, that has some (frequent) costs, or that you aren’t sure why you’re doing)
    • Prepare to accept all worlds (accept the idea of not doing the action. If hesitation comes up, this signals an implicit goal)
    • Factor the action out into goals 
      • Remember the difference between wanting to do X and wanting to appear to do X
      • include goals like social standing, sense of self, etc.; 
      • Check with button test (button removes action and achieves goals) whether these goals contain all the benefits of the action
    • Brainstorm replacement actions (first for one goal at a time. Then, combine replacement actions into a coherent plan)
    • Reality Check: Imagine the plan. Does system I protest? What does Murphyjitsu  reveal?

Before starting the process: remind yourself that you are not forced to replace your action. This makes the brainstorming process less scary!

Comment by Leon Lang (leon-lang) on Disentangling Shard Theory into Atomic Claims · 2023-01-18T02:37:31.187Z · LW · GW

Thank you! Then I indeed misunderstood Alex Turner's claim, and I basically seem to agree with my new understanding.