Posts

Scaling and evaluating sparse autoencoders 2024-06-06T22:50:39.440Z
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision 2023-12-16T05:39:10.558Z
Shapley Value Attribution in Chain of Thought 2023-04-14T05:56:18.208Z
[ASoT] Some thoughts on human abstractions 2023-03-16T05:42:12.595Z
Clarifying wireheading terminology 2022-11-24T04:53:23.925Z
Scaling Laws for Reward Model Overoptimization 2022-10-20T00:20:06.920Z
How many GPUs does NVIDIA make? 2022-10-08T17:54:35.466Z
Towards deconfusing wireheading and reward maximization 2022-09-21T00:36:43.244Z
Humans Reflecting on HRH 2022-07-29T21:56:53.561Z
leogao's Shortform 2022-05-24T20:08:32.928Z
[ASoT] Consequentialist models as a superset of mesaoptimizers 2022-04-23T17:57:40.130Z
[ASoT] Some thoughts about imperfect world modeling 2022-04-07T15:42:09.589Z
[ASoT] Some thoughts about LM monologue limitations and ELK 2022-03-30T14:26:15.381Z
[ASoT] Some thoughts about deceptive mesaoptimization 2022-03-28T21:14:27.217Z
[ASoT] Searching for consequentialist structure 2022-03-27T19:09:13.370Z
[ASoT] Some ways ELK could still be solvable in practice 2022-03-27T01:15:16.607Z
[ASoT] Observations about ELK 2022-03-26T00:42:20.540Z
What do paradigm shifts look like? 2022-03-16T19:17:37.586Z
EleutherAI's GPT-NeoX-20B release 2022-02-10T06:56:41.155Z
NFTs, Coin Collecting, and Expensive Paintings 2022-01-24T01:01:48.117Z
Retail Investor Advantages 2021-12-07T02:08:20.694Z
Behavior Cloning is Miscalibrated 2021-12-05T01:36:01.802Z
Quadratic Voting and Collusion 2021-11-17T00:19:15.737Z
In Defence of Optimizing Routine Tasks 2021-11-09T05:09:41.595Z
Towards Deconfusing Gradient Hacking 2021-10-24T00:43:32.916Z
Dissolving the Experience Machine Objection 2021-10-03T16:56:28.312Z
Gradient descent is not just more efficient genetic algorithms 2021-09-08T16:23:46.996Z
Obstacles to gradient hacking 2021-09-05T22:42:22.876Z
Thoughts on the Alignment Implications of Scaling Language Models 2021-06-02T21:32:08.555Z
Building AGI Using Language Models 2020-11-09T16:33:25.864Z
GPT-3: A Summary 2020-06-02T18:14:54.380Z

Comments

Comment by leogao on Anna and Oliver discuss Children and X-Risk · 2024-07-25T21:58:04.183Z · LW · GW

Wanting to raise kids/have what would normally be considered a lineage is importantly different from IGF; most people would not consider sperm bank donation to satisfy their child-having goals very well despite this being very good for IGF.

Comment by leogao on You should go to ML conferences · 2024-07-25T04:11:28.161Z · LW · GW

For what it's worth, my view on the value of conferences is that a huge proportion of the value comes from meeting new people, catching up with people who live elsewhere, having research discussions, etc. I've occasionally found out about a paper I wouldn't have otherwise, but this is a smaller fraction of the value for me. Language model research is generally behind the state of the art of what's available inside labs, and a huge fraction of papers at conferences won't replicate or are too toy or otherwise end up never becoming useful.

Comment by leogao on leogao's Shortform · 2024-07-24T06:52:44.288Z · LW · GW

tourist traps are the lubricant between the expectations and realities of travel

Comment by leogao on A simple model of math skill · 2024-07-24T05:20:20.575Z · LW · GW

How do you sample uniformly from the integers?

Comment by leogao on JumpReLU SAEs + Early Access to Gemma 2 SAEs · 2024-07-20T01:18:47.505Z · LW · GW

Very exciting that JumpReLU works well with STE gradient estimation! I think this fixes one of the biggest flaws with TopK, which is that having a fixed number of latents k on each token is kind of wonky. I also like the argument in section 4 a lot - in particular the point about how this works because we're optimizing the expectation of the loss. Because of how sparse the features are, I wonder if it would reduce gradient noise substantially to use a KDE with state persisting across a few recent steps.

Comment by leogao on A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team · 2024-07-19T17:04:19.632Z · LW · GW

I'm 80% that with optimal hyperparameters for both (you need to retune hparams when you change batch size), 131072/64 is substantially less efficient than 131072.

We find that at a batch size of 131072, when hyperparameters are tuned, then the training curves as a function of number of tokens are roughly the same as with a batch size of 4096 (see appendix A.4). So it is not the case that 131072 is in a degenerate large batch regime where efficiency is substantially degraded by batch size.

When your batch is not fully iid, this is like effectively having a smaller batch size of iid data (in the extreme, if your batch contains 64 copies of the same data, this is obviously the same as a 64x smaller batch size), but you still pay the compute cost of putting all 131072 tokens through the model.

Comment by leogao on A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team · 2024-07-19T09:19:06.183Z · LW · GW

Some takes on some of these research questions:

Looking for opposing feature directions in SAEs

I checked a top-k SAE with 256k features and k=256 trained on GPT-4 and found only 286 features that had any other feature with cosine similarity < -0.9, and 1314 with cosine sim < -0.7.

SAE/Transcoder activation shuffling

I'm confident that when learning rate and batch size are tuned properly, not shuffling eventually converges to the same thing as shuffling. The right way to frame this imo is the efficiency loss from not shuffling, which from preliminary experiments+intuition I'd guess is probably substantial.

How much does initializing the encoder to be the transpose of the decoder (as done so here and here) help for SAEs and transcoders?

It helps tremendously for SAEs by very substantially reducing dead latents; see appendix C.1 in our paper.

Comment by leogao on Stitching SAEs of different sizes · 2024-07-13T21:02:07.698Z · LW · GW

Cool work - figuring out how much of scaling up autoencoders is discovering new features vs splitting existing ones feels quite important. Especially since for any one scale of autoencoder there are simultaneously features which are split too finely and features which are too rare to yet be discovered, it seems quite plausible that the most useful autoencoders will be ones with features stitched together from multiple scales.

Some minor nitpicks: I would recommend always thinking of MSE/L0 in terms of the frontier between the two, rather than either alone; in my experiments I found it very easy to misjudge at a glance whether a run with better MSE but worse L0 was better or worse than the frontier.

Comment by leogao on Daniel Kokotajlo's Shortform · 2024-07-11T19:31:29.125Z · LW · GW

I think something based on prediction markets can counteract this kind of war-of-attrition strategy. There are two main advantages of this solution: (a) it requires users to stake their reputation on their claims, and so if you ever double down really really hard on something that's obviously wrong, it will cost you a lot, and (b) in general prediction markets solve the problem of providing a cheap way to approximate a very expensive process if it's obvious to everyone what the output of the very expensive process will be, which nullifies an entire swathe of bad-faith arguing techiques.

To avoid the Arbital failure mode, I think the right strategy is to (i) start simple and implement one feature at a time and see how it interacts with actual conversations (every successful complex system grows out of a simple one - maybe we can start with literally just a LW clone but the voting algorithm is entirely using the community notes algorithm), and (ii) for the people implementing the ideas to be basically the same people coming up with the ideas.

Comment by leogao on leogao's Shortform · 2024-07-11T07:23:58.470Z · LW · GW

VAEs:

a normal autoencoder decodes single latents z to single images (or whatever other kind of data) x, and also encodes single images x to single latents z. 

with VAEs, we want our decoder (p(x|z)) to take single latents z and output a distribution over x's. for simplicity we generally declare that this distribution is a gaussian with identity covariance, and we have our decoder output a single x value that is the mean of the gaussian.

because each x can be produced by multiple z's, to run this backwards you also need a distribution of z's for each single x. we call the ideal encoder p(z|x) - the thing that would perfectly invert our decoder p(x|z). unfortunately, we obviously don't have access to this thing. so we have to train an encoder network q(z|x) to approximate it. to make our encoder output a distribution, we have it output a mean vector and a stddev vector for a gaussian. at runtime we sample a random vector eps ~ N(0, 1) and multiply it by the mean and stddev vectors to get an N(mu, std).

to train this thing, we would like to optimize the following loss function:

-log p(x) + KL(q(z|x)||p(z|x))

where the terms optimize the likelihood (how good is the VAE at modelling data, assuming we have access to the perfect z distribution) and the quality of our encoder (how good is our q(z|x) at approximating p(z|x)). unfortunately, neither term is tractable - the former requires marginalizing over z, which is intractable, and the latter requires p(z|x) which we also don't have access to. however, it turns out that the following is mathematically equivalent and is tractable:

-E z~q(z|x) [log p(x|z)] + KL(q(z|x)||p(z))

the former term is just the likelihood of the real data under the decoder distribution given z drawn from the encoder distribution (which happens to be exactly equivalent to the MSE, because it's the log of gaussian pdf). the latter term can be computed analytically, because both distributions are gaussians with known mean and std.  (the distribution p is determined in part by the decoder p(x|z), but that doesn't pin down the entire distribution; we still have a degree of freedom in how we pick p(z). so we typically declare by fiat that p(z) is a N(0, 1) gaussian. then, p(z|x) is implied to be equal to p(x|z) p(z) / sum z' p(x|z') p(z'))

Comment by leogao on leogao's Shortform · 2024-07-11T06:11:50.632Z · LW · GW

learning thread for taking notes on things as i learn them (in public so hopefully other people can get value out of it)

Comment by leogao on Daniel Kokotajlo's Shortform · 2024-07-10T22:51:27.886Z · LW · GW

I'd be excited about a version of Wikipedia that is built from the ground up to operate in an environment where truth is difficult to find and there is great incentive to shape the discourse. Perhaps there are new epistemic technologies similar to community notes that are yet to be invented.

Comment by leogao on JanPro's Shortform · 2024-06-29T03:09:51.018Z · LW · GW

One problem is that the outcome may itself be strongly correlated with the S&P 500, which would mess up the probabilities

Comment by leogao on Daniel Kokotajlo's Shortform · 2024-06-25T04:07:34.219Z · LW · GW

I like the aesthetics of this idea and am vaguely interested in the idea of making this happen (though most likely I will never get around to doing it)

Comment by leogao on Interpreting and Steering Features in Images · 2024-06-21T06:45:10.607Z · LW · GW

I've been told by Gabriel Goh that the CLIP neurons are apparently surprisingly sparse, so the neuron baseline might be stronger relative to SAEs than in e.g LLMs. (This is also why the Multimodal Neurons work was possible without SAEs)

Comment by leogao on Research Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT. · 2024-06-15T01:38:24.447Z · LW · GW

Did you use the initialization scheme in our paper where the decoder is initialized to the transpose of the encoder (and then columns unit normalized)? There should not be any dead latents with topk at small scale with this init.

Also, if I understand correctly, leaky topk is similar to the multi-topk method in our paper. I'd be interested in a comparison of the two methods.

Comment by leogao on My AI Model Delta Compared To Yudkowsky · 2024-06-13T03:38:41.663Z · LW · GW

(In fact, we know that the fraction of features extracted is probably quite small - for example, the 16M latent GPT-4 autoencoder only captures 10% of the downstream loss in terms of equivalent pretraining compute.)

Comment by leogao on My AI Model Delta Compared To Yudkowsky · 2024-06-13T03:34:36.022Z · LW · GW

(I'm the first author of the linked paper on GPT-4 autoencoders.)

I think many people are heavily overrating how human-explainable SAEs today are, because it's quite subtle to determine whether a feature is genuinely explainable. SAE features today, even in the best SAEs, are generally are not explainable with simple human understandable explanations. By "explainable," I mean there is a human understandable procedure for labeling whether the feature should activate on a given token (and also how strong the activation should be, but I'll ignore that for now), such that your procedure predicts an activation if and only if the latent actually activates.

There are a few problems with interpretable-looking features:

  • it is insufficient that latent-activating samples have a common explanation. You also need the opposite direction of things that match the explanation to activate the latent. For example, we found a neuron in GPT-2 that appears to activate on the word "stop," but actually most instances of the word "stop" don't activate the neuron. It turns out that this was not really a "stop" neuron, but rather a "don't stop/won't stop" neuron. While in this case there was a different but still simple explanation, it's entirely plausible that many features just cannot be explained with simple explanations. This problem gets worse as autoencoders scale, because their explanations will get more and more specific.
  • People often look at the top activating examples of a latent, but this provides a heavily misleading picture of how monosemantic the latent is even just in the one direction. It's very common for features to have extremely good top activations but then terrible nonzero activations. This is why our feature visualizer shows random nonzero activations before the top activations.
  • Oftentimes, it is actually harder to simulate a latent than it looks. For example, we often find latents that activate on words in a specific context- say, financial news articles- but it seems to activate on random words inside those contexts and we don't have a good explanation why it activates on some words but not others.

We also discuss this in the evaluation section of our paper on GPT-4 autoencoders. The ultimate metric of whether the features are explainable that we introduce is the following: simulate each latent with your best explanation of the latent, and then run the values through the decoder and the rest of the model and look at the downstream loss. This procedure is very expensive, so making it feasible to run is a nontrivial research problem, but I predict basically all existing autoencoders will score terribly on this metric.

Comment by leogao on Scaling and evaluating sparse autoencoders · 2024-06-09T06:00:52.447Z · LW · GW

We had done very extensive ablations at small scale where we found TopK to be consistently better than all of the alternatives we iterated through, and by the time we launched the big run we had already worked out how to scale all of the relevant hyperparameters, so we were decently confident.

One reason we might want a progressive code is it would basically let you train one autoencoder and use it for any k you wanted to at test time (which is nice because we don't really know exactly how to set k for maximum interpretability yet). Unfortunately, this is somewhat worse than training for the specific k you want to use, so our recommendation for now is to train multiple autoencoders.

Also, even with a progressive code, the activations on the margin would not generally be negative (we actually apply a ReLU to make sure that the activations are definitely non-negative, but almost always the (k+1)th value is still substantially positive)

Comment by leogao on Scaling and evaluating sparse autoencoders · 2024-06-08T21:10:11.019Z · LW · GW

To add some more concreteness: suppose we open up the model and find that it's basically just a giant k nearest neighbors (it obviously can't be literally this, but this is easiest to describe as an analogy). Then this would explain why current alignment techniques work and dissolves some of the mystery of generalization. Then suppose we create AGI and we find that it does something very different internally that is more deeply entangled and we can't really make sense of it because it's too complicated. Then this would imo also provide strong evidence that we should expect our alignment techniques to break.

In other words, a load bearing assumption is that current models are fundamentally simple/modular in some sense that makes interpretability feasible, and that observing this breaking in the future is probably important evidence that will hopefully come before those future systems actually kill everyone.

Comment by leogao on Scaling and evaluating sparse autoencoders · 2024-06-08T11:53:07.969Z · LW · GW

Thanks for your kind words!

My views on interpretability are complicated by the fact that I think it's quite probable there will be a paradigm shift between current AI and the thing that is actually AGI like 10 years from now or whatever. So I'll describe first a rough sketch of what I think within-paradigm interp looks like and then what it might imply for 10 year later AGI. (All these numbers will be very low confidence and basically made up)

I think the autoencoder research agenda is currently making significant progress on item #1. The main research bottlenecks here are (a) SAEs might not be able to efficiently capture every kind of information we care about (e.g circular features) and (b) residual stream autoencoders are not exactly the right thing for finding circuits. Probably this stuff will take a year or two to really hammer out. Hopefully our paper helps here by giving a recipe to push autoencoders really quickly so we bump into the limitations faster and with less second guessing about autoencoder quality.

Hopefully #4 can be done to some great part in parallel with #1; there's a whole bunch of engineering needed to e.g take autoencoders and scale them up to capture all the behavior of the model (which was also a big part of the contribution of this paper). I'm pretty optimistic that if we have a recipe for #1 that we trust, the engineering (and efficiency improvements) for scaling up is doable. Maybe this adds another year of serial time. The big research uncertainty here fmpov is how hard it is to actually identify the structures we're looking for, because we'll probably have a tremendously large sparse network where each node does some really boring tiny thing.

However, I mostly expect that GPT-4 (and probably 5) is probably just actually not doing anything super spicy/stabby. So I think most of the value of doing this interpretability will be to sort of pull back the veil, so to speak, of how these models are doing all the impressive stuff. Some theories of impact:

  • Maybe we'll become less confused about the nature of intelligence in a way that makes us just have better takes about alignment (e.g there will be many mechanistic theories of what the heck GPT-4 is doing that will have been conclusively ruled out)
  • Maybe once the paradigm shift happens, we will be better prepared to identify exactly what interpretability assumptions it broke (or even just notice whether some change is causing a mechanistic paradigm shift)

Unclear what timeline these later things happen on; probably depends a lot on when the paradigm shift(s) happen.

Comment by leogao on How to Better Report Sparse Autoencoder Performance · 2024-06-03T17:43:14.798Z · LW · GW

I usually look at log(downstream loss - original LM loss). But more broadly, there's nothing wrong with looking at log of some LM loss based term - all the scaling laws stuff does it.

Comment by leogao on When is Goodhart catastrophic? · 2024-06-03T09:57:14.460Z · LW · GW

For what it's worth, it seems much more likely to me for catastrophic Goodhart to happen because the noise isn't independent from the thing we care about, rather than the noise being independent but heavy tailed.

Comment by leogao on How to Better Report Sparse Autoencoder Performance · 2024-06-03T03:53:02.073Z · LW · GW

I've found the MSE-L0 (or downstream loss-L0) frontier plot to be much easier to interpret when both axes are in log space.

Comment by leogao on Please stop publishing ideas/insights/research about AI · 2024-05-03T00:37:23.851Z · LW · GW

I'm very sympathetic to the idea of being careful about publishing things that could spread capabilities ideas. However, I think there are several important things missing from your world model, which cause me to believe that following your advice would substantially hurt alignment progress.

(To be clear, none of this applies to alignment people working directly on capabilities, who should, like, not. Rather, this is about alignment researchers accidentally advancing capabilities by talking to capabilities people)

  • It's genuinely hard to come up with ideas that help capabilities a lot. I think you are severely underestimating how hard it is, and how much insight is required. I think one issue here is that most papers on arxiv are garbage and don't actually make any progress, but those papers are not the ones that are pushing AGI forward anyways.
  • Even if you try very hard to do so, it's still very hard to convince people that you're right if you don't have a ton of clout via a legible reputation of being right a lot. Everyone has an agenda they're convinced will solve AGI and is too busy trying to convince everyone else to work on their agenda.
  • High level ideas are generally not that valuable in and of themselves. People generally learn to ignore ideas unless they have strong empirical evidence of correctness (or endorsement of highly respected researchers) because there are simply too many ideas. The valuable thing is not the idea itself, but the knowledge of which ideas are actually correct.
  • I think deeply understanding top tier capabilities researchers' views on how to achieve AGI is actually extremely valuable for thinking about alignment. Even if you disagree on object level views, understanding how very smart people come to their conclusions is very valuable.
  • I think alignment discourse is greatly harmed by people being too scared to say things. When it bleeds over to being too scared to think about capabilities related topics for fear of accidentally generating something dangerous, I think this is even more harmful.
Comment by leogao on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-05-01T01:08:21.989Z · LW · GW

It doesn't seem like a huge deal to depend on the existence of smaller LLMs - they'll be cheap compared to the bigger one, and many LM series already contain smaller models. Not transferring between sites seems like a problem for any kind of reconstruction based metric because there's actually just differently important information in different parts of the model.

Comment by leogao on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-05-01T01:01:34.818Z · LW · GW

Sorry I meant the Anthropiclike neuron resampling procedure.

I think I misread Neel's comment, I thought he was saying that 131k was chosen because larger autoencoders would have too many dead latents (as opposed to this only being for Pythia residual).

Comment by leogao on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-05-01T00:04:50.628Z · LW · GW

Another question: any particular reason to expect ablate-to-zero to be the most relevant baseline? In my experiments, I find ablate to zero to completely destroy the loss. So it's unclear whether 90% recovered on this metric actually means that much - GPT-2 probably recovers 90% of the loss of GPT-4 under this metric, but obviously GPT-2 only explains a tiny fraction of GPT-4's capabilities. I feel like a more natural measure may be for example the equivalent compute efficiency hit.

Comment by leogao on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-05-01T00:01:29.377Z · LW · GW

Got it - do you think with a bit more tuning the feature death at larger scale could be eliminated, or would it be tough to manage with the reinitialization approach?

Comment by leogao on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-05-01T00:00:05.505Z · LW · GW

Makes sense that the shift would be helpful

Comment by leogao on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-04-30T23:59:28.040Z · LW · GW

Thanks, that makes sense

Comment by leogao on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-04-26T01:28:42.734Z · LW · GW

Great paper! The gating approach is an interesting way to learn the JumpReLU threshold and it's exciting that it works well. We've been working on some related directions at OpenAI based on similar intuitions about feature shrinking.

Some questions:

  • Is b_mag still necessary in the gated autoencoder?
  • Did you sweep learning rates for the baseline and your approach?
  • How large is the dictionary of the autoencoder?
Comment by leogao on leogao's Shortform · 2024-03-27T20:37:39.281Z · LW · GW

philosophy: while the claims "good things are good" and "bad things are bad" at first appear to be compatible with each other, actually we can construct a weird hypothetical involving exact clones that demonstrates that they are fundamentally inconsistent with each other

law: could there be ambiguity in "don't do things that are bad as determined by a reasonable person, unless the thing is actually good?" well, unfortunately, there is no way to know until it actually happens

Comment by leogao on Modern Transformers are AGI, and Human-Level · 2024-03-27T01:42:26.391Z · LW · GW

I believe that the important part of generality is the ability to handle new tasks. In particular, I disagree that transformers are actually as good at handling new tasks as humans are. My mental model is that modern transformers are not general tools, but rather an enormous Swiss army knife with billions of specific tools that compose together to only a limited extent. (I think human intelligence is also a Swiss army knife and not the One True Tool, but it has many fewer tools that are each more general and more compositional with the other tools.)

I think this is heavily confounded because the internet is so huge that it's actually quite hard to come up with things that are not already on the internet. Back when GPT-3 first came out, I used to believe that widening the distribution to cover every task ever was a legitimate way to solve the generality problem, but I no longer believe this. (I think in particular this would have overestimated the trajectory of AI in the past 4 years)

One way to see this is that the most interesting tasks are ones that nobody has ever done before. You can't just widen the distribution to include discovering the cure for cancer, or solving alignment. To do those things, you actually have to develop general cognitive tools that compose in interesting ways.

We spend a lot of time thinking about how human cognitive tools are flawed, which they certainly are compared to the true galaxy brain superintelligence. But while humans certainly don't generalize perfectly and there isn't a sharp line between "real reasoning" and "mere memorization", it's worth keeping in mind that we're literally pretrained on surviving in the wilderness and those cognitive tools can still adapt to pushing buttons on a keyboard to write code.

I think this effect is also visible on a day to day basis. When I learn something new - say, some unfamiliar new piece of math - I generally don't immediately fully internalize it. I can recall some words to describe it and maybe apply it in some very straightforward cases where it obviously pattern matches, but I don't really fully grok its implications and connections to other knowledge. Then, after simmering on it for a while, and using it to bump into reality a bunch, I slowly begin to actually fully internalize the core intuition, at which point I can start generating new connections and apply it in unusual ways.

(From the inside, the latter feels like fully understanding the concept. I think this is at least partly the underlying reason why lots of ML skeptics say that models "don't really understand" - the models do a lot of pattern matching things straightforwardly.)

To be clear, I agree with your argument that there is substantial overlap between the most understanding language models and the least understanding humans. But I think this is mostly not the question that matters for thinking about AI that can kill everyone (or prevent that).

Comment by leogao on All About Concave and Convex Agents · 2024-03-25T04:41:12.207Z · LW · GW

Well, if you make a convex misaligned AI, it will play the (metaphorical) lottery over and over again until 99.9999%+ of the time it has no power and resources left whatsoever. The smarter it is, the faster and more efficient it will be at achieving this outcome.

So unless the RNG gods are truly out to get you, in the long run you are exceedingly unlikely to actually encounter a convex misaligned AI that has accumulated any real amount of power.

Comment by leogao on All About Concave and Convex Agents · 2024-03-24T21:58:45.030Z · LW · GW

Thankfully, almost all of the time the convex agents end up destroying themselves by taking insane risks to concentrate their resources into infinitesimally likely worlds, so you will almost never have to barter with a powerful one.

(why not just call them risk seeking / risk averse agents instead of convex/concave?)

Comment by leogao on More people getting into AI safety should do a PhD · 2024-03-17T01:46:36.399Z · LW · GW

My personal anecdote as one of the no-undergrad people: I got into ML research on my own and published papers without much research mentorship, and then joined OpenAI. My background is definitely more in engineering than research, but I've spent a substantial amount of time exploring my own research directions. I get direct mentorship from my manager, but I also seek out advice from many other researchers in the organization, which I've found to be valuable.

My case is quite unusual, so I would caution about drawing generalized conclusions about what to do based on my experience.

Comment by leogao on leogao's Shortform · 2024-03-10T18:05:42.018Z · LW · GW

it's often stated that believing that you'll succeed actually causes you to be more likely to succeed. there are immediately obvious explanations for this - survivorship bias. obviously most people who win the lottery will have believed that buying lottery tickets is a good idea, but that doesn't mean we should take that advice. so we should consider the plausible mechanisms of action.

first, it is very common for people with latent ability to underestimate their latent ability. in situations where the cost of failure is low, it seems net positive to at least take seriously the hypothesis that you can do more than you think you can. (also keeping in mind that we often overestimate the cost of failure). there are also deleterious mental health effects to believing in a high probability of failure, and then bad mental health does actually cause failure - it's really hard to give something your all if you don't really believe in it.

belief in success also plays an important role in signalling. if you're trying to make some joint venture happen, you need to make people believe that the joint venture will actually succeed (opportunity costs exist). when assessing the likelihood of success of the joint venture, people will take many pieces of information into account: your track record, the opinions of other people with a track record, object level opinions on the proposal, etc.

being confident in your own venture is an important way of putting your "skin in the game" to vouch that it will succeed. specifically, the way this is supposed to work is that you get punished socially for being overconfident, so you have an incentive to only really vouch for things that really will work. in practice, in large parts of the modern world overconfidence is penalized less than we're hardwired to expect. sometimes this is due to regions with cultural acceptance and even embrace of risky bets (SV), or because of atomization of modern society making the effects of social punishment less important.

this has both good and bad effects. it's what enables innovation, because that fundamentally requires a lot of people to play the research lottery. if you're not willing to work on something that will probably fail but also will pay out big if it succeeds, it's very hard to innovate. research consists mostly of people who are extremely invested in some research bet, to the point where it's extremely hard to convince them to pivot if it's not working out. ditto for startups, which are probably the architypical example of both innovation and also of catastrophic overconfidence.

this also creates problems - for instance, it enables grifting because you don't actually need to have to be correct if you just claim that your idea will work, and then when it inevitably fails you can just say that this is par for the course. also, being systematically overconfident can cause suboptimal decision making where calibration actually is important.

because many talented people are underequipped with confidence (there is probably some causal mechanism here - technical excellence often requires having a very mechanistic mental model of the thing you're doing, rather than just yoloing it and hoping it works), it also creates a niche for middlemen to supply confidence as a service, aka leadership. in the ideal case, this confidence is supplied by people who are calibratedly confident because of experience, but the market is inefficient enough that even people who are not calibrated can supply confidence because of the market inefficiency. another way to view this is that leaders deliver the important service of providing certainty in the face of an uncertain world.

(I'm using the term middleman here in a sense that doesn't necessarily imply that they deliver no value - in fact, causing things to happen can create lots of value, and depending on the specifics this role can be very difficult to fill. but they aren't the people who do the actual technical work. it is of course also valuable for the leader to e.g be able in theory to fill any of the technical roles if needed, because it makes them more able to spend their risk budget on the important technical questions, it creates more slack and thereby increases the probability of success, and the common knowledge of the existence of this slack itself also increases the perceived inevitability of success)

a similar story also applies at the suprahuman level, of tribes or ideologies. if you are an ideology, your job is unfortunately slightly more complicated. on the one hand, you need to project the vibe of inevitable success so that people in other tribes feel the need to get in early on your tribe, but on the other hand you need to make your tribe members feel like every decision they make is very consequential for whether the tribe succeeds. if you're merely calibrated, then only one of the two can be true. different social technologies are used by religions, nations, political movements, companies, etc to maintain this paradox.

Comment by leogao on leogao's Shortform · 2024-03-10T17:02:06.153Z · LW · GW

I make no claim to fungibility or lack of value created by middlemen.

Comment by leogao on leogao's Shortform · 2024-03-09T16:34:57.032Z · LW · GW

an example: open source software produces lots of value. this value is partly captured by consumers who get better software for free, and partly by businesses that make more money than they would otherwise.

the most clear cut case is that some businesses exist purely by wrapping other people's open source software, doing advertising and selling it for a handsome profit; this makes the analysis simpler, though to be clear the vast majority of cases are not this egregious.

in this situation, the middleman company is in fact creating value (if a software is created in a forest with no one around to use it, does it create any value?) by using advertising to cause people to get value from software. in markets where there are consumers clueless enough to not know about the software otherwise (e.g legacy companies), this probably does actually create a lot of counterfactual value. however, most people would agree that the middleman getting 90% of the created value doesn't satisfy our intuitive notion of fairness. (open source developers are more often trying to have the end consumers benefit from better software, not for random middlemen to get rich off their efforts)

and if advertising is commoditized, then this problem stops existing (you can't extract that much value as an advertising middleman if there is an efficient market with 10 other competing middlemen), and so most of the value does actually accrue to the end user.

Comment by leogao on Vote on Anthropic Topics to Discuss · 2024-03-09T13:25:54.698Z · LW · GW

[meta comment] maybe comments that are also poll options should be excluded from popular comments, visibly differently on profile pages, etc to remove the need to say things like "[This comment is present for voting purposes, it does not represent my opinions, see the OP for context.]"
 

Comment by leogao on leogao's Shortform · 2024-03-09T12:51:36.974Z · LW · GW

of course, this is more a question about equilibria than literal transactions. suppose you capture most of the value and then pay it back out to users as a dividend: the users now have more money with which they could pay a middleman, and a middleman that could have extracted some amount of value originally can still extract that amount of value in this new situation.

we can model this as a game of ultimatum between the original value creator and the middlemen. if the participation of the OVC and middleman are both necessary, the OVC can bargain for half the value in an iterated game / as FDT agents. however, we usually think of the key differentiating factor between the OVC and middlemen as the middlemen being more replaceable, so the OVC should be able to bargain for a lot more. (see also: commoditizing your complement)

so to ensure that the end users get most of the value, you need to either ensure that all middleman roles are commoditized, or precommit to only provide value in situations where the end user can actually capture most of the value

Comment by leogao on leogao's Shortform · 2024-03-09T12:31:58.539Z · LW · GW

any time someone creates a lot of value without capturing it, a bunch of other people will end up capturing the value instead. this could be end consumers, but it could also be various middlemen. it happens not infrequently that someone decides not to capture the value they produce in the hopes that the end consumers get the benefit, but in fact the middlemen capture the value instead

Comment by leogao on Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT · 2024-03-07T00:20:50.036Z · LW · GW

In my experiments log L0 vs log unexplained variance should be a nice straight line. I think your autoencoders might be substantially undertrained (especially given that training longer moves off the frontier a lot). Scaling up the data by 10x or 100x wouldn't be crazy. 

(Also, I think L0 is more meaningful than L0 / d_hidden for comparing across different d_hidden (I assume that's what "percent active features" is))

Comment by leogao on Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT · 2024-03-05T23:51:43.187Z · LW · GW

Fwiw, I find it's much more useful to have (log) active features on the x axis, and (log) unexplained variance on the y axis. (if you want you can then also plot the L1 coefficient above the points, but that seems less important)

Comment by leogao on Can we get an AI to do our alignment homework for us? · 2024-02-27T00:04:53.883Z · LW · GW

My mental model is that there is an entire space of possible AIs, each with some capability level and alignability level. Given the state of the alignment field, there is some alignability ceiling, below which we can reliably align AIs. Right now, this ceiling is very low, but we can push it higher over time.

At some capability level, the AI is powerful enough to solve alignment of a more capable AI, which can then solve alignment for even more capable AI, etc all the way up. However, even the most alignable AI capable of this is still potentially very hard to align. There will of course be more alignable and less capable AIs too, but they will not be capable enough to actually kick off this bucket chain.

Then the key question is whether there will exist an AI that is both alignable and capable enough to start the bucket chain. This is a function of both (a) the shape of the space of AIs (how quickly do models become unalignable as they become more capable?) and (b) how good we become at solving alignment. Opinions differ on this - my personal opinion is that probably this first AI is pretty hard to align, so we're pretty screwed, though it's still worth a try.

Comment by leogao on Do sparse autoencoders find "true features"? · 2024-02-23T04:47:19.033Z · LW · GW

In the limit of infinite SAE width and infinite (iid) training data, you can get perfect reconstruction and perfect sparsity (both L0 and L1). We can think of this as maximal feature splitting. Obviously, this is undesirable, because you've discarded all of the structure present in your data.

Therefore, reconstruction and sparsity aren't exactly the thing we most fundamentally care about. It just happens to do something reasonable at practical scales. However, that doesn't mean we have to throw it out - we might hope that it gives us enough of a foothold in practice.

In particular, the maximal feature splitting case requires exponentially many latents. We might believe that in practice, on the spectrum from splitting too little (polysemanticity) to splitting too much, erring on the side of splitting too much is preferable, because we can still do circuit finding and so on if we artificially cut some existing features into smaller pieces.

Comment by leogao on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-14T22:23:00.201Z · LW · GW

For the dashboards, did you filter out the features that fire less frequently? I looked through a few and didn't notice any super low density ones.

Comment by leogao on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-14T10:17:41.764Z · LW · GW

For your dashboards, how many tokens are you retrieving the top examples from?

Comment by leogao on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-10T03:07:45.498Z · LW · GW

Why do you scale your MSE by 1/(x_centred**2).sum(dim=-1, keepdim=True).sqrt() ? In particular, I'm confused about why you have the square root. Shouldn't it just be 1/(x_centred**2).sum(dim=-1, keepdim=True)?