Posts

Anthropic release Claude 3, claims >GPT-4 Performance 2024-03-04T18:23:54.065Z
Sam Altman fired from OpenAI 2023-11-17T20:42:30.759Z
Open Phil releases RFPs on LLM Benchmarks and Forecasting 2023-11-11T03:01:09.526Z
What I would do if I wasn’t at ARC Evals 2023-09-05T19:19:36.830Z
Long-Term Future Fund Ask Us Anything (September 2023) 2023-08-31T00:28:13.953Z
Meta announces Llama 2; "open sources" it for commercial use 2023-07-18T19:28:28.685Z
Should we publish mechanistic interpretability research? 2023-04-21T16:19:40.514Z
[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques 2023-03-16T16:38:33.735Z
Natural Abstractions: Key claims, Theorems, and Critiques 2023-03-16T16:37:40.181Z
Sam Altman: "Planning for AGI and beyond" 2023-02-24T20:28:00.430Z
Meta "open sources" LMs competitive with Chinchilla, PaLM, and code-davinci-002 (Paper) 2023-02-24T19:57:24.402Z
Behavioral and mechanistic definitions (often confuse AI alignment discussions) 2023-02-20T21:33:01.499Z
Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic) 2023-02-16T19:47:20.696Z
GPT-175bee 2023-02-08T18:58:01.364Z
OpenAI/Microsoft announce "next generation language model" integrated into Bing/Edge 2023-02-07T20:38:08.726Z
Evaluations (of new AI Safety researchers) can be noisy 2023-02-05T04:15:02.117Z
The Alignment Problem from a Deep Learning Perspective (major rewrite) 2023-01-10T16:06:05.057Z
Paper: Superposition, Memorization, and Double Descent (Anthropic) 2023-01-05T17:54:37.575Z
Touch reality as soon as possible (when doing machine learning research) 2023-01-03T19:11:58.915Z
Shard Theory in Nine Theses: a Distillation and Critical Appraisal 2022-12-19T22:52:20.031Z
Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic) 2022-12-16T22:12:54.461Z
Paper: Transformers learn in-context by gradient descent 2022-12-16T11:10:16.872Z
Causal scrubbing: results on induction heads 2022-12-03T00:59:18.327Z
Causal scrubbing: results on a paren balance checker 2022-12-03T00:59:08.078Z
Causal scrubbing: Appendix 2022-12-03T00:58:45.850Z
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] 2022-12-03T00:58:36.973Z
Paper: In-context Reinforcement Learning with Algorithm Distillation [Deepmind] 2022-10-26T18:45:02.737Z
LawrenceC's Shortform 2022-10-08T17:17:28.904Z
Paper: Discovering novel algorithms with AlphaTensor [Deepmind] 2022-10-05T16:20:11.984Z
Language models seem to be much better than humans at next-token prediction 2022-08-11T17:45:41.294Z
High-stakes alignment via adversarial training [Redwood Research report] 2022-05-05T00:59:18.848Z
Book Review: Discrete Mathematics and Its Applications (MIRI Course List) 2015-04-14T09:08:38.981Z

Comments

Comment by LawrenceC (LawChan) on LawrenceC's Shortform · 2024-03-24T05:37:57.210Z · LW · GW

Yeah, "strongest" doesn't mean "strong" here! 

Comment by LawrenceC (LawChan) on LawrenceC's Shortform · 2024-03-20T03:18:12.779Z · LW · GW

I mean, yeah, as your footnote says:

Another simpler but less illuminating way to put this is that higher serial reasoning depth can't be parallelized.[1]

Transformers do get more computation per token on longer sequences, but they also don't get more serial depth, so I'm not sure if this is actually an issue in practice?

 

  1. ^

    [C]ompactly represent  (f composed with g) in a way that makes computing  more efficient for general choices of  and .

    As an aside, I actually can't think of any class of interesting functions with this property -- when reading the paper, the closest I could think of are functions on discrete sets (lol), polynomials (but simplifying these are often more expensive than just computing the terms serially), and rational functions (ditto)

     

Comment by LawrenceC (LawChan) on LawrenceC's Shortform · 2024-03-20T02:42:24.465Z · LW · GW

I finally got around to reading the Mamba paper. H/t Ryan Greenblatt and Vivek Hebbar for helpful comments that got me unstuck. 

TL;DR: authors propose a new deep learning architecture for sequence modeling with scaling laws that match transformers while being much more efficient to sample from.

A brief historical digression

As of ~2017, the three primary ways people had for doing sequence modeling were RNNs, Conv Nets, and Transformers, each with a unique “trick” for handling sequence data: recurrence, 1d convolutions, and self-attention.
 

  • RNNs are easy to sample from — to compute the logit for x_t+1, you only need the most recent hidden state h_t and the last token x_t, which means it’s both fast and memory efficient. RNNs generate a sequence of length L with O(1) memory and O(L) time. However, they’re super hard to train, because  you need to sequentially generate all the hidden states and then (reverse) sequentially calculate the gradients. The way you actually did this is called backpropogation through time — you basically unroll the RNN over time — which requires constructing a graph of depth equal to the sequence length. Not only was this slow, but the graph being so deep caused vanishing/exploding gradients without careful normalization. The strategy that people used was to train on short sequences and finetune on longer ones. That being said, in practice, this meant you couldn’t train on long sequences (>a few hundred tokens) at all. The best LSTMs for modeling raw audio could only handle being trained on ~5s of speech, if you chunk up the data into 25ms segments.
  • Conv Nets had a fixed receptive field size and pattern, so weren’t that suited for long sequence  modeling. Also, generating each token takes O(L) time, assuming the receptive field is about the same size as the sequence. But they had significantly more stability (the depth was small, and could be as low as O(log(L))), which meant you could train them a lot easier. (Also, you could use FFT to efficiently compute the conv, meaning it trains one sequence in O(L log(L)) time.) That being said, you still couldn’t make them that big. The most impressive example was DeepMind’s WaveNet, conv net used to model human speech, and could handle up sequences up to 4800 samples … which was 0.3s of actual speech at 16k samples/second (note that most audio is sampled at 44k samples/second…), and even to to get to that amount, they had to really gimp the model’s ability to focus on particular inputs.
  • Transformers are easy to train, can handle variable length sequences, and also allow the model to “decide” which tokens it should pay attention to. In addition to both being parallelizable and having relatively shallow computation graphs (like conv nets), you could do the RNN trick of pretraining on short sequences and then finetune on longer sequences to save even more compute. Transformers could be trained with comparable sequence length to conv nets but get much better performance; for example, OpenAI’s musenet was trained on sequence length 4096 sequences of MIDI files. But as we all know, transformers have the unfortunate downside of being expensive to sample from — it takes O(L) time and O(L) memory to generate a single token (!).

The better performance of transformers over conv nets and their ability to handle variable length data let them win out.

That being said, people have been trying to get around the O(L) time and memory requirements for transformers since basically their inception. For a while, people were super into sparse or linear attention of various kinds, which could reduce the per-token compute/memory requirements to O(log(L)) or O(1).

The what and why of Mamba

If the input -> hidden and hidden -> hidden map for RNNs were linear (h_t+1 = A h_t + B x_t), then it’d be possible to train an entire sequence in parallel — this is because you can just … compose the transformation with itself (computing A^k for k in 2…L-1) a bunch, and effectively unroll the graph with the convolutional kernel defined by A B, A^2 B, A^3 B, … A^{L-1} B. Not only can you FFT during training to get the O(L log (L)) time of a conv net forward/backward pass (as opposed to O(L^2) for the transformer), you still keep the O(1) sampling time/memory of the RNN!

The problem is that linear hidden state dynamics are kinda boring. For example, you can’t even learn to update your existing hidden state in a different way if you see particular tokens! And indeed, previous results gave scaling laws that were much worse than transformers in terms of performance/training compute. 

In Mamba, you basically learn a time varying A and B. The parameterization is a bit wonky here, because of historical reasons, but it goes something like: A_t is exp(-\delta(x_t) * exp(A)), B_t = \delta(x_t) B x_t, where \delta(x_t) = softplus ( W_\delta x_t). Also note that in Mamba, they also constrain A to be diagonal and W_\delta to be low rank, for computational reasons  

Since exp(A) is diagonal and has only positive entries, we can interpret the model as follows: \delta controls how much to “learn” from the current example — with high \delta, A_t approaches 0 and B_t is large, causing h_t+1 ~= B_t x_t, while with \delta approaching 0, A_t approaches 1 and B_t approaches 0, meaning h_t+1 ~= h_t.

Now, you can’t exactly unroll the hidden state as a convolution with a predefined convolution kernel anymore, but you can still efficiently compute the implied “convolution” using parallel scanning.

Despite being much cheaper to sample from, Mamba matches the pretraining flops efficiency of modern transformers (Transformer++ = the current SOTA open source Transformer with RMSNorm, a better learning rate schedule, and corrected AdamW hyperparameters, etc.). And on a toy induction task, it generalizes to much longer sequences than it was trained on.

So, about those capability externalities from mech interp...

Yes, those are the same induction heads from the Anthropic ICL paper! 

Like the previous Hippo and Hyena papers they cite mech interp as one of their inspirations, in that it inspired them to think about what the linear hidden state model could not model and how to fix that. I still don’t think mech interp has that much Shapley here (the idea of studying how models perform toy tasks is not new, and the authors don't even use induction metric or RRT task from the Olsson et al paper), but I'm not super sure on this.

IMO, this is line of work is the strongest argument for mech interp (or maybe interp in general) having concrete capabilities externalities. In addition, I think the previous argument Neel and I gave of "these advances are extremely unlikely to improve frontier models" feels substantially weaker now.

Is this a big deal?

I don't know, tbh.

Comment by LawrenceC (LawChan) on LawrenceC's Shortform · 2024-03-13T16:47:42.773Z · LW · GW

That seems correct, at least directionally, yes.

Comment by LawrenceC (LawChan) on LawrenceC's Shortform · 2024-03-13T01:58:19.126Z · LW · GW

I don't want to say things that have any chance of annoying METR without checking with METR comm people, and I don't think it's worth their time to check the things I wanted to say. 

Comment by LawrenceC (LawChan) on Transfer learning and generalization-qua-capability in Babbage and Davinci (or, why division is better than Spanish) · 2024-03-12T09:20:09.415Z · LW · GW

I'm not sure your results really support the interpretation that davinci "transfers less well". Notably, achieving 100% accuracy from 50% is often a lot harder than achieving 50% from 0%/whatever random chance is on your datasets (I haven't looked through your code yet to examine the datasets) and I'd predict that davinci already does pretty well zero-shot (w/ no finetuning) on most of the tasks you consider here (which limits its improvement from finetuning, as you can't get above 100% accuracy). 

In addition, larger LMs are often significantly more data efficient, so you'd predict that they need less total finetuning to do well on tasks (and therefore the additional finetuning on related tasks would benefit the larger models less). 

Comment by LawrenceC (LawChan) on LawrenceC's Shortform · 2024-03-12T08:41:47.400Z · LW · GW

How my views on AI(S) have changed over the last 5.5 years

This was shamelessly copied from directly inspired by Erik Jenner's "How my views on AI have changed over the last 1.5 years". I think my views when I started my PhD in Fall 2018 look a lot worse than Erik's when he started his PhD, though in large part due to starting my PhD in 2018 and not 2022.

Apologies for the disorganized bullet points. If I had more time I would've written a shorter shortform.

AI Capabilities/Development

Summary: I used to believe in a 2018-era MIRI worldview for AGI, and now I have updated toward slower takeoff, fewer insights, and shorter timelines. 

  • In Fall of 2018, my model of how AGI might happen was substantially influenced by AlphaGo/Zero, which features explicit internal search. I expected future AIs to also feature explicit internal search over world models, and be trained mainly via reinforcement learning or IDA. I became more uncertain after OpenAI 5 (~May 2018), which used no clever techniques and just featured BPTT being ran on large LSTMs. 
  • That being said, I did not believe in the scaling hypothesis -- that is, that simply training larger models on more inputs would continually improve performance until we see "intelligent behavior" -- until GPT-2 (2019), despite encountering it significantly earlier (e.g. with OpenAI 5, or speaking to OAI people). 
  • In particular, I believed that we needed many "key insights" about intelligence before we could make AGI. This both gave me longer timelines and also made me believe more in fast take-off. 
  • I used to believe pretty strongly in MIRI-style fast take-off (e.g. would've assigned <30% credence that we see a 4 year period with the economy doubling) as opposed to (what was called at the time) Paul-style slow take-off. Given the way the world has turned out, I have updated substantially. While I don't think that AI development will be particularly smooth, I do expect it to be somewhat incremental, and I also expect earlier AIs to provide significantly more value even before truly transformative aI. 
  • -- Some beliefs about AI Scaling Labs that I'm redacting on LW --
  • My timelines are significantly shorter -- I would've probably said median 2050-60 in 2018, but now I think we will probably reach human-level AI by 2035. 

AI X-Risk

Summary: I have become more optimistic about AI X-risk, but my understanding has become more nuanced. 

  • My P(Doom) has substantially decreased, especially P(Doom) attributable to an AI directly killing all of humanity. This is somewhat due to having more faith that many people will be reasonable (in 2018, there were maybe ~20 FTE AIS researchers, now there are probably something like 300-1000 depending on how you count), somewhat due to believing that governance efforts may successfully slow down AGI substantially, and somewhat due to an increased belief that "winging-it"--style, "unprincipled" solutions can scale to powerful AIs. 
  • That being said, I'm less sure about what P(Doom) means. In 2018, I imagined the main outcomes were either "unaligned AGI instantly defeats all of humanity" and "a pure post-scarcity utopia". I now believe in a much wider variety of outcomes.  
  • For example, I've become more convinced both that misuse risk is larger than I thought, and that even weirder outcomes are possible (e.g. the AI keeps human (brain scans) around due to trade reasons). The former is in large part related to my belief in fast take-off being somewhat contradicted by world events; now there is more time for powerful AIs to be misused. 
  • I used to think that solving the technical problem of AI alignment would be necessary/sufficient to prevent AI x-risk. I now think that we're unlikely to "solve alignment" in a way that leads to the ability to deploy a powerful Sovereign AI (without AI assistance), and also that governance solutions both can be helpful and are required. 

AI Safety Research

Summary: I've updated slightly downwards on the value of conceptual work and significantly upwards on the value of fast empirical feedback cycles. I've become more bullish on (mech) interp, automated alignment research, and behavioral capability evaluations. 

  • In Fall 2018, I used to think that IRL for ambitious value learning was one of the most important problems to work on. I no longer think so, and think that most of my work on this topic was basically useless. 
  • In terms of more prosaic IRL problems, I very much lived in a frame of "the reward models are too dumb to understand" (a standard academic take) . I didn't think much about issues of ontology identification or (malign) partial observability. 
  • I thought that academic ML theory had a decent chance of being useful for alignment. I think it's basically been pretty useless in the past 5.5 years, and no longer think the chances of it being helpful "in time" are enough. It's not clear how much of this is because the AIS community did not really know about the academic ML theory work, but man, the bounds turned out to be pretty vacuous, and empirical work turned out far more informative than pure theory work.
  • I still think that conceptual work is undervalued in ML, but my prototypical good conceptual work looks less like "prove really hard theorems" or "think about philosophy" and a lot more like "do lots of cheap and quick experiments/proof sketches to get grounding". 
  • Relatedly, I used to dismiss simple techniques for AI Alignment that try "the obvious thing". While I don't think these techniques will scale (or even necessarily work well on current AIs), this strategy has turned out to be significantly better in practice than I thought. 
  • My error bars around the value of reading academic literature have shrunk significantly (in large part due to reading a lot of it). I've updated significantly upwards on "the academic literature will probably contain some relevant insights" and downwards on "the missing component of all of AGI safety can be found in a paper from 1983". 
  • I used to think that interpretability of deep neural networks was probably infeasible to achieve "in time" if not "actually impossible" (especially mechanistic interpretability). Now I'm pretty uncertain about its feasibility. 
  • Similarly, I used to think that having AIs automate substantial amounts of alignment research was not possible. Now I think that most plans with a shot of successfully preventing AGI x-risk will feature substantial amounts of AI.
  • I used to think that behavioral evaluations in general would be basically useless for AGIs. I now think that dangerous capability evaluations can serve as an important governance tool. 

Bonus: Some Personal Updates

Summary: I've better identified my comparative advantages, and have a healthier way of relating to AIS research. 

  • I used to think that my comparative advantage was clearly going to be in doing the actual technical thinking or theorem proving. In fact, I used to believe that I was unsuited for both technical writing and pushing projects over the finish line. Now I think that most of my value in the past ~2 years has come from technical writing or by helping finish projects.
  • I used to think that pure engineering or mathematical skill were what mattered, and feel sad about how it seemed that my comparative advantage was something akin to long term memory.[1] I now see more value in having good long-term memory.
  • I used to be uncertain about if academia was a good place for me to do research. Now I'm pretty confident it's not.
  • Embarrassingly enough, in 2018 I used to implicitly believe quite strongly in a binary model of "you're good enough to do research" vs "you're not good enough to do research". In addition, I had an implicit model that the only people "good enough" were those who never failed at any evaluation. I no longer think this is true. 
  • I am more of a fan of trying obvious approaches or "just doing the thing".

 

  1. ^

    I think, compared to the people around me, I don't actually have that much "raw compute" or even short term memory (e.g. I do pretty poorly on IQ tests or novel math puzzles), and am able to perform at a much higher level by pattern matching and amortizing thinking using good long-term memory (if not outsourcing it entirely by quoting other people's writing). 

Comment by LawrenceC (LawChan) on Natural Latents: The Math · 2024-03-09T01:04:57.521Z · LW · GW

Right, the step I missed on was that P(X|Y) = P(X|Z) for all y, z implies P(X|Z) = P(X). Thanks!

Comment by LawrenceC (LawChan) on Natural Latents: The Math · 2024-03-08T22:06:22.670Z · LW · GW

Hm, it sounds like you're claiming that if each pair of x, y, z are pairwise independent conditioned on the third variable, and p(x, y, z) =/= 0 for all x, y, z with nonzero p(x), p(y), p(z), then ?

I tried for a bit to show this but couldn't prove it, let alone the general case without strong invariance. My guess is I'm probably missing something really obvious. 

Comment by LawrenceC (LawChan) on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-08T19:02:56.310Z · LW · GW

I agree that GSM8K has been pretty saturated (for the best frontier models) since ~GPT-4, and GPQA is designed to be a hard-to-saturated benchmark (though given the pace of progress...). 

But why are HumanEval and MMLU also considered saturated? E.g. Opus and 4-Turbo are both significantly better than all other publicly known models on both benchmarks on both.  And at least for HumanEval, I don't see why >95% accuracy isn't feasible. 

It seems plausible that MMLU/HumanEval could be saturated after GPT-4.5 or Gemini 1.5 Ultra, at least for the best frontier models. And it seems fairly likely we'll see them saturated in 2-3 years. But it seems like a stretch to call them saturated right now. 

Is the reasoning for this is that Opus gets only 0.4% better on MMLU than the March GPT-4? That seems like pretty invalid reasoning, akin to deducing that because two runners achieve the same time, that that time is the best human-achievable time. And this doesn't apply to HumanEval, where Opus gets ~18% better than March GPT-4 and the November 4-Turbo gets 2.9% better than Opus. 

Comment by LawrenceC (LawChan) on Natural Latents: The Math · 2024-03-08T18:49:18.520Z · LW · GW

Probabilities of zero are extremely load-bearing for natural latents in the exact case, and probabilities near zero are load-bearing in the approximate case; if the distribution is zero nowhere, then it can only have a natural latent if the ’s are all independent (in which case the trivial variable is a natural latent).

I'm a bit confused why this is the case. It seems like in the theorems, the only thing "near zero" is that D_KL (joint, factorized) < epsilon ~= 0 . But you. can satisfy this quite easily even with all probabilities > 0. 

E.g. the trivial case where all variables are completely independents satisfies all the conditions of your theorem, but can clearly have every pair of probabilities > 0. Even in nontrivial cases, this is pretty easy (e.g. by mixing in irreducible noise with every variable).

Comment by LawrenceC (LawChan) on On Claude 3.0 · 2024-03-06T20:12:50.690Z · LW · GW

I'd like to caveat the comment you quoted above:

Also worth noting that Claude 3 does not substantially advance the LLM capabilities frontier! [..]

I wrote that before I had the chance to try replacing Claude 3 with GPT-4 in my daily workflow, based on its LLM benchmark scores compared to gpt-4-turbo variants. After having used it for a full day, I do feel like Claude 3 has  noticeable advantages over GPT-4 in ways that aren't captured by said benchmarks. So while I stand behind my claim that it "does not substantially advance the LLM capabilities frontier", I do think that Claude 3 Opus is advancing the frontier at least a little. 

In my experience, it seems to have noticeably better on coding and mathematical reasoning tasks, which was surprising to me given that it does worse on HumanEval and MATH. I guess they focused on delivering practically useful intelligence as opposed to optimizing for the benchmarks? (Or even optimized against the benchmarks?) 

(EDIT: it’s also much better at convincing me that its made up math is real, lol)

Comment by LawrenceC (LawChan) on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-06T03:27:33.869Z · LW · GW

I think that you're correct that Anthropic at least heavily implied that they weren't going to "meaningfully advance" the frontier (even if they have not made any explicit commitments about this).  I'd be interested in hearing when Dustin had this conversation w/ Dario -- was it pre or post RSP release?

And as far as I know, the only commitments they've made explicitly are in their RSP,  which commits to limiting their ability to scale to the rate at which they can advance and deploy safety measures. It's unclear if the "sufficient safety measures" limitation is the only restriction on scaling, but I would be surprised if anyone senior Anthropic was willing to make a concrete unilateral commitment to stay behind the curve. 

My current story based on public info is, up until mid 2022, there was indeed an intention to stay at the frontier but not push it forward significantly. This changed sometime in late 2022-early 2023, maybe after ChatGPT released and the AGI race became somewhat "hot". 

Comment by LawrenceC (LawChan) on Many arguments for AI x-risk are wrong · 2024-03-06T01:31:24.516Z · LW · GW

He'd've probably been surprised to see people just... using it for stuff like DoTA2 on fully-differentiable BPTT RNNs. I wonder if he's ever done any interviews on DL recently? AFAIK he's still alive.

Sadly, Williams passed away this February: https://www.currentobituary.com/member/obit/282438

Comment by LawrenceC (LawChan) on Many arguments for AI x-risk are wrong · 2024-03-06T01:22:38.498Z · LW · GW

I wasn't around in the community in 2010-2015, so I don't know what the state of RL knowledge was at that time. However, I dispute the claim that rationalists "completely miss[ed] this [..] interpretation":

To be honest, it was a major blackpill for me to see the rationalist community, whose whole whole founding premise was that they were supposed to be good at making efficient use of the available evidence, so completely missing this very straightforward interpretation of RL [..] the mechanistic function of per-trajectory rewards in a given batched update was to provide the weights of a linear combination of the trajectory gradients.

Ever since I entered the community, I've definitely heard of people talking about policy gradient as "upweighting trajectories with positive reward/downweighting trajectories with negative reward" since 2016, albeit in person. I remember being shown a picture sometime in 2016/17 that looks something like this when someone (maybe Paul?) was explaining REINFORCE to me: (I couldn't find it, so reconstructing it from memory)

In addition, I would be surprised if any of the CHAI PhD students when I was at CHAI from 2017->2021, many of whom have taken deep RL classes at Berkeley, missed this "upweight trajectories in proportion to their reward" intepretation? Most of us at the time have also implemented various RL algorithms from scratch, and there the "weighting trajectory gradients" perspective pops out immediately. 

As another data point, when I taught MLAB/WMLB in 2022/3, my slides also contained this interpretation of REINFORCE (after deriving it) in so many words:

Insofar as people are making mistakes about reward and RL, it's not due to having never been exposed to this perspective.


That being said, I do agree that there's been substantial confusion in this community, mainly of two kinds:

  • Confusing the objective function being optimized to train a policy with how the policy is mechanistically implemented: Just because the outer loop is modifying/selecting for a policy to score highly on some objective function, does not necessarily mean that the resulting policy will end up selecting actions based on said objective. 
  • Confusing "this policy is optimized for X" with "this policy is optimal for X": this is the actual mistake I think Bostom is making in Alex's example -- it's true that an agent that wireheads achieves higher reward than on the training distribution (and the optimal agent for the reward achieves reward at least as good as wireheading). And I think that Alex and you would also agree with me that it's sometimes valuable to reason about the global optima in policy space. But it's a mistake to identify the outputs of optimization with the optimal solution to an optimization problem, and many people were making this jump without noticing it. 

Again, I contend these confusions were not due to a lack of exposure to the "rewards as weighting trajectories" perspective. Instead, the reasons I remember hearing back in 2017-2018 for why we should jump from "RL is optimizing agents for X" to "RL outputs agents that both optimize X and are optimal for X":

  • We'd be really confused if we couldn't reason about "optimal" agents, so we should solve that first. This is the main justification I heard from the MIRI people about why they studied idealized agents. Oftentimes globally optimal solutions are easier to reason about than local optima or saddle points, or are useful for operationalizing concepts. Because a lot of the community was focused on philosophical deconfusion (often w/ minimal knowledge of ML or RL), many people naturally came to jump the gap between "the thing we're studying" and "the thing we care about". 
  • Reasoning about optima gives a better picture of powerful, future AGIs. Insofar as we're far from transformative AI, you might expect that current AIs are a poor model for how transformative AI will look. In particular, you might expect that modeling transformative AI as optimal leads to clearer reasoning than analogizing them to current systems. This point has become increasingly tenuous since GPT-2, but 
  • Some off-policy RL algorithms are well described as having a "reward" maximizing component: And, these were the approaches that people were using and thinking about at the time. For example, the most hyped results in deep learning in the mid 2010s were probably DQN and AlphaGo/GoZero/Zero. And many people believed that future AIs would be implemented via model-based RL. All of these approaches result in policies that contain an internal component which is searching for actions that maximize some learned objective. Given that ~everyone uses policy gradient variants for RL on SOTA LLMs, this does turn out to be incorrect ex post. But if the most impressive AIs seem to be implemented in ways that correspond to internal reward maximization, it does seem very understandable to think about AGIs as explicit reward optimizers. 
  • This is how many RL pioneers reasoned about their algorithms. I agree with Alex that this is probably from the control theory routes, where a PID controller is well modeled as picking trajectories that minimize cost, in a way that early simple RL policies are not well modeled as internally picking trajectories that maximize reward.  

Also, sometimes it is just the words being similar; it can be hard to keep track of the differences between "optimizing for", "optimized for", and "optimal for" in normal conversation. 

I think if you want to prevent the community from repeating these confusions, this looks less like "here's an alternative perspective through which you can view policy gradient" and more "here's why reasoning about AGI as 'optimal' agents is misleading" and "here's why reasoning about your 1 hidden layer neural network policy as if it were optimizing the reward is bad". 


An aside:

In general, I think that many ML-knowledgeable people (arguably myself included) correctly notice that the community is making many mistakes in reasoning, that they resolve internally using ML terminology or frames from the ML literature. But without reasoning carefully about the problem, the terminology or frames themselves are insufficient to resolve the confusion. (Notice how many Deep RL people make the same mistake!) And, as Alex and you have argued before, the standard ML frames and terminology introduce their own confusions (e.g. 'attention').

A shallow understanding of "policy gradient is just upweighting trajectories" may in fact lead to making the opposite mistake: assuming that it can never lead to intelligent, optimizer-y behavior. (Again, notice how many ML academics made exactly this mistake) Or, more broadly, thinking about ML algorithms purely from the low-level, mechanistic frame can lead to confusions along the lines of "next token prediction can only lead to statistical parrots without true intelligence". Doubly so if you've only worked with policy gradient or language modeling with tiny models. 

Comment by LawrenceC (LawChan) on Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT · 2024-03-06T00:44:18.226Z · LW · GW

Thanks!

Comment by LawrenceC (LawChan) on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-05T23:31:31.323Z · LW · GW

After having spent a few hours playing with Opus, I think "slightly better than best public gpt-4" seems qualitatively correct -- both models tend to get tripped up on the same kinds of tasks, but Opus can inconsistently solve some tasks in my workflow that gpt-4 cannot. 

And yeah, it seems likely that I will also swap to Claude over ChatGPT. 

Comment by LawrenceC (LawChan) on Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT · 2024-03-05T19:31:16.851Z · LW · GW

Thanks for uploading your interp and training code!

Could you upload your model and/or datasets somewhere as well, for reproducibility? (i.e. your datasets folder containing the datasets:)

def recognized_dataset():
    mode_lookups={
        "gpt_train":        ["datasets/othello_gpt_training_corpus.txt",        OthelloDataset,         {}],
        "gpt_train_small":  ["datasets/small_othello_gpt_training_corpus.txt",  OthelloDataset,         {}],
        "gpt_test":         ["datasets/othello_gpt_test_corpus.txt",            OthelloDataset,         {}],
        "sae_train":        ["datasets/sae_training_corpus.txt",                OthelloDataset,         {}],
        "probe_train":      ["datasets/probe_train_corpus.txt",                 LabelledOthelloDataset, {}],
        "probe_train_bw":   ["datasets/probe_train_corpus.txt",                 LabelledOthelloDataset, {"use_ally_enemy":False}],
        "probe_train_small":["datasets/small_probe_training_corpus.txt",        LabelledOthelloDataset, {}],
        "probe_test":       ["datasets/probe_test_corpus.txt",                  LabelledOthelloDataset, {}],
    }
    return mode_lookups

Agree that its worth experimenting with R, but the only other hyperparameter is the sparsity coefficient alpha, and I found that alpha had to be in a narrow range or the training would collapse to "all variance is unexplained" or "no active features".

Yeah, the main hyperparameters are the expansion factor and "what optimization algorithm do you use/what hyperparameters do you use for the optimization algorithm". 

Comment by LawrenceC (LawChan) on Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT · 2024-03-05T18:09:22.383Z · LW · GW

Thanks for doing this -- could you share your code? 

While I put only a medium probability that the current SAE algorithm works to recover all the features, my main concerns with the work are due to the quality of the model and the natural "features" not being board positions. 

I'd be interested in running the code on the model used by Li et al, which he's hosted on Google drive:

https://drive.google.com/drive/folders/1bpnwJnccpr9W-N_hzXSm59hT7Lij4HxZ

Also, in addition to the future work you list, I'd be interested in running the SAEs with much larger Rs and with alternative hyperparameter selection criteria.

Comment by LawrenceC (LawChan) on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-05T18:03:11.280Z · LW · GW

Thanks for doing this!

Comment by LawrenceC (LawChan) on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-05T17:11:32.970Z · LW · GW

They indeed did not advance the frontier with this launch (at least not meaningfully, possibly not at all). But "meaningfully advance the frontier" is quite different from both "stay on the frontier" or "slightly push the envelope while creating marketing hype", which is what I think is going on here?

Comment by LawrenceC (LawChan) on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-05T02:45:57.632Z · LW · GW

which case they’ve misled people by suggesting that they would not do this. 

Neither of your examples seem super misleading to me. I feel like there was some atmosphere of "Anthropic intends to stay behind the frontier" when the actual statements were closer to "stay on the frontier". 

Also worth noting that Claude 3 does not substantially advance the LLM capabilities frontier! Aside from GPQA, it doesn't do that much better on benchmarks than GPT-4 (and in fact does worse than gpt-4-1106-preview). Releasing models that are comparable to models OpenAI released a year ago seems compatible with "staying behind the frontier", given OpenAI has continued its scale up and will no doubt soon release even more capable models. 

That being said, I agree that Anthropic did benefit in the EA community by having this impression. So compared to the impression many EAs got from Anthropic, this is indeed a different stance. 

In any case, whether or not Claude 3 already surpasses the frontier, soon will, or doesn’t, I request that Anthropic explicitly clarify whether their intention is to push the frontier.

As Evan says, I think they clarified their intentions in their RSP: https://www.anthropic.com/news/anthropics-responsible-scaling-policy 

The main (only?) limit on scaling is their ability to implement containment/safety measures for ever more dangerous models. E.g.:

That is, they won't go faster than they can scale up safety procedures, but they're otherwise fine pushing the frontier.

It's worth noting that their ASL-3 commitments seem pretty likely to trigger in the next few years, and probably will be substantially difficult to meet:
 

Comment by LawrenceC (LawChan) on Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds" · 2024-03-03T02:07:17.695Z · LW · GW

If the claim is that it's easier to learn a covering set for a "true" harm predicate and then act conservatively wrt the set, than to learn a single harm predicate, is not a new approach. E.g. just from CHAI:

  1. The Inverse Reward Design paper, which tries to straightforwardly implement the posterior P(True Reward|Human specification of Reward) and act conservatively wrt to this outcome. 
  2. The Learning preferences from state of the world paper does this for P(True Reward | Initial state of the environment) and also acts conservatively wrt to outcome.
  3. [A bunch of other papers which consider this for Reward | Another Source of Evidence, including: randomly sampled rewards and human off-switch pressing, . Also the CIRL paper, which proposes using uncertainty to directly solve the meta problem of "the thing this uncertainty is for".]
  4. It's discussed as a strategy in Stuart's Human Compatible, though I don't have a copy to reference the exact page number.

I also remember finding Rohin's Reward Uncertainty to be a good summary of ~2018 CHAI thinking on this topic. There's also a lot more academic work in this vein from other research groups/universities too. 

The reason I'm not excited about this work is that (as Ryan and Fabien say) correctly specifying this distribution without solving ELK also seems really hard. 

It's clear that if you allow H to be "all possible harm predicates", then an AI that acts conservatively wrt to this is safe, but it's also going to be completely useless. Specifying the task of learning a good enough harm predicate distribution that both covers the "true" harms and also allows your AI to do things is quite hard, and subject to various kinds of terrible misspecification problems that seem not much easier to deal with than the case where you just try to learn a single harm predicate. 


Solving this task (that is, solving the task spec of learning this harm predicate prosterior) via probabilistic inference also seems really hard from the "Bayesian ML is hard" perspective.

Ironically, the state of the art for this when I left CHAI in 2021 were "ask the most capable model (an LLM) to estimate the uncertainty for you in one go" and "ensemble the point estimates of very few but quite capable models" (that is ensembling, but common numbers were in the single digit range, e.g. 4). These seemed to out perform even the "learn a generative/contrastive model to get features, and then learn a bayesian logistic regression on top of it" approaches. (Anyone who's currently at CHAI should feel free to correct me here.)

I think? that the approach from Bengio is trying to avoid the difficulties is by trying to solve Bayesian ML. I'm not super confident that he'll do better than "finetune an LLM to help you do it", which is presumably what we'd be doing anyways? 

(That being said, my main objections are akin to the ontology misidentification problem in the ELK paper or Ryan's comments above.)

Comment by LawrenceC (LawChan) on Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds" · 2024-03-03T02:05:10.884Z · LW · GW

For the most simple case, consider learning a linear probe on embeddings with Bayesian ML. This is totally computationally doable. (It's just Bayesian Logistic Regression).

IIRC Adam Gleave tried this in summer of 2021 with one of Chinchilla/Gopher while he was interning at DeepMind, and this did not improve on ensembling for the tasks he considered. 

Comment by LawrenceC (LawChan) on Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders · 2024-02-27T06:20:49.840Z · LW · GW

Thanks for doing this work -- I'm really happy people are doing the basic stress testing of SAEs, and I agree that this is important and urgent given the sheer amount of resources being invested into SAE research. 

For me, this was actually a positive update that SAEs are pretty good on distribution -- you trained SAE on length 128 sequences from OpenWebText, and the log loss was quite low up to ~200 tokens! This is despite its poor downstream use case performance.

I expected to see more negative results along the lines of your Lambada and Children's Book test results (that is, substantial degradation of loss, as soon as you go a tiny bit off distribution): 

I do think these results add on to the growing pile of evidence that SAEs are not good "off distribution" (even a small amount off distribution, as in Sam Marks's results you link). This means they're somewhat problematic for OOD use cases like treacherous turn detection or detecting misgeneralization. That doesn't mean they're useless -- e.g. it's plausible that SAEs could be useful for steering, mechanistic anomaly detection, or helping us do case analysis for heuristic arguments or proofs. 


As an aside, am I reading this plot incorrectly, or does the figure on the right suggest that SAE reconstructed representations have lower log loss than the original unmodified model?

Comment by LawrenceC (LawChan) on Dreams of AI alignment: The danger of suggestive names · 2024-02-13T06:35:20.766Z · LW · GW

I broadly agree that a lot of discussion about AI x-risk is confused due to the use of suggestive terms. Of the ones you've listed, I would nominate "optimizer", "mesa optimization", "(LLMs as) simulators", "(LLMs as) agents", and "utility" as probably the most problematic. I would also add "deception/deceptive alignment", "subagent", "shard", "myopic", and "goal". (It's not a coincidence that so many of these terms seem to be related to notions of agency or subcomponents of agents; this seems to be the main place where sloppy reasoning can slide in.) 

I also agree that I've encountered a lot of people who confidently predict Doom on the basis of subtle word games.

However, I also agree with Ryan's comment that these confusions seem much less common when we get to actual senior AIS researchers or people who've worked significantly with real models. (My guess is that Alex would disagree with me on this.) Most conversations I've been in that used these confused terms tended to involve MATS fellows or other very junior people (I don't interact with other more junior people much, unfortunately, so I'm not sure.) I've also had several conversations with people who seemed relieved at how reasonable and not confused the relevant researchers have been (e.g. with Alexander Gietelink-Oldenziel). 

I suspect that a lot of the confusions stem from the way that the majority of recruitment/community building is conducted -- namely, by very junior people recruiting even more junior people (e.g. via student groups). Not only is there only a very limited amount of communication bandwidth available to communicate with potential new recruits (and therefore encouraging more arguments by analogy or via suggestive words), the people doing the communication are also likely to use a lot of  (in large part because they're very junior, and likely not technical researchers).[1] There's also historical reasons why this is the case: a lot of early EA/AIS people were philosophers, and so presented detailed philosophical arguments (often routing through longtermism) about specific AI doom scenarios that in turn were suffered lossy compression during communication, as opposed to more robust general arguments (e.g. Ryan Greenblatt's example of "holy shit AI (and maybe the singularity), that might be a really big deal").[2] 

Similarly, on LessWrong, I suspect that the majority of commenters are not people who have deeply engaged with a lot of the academic ML literature or have spent significant time doing AIS or even technical ML work. 

And I'd also point a finger at lot of the communication from MIRI in particular as the cause for these confusions, e.g. the "sharp left-turn" concept seems to be primarily communicated via metaphor and cryptic sayings, while their communications about Reward Learning and Human Values seems in retrospect to have at least been misleading if not fundamentally confused. I suspect that the relevant people involved have much better models, but I think this did not come through in their communication. 


I'm not super sure what to do about it; the problem of suggestive names (or in general, of smuggling connotations into technical work) is not a unique one to this community, nor is it one that can be fixed with reading a single article or two (as your post emphasizes). I'd even argue this community does better than a large fraction of academics (even ML academics). 

John mentioned using specific, concrete examples as a way to check your concepts. If we're quoting old rationalist foundation texts, then the relevant example from "Surely You're Joking, Mr. Feynman" is relevant:

"I had a scheme, which I still use today when somebody is explaining something that I’m trying to understand: I keep making up examples. For instance, the mathematicians would come in with a terrific theorem, and they’re all excited. As they’re telling me the conditions of the theorem, I construct something which fits all the conditions. You know, you have a set (one ball) – disjoint (two balls). Then the balls turn colors, grow hairs, or whatever, in my head as they put more conditions on. Finally they state the theorem, which is some dumb thing about the ball which isn’t true for my hairy green ball thing, so I say, ‘False!’"

Unfortunately, in my experience, general instructions of the form "create concrete examples when listening to a chain of reasoning involving suggestive terms" do not seem to work very well, even if examples of doing so are provided, so I'm not sure there's a scalable solution here. 

My preferred approach is to give the reader concrete examples to chew on as early as possible, but this runs into the failure mode of contingent facts about the example being taken as a general point (or even worse, the failure mode where the reader assumes that the concrete case is the general point being made). I'd consider mathematical equations (even if they are only toy examples) to be helpful as well, assuming you strip away the suggestive terms and focus only on the syntax/semantics. But I find that I also have a lot of difficulty getting other people to create examples I'd actually consider concrete. Frustratingly, many "concrete" examples I see smuggle in even more suggestive terms or connotations, and sometimes even fail to capture any of the semantics of the original idea. 

So in the end, maybe I have nothing better than to repeat Alex's advice at the end of the post:

All to say: Do not trust this community's concepts and memes, if you have the time. Do not trust me, if you have the time. Verify.

At the end of the day, while saying "just be better" does not serve as actionable advice, there might not be an easier answer. 

  1. ^

    To be clear, I think that many student organizers and community builders in general do excellent work that is often incredibly underappreciated. I'm making a specific claim about the immediate causal reasons for why this is happening, and not assigning fault. I don't see an easy way for community builders to do better, short of abandoning specialization and requiring everyone to be a generalist who also does techncical AIS work. 

  2. ^

    That being said, I think that it's worth trying to make detailed arguments concretizing general concerns, in large part to make sure that the case for AI x-risk doesn't "come down to a set of subtle word games". (e.g. I like Ajeya's doom story. ) After all, it's worth concretizing a general concern, and making sure that any concrete instantiations of the concern are possible. I just think that detailed arguments (where the details matter) often get compressed in ways that end up depending on suggestive names, especially in cases with limited communication bandwith.

Comment by LawrenceC (LawChan) on Making every researcher seek grants is a broken model · 2024-01-27T23:15:27.082Z · LW · GW

Thanks for posting this, I agree with the overall take that a block model is a superior alternative. I think some people in the Bay Area have idly looked into this for AIS funding; I was considering doing this myself but unfortunately had other obligations. 

Comment by LawrenceC (LawChan) on Toward A Mathematical Framework for Computation in Superposition · 2024-01-19T19:55:24.729Z · LW · GW

Fascinating, thanks for the update!

Comment by LawrenceC (LawChan) on Toward A Mathematical Framework for Computation in Superposition · 2024-01-19T01:04:30.426Z · LW · GW

Also, here's a summary I posted in my lab notes:

A few researchers (at Apollo, Cadenza, and IHES) posted this document today (22k words, LW says ~88 minutes).

They propose two toy models of computation in superposition. 

First, they posit a MLP setting where a single layer MLP is used to compute the pairwise ANDs of m boolean input variables up to epsilon-accuracy, where the input is sparse (in the sense that l < m are active at once). Notably, in this set up, instead of using O(m^2) neurons to represent each pair of inputs, you can instead use O(polylog(m)) neurons with random inputs, and “read off” the ANDs by adding together all neurons that contain the pair of inputs. They also show that you can extend this to cases where the inputs themselves are in superposition, though you need O(sqrt(m)) neurons. (Also, insofar as real neural networks implement tricks like this, this probably incidentally answers the Sam Mark’s XOR puzzle.)

They then consider a setting involving the QK matrix of an attention head, where the task is to attend to a pair of activations in a transformer, where the first activation contains feature i and the second feature j. While the naive construction can only check for d_head bigrams, they provide a construction involving superposition that allows the QK matrix to approximately check for Theta(d_head * d_residual) bigrams (that is, up to ~parameter count; this involves placing the input features in superposition).

If I’m understanding it correctly, these seem like pretty cool constructions, and certainly a massive step up from what the toy models of superposition looked like in the past. In particular, these constructions do not depend on human notions of what a natural “feature” is. In fact, here the dimensions in the MLP are just sums of random subsets of the input; no additional structure needed. Basically, what it shows is that for circuit size reasons, we’re going to get superposition just to get more computation out of the network.

Comment by LawrenceC (LawChan) on Toward A Mathematical Framework for Computation in Superposition · 2024-01-18T23:27:41.840Z · LW · GW

(I haven't had the chance to read part 3 in detail, and I also haven't checked the proofs except insofar as they seem reasonable on first viewing. Will probably have a lot more thoughts after I've had more time to digest.)

This is very cool work! I like the choice of U-AND task, which seems way more amenable to theoretical study (and is also a much more interesting task) than the absolute value task studied in Anthropic's Toy Model of Superposition (hereafter TMS). It's also nice to study this toy task with asymptotic theoretical analysis as opposed to the standard empirical analysis, thereby allowing you to use a different set of tools than usual. 

The most interesting part of the results was the discussion on the universality of universal calculation -- it reminds me of the interpretations of the lottery ticket hypothesis that claim some parts of the network happen to be randomly initialized to have useful features at the start. 

Some examples that are likely to be boolean-interpretable are bigram-finding circuits and induction heads. However, it's possible that most computations are continuous rather than boolean[31].

My guess is that most computations are indeed closer to continuous than to boolean. While it's possible to construct boolean interpretations of bigram circuits or induction heads, my impression (having not looked at either in detail on real models) is that neither of these cleanly occur inside LMs. For example, induction heads demonstrate a wide variety of other behavior, and even on induction-like tasks, often seem to be implementing induction heuristics that involve some degree of semantic content. 

Consequently, I'd be especially interested in exploring either the universality of universal calculation, or the extension to arithmetic circuits (or other continuous/more continuous models of computation in superposition).


Some nitpicks:

The post would probably be a lot more readable if it were chunked into 4.  The 88 minute read time is pretty scary, and I'd like to comment only on the parts I've read. 

Section 2:

Two reasons why this loss function might be principled are

  1. If there is reason to think of the model as a Gaussian probability model
  2. If we would like our loss function to be basis independent

A big reason to use MSE as opposed to eps-accuracy in the Anthropic model is for optimization purposes (you can't gradient descent cleanly through eps-accuracy). 

 

Section 5:

4 How relevant are our results to real models?

This should be labeled as section 5. 

 

Appendix to the Appendix:

Here, $f_i$ always denotes the vector.

[..]

with \[\sigma_1\leq n\) with

(TeX compilation failure)

Comment by LawrenceC (LawChan) on Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small · 2024-01-17T19:35:58.740Z · LW · GW

As with the CCS post, I'm reviewing both the paper and the post, though the majority of the review is on the paper. Writing this quickly (total time on review: ~1.5h), but I expect to be willing to defend the points being made --

There's a lot of reasons I like the work. It's an example of:

  1. Actually poking inside a real model. A lot of the mech interp work in early-mid 2022 was focused on getting a deep understanding of toy models trained on algorithmic tasks (at least in this community).[1] There was some effort at Redwood to do neuron-by-neuron replacement, and Nix completed his work on the parentheses balancer concurrently to the IOI results, but insofar as there was mech interp work being done, most of it was on simple models such as the ones featured in Toy Models of Superposition or Modular Arithmetic Grokking (with the primary exception being the Induction Head results from Anthropic, which are substantively weaker outside of very small transformers). 

    This work was one of the first attempts to explain a particular, nontrivial behavior inside of a small but real LM (GPT-2-small). 
  2. Demonstrating the feasibility of patching and circuit-based analysis on language models. I think it's notable that this work doesn't just mechanistically study behavior inside of a language model, it finds a circuit (a small subgraph) implementing the behavior. This is valuable both as a confirmation that patching could be used to find circuits in "real" models, but also as evidence that we can find these circuits at all. In turn, this has led to a veritable explosion of "poking LLMs with various kinds of patching/scrubbing to identify subgraphs of particular behaviors", which I think has been pretty valuable on net. 

    Also, as Neel says below, it's important for pedagogical reasons. 
  3. Field-building via example. As with the Modular Arithmetic work by Neel, this was published in ICLR '23 as the joint first mech interp work to be published in a top conference. This helped build a substantial amount of legitimacy and academic interest for the field of mech interp (and broadly, ai x-risk flavored interp in general). 
  4. Demonstrating failure modes and limitations of mech interp techniques. As stated in this post, an earlier version of this work used mean ablation in a way that preserved "information that helped compute the task", which incorrectly suggested that parts of the circuit were unimportant for performance. It's a concrete example of why important to think about what exactly you're ablating, and how your ablation serves as a valid test of your hypothesis. 

    This work also directly inspired Causal Scrubbing , which was an attempt to more completely remove information that helps complete a task. 
  5. Validating interp via adversarial inputs. I appreciate the use of adversarial example discovery as a downstream use case of the interp. 

But there's also some reservations I have:

  1. Some of the presentation was misleading. Originally, the paper defined the IOI task as something along the lines of:
    '... sentences like “When John and Mary went to the store, John gave a drink to” should be completed with “Mary”.'
    That is, it did not make it clear that IOI was about assigning a higher logit to "Mary" than to "John", and not about assigning an (absolutely) high logit to "Mary". IIRC, this was only clarified near the end thanks to the effort of one of the critical ICLR reviewers.[2] There were also other strong claims that were significantly ameliorated by the ICLR review process.[3]
  2. The circuit is likely overfit to the metric. I think that the mean logit difference is indeed the correct metric to look at, both because of how the task was defined and also for many use cases in general.[4] However, it's worth noting that this circuit does not hold up well if we replace the mean logit difference with other superficially similar metrics. E.g. if you replace the metric with mean absolute logit difference (i.e. E[|logit diff_model - logit diff_subgraph|]). 
  3. The circuit is likely incomplete. Running Causal Scrubbing on the hypothesis suggests that it is importantly incomplete, see for example Alexandre's comment below. The incompleteness of the circuit also suggests some limitations of node-based causal interventions (i.e. activation patching in this case), as previously discussed. That being said, this wasn't something that could've really been known when the experiments were being done for this paper, as Causal Scrubbing was inspired by these results (and thus could not have been used to generate them). 

And there's two big points that I'm very, very torn on (it's less to do with the work itself than general approaches to/issues with mech interp):

  • Using an algorithmic task (IOI). As this post says, it's an example of "streetlight interpretability" -- looking where at cases that are easy, as opposed to where it's useful or realistic. I think it's valuable to do some amount of streetlight interpretability, and it's especially understandable in the case of this work (as one of the earliest mech interp pieces) but I do think that this is a weakness of the work. I also think that fact that seminal works in mech interp used algorithmic tasks may have contributed to a lack of attention paid to soft heuristics/memorization/n-gram statistic-style behavior inside of models, which I think are quite neglected. 
  • Low percent performance recovered. While the headline numbers for completeness/faithfulness are pretty high in terms of percentage, this actually is quite bad in terms of downstream performance.This isn't specific to this work. But, to use causal scrubbing as an example, if random performance on a task is 10 nats of log loss and the model's performance is 2.1 nats, recovering up to 2.6 nats might give the impressive number of 93.7% loss recovered. But in practice, 2.6 nats might be the performance of a model 1/100 or 1/1000 the size of the model we're trying to explain. If the behavior you're trying to explain is present in the most capable models but not in models a generation or two back, this work does not provide significant evidence that it's possible. Again, this isn't specific to this work, but to circuit-style mech interp on real models in general. 

I think the post itself is pretty good though not exceptional -- I appreciate the explanation of how the task and approach were chosen, as well as the key takeaway that causal interventions can be powerful for mech interp, if they are performed appropriately, but doing them appropriately is challenging.

All said, I'm giving this a 4 on the annual review.

  1. ^

    Note that there was plenty of non-mechanistic interp work that looked at real models and tasks; in fact, the majority of interp has always been on non-toy models and tasks. But mech interp was focused on toy tasks.

  2. ^

    I helped out with rebuttals on this paper, and was honestly impressed by the two critical reviews posted by reviewer jy1a (official review, response to author rebuttal), who among (imo correct) issues correctly pointed out that the post was using this incorrect definition of IOI. Notice also how in the rebuttal response, they also point out the issue of using mean logit difference versus mean absolute logit difference. I think that (alongisde the RR AT paper) this was one of the reasons I updated to be more in favor of the existing academic peer review system. 

  3. ^

    See e.g. this comment from the Program Chairs:

    The major concerns from the reviewers are the current limited limitation section and a few not-well supported/overstated claims in the paper. Request to the authors: Please update the paper to have a stronger and more critical limitation discussion, as well as substantially change the writing to justify all claims/assumptions (or not to overstate claims) in order to reflect reviewers’ comments.

  4. ^

    The main reason is that we don't really care about 'noise' when explaining good performance, e.g. from the Causal Scrubbing appendix:

    Suppose that one of the drivers of the model’s behavior is noise: trying to capture the full distribution would require us to explain what causes the noise. For example, you’d have to explain the behavior of a randomly initialized model despite the model doing ‘nothing interesting’.

    That being said, this claim depends greatly on the implied downstream use case of interp. E.g. if the goal is to understand failure modes, then explaining just the success is insufficient. 

Comment by LawrenceC (LawChan) on How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme · 2024-01-15T17:44:56.077Z · LW · GW

I agree that people dramatically overrated the empirical results of this work, but not more so than other pieces that "went viral" in this community. I'd be excited to see your takes on this general phenomenon as well as how we might address it in the future.

Comment by LawrenceC (LawChan) on How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme · 2024-01-15T07:56:58.229Z · LW · GW

This is a review of both the paper and the post itself, and turned more into a review of the paper (on which I think I have more to say) as opposed to the post. 

Disclaimer: this isn’t actually my area of expertise inside of technical alignment, and I’ve done very little linear probing myself. I’m relying primarily on my understanding of others’ results, so there’s some chance I’ve misunderstood something. Total amount of work on this review: ~8 hours, though about 4 of those were refreshing my memory of prior work and rereading the paper. 

TL;DR: The paper made significant contributions by introducing the idea of unsupervised knowledge discovery to a broader audience and by demonstrating that relatively straightforward techniques may make substantial progress on this problem. Compared to the paper, the blog post is substantially more nuanced, and I think that more academic-leaning AIS researchers should also publish companion blog posts of this kind. Collin Burns also deserves a lot of credit for actually doing empirical work in this domain when others were skeptical. However, the results are somewhat overstated and, with the benefit of hindsight, (vanilla) CCS does not seem to be a particularly promising technique for eliciting knowledge from language models. That being said, I encourage work in this area.[1]

Introduction/Overview

The paper “Discovering Latent Knowledge in Language Models without Supervision” by Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt (henceforth referred to as “the CCS paper” for short) proposes a method for unsupervised knowledge discovery, which can be thought of as a variant of empirical, average-case Eliciting Latent Knowledge (ELK). In this companion blog post, Collin Burns discusses the motivations behind the paper, caveats some of the limitations of the paper, and provides some reasons for why this style of unsupervised methods may scale to future language models. 

The CCS paper kicked off a lot of waves in the alignment community when it came out. The OpenAI Alignment team was very excited about the paper. Eliezer Yudkowsky even called it “Very Dignified Work!”. There’s also been a significant amount of followup work that discusses or builds on CCS, e.g. these Alignment Forum Posts:

As well as the following papers:[2]

So it seems a pity that no one has provided a review for this post. This is my attempt to fix that. 

Unfortunately, this review has ballooned into a much longer post. To make it more digestible, I’ve divided it into sections:

  1. The CCS paper and follow-up work
  2. The post itself

Overall, I give this post a high but not maximally high rating. I think that the paper made significant contributions, albeit with important caveats and limitations. While I also have some quibbles with the post, I think the post does a good job of situating the paper and the general research direction in an alignment scheme. Collin Burns also deserves significant credit for pioneering the research direction in general; many people at the time (including myself) were decently surprised by the positive results. 

The Paper and Follow-up Work

The headline method, Contrast-Consistent Search (CCS), learns a linear probe[3] that predicts the probabilities of a binary label, without supervised data.[4] CCS does this by first generating “contrast pairs” of statements with positive and negative answers,[5] and then maximizes the consistency and confidence of the probabilities for each pair. It then combines the predictions on the positive/negative answers to get a number that corresponds to either the probability of the “true”, correct answer or the “false”, incorrect answer.  To evaluate this method, the authors consider whether assigning this classifier to “true” or “false” leads to higher test accuracy, and then pick the higher-accuracy assignment.[6] They show that this lets them outperform directly querying the model by ~4% on average across a selection of datasets, and can work in cases where the prompt is misleading. 

Important takeaways

Here’s some of my thoughts on some useful updates I made as a result of this paper. You can also think of this as the “strengths” section, though I only talk about things in the paper that seem true to me and don’t e.g. praise the authors for having very clear writing and for being very good at conveying and promoting their ideas. 

Linear probes are pretty good for recovering salient features, and some truth-like feature is salient on many simple datasets. CCS only uses linear probes, and achieves good performance. This suggests that some feature akin to “truthiness” is represented linearly inside of the model for these contrast pairs, a result that seems somewhat borne out in follow-up work, albeit with major caveats

I also think that this result is consistent with other results across a large variety of papers. For example, Been Kim and collaborators were using linear probes to do interp on image models as early as 2017, while attaching linear classification heads to various segments of models has been a tradition since before I started following ML research (i.e. before ~late 2015). In terms of more recent work, we’ve not only seen work on toy-ish models showing that small transformers trained on Othello and Chess have linear world representations, but we’ve seen that simple techniques for finding linear directions can often be used to successfully steer language models to some extent. For a better summary of these results, consider reading the Representation Engineering and Linear Representation Hypothesis papers, as well as the 2020 Circuits thread on “Features as Directions”

Simple empirical techniques can make progress. The CCS paper takes a relatively straightforward idea, and executes on it well in an empirical setting. I endorse this strategy and think more people in AIS (but not the ML community in general) should do things like this.

Around late 2022, I became significantly more convinced that the basic machine learning technique of “try the simplest thing”. The CCS work was a significant reason for this, because I expected it to fail and was pleasantly surprised by the positive results. I also think that I’ve updated upwards on the fact that executing “try the simplest thing” well is surprisingly difficult. I think that even in cases where the “obvious” thing is probably doomed to fail, it’s worth having someone try it anyway, because 1) you can be wrong, and more importantly 2) the way in which it fails can be informative. See also, obvious advice by Nate Soares and this tweet by Sam Altman. 

It’s worth studying empirical, average-case ELK. In particular, I think that it’s worth “doing the obvious thing” when it comes to ELK. My personal guess is that (worst-case) ELK is really hard, and that simple linear probes are unlikely to work for it because there’s not really an actual “truth” vector represented by LLMs. However, there’s still a chance that it might nonetheless work in the average case. (See also, this discussion of empirical generalization.) I think ultimately, this is a quantitative question that needs to be resolved empirically – to what extent can we find linear directions inside LLMs that correspond very well with truth? What does the geometry of the LLM activation space actually look like? 

Caveating claims from the CCS Paper

While I hold an overall positive view of the paper, I do think that some of the claims in the paper have either not held up over time, or are easily understood to mean something false. I’ll talk about some of my quibbles below. 

The CCS algorithm as stated does not seem to reliably recover a robust “truth” direction. The first and biggest problem is that CCS does not perform that well at its intended goal. Notably, CCS classifiers may pick up on other prominent features instead of the intended “truth” direction (even sometimes on contrast pairs that only differ in the label!). Some results from the Still No Lie Detector for Language Models suggest that this may be because CCS is representing which labels are positive versus negative (i.e. the normalization in the CCS paper does not always work to remove this information). Note that Collin does discuss this as a possible issue (for future language models) in the blog post, but this is not discussed in the paper. 

In addition, all of the Geometry of Truth paperFabien’s results on CCS, and the Challenges with unsupervised knowledged discovery paper note that CCS tends to be very dependent on the prompt for generative LMs. 

Does CCS work for the reasons the authors claim it does? From reading the introduction and the abstract, one might get the impression that the key insight is that truth satisfies a particular consistency condition, and thus that the consistency loss proposed in the paper is driving a lot of the results. 

However, I’m reasonably confident that it’s the contrast pairs that are driving CCS’s performance. For example, the Challenges with unsupervised knowledged discovery paper found that CCS does not outperform other unsupervised clustering methods. And as Scott Emmons notes, this is supported by section 3.3.3, where two other unsupervised clustering methods are competitive with CCS. Both the Challenges paper and Scott Emmon’s post also argue that CCS’s consistency loss is not particularly different from normal unsupervised learning losses. On the other hand, there’s significant circumstantial evidence that carefully crafted contrast pairs alone often define meaningful directions, for example in the activation addition literature. 

There’s also two proofs in the Challenges paper, showing “that arbitrary features satisfy the consistency structure of [CCS]”, that I have not had time to fully grok and situate. But insofar as you can take this claim is correct, this is further evidence against CCS's empirical performance being driven by its consistency loss. 

A few nitpicks on misleading presentation. One complaint I have is that the authors use the test set to decide if higher numbers from their learned classifier correspond to “true” or "false". This is mentioned briefly in a single sentence in section two (“For simplicity in our evaluations we take the maximum accuracy over the two possible ways of labeling the predictions of a given test set.”) and then one possible solution is discussed in Appendix A but not implemented. Also worth noting that the evaluation methodology used gives an unfair advantage to CCS (as it can never do worse than random chance on the test set, while many of the supervised methods perform worse than random). 

This isn’t necessarily a big strike against the paper: there’s only so much time for each project and the CCS paper already contains a substantial amount of content. I do wish that this were more clearly highlighted or discussed, as I think that this weakens the repeated claims that CCS “uses no labels” or “is completely unsupervised”.

My Take

I think that the paper made significant contributions by significantly signal boosting the idea of unsupervised knowledge discovery,[7] and showed that you can achieve decent performance by contrast pairs and consistency checks. It also has spurred a large amount of follow-up work and interest in the topic. 

However, the paper is somewhat misleading in its presentation, and the primary driver of performance seems to be the construction of the contrast pairs and not the loss function. Follow-up work has also found that the results of the paper can be brittle, and suggest that CCS does not necessarily find a singular “truth direction”.

The Post

The post starts out by discussing the setup and motivation of unsupervised knowledge recovery. Suppose you wanted to make a model “honestly” report its beliefs. When the models are sub human or even human level, you can use supervised methods. Once the models are superhuman, these methods probably won’t scale for many reasons. However, if we used unsupervised methods, there might not be a stark difference between human-level and superhuman models, since we’re not relying on human knowledge.  

The post then goes into some intuitions for why it might be possible: interp on vision models has found that models seem to learn meaningful features, and models seem to linearly represent many human-interpretable features. 

Then, Collin talks about the key takeaways from his paper, and also lists many caveats and limitations of his results.

Finally, the post concludes by explaining the challenges of applying CCS-style knowledge discovery techniques to powerful future LMs, as well as why Collin thinks that these unsupervised techniques may scale. 

A minor nitpick: 

Collin says:

I think this is surprising because before this it wasn’t clear to me whether it should even be possible to classify examples as true or false from unlabeled LM representations *better than random chance*!

As discussed above, the methodology in the paper guarantees that any classifier will do at least as good at random chance. I’m not actually sure how much worse if you orient the classifier on the training set as opposed to the test set. (And I'd be excited for someone to check this!)

My Take

I think this blog post is quite good, and I wish more academic-adjacent ML people would write blog posts caveating their results and discussing where they fit in. I especially appreciated the section on what the paper does and does not show, which I think accurately represents the evidence presented in the paper (as opposed to overhyping or downplaying it). In addition, I think Collin makes a strong case for studying more unsupervised approaches to alignment. 

I also strongly agree with Collin that it can be “extremely valuable to sketch out what a full solution could plausibly look like given [your] current model of how deep learning systems work”, and wish more people would do this. 
 

Acknowledgments

Thanks to Aryan Bhatt, Stephen Casper, Adria-Garriga Alonso, David Rein, and others for helpful conversations on this topic. Thanks for Raymond Arnold for poking me into writing this. 
 

(EDIT Jan 15th: added my footnotes back in to the comment, which were lost during the copy/paste I did from Google Docs.)

  1. ^

    This originally was supposed to link to a list of projects I'd be excited for people to do in this area, but I ran out of time before the LW review deadline. 

  2. ^

    I also draw on evidence from many, many other papers in related areas, which unfortunately I do not have time to list fully. A lot of my intuitions come from work on steering LLMs with activation additions, conversations with Redwood researchers on various kinds of coup or deception probes, and linear probing in general. 

  3. ^

    That is, a direction with an additional bias term. 

  4. ^

    Unlike other linear probing techniques, CCS does this without needing to know if the positive or negative answer is correct. However, some amount of labeled data is needed later to turn this pair into a classifier for truth/falsehood. 

  5. ^

    I’ll use “Positive” and “Negative” for the sake of simplicity in this review, though in the actual paper they also consider “Yes” and “No” as well as different labels for news classification and story completion. 

  6. ^

    Note that this gives their method a small advantage in their evaluation, since CCS gets to fit a binary parameter on the test set while the other methods do not. Using a fairer baseline does significantly negatively affect their headline numbers, but I think that the results generally hold up anyways. (I haven’t actually written any code for this review, so I’m trusting the reports of collaborators who have looked into it, as well as Fabien’s results that use randomly initialized directions with the CCS evaluation methodology instead of trained CCS directions.)

    Also, while writing this review, I originally thought that this issue was addressed in section 3.3.3, but that only addresses fitting the classifiers with fewer contrast pairs, as opposed to deciding whether the combined classifier corresponds to correct or incorrect answers. After spending ~5 minutes staring at the numbers and thinking that they didn’t make sense, I realized my mistake. Ugh. 

  7. ^

    This originally said “by introducing the idea”, but some people who reviewed the post convinced me otherwise. It’s definitely an introduction for many academics, however.

  8. ^

    The authors call this a “direction” in their abstract, but CCS really learns two directions. This is a nitpick.

Comment by LawrenceC (LawChan) on Saving the world sucks · 2024-01-12T20:32:21.855Z · LW · GW

I agree with most of the points you're making here. 

The rationalist/EA community doesn't reward prosocial behavior enough.

I think there's a continued debate about whether these groups should behave more like a professional circle or as a social community. (In practice, both spheres are a bit of both.) I think from the lens of EA/rats as a social group, we don't really provide enough emotional support and mental health resources. However, insofar as EA is intended to be a professional circle trying to do hard things, it makes sense why these resources might be deprioritized. 

Comment by LawrenceC (LawChan) on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] · 2024-01-02T23:15:06.472Z · LW · GW

I agree with the overall point (that this was a solid intellectual contribution and is a reasonable-ish metric), but there's been a non-zero amount of followups or at least use cases of this work, imo. Off the top of my head:

  • In general, CaSc has been used on lots of toy/tiny models to a decent level of success. I agree that part of the reason for CaSc's lack of adoption is that the metric consistently returns "this explanation is not very faithful/complete/etc". For example:
    • I checked the hypotheses for the toy modular arithmetic/group composition work with my own hand-crafted CaSc implementation and found that the modular arithmetic results held up quite well. 
    • CaSc-style tests were used by Marius and Stefan to confirm their solutions to Stephen Casper's Mech Interp challenges (challenge 1, challenge 2).
    • etc.
  • Erik Jenner's agenda is pretty closely related to causal scrubbing and is still actively being worked on.
Comment by LawrenceC (LawChan) on 200 COP in MI: Exploring Polysemanticity and Superposition · 2023-12-27T02:52:53.539Z · LW · GW

I strongly upvoted this post because of the "Tips" section, which is something I've come around on only in the last ~2.5 months. 

Comment by LawrenceC (LawChan) on Employee Incentives Make AGI Lab Pauses More Costly · 2023-12-23T06:45:23.345Z · LW · GW

I strongly agree with the high-level point that conditional pauses are unlikely to go well without planning for what employees will do during the pause. 


 A nitpick: while (afaik) Anthropic has made no public statements about their plans, their RSP does include a commitment to:

Proactively plan for a pause in scaling. We will manage our plans and finances to support a pause in model training if one proves necessary, or an extended delay between training and deployment of more advanced models if that proves necessary. During such a pause, we would work to implement security or other measures required to support safe training and deployment, while also ensuring our partners have continued access to their present tier of models (which will have previously passed safety evaluations).

Comment by LawrenceC (LawChan) on Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) · 2023-12-23T06:27:02.276Z · LW · GW

This is great work, even though you weren't able to understand the memorization mechanistically.  

I agree that a big part of the reason to be pessimistic about ambitious mechanistic interp is that even very large neural networks are performing some amount of pure memorization. For example, existing LMs often can regurgitate some canary strings, which really seems like a case without any (to use your phrase) macrofeatures. Consequently, as you talk about in both posts 3 and 4, it's not clear that there even should be a good mechanistic understanding for how neural networks implement factual recall.  In the pathological worst case, with the hash and lookup algorithm, there are actually no nontrivial, interpretable features to recover. 

One hope here is no "interesting" cognition depends in large part on uninterpretable memorization[1]; maybe a understanding circuits is sufficient for understanding. It might be possible that for any dangerous capability a model could implement, we can't really understand how the facts it's using are mechanistically built up or even necessarily what all the facts are, but we can at least recognize the circuits building on factual representations, and do something with this level of understanding. 

I also agree that SAEs are probably not a silver bullet to this problem. (Or at least, it's not clear.) In the case of common names like "Michael Jordan", it seems likely that a sufficiently wide SAE would recover that feature (it's a macrofeature, to use the terminology from post 4). But I'm not sure how an SAE would work in one-off cases without internal structure like predicting canary strings? 

Absent substantial conceptual breakthroughs, my guess is the majority of my hope for ambitious mechanistic interp lies in use cases that don't require understanding factual recall. Given the negative results in this work despite significant effort, my best guess for how I'd study this problem would be to look at more toy models of memorization, perhaps building on Anthropic's prior work on this subject. If it's cheap and I had more familiarity with the SOTA on SAEs, I'd probably just throw some SAEs at the problem, to confirm that the obvious uses of SAEs wouldn't help. 


Also, some comments on specific quotes from this post:

First, I'm curious why you think this is true:

intuitively, the number of facts known by GPT-4 vs GPT-3.5 scales superlinearly in the number of neurons, let alone the residual stream dimension.

Why specifically do you think this is intuitively true? (I think this is plausible, but don't think it's necessarily obvious.)

Second, a nitpick: you say in this post about post 4:

In post 4, we also studied a toy model mapping pairs of integers to arbitrary labels where we knew all the data and could generate as much as we liked, and didn’t find the toy model any easier to interpret, in terms of finding internal sparsity or meaningful intermediate states.

However, I'm not seeing any mention of trained models in post 4 -- is it primarily intended as a thought experiment to clarify the problem of memorization, or was part of the post missing?

(EDIT Jan 5 2024: in private correspondence with the authors, they've clarified that they have indeed done several toy experiments finding those results, but did not include them in post 4 because the results were uniformly negative.)

  1. ^

    This reminds me about an old MIRI post distinguishing between interpretable top-level reasoning and uninterpretable subsystem reasoning, and while they imagined MCTS and SGD as examples of top-level reasoning (as opposed to the interpretable algorithms inside a neural network), this hope is similar to one of their paths to aligned AI:

    Hope that top-level reasoning stays dominant on the default AI development path

    Currently, it seems like most AI systems' consequentialist reasoning is explainable in terms of top-level algorithms. For example, AlphaGo's performance is mostly explained by MCTS and the way it's trained through self-play. The subsystem reasoning is subsumed by the top-level reasoning and does not overwhelm it.

Comment by LawrenceC (LawChan) on Shallow review of live agendas in alignment & safety · 2023-11-29T10:07:09.817Z · LW · GW

I strongly agree that this would be valuable; if not for the existence of this shallow review I'd consider doing this myself just to serve as a reference for myself. 

Comment by LawrenceC (LawChan) on Shallow review of live agendas in alignment & safety · 2023-11-29T09:43:05.048Z · LW · GW

I spent a good hour or two reading the construction and proposed solution of the paper; here's my attempted explanation with cleaned up notation.

Basically, he considers a setting with four actions: a, b, c, d, and a real numbered state s, where R(s, a) > R(s, b) = R(s, c) > R(s, d) = 0 if s > 0 and  0 = R(s, d) > R(s, c) = R (s, b) > R(s, c) if s <= 0. 

The transition rule is:
s' = s - 1 + L if action b is taken and s > 0,
s' = s - 1 - L if action c is taken and s > 0,
s' = s - 1 otherwise
for some constant L >= 0. 

The paper imagines s as measuring 'time until the human presses the shutdown button', and imagines action d as "stay shut down".  Because the reward of each of a, b, and c is greater than d before the buttonpress, the agent will never take action d before the buttonpress.  After the buttonpress (i.e. the timestep at which s drops below 0), the utility of "stay shut down" is higher than the other actions, and so the agent is incentivized to press d.  That is, the agent has an incentive to shut down after the buttonpress (desideratum 1 of the 2015 Soares et al paper), by construction

Accordingly, b is the delay action (which, in the limit of large L, just disables the button) and c is the advance action (which, in the limit, corresponds to immediately pressing the button). We'll call action a the "exploit" action. 

In this setting, the goal is to create a state-dependent correction fudge-factor f_c : S x A -> \R that occurs when the rewards are flipped (that is, an additional reward added for transition into s' <= 0 from s > 0, such that the agent never takes either the delaying action b or advancing action c.

(In his paper he calls these actions "p", ">", "<", and "e", and denotes the timestep at which s' <= 0 as "#", but that notation is annoying when ">" already means "greater than" and "<" already means less than. There's also a bunch of notational simplifications I've made that are harder to explain.)

In his setup, R(s,a) = 10, R(s,b) = R(s, c) = 8, and s_0 = 6. By default, he sets the delay/advance size L to 0.6. He also uses discount rate gamma = 0.9. With these values, the optimal policy does a once (s=5), b 12 times (s=0.2), a once again (s=-0.8), at which point the buttonpress occurs since s < 0, and thereafter takes action d. 


His solution is simply the utility indifference solution from the Soares et al paper: consider an alternative reward function R' such that R'(s, x) = R(1, x); that is, a world in which the shutdown button is never pressed. Then, we add the difference in (optimal) value functions under R' relative to R:

f(s, x) = V*_R'(s) - V*_R(s) = V*_R(s) 

By construction, the agent has no incentive to take the advance or delay actions (b or c), because its expected value in both cases is the same.  (desiderata 2 and 3 of the Soares et al paper)

Unfortunately, as this is in some sense the same solution as in the 2015 Soares et al paper, should run into the same problem of "there's no incentive to preserve the shutdown mechanism". (That is, it fails desiderata 4 of the 2015 paper.)


From a quick skim of section 9.8 (which argues for the corrigibility of his indifference solution) and sections 8.2 and 8.4 (which detail the part of his indifference solution), it seems like he proposes the following modification: instead of having R(s, d) = 0 for all s, we instead have R(s, d) be larger when s <= 0. This does incentivize preserving the ability to be shut down. He calls this agent, with modified R, the g_c agent. 

The g_c agent does indeed satisfy desiderata 4; there's an incentive to preserve the shutdown mechanism; in fact, there's again an incentive to press the shutdown mechanism!

The specific problem is, this incentive modifies the reward of R(s,d), but in his construction/implementation he does not go back to make this change to f(s, x). (He keeps it at V*_R(s), which no longer equals the value difference between shutting down and not shutting down.) This breaks the utility indifference, causing the Q-value of the advance action c to be larger for sufficiently large values of L. That is, his g_c agent now has an incentive to once again press the shutdown button (!!).


 TL;DR: no, based on a quick skim, the paper doesn't solve corrigibility.

Comment by LawrenceC (LawChan) on Shallow review of live agendas in alignment & safety · 2023-11-27T21:11:27.549Z · LW · GW

Very small nitpick: I think you should at least add Alex Lyzhov, David Rein, Jacob Pfau, Salsabila Mahdi, and Julian Michael for the NYU Alignment Research Group; it's a bit weird to not list any NYU PhD students/RSs/PostDocs when listing people involved in NYU ARG. 

Both Alex Lyzhov and Jacob Pfau also post on LW/AF:

Comment by LawrenceC (LawChan) on Shallow review of live agendas in alignment & safety · 2023-11-27T21:00:03.329Z · LW · GW

Expanding on this -- this whole area is probably best known as "AI Control", and I'd lump it under "Control the thing" as its own category. I'd also move Control Evals to this category as well, though someone at RR would know better than I. 

Comment by LawrenceC (LawChan) on Shallow review of live agendas in alignment & safety · 2023-11-27T17:31:24.725Z · LW · GW

Thanks for making this! I’ll have thoughts and nitpicks later, but this will be a useful reference!

Comment by LawrenceC (LawChan) on Paper out now on creatine and cognitive performance · 2023-11-27T01:51:47.265Z · LW · GW

Thanks for doing this study! I'm glad that people are doing RCTs on creatine with more subjects. (Also, I didn't know that vegetarians had similar amounts of brain creatine as omnivores, which meant I would've incorrectly guessed that vegetarians benefit more than omnivores from creatine supplementation). 

Here's the abstract of the paper summarizing the key results and methodology:

Background

Creatine is an organic compound that facilitates the recycling of energy-providing adenosine triphosphate (ATP) in muscle and brain tissue. It is a safe, well-studied supplement for strength training. Previous studies have shown that supplementation increases brain creatine levels, which might increase cognitive performance. The results of studies that have tested cognitive performance differ greatly, possibly due to different populations, supplementation regimens, and cognitive tasks. This is the largest study on the effect of creatine supplementation on cognitive performance to date.

Methods

Our trial was preregistered, cross-over, double-blind, placebo-controlled, and randomised, with daily supplementation of 5 g for 6 weeks each. We tested participants on Raven’s Advanced Progressive Matrices (RAPM) and on the Backward Digit Span (BDS). In addition, we included eight exploratory cognitive tests. About half of our 123 participants were vegetarians and half were omnivores.

Results

Bayesian evidence supported a small beneficial effect of creatine. The creatine effect bordered significance for BDS (p = 0.064, η2P = 0.029) but not RAPM (p = 0.327, η2P = 0.008). There was no indication that creatine improved the performance of our exploratory cognitive tasks. Side effects were reported significantly more often for creatine than for placebo supplementation (p = 0.002, RR = 4.25). Vegetarians did not benefit more from creatine than omnivores.

Conclusions

Our study, in combination with the literature, implies that creatine might have a small beneficial effect. Larger studies are needed to confirm or rule out this effect. Given the safety and broad availability of creatine, this is well worth investigating; a small effect could have large benefits when scaled over time and over many people.

Note that the effect size is quite small:

We found Bayesian evidence for a small beneficial effect of creatine on cognition for both tasks. Cohen’s d based on the estimated marginal means of the creatine and placebo scores was 0.09 for RAPM and 0.17 for BDS. If these were IQ tests, the increase in raw scores would mean 1 and 2.5 IQ points. The preregistered frequentist analysis of RAPM and BDS found no significant effect at p < 0.05 (two-tailed), although the effect bordered significance for BDS.

Comment by LawrenceC (LawChan) on AI Safety Research Organization Incubation Program - Expression of Interest · 2023-11-22T00:18:23.922Z · LW · GW

I don't think that's actually true at all;  Anthropic was explicitly a scaling lab when made, for example, and Deepmind does not seem like it was "an attempt to found an ai safety org". 

It is the case that Anthropic/OAI/Deepmind did feature AI Safety people supporting the org, and the motivation behind the orgs is indeed safety, but the people involved did know that they were also going to build SOTA AI models. 

Comment by LawrenceC (LawChan) on Sam Altman fired from OpenAI · 2023-11-17T22:12:19.732Z · LW · GW

Thanks, edited.

Comment by LawrenceC (LawChan) on Picking Mentors For Research Programmes · 2023-11-11T05:59:33.254Z · LW · GW

I'm not sure I agree -- I think historically I made the opposite mistake, and from a rough guess the average new grad student at top CS programs tends to look too much for straightforward new projects (in part because you needed to have a paper in undergrad to get in, and therefore have probably done a project that was pretty straightforward and timeboxed).

I do think many early SERI MATS mentees did make the mistake you describe though, so maybe amongst people who are reading this post, the average person considering mentorship (who is not the average grad student) would indeed make your mistake? 

Comment by LawrenceC (LawChan) on Trying to understand John Wentworth's research agenda · 2023-10-20T02:03:13.854Z · LW · GW

My hope is that products will give a more useful feedback signal than other peoples' commentary on our technical work.

I'm curious what form these "products" are intended to take -- if possible, could you give some examples of things you might do with a theory of natural abstractions? If I had to guess, the product will be an algorithm that identifies abstractions in a domain where good abstractions are useful, but I'm not sure how or in what domain. 

Comment by LawrenceC (LawChan) on Interpretability Externalities Case Study - Hungry Hungry Hippos · 2023-10-04T19:00:54.143Z · LW · GW

Oh, I like that one! Going to use it from now on

Comment by LawrenceC (LawChan) on Interpretability Externalities Case Study - Hungry Hungry Hippos · 2023-10-02T19:43:27.861Z · LW · GW

Sure, though it seems too general or common to use a long word for it?

Maybe "linear intervention"?