Speculative inferences about path dependence in LLM supervised fine-tuning from results on linear mode connectivity and model souping 2023-07-20T09:56:05.574Z
Causal confusion as an argument against the scaling hypothesis 2022-06-20T10:54:05.623Z
Sparsity and interpretability? 2020-06-01T13:25:46.557Z
How can Interpretability help Alignment? 2020-05-23T16:16:44.394Z
What is Interpretability? 2020-03-17T20:23:33.002Z


Comment by RobertKirk on TurnTrout's shortform feed · 2024-01-18T10:21:23.265Z · LW · GW

I think this and other potential hypotheses can potentially be tested empirically today rather than only being distinguishable close to AGI

How would you imagine doing this? I understand your hypothesis to be "If a model generalises as if it's a mesa-optimiser, then it's better-described as having simplicity bias". Are you imagining training systems that are mesa-optimisers (perhaps explicitly using some kind of model-based RL/inference-time planning and search/MCTS), and then trying to see if they tend to learn simple cross-episode inner goals which would be implied by a stronger implicity bias?

Comment by RobertKirk on Steering Llama-2 with contrastive activation additions · 2024-01-10T16:22:26.161Z · LW · GW

A quick technical question: In the comparison to fine-tuning results in Section 6 where you stack CAA with fine-tuning, do you find a new steering vector after each fine-tune, or are you using the same steering vector for all fine-tuned models? My guess is you're doing the former as it's likely to be more performant, but I'd be interested to see what happens if you try to do the latter.

Comment by RobertKirk on Speculative inferences about path dependence in LLM supervised fine-tuning from results on linear mode connectivity and model souping · 2024-01-08T09:16:03.453Z · LW · GW

I think a point of no return exists if you only use small LRs. I think if you can use any LR (or any LR schedule) then you can definitely jump out of the loss basin. You could imagine just choosing a really large LR to basically resent to a random init and then starting again.

I do think that if you want to utilise the pretrained model effectively, you likely want to stay in the same loss basin during fine-tuning.

Comment by RobertKirk on Measuring and Improving the Faithfulness of Model-Generated Reasoning · 2023-07-31T13:35:22.384Z · LW · GW

To check I understand this, is another way of saying this that in the scaling experiments, there's effectively a confounding variable which is model performance on the task:

  • Improving zero-shot performance decreases deletion-CoT-faithfulness
    • The model will get the answer write with no CoT, so adding a prefix of correct reasoning is unlikely to change the output, hence a decrease in faithfulness
  • Model scale is correlated with model performance.
  • So the scaling experiments show model scale is correlated with less faithfulness, but probably via the correlation with model performance.

If you had a way of measuring the faithfulness conditioned on a given performance for a given model scale then you could measure how scaling up changes faithfulness. Maybe for a given model size you can plot performance vs faithfulness (with each dataset being a point on this plot), measure the correlation for that plot and then use that as a performance-conditioned faithfulness metric? Or there's likely some causally correct way of measuring the correlation between model scale and faithfulness while removing the confounder of model performance.

Comment by RobertKirk on QAPR 5: grokking is maybe not *that* big a deal? · 2023-07-25T17:31:57.766Z · LW · GW

If you train on infinite data, I assume you'd not see a delay between training and testing, but you'd expect a non-monotonic accuracy curve that looks kind of like the test accuracy curve in the finite-data regime? So I assume infinite data is also cheating?

Comment by RobertKirk on QAPR 5: grokking is maybe not *that* big a deal? · 2023-07-24T10:39:34.013Z · LW · GW

I've been using "delayed generalisation", which I think is more precise than "grokking", places the emphasis on the delay rather the speed of the transition, and is a short phrase.

Comment by RobertKirk on Maze-solving agents: Add a top-right vector, make the agent go to the top-right · 2023-04-02T14:47:08.768Z · LW · GW

I think the hyperlink for "conv nets without residual streams" is wrong? It's for me

Comment by RobertKirk on Existential AI Safety is NOT separate from near-term applications · 2022-12-21T13:39:54.168Z · LW · GW

This feels kind of like a semantic disagreement to me. To ground it, it's probably worth considering whether further research on the CCS-style work I posted would also be useful for self-driving cards (or other applications). I think that would depend on whether the work improves the robustness of the contrastive probing regardless of what is being probed for (which would be generically useful), or whether it improves the probing specifically for truthfulness in systems that have a conception of truthfulness , possibly by improving the constraints or adding additional constraints (less useful for other systems). I think both would be good, but I'm uncertain which would be more useful to pursue if one is motivated by reducing x-risk from misalignment.

Comment by RobertKirk on Positive values seem more robust and lasting than prohibitions · 2022-12-21T13:35:21.904Z · LW · GW

I think that "don't kill humans" can't chain into itself because there's not a real reason for its action-bids to systematically lead to future scenarios where it again influences logits and gets further reinforced, whereas "drink juice" does have this property.


I'm trying to understand why the juice shard has this propety. Which of these (if any) are the the explanation for this:

  • Bigger juice shards will bid on actions which will lead to juice multiple times over time, as it pushes the agent towards juice from quite far away (both temporally and spatially), and hence will be strongly reinforcement when the reward comes, even though it's only a single reinforcement event (actually getting the juice).
  • Juice will be acquired more with stronger juice shards, leading to a kind of virtuous cycle, assuming that getting juice is always positive reward (or positive advantage/reinforcement, to avoid zero-point issues)

The first seems at least plausibly to also to apply to "avoid moldy food", if it requires multiple steps of planning to avoid moldy food (throwing out moldy food, buying fresh ingredients and then cooking them, etc.)

The second does seem to be more specific to juice than mold, but it seems to me that's because getting juice is rare, and is something we can better and better at, whereas avoiding moldy food is something that's fairly easy to learn, and past that there's not much reinforcement to happen. If that's the case, then I kind of see that as being covered by the rare-states explanation in my previous comment, or maybe an extension of that to "rare states and skills in which improvement leads to more reward".

Having just read tailcalled comment, I think that is in some sense another of phasing what I was trying to say, where rare (but not too rare) states are likely to mean that policy-caused variance is high on those decisions. Probably policy-caused variance is more fundamental/closer as an explanation to what's actually happening in the learning process, but maybe states of certain rarity which are high-reward/reinforcement is one possibly environmental feature that produces policy-caused variance.

Comment by RobertKirk on Existential AI Safety is NOT separate from near-term applications · 2022-12-20T15:47:48.685Z · LW · GW

So I struggle to think of an example of a tool that is good for finding insidious misaligning bugs but not others.

One example: A tool that is designed to detect whether a model is being truthful (correctly representing what it knows about the world to the user), perhaps based on specific properties of truthfulness (for example see doesn't seem like it would be useful for improving the reliability of self-driving cars, as likely self-driving cars aren't misaligned in the sense that they could drive perfectly safely but choose not to, but rather are just unable to drive perfectly safely because some of their internal (learned) systems aren't sufficiently robust.

Comment by RobertKirk on Existential AI Safety is NOT separate from near-term applications · 2022-12-19T15:57:27.413Z · LW · GW

Not Paul, but some possibilities why ARC's work wouldn't be relevant for self-driving cars:

  • The stuff Paul said about them aiming at understanding quite simple human values (don't kill us all, maintain our decision-making power) rather than subtle things. It's likely for self-driving cars we're more concerned with high reliability and hence would need to be quite specific. E.g., maybe ARC's approach could discern whether a car understands whether it's driving on the road or not (seems like a fairly simple concept), but not whether it's driving in a riskier way than humans in specific scenarios.
  • One of the problems that I think ARC is worried about is ontology identification, which seems like a meaningfully different problem for sub-human systems (whose ontologies are worse than ours, so in theory could be injected into ours) than for human-level or super-human systems (where that may not hold). Hence focusing on the super-human case would look weird and possibly not helpful for the subhuman case, although it would be great if they could solve all the cases in full generality.
  • Maybe once it works ARC's approach could inform empirical work which helps with self-driving cars, but if you were focused on actually doing the thing for cars you'd just aim directly at that, whereas ARC's approach would be a very roundabout and needlessly complex and theoretical way of solving the problem (this may or may not actually be the case, maybe solving this for self-driving cars is actually fundamentally difficult in the same way as for ASI, but it seems less likely).
Comment by RobertKirk on Positive values seem more robust and lasting than prohibitions · 2022-12-19T15:46:33.696Z · LW · GW

I found it useful to compare a shard that learns to pursue juice (positive value) to one that avoids eating mouldy food (prohibition), just so they're on the same kind of framing/scale.

It feels like a possible difference between prohibitions and positive values is that positive values specify a relatively small portion of the state space that is good/desirable (there are not many states in which you're drinking juice), and hence possibly only activate less frequently, or only when parts of the state space like that are accessible, whereas prohibitions specify a large part of the state space that is bad (but not so much that the complement is a small portion - there are perhaps many potential states where you eat mouldy food, but the complement of that set is still not a similar size to the set of states of drinking juice). The first feels more suited to forming longer-term plans towards the small part of the state space (cf this definition of optimisation), whereas the second is less so. Then shards that start doing optimisation like this are hence more likely to become agentic/self-reflective/meta-cognitive etc.

In effect, positive values are more likely/able to self-chain because they actually (kind of, implicitly) specify optimisation goals, and hence shards can optimise them, and hence grow and improve that optimisation power, whereas prohibitions specify a much larger desirable state set, and so don't require or encourage optimisation as much.

As an implication of this, I could imagine that in most real-world settings "don't kill humans" would act as you describe, but in environments where it's very easy to accidentally kill humans, such that states where you don't kill humans are actually very rare, then the "don't kill humans" shard could chain into itself more, and hence become more sophisticated/agentic/reflective. Does that seem right to you?

Comment by RobertKirk on Trying to disambiguate different questions about whether RLHF is “good” · 2022-12-14T22:55:16.864Z · LW · GW

Thanks for the answer! I feel uncertain whether that suggestion is an "alignment" paradigm/method though - either these formally specified goals don't cover most of the things we care about, in which case this doesn't seem that useful, or they do, in which case I'm pretty uncertain how we can formally specify them - that's kind of the whole outer alignment problem. Also, there is still (weaker) pressure to produce outputs that look good to humans, if humans are searching over goals to find those that produce good outputs. I agree it's further away, but that seems like it could also be a bad thing, if it makes it harder to pressure the models to actually do what we want in the first place.

Comment by RobertKirk on Trying to disambiguate different questions about whether RLHF is “good” · 2022-12-14T17:02:07.838Z · LW · GW

I still don't think you've proposed an alternative to "training a model with human feedback". "maintaining some qualitative distance between the optimisation target for an AI model and the human “does this look good?” function" sounds nice, but how do we even do that? What else should we optimise the model for, or how should we make it aligned? If you think the solution is use AI-assisted humans as overseers, then that doesn't seem to be a real difference with what Buck is saying. So even if he actually had written that he's not aware of an alternative to "training a model with human/overseer feedback", I don't think you've refuted that point.

Comment by RobertKirk on Deconfusing Direct vs Amortised Optimization · 2022-12-13T15:10:00.599Z · LW · GW

An existing example of something like the difference between amortised and direct optimisation is doing RLHF (w/o KL penalties to make the comparison exact) vs doing rejection sampling (RS) with a trained reward model. RLHF amortises the cost of directly finding good outputs according to the reward model, such that at evaluation the model can produce good outputs with a single generation, whereas RS requires no training on top of the reward model, but uses lots more compute at evaluation by generating and filtering with the RM. (This case doesn't exactly match the description in the post as we're using RL in the amortised optimisation rather than SL. This could be adjusted by gathering data with RS, and then doing supervised fine-tuning on that RS data, and seeing how that compares to RS).

Given we have these two types of optimisation, I think two key things to consider are how each type of optimisation interacts with Goodhart's Law, and how they both generalise (kind of analogous to outer/inner alignment, etc.):

  • The work on overoptimisation scaling laws in this setting shows that, at least on distribution, there does seem to be a meaningful difference to the over-optimisation behaviour between the two types of optimisation - as shown by the different functional forms for RS vs RLHF.
  • I think the generalisation point is most relevant when we consider that the optimisation process used (either in direct optimisation to find solutions, or in amortised optimisation to produce the dataset to amortise) may not generalise perfectly. In the setting above, this corresponds to the reward model not generalising perfectly. It would be interesting to see a similar investigation as the overoptimisation work but for generalisation properties - how does the generalisation of the RLHF policy relate to the generalisation of the RM, and similarly to the RS policy? Of course, over-optimisation and generalisation probably interact, so it may be difficult to disentangle whether poor performance under distribution shift is due to over-optimisation or misgeneralisation, unless we have a gold RM that also generalises perfectly.
Comment by RobertKirk on Concept extrapolation for hypothesis generation · 2022-12-13T14:36:17.014Z · LW · GW

Instead, Aligned AI used its technology to automatically tease out the ambiguities of the original data.

Could you provide any technical details about how this works? Otherwise I don't know what to take from this post.

Comment by RobertKirk on Alignment allows "nonrobust" decision-influences and doesn't require robust grading · 2022-11-29T13:27:43.909Z · LW · GW

Question: How do we train an agent which makes lots of diamonds, without also being able to robustly grade expected-diamond-production for every plan the agent might consider?

I thought you were about to answer this question in the ensuing text, but it didn't feel like to me you gave an answer. You described the goal (values-child), but not how the mother would produce values-child rather than produce evaluation-child. How do you do this?

Comment by RobertKirk on Engineering Monosemanticity in Toy Models · 2022-11-23T16:42:38.470Z · LW · GW

You might well expect that features just get ignored below some threshold and monosemantically represented above it, or it could be that you just always get a polysemantic morass in that limit

I guess the recent work on Polysemanticity and Capacity seems to suggest the latter case, especially in sparser settings, given the zone where multiple feature are represented polysemantically, although I can't remember if they investigate power-law feature frequencies or just uniform frequencies

were a little concerned about going down a rabbit hole given some of the discussion around whether the results replicated, which indicated some sensitivity to optimizer and learning rate.

My impression is that that discussion was more about whether the empirical results (i.e. do ResNets have linear mode connectivity?) held up, rather than whether the methodology used and present in the code base could be used to find whether linear mode connectivity is present between two models (up to permutation) for a given dataset. I imagine you could take the code and easily adapt it to check for LMC between two trained models pretty quickly (it's something I'm considering trying to do as well, hence the code requests).

I think (at least in our case) it might be simpler to get at this question, and I think the first thing I'd do to understand connectivity is ask "how much regularization do I need to move from one basin to the other?" So for instance suppose we regularized the weights to directly push them from one basin towards the other, how much regularization do we need to make the models actually hop?

That would defiitely be interesting to see. I guess this is kind of presupposing that the models are in different basins (which I also believe but hasn't yet been verified). I also think looking at basins and connectivity would be more interesting in the case where there was more noise, either from initialisation, inherently in the data, or by using a much lower batch size so that SGD was noisy. In this case it's less likely that the same configuration results in the same basin, but if your interventions are robust to these kinds of noise then it's a good sign.

Good question! We haven't tried that precise experiment, but have tried something quite similar. Specifically, we've got some preliminary results from a prune-and-grow strategy (holding sparsity fixed, pruning smallest-magnitude weights, enabling non-sparse weights) that does much better than a fixed sparsity strategy.

I'm not quite sure how to interpret these results in terms of the lottery ticket hypothesis though. What evidence would you find useful to test it?

That's cool, looking forward to seeing more detail. I think these results don't seem that related to the LTH (if I understand your explanation correctly), as LTH involves finding sparse subnetworks in dense ones. Possibly it only actually holds in model with many more parameters, I haven't seen it investigated in models that aren't overparametrised in a classical sense.

I think if iterative magnitude pruning (IMP) on these problems produced much sparse subnetworks that also maintained the monosemanticity levels, then that would suggest that sparsity doesn't penalise monosemanticity (or polysemanticity) in this toy model, and also (much more speculatively) that the sparse well-performing subnetworks that IMP finds in other networks possibly also maintain their levels of poly/mono-semanticity. If we also think these networks are favoured towards poly or mono, then that hints at how the overall learning process if favoured towards poly or mono.

Comment by RobertKirk on Engineering Monosemanticity in Toy Models · 2022-11-19T17:01:15.696Z · LW · GW

This work looks super interesting, definitely keen to see more! Will you open-source your code for running the experiments and producing plots? I'd definitely be keen to play around with it. (They already did here: I just missed it. Thanks! Although it would be useful to have the plotting code as well, if that's easy to share?)

Note that we primarily study the regime where there are more features than embedding dimensions (i.e. the sparse feature layer is wider than the input) but where features are sufficiently sparse that the number of features present in any given sample is smaller than the embedding dimension. We think this is likely the relevant limit for e.g. language models, where there are a vast array of possible features but few are present in any given sample.

I agree that N (true feature dimension) > d (observed dimension), and that sparsity will be high, but I'm uncertain whether the other part of the regime (that you don't mention here), that k (model latent dimension) > N, is likely to be true. Do you think that is likely to be the case? As an analogy, I think the intermediate feature dimensions in MLP layers in transformers (analogously k) are much lower dimension than the "true intrinsic dimension of features in natural language" (analogously N), even if it is larger than the input dimension (embedding dimension* num_tokens, analogously d). So I expect  whereas in your regime . Do you think you'd be able to find monosemantic networks for ? Did you try out this regime at all (I don't think I could find it in the paper).

In the paper you say that you weakly believe that monosemantic and polysemantic network parametrisations are likely in different loss basins, given they're implementing very different algorithms. I think (given the size of your networks) it should be easy to test for at least linear mode connectivity with something like git re-basin ( Have you tried doing that? I think there are also algorithms for finding non-linear (e.g. quadratic) mode connectivity, although I'm less familiar with them. If it is the case that they're in different basins, I'd be curious to see whether there are just two basins (poly vs mono), or a basin for each level of monosemanticity, or if even within a level of polysemanticity there are multiple basins. If it's one of the former cases, it's be interesting to do something like the connectivity-based fine-tuning talked about here (, in effect optimise for a new parametrisation that is linearly disconnected from the previous one), and see if doing that from a polysemantic initialisation can produce a more monosemantic one, or if it just becomes polysemantic in a different way.

You also mentioned your initial attempts at sparsity through a hard-coded initially sparse matrix failed; I'd be very curious to see whether a lottery ticket-style iterative magnitude pruning was able to produce sparse matrices from the high-latent-dimension monosemantic networks that are still monosemantic, or more broadly how the LTH interacts with polysemanticity - are lottery tickets less polysemantic, or more, or do they not really change the monosemanticity?

If my understanding of the bias decay method is correct, is a large initial part of training only reducing the bias (through weight decay) until certain neurons start firing? If that's the case, could you calculate the maximum output in the latent dimension on the dataset at the start of training (say B), and then initialise the bias to be just below -B, so that you skip almost all of the portion of training that's only moving the bias term. You could do this per-neuron or just maxing over neurons. Or is this portion of training relatively small compared to the rest of training, and the slower convergence is more due to less neurons getting gradients even when some of them are outputting higher than the bias?

Comment by RobertKirk on Current themes in mechanistic interpretability research · 2022-11-16T15:27:40.296Z · LW · GW

Thanks for writing the post, and it's great to see that (at least implicitly) lots of the people doing mechanistic interpretability (MI) are talking to each other somewhat.

Some comments and questions:

  • I think "science of deep learning" would be a better term than "deep learning theory" for what you're describing, given that I think all the phenomena you list aren't yet theoretically grounded or explained in a mathematical way, and are rather robust empirical observations. Deep learning theory could be useful, especially if it had results concerning the internals of the network, but I think that's a different genre of work to the science of DL work.
  • In your description of the relevance of the lottery ticket hypothesis (LTH), it feels like a bit of a non-sequitur to immediately discuss removing dangerous circuits at initialisation. I guess you think this is because lottery tickets are in some way about removing circuits at the beginning of training (although currently we only know how to find out which circuits by getting to the end of training)? I think the LTH potentially has broader relevance for MI, i.e.:  if lottery tickets do exist and are of equal performance, then it's possible they'd be easier to interpret (due to increased sparsity); or just understanding what the existence of lottery tickets means for what circuits are more likely to emerge during neural network training.
  • When you say "Automating Mechanistic Interpretability research", do you mean automating (1) the task of interpreting a given network (automating MI), or automating (2) the research of building methods/understanding/etc. that enable us to better-interpret neural networks (automating MI Research)? I realise that a lot of current MI research, even if the ultimate goal is (2), is mostly currently doing (1) as a first step.
    Most of the text in that section implies automating (1) to me, but "Eventually, we might also want to automate the process of deciding which interventions to perform on the model to improve AI safety" seems to lean more towards automating (2), which comes under generally approach of automating alignment research. Obviously it would be great to be able to do both of them, but automating (1) seems both much more tractable, and also probably necessary to enable scalable interpretability of large models, whereas (2) is potentially less necessary for MI research to be useful for AI safety.
Comment by RobertKirk on Sticky goals: a concrete experiment for understanding deceptive alignment · 2022-10-03T13:06:28.380Z · LW · GW

I've now had a conversation with Evan where he's explained his position here and I now agree with it. Specifically, this is in low path-dependency land, and just considering the simplicity bias. In this case it's likely that the world model will be very simple indeed (e.g. possibly just some model of physics), and all the hard work is done at run time in activations by the optimisation process. In this setting, Evan argued that the only very simple goal we can encode (that we know of) that would produce behaviour that maximises the training objective is any arbitrary simple objective, combined with the reasoning that deception and power are instrumentally useful for that objective. To encode an objective that would reliably produce maximisation of the training objective always (e.g. internal or corrigible alignment), you'd need to fully encode the training objective.

Comment by RobertKirk on Causal confusion as an argument against the scaling hypothesis · 2022-09-20T15:49:59.866Z · LW · GW

Overall, I think the issue of causal confusion and OOD misgeneralisation is much more about capabilities than about alignment, especially if we are talking about the long-term x-risk from superintelligent AI, rather than short/mid-term AI risk.

The main argument of the post isn't "ASI/AGI may be causally confused, what are the consequences of that" but rather "Scaling up static pretraining may result in causally confused models, which hence probably wouldn't be considered ASI/AGI". I think in practice if we get AGI/ASI, then almost by definition I'd think it's not causally confused.

OOD misgeneralisation is absolutely inevitable, due to Gödel's incompleteness of the universe and the fact that all the systems that evolve on Earth generally climb up in complexity

In a theoretical sense this may be true (I'm not really familiar with the argument), but in practice OOD misgeneralisation is probably a spectrum, and models can be more or less causally confused about how the world works. We're arguing here that static training, even when scaled up, plausibly doesn't lead to a model that isn't causally confused about a lot of how the world works.

Did you use the term "objective misgeneralisation" rather than "goal misgeneralisation" on purpose? "Objective" and "goal" are synonyms, but "objective misgeneralisation" is hardly used, "goal misgeneralisation" is the standard term.

No reason, I'll edit the post to use goal misgeneralisation. Goal misgeneralisation is the standard term but hasn't been so for very long (see e.g. this tweet:

Maybe I miss something obvious, but this argument looks wrong to me, or it assumes that the learning algorithm is not allowed to discover additional (conceptual, abstract, hidden, implicit) variables in the training data, but this is false for deep neural networks

Given that the model is trained statically, while it could hypothesise about additional variables of the kinds your listed, it can never know which variables or which values for those variables are correct without domain labels or interventional data. Specifically while "Discovering such hidden confounders doesn't give interventional capacity" is true, to discover these confounders he needed interventional capacity.

I don't understand the italicised part of this sentence. Why will P(shorts, ice cream) be a reliable guide to decision-making?

We're not saying that P(shorts, icecream) is good for decision making, but P(shorts, do(icecream)) is useful in sofar as the goal is to make someone where shorts, and providing icecream is one of the possible actions (as the causal model will demonstrate that providing icecream isn't useful for making someone where shorts).

What do these symbols in parens before the claims mean?

They are meant to be referring to the previous parts of the argument, but I've just realised that this hasn't worked as the labels aren't correct. I'll fix that.

Comment by RobertKirk on Path dependence in ML inductive biases · 2022-09-19T11:48:03.937Z · LW · GW

When you talk about whether we're in a high or low path-dependence "world", do you think that there is a (somewhat robust) answer to this question that holds across most ML training processes? I think it's more likely that some training processes are highly path-dependent and some aren't. We definitely have evidence that some are path-dependent, e.g. Ethan's comment and other examples like, and almost any RL paper where different random seeds of the training process often result in quite different results. Arguably I don't think we have conclusive of any particular existing training process being low-path dependence, because the burden of proof is heavy for proving that two models are basically equivalent on basically all inputs (given that they're very unlikely to literally have identical weights, so the equivalence would have to be at a high level of abstraction).

Reasoning about the path dependence of a training process specifically, rather than whether all of the ML/AGI development world is path dependent, seems more precise, and also allows us to reason about whether we want a high or low path-dependence training process, and considering that as an intervention, rather than a state of the world we can't change.

Comment by RobertKirk on Sticky goals: a concrete experiment for understanding deceptive alignment · 2022-09-08T08:52:12.060Z · LW · GW

When you say "the knowledge of what our goals are should be present in all models", by "knowledge of what our goals are" do you mean a pointer to our goals (given that there are probably multiple goals which are combined in someway) is in the world model? If so this seems to contradict you earlier saying:

The deceptive model has to build such a pointer [to the training objective] at runtime, but it doesn't need to have it hardcoded, whereas the corrigible model needs it to be hardcoded

I guess I don't understand what it would mean for the deceptive AI to have the knowledge of what are goals are (in the world model), but for that not to mean it doesn't have a hard-coded pointer to what our goals are. I'd imagine that what it means for the world model to capture what our goals are is exactly having such a pointer to them.

(I realise I've been failing to do this, but it might make sense to use AI when we mean the outer system and model when we mean the world model. I don't think this is the core of the disagreement, but it could make the discussion clearer. For example, when you say the knowledge is present in the model, do you mean the world model or the AI more generally? I assumed the former above.)

To try and run my (probably inaccurate) simulation of you: I imagine you don't think that's a contradiction above. So you'd think that "knowledge of what our goals are" doesn't mean a pointer to our goals in all the AI's world models, but something simpler, that can be used to figure out what our goals are by the deceptive AI (e.g. in it's optimisation process), but wouldn't enable the aligned AI to use as its objective a simpler pointer and instead would require the aligned AI to hard-code the full pointer to our goals (where the pointer would be pointing into it's the world model, and probably using this simpler information about our goals in some way). I'm struggling to imagine what that would look like.

Comment by RobertKirk on Sticky goals: a concrete experiment for understanding deceptive alignment · 2022-09-07T10:44:27.769Z · LW · GW

Even agreeing that no additional complexity is required to rederive that it should try to be deceptive (assuming it has situational awareness of the training process and long-term goals which aren't aligned with ours), to be deceptive successfully, it then needs to rederive what our goals are, so that it can pursue them instrumentally. I'm arguing that the ability to do this in the AI would require additional complexity compared to a AI that doesn't need to rederive the content of this goal (that is, our goal) at every decision.

Alternatively, the aligned model could use the same derivation process to be aligned: The deceptive model has some long-term goal, and in pursuing it rederives the content of the instrumental goal "do 'what the training process incentives'", and the alignment model has the long-term goal "do 'what the training process incentivises'" (as a pointer/de dicto), and also rederives it with the same level of complexity. I think "do 'what the training process incetivises'" (as a pointer/de dicto) isn't a very complex long-term goal., and feels likely to be as complex as the arbitrary crystallised deceptive AI's internal goal, assuming both models have full situation awareness of the training process and hence such a pointer is possible, which we're assuming they do.

(ETA/Meta point: I do think deception is a big issue that we definitely need more understanding of, and I definitely put weight on it being a failure of alignment that occurs in practice, but I think I'm less sure it'll emerge (or less sure that your analysis demonstrates that). I'm trying to understand where we disagree, and whether you've considered the doubts I have and you possess good arguments against them or not, rather than convince you that deception isn't going to happen.)

Comment by RobertKirk on Sticky goals: a concrete experiment for understanding deceptive alignment · 2022-09-07T08:51:10.226Z · LW · GW

It seems like a lot more computationally difficult to, at every forward pass/decision process, derive/build/construct such a pointer. If the deceptive model is going to be doing this every time it seems like it would be more efficient to have a dedicated part of the network that calculates it (i.e. have it in the weights)

Separately, for more complex goals this procedure is also going to be more complex, and the network probably needs to be more complex to support constructing it in the activations at every forward pass, compared to the corrigible model that doesn't need to do such a construction (becauase it has it hard-coded as you say). I guess I'm arguing that the additional complexity in the deceptive model that allows it to rederive our goals at every forward pass compensates for the additional complexity in the corrigible model that has the our goals hard-coded.

whereas the corrigible model needs it to be hardcoded

The corrigible model needs to be able to robustly point to our goals, in a way that doesn't change. One way of doing this is having the goals hardcoded. Another way might be to instead have a pointer to the output of a procedure that is executed at runtime that always constructs our goals in the activations. If the deceptive model can reliably construct in it's activations something that actually points towards our goals, then the corrible model could also have such a procedure, and make it's goal be a pointer to the output of such a procedure. Then the only difference in model complexity is that the deceptive model points to some arbitrary attribute of the world model (or whatever), and the aligned model points to the output of this computation, that both models posses.

I think at a high level I'm trying to say that any way in which the deceptive model can robustly point at our goals such that it can pursue them instrumentally, the aligned model can robustly point at them to pursue them terminally. SGD+DL+whatever may favour one way of another of robustly pointing at such goals (either in the weights, or through a procedure that robustly outputs them in the activations), but both deceptive and aligned models could make use of that.

Comment by RobertKirk on Sticky goals: a concrete experiment for understanding deceptive alignment · 2022-09-05T15:57:32.670Z · LW · GW

Now, one thing I will say is that, since the difficulty of changing the proxies into long-term goals seems to be essentially constant regardless of the complexity of the training objective—whereas the difficulty of creating a pointer to the training objective scales with that complexity—I think that, if we try to train models on increasingly complex goals, eventually deception will win almost regardless of the baseline “stickiness” level. But understanding that baseline could still be quite important, and it’s something that I think we can quite readily experiment with right now.

But the deceptively aligned model also needs "a pointer to training objective", for it to be able to optimize that instrumentally/deceptively, so there doesn't seem to be a penalty in complexity to training on complex goals.

This is similar to my comment on the original post about the likelihood of deceptive alignment, but reading that made it slightly clearer exactly what I disagreed with, hence writing the comment here.

Comment by RobertKirk on Common misconceptions about OpenAI · 2022-09-03T14:02:04.697Z · LW · GW

I think people who value empirical alignment work now probably think that (to some extent) we can predict at a high level what future problems we might face (contrasting with "there'd have been no hope whatsoever of identifying all the key problems in advance just based on theory"). Obviously this is a spectrum, but I think the chip fab analogy is I think further towards people believing there are unknown unknowns in the problem space than people at OpenAI are (e.g. OpenAI people possibly think outer alignment and inner alignment capture all of the kinds of problems we'll face).

However, they probably don't believe you can work on solutions to those problems without being able to empirically demonstrate those problems and hence iterate on them (and again one could probably appeal to a track record here of most proposed solutions to problems not working unless they were developed by iterating on the actual problem). We can maybe vaguely postulate what the solutions could look like (they would say), but it's going to be much better to try and actually implement solutions on versions of the problem we can demonstrate, and iterate from there. (Note that they probably also perhaps try and produce demonstrations of the problems such that they can then work on those solutions, but this is still all empirical).

Otherwise I do think your ITT does seem reasonable to me, although I don't think I'd put myself in the class of people you're trying to ITT, so that's not much evidence.

Comment by RobertKirk on How likely is deceptive alignment? · 2022-09-03T12:54:52.502Z · LW · GW

Thanks for writing this post, it's great to see explicit (high-level) stories for how and why deceptive alignment would arise! Some comments/disagreements:

(Note I'm using "AI" instead of  "model" to avoid confusing myself between "model" and "world model", e.g. "the deceptively aligned AI's world model" instead of "the deceptively-aligned model's world model").

Making goals long-term might not be easy

You say

Furthermore, this is a really short and simple modification. All gradient descent has to do in order to hook up the model’s understanding of the thing that we want it to do to its actions here is just to make its proxies into long term goals

However, this doesn't necessarily seem all that simple. The world model and internal optimisation process need to be able to plan into the "long term", or even have the conception of the "long term", for the proxy goals to be long-term; this seems to heavily depend on how much the world model and internal optimisation process are capturing this.

Conditioning on the world model and internal optimisation process capturing this concept, it's still not necessarily easy to convert proxies into long term goals, if the proxies are time-dependent in some way, as they might be - if tasks or episodes are similar lengths, then proxies like "start wrapping up my attempt at this task to present it to the human" is only useful if it's conditioned on a time near the end of the episode. My argument here seems much sketchier, but I think this might be because I can't come up with a good example. It seems like it's not necessarily the case that "making goals long-term" is easy; that seems to be mostly taken on intuition that I don't think I share

Relatedly, it seems that the conditioning on capabilities of the world model and internal optimisation process changes the path somewhat in a way that isn't captured by your analysis. That is, it might be easier to achieve corrigible or internal alignment with a less capable world model/internal optimisation process (i.e. earlier in training), as it doesn't require the world model/internal optimisation process to plan over the longer time horizons and greater situational awareness required to still perform well in the deceptive alignmeent case. Do you think that is the case?

On the overhang from throwing out proxies

In the high path-dependency world, you mention an overhang several times. If it understand correctly, what you're referring to here is that, as the world model increases in capabilities, it will start modelling things that are useful as internal optimisation targets for maximising the training objective, and then at some point SGD could just through away the AI's internal goals (which we see as proxies) and instead point to these parts of the world model as the target, which would result in a large increase in the training objective, as these are much better targets. (This is the description of what would happen in the internally aligned case, but the same mechanism seems present in the other cases, as you mention). 

However, it seems like the main reason the world model would capture these parts of the world is if they were useful (for maximising the training objective) as internal optimisation targets, and so if they're emerging and improving, it's likely because there's pressure for them to improve as they are being used as targets. This would mean there wasn't an overhang of the sort described above.

Another way of phrasing this might be that the internal goals (proxies) the AIs have will be part of the world model/in the same ontology/using the same representations, they won't be separate (as your story seems to imply?), and hence there won't be something to switch them to inside the world model that provides a bump in the training objective; or if there is, this will happen smoothly as the things to switch to are better-modelled such that they become useful targets.

I think how this affects the analysis is that, as the AI learns more about it's training process, this involves learning more about the training objective, and if it's doing this, it would be very easy for the internal goals to shift to pointing to this understanding of the training objective (if it's already there). This would result in a higher likelihood of corrigible alignment. Specifically, in the case where the AI has a full understanding of the training process, including a full understanding of the training objective (such that it models all parts of it, and there's a single pointer that points to all these parts and is hence easily referenced), it seems (a priori) equally easy to shift the internal objective proxies to be pointing to this training objective pointer, as it does to shift them to long-term goals which then require optimising for the training objective pointer instrumentally. In fact it might even be simpler, as you may need a less complex/long-time-horizon-capable world model/internal optimiser in the first case than the second.

Simplicity Bias Argument

(Building on the ideas above about how the world model could have a pointer to the training objective, and would need this for all model classes considered): In the "slightly more formal analysis" of the simplicity bias in the low-path-dependence world, you decompose the complexity of the model into a sum of it's world model complexity, optimisation process complexity and mesa objective complexity, and argue that the mesa objective complexity is the key thing that's going to vary between the model classes being considered. However, I don't think this argument holds.

Specifically, for the deceptive-aligned AI to perform well during training, inside it's world model, it needs to have a full specifical of the training objective, and some pointer P to it such that the optimisation process can say "we want {mesa_objective}, but we need to do well in training instrumentally, so plan to make the value of pointer P go higher". Either this makes the deceptive-aligned AI's world models more complex than the world model for the internally aligned AI (which instead has the pointer to the full specification of the training objective outside it's world model, in the mesa objective component), or it's the same complexity as the internally aligned model, in which case the internally aligned  AI's mesa_objective can just reference that pointer, and hence is likely the same complexity as the simplest mesa objective for the deceptively-aligned AI (which likely also just references a pointer to some attribute of the world which the world model captures).

Phrased a different way, I'm imagining that the world model, optimisation process and mesa objective are all in a shared (internal to the model) ontology, and hence the mesa objective specification isn't going to be a copy of the parts of the world model that are the objective (which would entail also copying all the machinery necessary to actually specify this from observations), but instead just a (weighted combination of) concept(s) in the internal ontology, which will be very simple to specify.


Overall, all these considerations argue that deceptive aligned is less likely than the analysis in this post suggests. It does still seem very possible that deceptive alignment occurs, and I still agree that we need transparency tools to fix these problems, but perhaps I think we're less underwater than Evan does (to use the terminology from the Conclusion).

Comment by RobertKirk on Will Capabilities Generalise More? · 2022-08-30T12:09:48.788Z · LW · GW

I think perhaps a lot work is being done by "if your optimiser worked". This might also be where there's a disanaology between humans<->evolution and AIs<->SGD+PPO (or whatever RL algorithm you're using to optimise the policy). Maybe evolution is actually a very weak optimiser, that doesn't really "work", compared to SGD+RL.

Comment by RobertKirk on The alignment problem from a deep learning perspective · 2022-08-21T10:09:46.045Z · LW · GW

Me, modelling skeptical ML researchers who may read this document:

It felt to me that Large-scale goals are likely to incentivize misaligned power-seeking and AGIs’ behavior will eventually be mainly guided by goals they generalize to large scales were the least well-argued sections (in that while reading them I felt less convinced, and the arguments were more hand-wavy than before).

In particular, the argument that we won't be able to use other AGIs to help with supervision because of collusion is entirely contained in footnote 22, and doesn't feel that robust to me -  or at least it seems easier for a skeptical reader to dismiss that, and hence not think the rest of section 3 is well-founded. Maybe it's worth adding another argument for why we probably can't just use other AGIs to help with alignment, or at least that we don't currently have good proposals for doing so that we're confident will work (e.g. how do we know the other AGIs are aligned and are hence actually helping).


Positive goals are unlikely to generalize well to larger scales, because without the constraint of obedience to humans, AGIs would have no reason to let us modify their goals to remove (what we see as) mistakes. So we’d need to train them such that, once they become capable enough to prevent us from modifying them, they’ll generalize high-level positive goals to very novel environments in desirable ways without ongoing corrections, which seems very difficult. Even humans often disagree greatly about what positive goals to aim for, and we should expect AGIs to generalize in much stranger ways than most humans.

seems to be saying that positive goals won't generalise correctly because we need to get the positive goals exactly correct on the first try. I don't know if that is exactly an argument for why positive goals won't generalise correctly. It feels like this paragraph is trying to preempt the counterargument to this section that goes something like "Why wouldn't we just interactively adjust the objective if we see bad behaviour?", by justifying why we would need to get it right robustly and on the first try and throughout training, because the AGI will stop us doing this modification later on. Maybe it would be better to frame it that way if that was the intention.


Note that I agree with the document and I'm in favour of producing more ML-researcher-accessible descriptions of and motivations for the alignment problem, hence this effort to make the document more robust to skeptical ML researchers.

Comment by RobertKirk on Externalized reasoning oversight: a research direction for language model alignment · 2022-08-08T10:59:00.930Z · LW · GW

First condition: assess reasoning authenticity

To be able to do this step in the most general setting seems to capture the entire difficulty of interpretability - if we could assess whether a model's outputs faithfully reflect it's internal "thinking" and hence that all of it's reasoning is what we're seeing, then that would be a huge jump forwards (and perhaps possible be equivalent to solving) something like ELK. Given that that problem is known to be quite difficult, and we currently don't have solutions for it, I'm uncertain whether this reduction of aligning a language model into first verifying all its visible reasoning is complete, correct and faithful, and then doing other steps (i.e. actively optimising against this our measures of correct reasoning) is one that makes the problem easier. Do you think it's meaningfully different (i.g. easier) to solve the "assess reasoning authenticity" completely than to solve ELK, or another hard interpretability problem?

Comment by RobertKirk on Circumventing interpretability: How to defeat mind-readers · 2022-07-28T09:50:55.610Z · LW · GW

If, instead of using interpretability tools in the loss function, we merely use it as a ‘validation set’ instead of the training set (i.e. using it as a ‘mulligan’), we might have better chances of picking up dangerous cognition before it gets out of hand so we can terminate the model and start over. We’re therefore still using interpretability in model selection, but the feedback loop is much less tight, so it’d be harder to Goodhart.

While only using the interpretability-tool-based filter for model selection is much weaker optimisation pressure than using it in the loss function, and hence makes goodhearting harder and hence slower, it's not clear that this would solve the problem in the long run. If the interpretability-tool-based filter captures everything we know now to capture, and we don't get new insights during the iterated process of model training and model selection, then it's possible we'll eventually end up goodharting the model selection process in the same was as SGD would goodhart the interpretability tool in the loss function.

I think it's likely that we would gain more insights or have more time if we were to use the interpretability tool as a mulligan, and it's possible the way we as AI builders optimise producing a model that passes the interpretability filters is qualitatively different from the way SGD (or whatever training algorithm is being used) would optimise the interpretability-filter loss function. However, in the spirit of paranoia/security mindset/etc., it's worth pointing out that using the tool as a model selection filter doesn't guarantee that an AGI that passes the filter is safer than if we used the interpretability tool as a training signal, in the limit of iterating to pass the interpretability tool model selection filter.

Comment by RobertKirk on A note about differential technological development · 2022-07-18T09:30:24.619Z · LW · GW

Suppose that aligning an AGI requires 1000 person-years of research.

  • 900 of these person-years can be done in parallelizable 5-year chunks (e.g., by 180 people over 5 years — or, more realistically, by 1800 people over 10 years, with 10% of the people doing the job correctly half the time).
  • The remaining 100 of these person-years factor into four chunks that take 25 serial years apiece (so that you can't get any of those four parts done in less than 25 years).


Do you have a similar model for just building (unaligned) AGI? Or is the model meaningfully different? On a similar model for just building AGI, then timelines would mostly be shortened by progressing through the serial research-person-years instead of the parallelisable research-person-years. If researchers who are progressing both capabilities and aligning are doing both in the parallelisable part, then this would be less worrying, as they're not actually shortening timelines meaningfully.


Unfortunately I imagine you think that building (unaligned) AGI quite probably doesn't have many more serial person-years of research required, if any. This is possibly another way of framing the prosaic AGI claim: "we expect we can get to AGI without any fundamentally new insights on intelligence, using (something like) current methods."

Comment by RobertKirk on Causal confusion as an argument against the scaling hypothesis · 2022-06-26T09:14:35.830Z · LW · GW

I expect that these kinds of problems could mostly be solved by scaling up data and compute (although I haven't read the paper). However, the argument in the post is that even if we did scale up, we couldn't solve the OOD generalisation problems.

Comment by RobertKirk on Causal confusion as an argument against the scaling hypothesis · 2022-06-26T09:05:36.406Z · LW · GW

Here we're saying that the continual fine-tuning might not necessarily resolve causal confusion within the model; instead, it will help the model learn the (new) spurious correlations so that it still performs well on the test data. This is assuming that continual fine-tuning is using a similar ERM-based method (e.g. the same pretraining objective but on the new data distribution). In hindsight, we probably should have written "continual training" rather than specifically "continual fine-tuning". If you could continually train online in the deployment environment then that would be better, and whether it's enough is very related to whether online training is enough, which is one of the key open questions we mention.

Comment by RobertKirk on A transparency and interpretability tech tree · 2022-06-20T13:22:12.249Z · LW · GW

The ability to go 1->4 or 2->5 by the behavioural-cloning approach would assume that the difficulty of interpreting all parts of the model are fairly similar, but it just takes time for the humans to interpret all parts, so we can automate that by imitating the humans. But if understanding the worst-case stuff is significantly harder than the best-case stuff (which seems likely to me) then I wouldn't expect the behaviourally-cloned interpretation agent to generalise to being able to correctly interpret the worse-case stuff.

Comment by RobertKirk on A transparency and interpretability tech tree · 2022-06-20T13:19:32.195Z · LW · GW

Another point worth making here is why I haven’t separated out worst-case inspection transparency for deceptive models vs. worst-case training process transparency for deceptive models there. That’s because, while technically the latter is strictly more complicated than the former, I actually think that they’re likely to be equally difficult. In particular, I suspect that the only way that we might actually have a shot at understanding worst-case properties of deceptive models is through understanding how they’re trained.

I'd be curious to hear a bit more justification for this. It feels like resting on this intuition for a reason not to include worst-case inspection transparency for deceptive models as a separate node is a bit of a brittle choice (i.e. makes it more likely the tech tree would change if we got new information). You write

That is, if our ability to understand training dynamics is good enough, we might be able to make it impossible for a deceptive model to evade us by always being able to see its planning for how to do so during training.

which to me is a justification that worst-case inspection transparency for deceptive models is solved if we solve worst-case training process transparency for deceptive models, but not a justification that that's the only way to solve it.

Comment by RobertKirk on Epistemological Vigilance for Alignment · 2022-06-20T09:40:46.420Z · LW · GW

This assumption is basically that you can predict the result of an intervention without having to understand the internal mechanism in detail, because the latter is straightforward.

This seems to me that you want a word for whatever the opposite of complex/chaotic systems are, right? Although obviously "Simple" is probably not the best word (as it's very generic). It could be "Simple Dynamics" or "Predictable Dynamics"?

Comment by RobertKirk on Epistemological Vigilance for Alignment · 2022-06-06T13:57:22.691Z · LW · GW

Newtonian: complex reactions

So please suggest alternative names and characterizations, or ask questions to pinpoint what I’m describing.

Are you pointing here at the fact that the AI training process and world will be a complex system, and as such it is hard to predict the outcomes of interventions, and hence the first-order obvious outcomes of interventions may not occur, or may be dominated by higher-order outcomes? That's what the "complex reactions" and some of the references kind of point at, but then in the description you seem to be talking more about a specific case: Strong optimisation will always find a path if it exists, so patching some but not all paths isn't useful, and in fact could have weird counter-productive effects if the remaining paths that the strong optimisation takes are actually worse in some other ways than the ones you patched.

Other possible names would then be either leaning into the complex systems view, so the (possibly incorrect) assumption is something like "non-complexity" or "linear/predictable responses"; or leaning into the optimisation paths analogy which might be something like "incremental improvement is ok" although that is pretty bad as a name.

Comment by RobertKirk on RL with KL penalties is better seen as Bayesian inference · 2022-06-02T17:46:05.308Z · LW · GW

So I think it mostly comes down to a philosophical difference. Do you want your LM to be a decision-maker acting in a world or a model of a some probability distribution over texts? If you want a decision-maker and training on language is just a scaffolding to get you there, maybe indeed staying close to the original distribution only has instrumental value?

But what if what you want is just an oracle-type conversational AI: a knowledge base and a common-sense reasoner. Maybe in this case staying close to human knowledge and inference rules represented in language is of inherent value?

I feel like this wasn't exactly made clear in the post/paper that this was the motivation. You state that distribution collapse is bad without really justifying it (in my reading). Clearly distributional collapse to a degenerate bad output is bad, and also will often stall learning so is bad from an optimisation perspective as well (as it makes exploration much less likely), but this seems different from distributional collapse to the optimal output. For example, when you say

if x∗ is truly the best thing, we still wouldn’t want the LM to generate only x∗

I think I just disagree, in the case where we're considering LLMs as a model for future agent-like systems that we will have to align, which to me is the reason they're useful for alignment research. If there's a normative claim that diversity is important, then you should just have that in your objective/algorithm.

I think the reason KL divergence is included in RLHF is an optimisation hack to make sure it does well. Maybe that's revealed that for alignment we actually wanted the Bayesian posterior distribution you describe rather than just the optimal distribution according to the reward function (i.e. a hardmax rather than a softmax on the reward over trajectories), although that seems to be an empirical question whether that was our preference all along or it's just useful in the current regime.

Comment by RobertKirk on Richard Ngo's Shortform · 2022-05-30T13:51:55.880Z · LW · GW

Another possible way to provide pressure towards using language in a human-sense way is some form of multi-tasking/multi-agent scenario, inspired by this paper: Multitasking Inhibits Semantic Drift. They show that if you pretrain multiple instructors and instruction executors to understand language in a human-like way (e.g. with supervised labels), and then during training mix the instructors and instruction executors, it makes it difficult to drift from the original semantics, as all the instructors and instruction executors would need to drift in the same direction; equivalently, any local change in semantics would be sub-optimal compared to using language in the semantically correct way. The examples in the paper are on quite toy problems, but I think in principle this could work.

Comment by RobertKirk on RL with KL penalties is better seen as Bayesian inference · 2022-05-30T13:44:48.525Z · LW · GW

Do you think these insights would generalise to the case where the language model may be interacting with some system during this fine-tuning phase? For example, if it generates queries to an external search engine or API, or has dialogue with a human, then the optimal policy is no longer equivalent to just generating the correct output distribution, as it now also involves environment observations. This setting makes the distinction between a generative model and a policy clearer, and maybe changes the relevance of this statement:

The problem with the RL objective is that it treats the LM as a policy, not as a generative model. While a generative model is supposed to capture a diverse distribution of samples, a policy is supposed to chose the optimal action.

That is, in these settings we do want an optimal policy and not a generative model. This is quite similar to what Paul was saying as well.

Further, if we see aligning language models as a proxy for aligning future strong systems, it seems likely that these systems will be taking multiple steps of interaction in some environment, rather than just generating one (or several) sequences without feedback.

Comment by RobertKirk on Reshaping the AI Industry · 2022-05-30T13:37:00.608Z · LW · GW

That's my guess also, but I'm more asking just in case that's not the case, and he disagrees with (for example) the Pragmatic AI Safety sequence, in which case I'd like to know why.

Comment by RobertKirk on Reshaping the AI Industry · 2022-05-30T09:42:14.025Z · LW · GW

I'd be curious to hear what your thoughts are on the other conversations, or at least specifically which conversations you're not a fan of?

Comment by RobertKirk on How can Interpretability help Alignment? · 2020-05-25T11:33:58.490Z · LW · GW

(Note: you're quoting your response as well as the sentence you've meant to be quoting (and responding to), which makes it hard to see which part is your writing. I think you need 2 newlines to break the quote formatting).

Do you see a way of incentivizing the RL community to change this? (If possible, that would seem like a more effective approach than doing it "ourselves".)

I think this is kind of the same as how do we incentivise the wider ML community to think safety is important? I don't know if there's anything specific about the RL community which makes it a different case.

There is some work in DeepMind's safety team on this, isn't there? (Not to dispute the overall point though, "a part of DeepMind's safety team" is rather small compared to the RL community :-).)

I think there is too, and I think there's more research in general than there used to be. I think the field of interpretability (and especially RL) interpretability is very new, pre-paradigmatic, which can make some of the research not seem useful or relevant.

It was a bit hard to understand what you mean by the "research questions vs tasks" distinction. (And then I read the bullet point below it and came, perhaps falsely, to the conclusion that you are only after "reusable piece of wisdom" vs "one-time thing" distinction.)

I'm still uncertain whether tasks is the best word. I think we want reusable pieces of wisdom as well as one-time things, and I don't know whether that's the distinction I was aiming for. It's more like "answer this question once, and then we have the answer forever" vs "answer this question again and again with different inputs each time". In the first case interpretability tools might enable researchers to answer the question easier. In the second our interpretability tool might have to answer the question directly in an automatic way.

If we believe a particular proposal is more or less likely than others to produce aligned AI, then we would preferentially work on interpretability research which we believe will help this proposal over research which wouldn't, as it wouldn't be as useful.

I have changed the sentence, I had other instead of over.

Comment by RobertKirk on Resources for AI Alignment Cartography · 2020-04-10T16:46:15.136Z · LW · GW

No worries. As much as I think less has been written on debate than amplification (Paul has a lot of blog posts on IDA), it seems to me like most of the work Paul's team at OpenAI is doing is working on debates rather than IDA.

Comment by RobertKirk on Resources for AI Alignment Cartography · 2020-04-08T13:14:05.315Z · LW · GW

I don't know whether this is on purpose, but I'd think that AI Safety Via Debate (original paper:; recent progress report: should get a mention, probably in the Technical agendas focused on possible solutions section? I'd argue it's different enough from IDA to have it's own subititle.