[AN #150]: The subtypes of Cooperative AI research 2021-05-12T17:20:27.267Z
[AN #149]: The newsletter's editorial policy 2021-05-05T17:10:03.189Z
[AN #148]: Analyzing generalization across more axes than just accuracy or loss 2021-04-28T18:30:03.066Z
FAQ: Advice for AI Alignment Researchers 2021-04-26T18:59:52.589Z
[AN #147]: An overview of the interpretability landscape 2021-04-21T17:10:04.433Z
[AN #146]: Plausible stories of how we might fail to avert an existential catastrophe 2021-04-14T17:30:03.535Z
[AN #145]: Our three year anniversary! 2021-04-09T17:48:21.841Z
Alignment Newsletter Three Year Retrospective 2021-04-07T14:39:42.977Z
[AN #144]: How language models can also be finetuned for non-language tasks 2021-04-02T17:20:04.230Z
[AN #143]: How to make embedded agents that reason probabilistically about their environments 2021-03-24T17:20:05.166Z
[AN #142]: The quest to understand a network well enough to reimplement it by hand 2021-03-17T17:10:04.180Z
[AN #141]: The case for practicing alignment work on GPT-3 and other large models 2021-03-10T18:30:04.004Z
[AN #140]: Theoretical models that predict scaling laws 2021-03-04T18:10:08.586Z
[AN #139]: How the simplicity of reality explains the success of neural nets 2021-02-24T18:30:04.038Z
[AN #138]: Why AI governance should find problems rather than just solving them 2021-02-17T18:50:02.962Z
[AN #137]: Quantifying the benefits of pretraining on downstream task performance 2021-02-10T18:10:02.561Z
[AN #136]: How well will GPT-N perform on downstream tasks? 2021-02-03T18:10:03.856Z
[AN #135]: Five properties of goal-directed systems 2021-01-27T18:10:04.648Z
[AN #134]: Underspecification as a cause of fragility to distribution shift 2021-01-21T18:10:06.783Z
[AN #133]: Building machines that can cooperate (with humans, institutions, or other machines) 2021-01-13T18:10:04.932Z
[AN #132]: Complex and subtly incorrect arguments as an obstacle to debate 2021-01-06T18:20:05.694Z
[AN #131]: Formalizing the argument of ignored attributes in a utility function 2020-12-31T18:20:04.835Z
[AN #130]: A new AI x-risk podcast, and reviews of the field 2020-12-24T18:20:05.289Z
[AN #129]: Explaining double descent by measuring bias and variance 2020-12-16T18:10:04.840Z
[AN #128]: Prioritizing research on AI existential safety based on its application to governance demands 2020-12-09T18:20:07.910Z
[AN #127]: Rethinking agency: Cartesian frames as a formalization of ways to carve up the world into an agent and its environment 2020-12-02T18:20:05.196Z
[AN #126]: Avoiding wireheading by decoupling action feedback from action effects 2020-11-26T23:20:05.290Z
[AN #125]: Neural network scaling laws across multiple modalities 2020-11-11T18:20:04.504Z
[AN #124]: Provably safe exploration through shielding 2020-11-04T18:20:06.003Z
[AN #123]: Inferring what is valuable in order to align recommender systems 2020-10-28T17:00:06.053Z
[AN #122]: Arguing for AGI-driven existential risk from first principles 2020-10-21T17:10:03.703Z
[AN #121]: Forecasting transformative AI timelines using biological anchors 2020-10-14T17:20:04.918Z
[AN #120]: Tracing the intellectual roots of AI and AI alignment 2020-10-07T17:10:07.013Z
The Alignment Problem: Machine Learning and Human Values 2020-10-06T17:41:21.138Z
[AN #119]: AI safety when agents are shaped by environments, not rewards 2020-09-30T17:10:03.662Z
[AN #118]: Risks, solutions, and prioritization in a world with many AI systems 2020-09-23T18:20:04.779Z
[AN #117]: How neural nets would fare under the TEVV framework 2020-09-16T17:20:14.062Z
[AN #116]: How to make explanations of neurons compositional 2020-09-09T17:20:04.668Z
[AN #115]: AI safety research problems in the AI-GA framework 2020-09-02T17:10:04.434Z
[AN #114]: Theory-inspired safety solutions for powerful Bayesian RL agents 2020-08-26T17:20:04.960Z
[AN #113]: Checking the ethical intuitions of large language models 2020-08-19T17:10:03.773Z
[AN #112]: Engineering a Safer World 2020-08-13T17:20:04.013Z
[AN #111]: The Circuits hypotheses for deep learning 2020-08-05T17:40:22.576Z
[AN #110]: Learning features from human feedback to enable reward learning 2020-07-29T17:20:04.369Z
[AN #109]: Teaching neural nets to generalize the way humans would 2020-07-22T17:10:04.508Z
[AN #107]: The convergent instrumental subgoals of goal-directed agents 2020-07-16T06:47:55.532Z
[AN #108]: Why we should scrutinize arguments for AI risk 2020-07-16T06:47:38.322Z
[AN #106]: Evaluating generalization ability of learned reward models 2020-07-01T17:20:02.883Z
[AN #105]: The economic trajectory of humanity, and what we might mean by optimization 2020-06-24T17:30:02.977Z
[AN #104]: The perils of inaccessible information, and what we can learn about AI alignment from COVID 2020-06-18T17:10:02.641Z


Comment by rohinmshah on Is driving worth the risk? · 2021-05-11T22:21:23.059Z · LW · GW

Sure, but then shouldn't you be dividing by distance / time traveled by the average American per year to get risk per mile / hour of driving?

Like, take your $25,000/year estimate, divide by 300 hours for a typical American, and you get ~$80 per hour of driving, which might start to look more worth it. (Again, I recommend finding a better version of the "300" number.)

(Another plausibly important correction would be the proportion of driving that happens at high speed vs. low speed.)

Comment by rohinmshah on Is driving worth the risk? · 2021-05-11T18:33:13.176Z · LW · GW

You should probably reduce your estimate of the risk by some factor to account for the fact that you will be in a car a lot less than the average American. 1 minute of Googling suggests that it's ~300 hours per year for the average American, though I'm sure there are lots of problems with that number (e.g. I think that is the number for typical drivers, rather than typical Americans).

Comment by rohinmshah on [AN #149]: The newsletter's editorial policy · 2021-05-11T02:07:39.772Z · LW · GW

I really like the long summaries, and would be sad to see them go

Fwiw I still expect to do them; this is an "on the margin" thing. Like, I still would do a long summary for bio anchors, but maybe I do something shorter for infra-Bayesianism.

Frame this as a 'request for summaries', link to the papers you won't get round to, but offer to publish any sufficiently good summaries of those papers that someone sends you in a future newsletter.

Hmm, intriguing. That might be worth trying.

Comment by rohinmshah on [AN #149]: The newsletter's editorial policy · 2021-05-10T20:28:48.454Z · LW · GW

Other results from the survey:

There were 66 responses, though at least one was a duplicate. (I didn't deduplicate in any of the analyses below; I doubt it will make a big difference.) Looking at names (when provided), it looks like people in the field were quite a bit more likely to respond than the typical reader. Estimating 5 min on average per response (since many provided qualitative feedback as well), that's 5.5 hours of person-time answering the survey.

My main takeaways (described in more detail below):

  • The newsletter is still useful to people.
  • Long summaries are not as useful as I thought.
  • On the current margin I should move my focus away from the Alignment Forum, since the most involved readers seem to read most of the Alignment Forum already.
  • It would be nice to do more "high-level opinions" -- if you imagine a tree where the root node is "did we build safe / beneficial AI", and then lower nodes delve into subproblems; it would be useful to have opinions talk about how the current paper / article relates to the top-level node. I don't think I'll make a change of this form right now, but I might in the future.

I think these takeaways are probably worth 5-10 hours of time? It's close though.


Average rating of various components of the newsletter (arranged in ascending order of popularity):

3.88 Long summaries (full newsletter dedicated to one topic)
3.91 Source of interesting things to read
3.95 Opinions
4.02 Highlights
4.27 Regular summaries
4.47 Newsletter overall

(The question was a five-point scale: "Useless, Meh, Fairly useful, Keep doing this, This is amazing!", which I then converted to 1-5 and averaged across respondents.)

The newsletter overall is far more popular than any of the individual components. In hindsight this makes sense -- different people will find different components valuable, but probably people will subscribe if there's just one or two components valuable to them. So everyone will rate the newsletter highly, but only a subset will rate any given component highly.

I was surprised to see the long summaries were least popular, since people have previously explicitly said that they especially liked the long summaries without any prompting from me. I will probably be less likely to do long summaries in the future.


In the "value of the newsletter" qualitative section, the most common thing by far was people saying that it helped them stay abreast of the field -- especially for articles that are not on the Alignment Forum.


One or two people suggested adding links to interesting papers that I wouldn't have time to summarize. I actually used to do this when the newsletter first started, but it seemed like no one was clicking on those links so I stopped doing that. I'm pretty sure that would still be the case now so I'm not planning to restart that practice.

Comment by rohinmshah on Pitfalls of the agent model · 2021-05-10T20:03:18.328Z · LW · GW

Planned summary for the Alignment Newsletter:

It is common to view AI systems through the “agent lens”, in which the AI system implements a fixed, unchanging policy that given some observations takes some actions. This post points out several ways in which this “fixed, unchanging policy” assumption can lead us astray.

For example, AI designers may assume that the AI systems they build must have unchanging decision algorithms, and therefore believe that there will be a specific point at which influence is “handed off” to the AI system, before which we have to solve a wide array of philosophical and technical problems.

Comment by rohinmshah on [AN #139]: How the simplicity of reality explains the success of neural nets · 2021-05-05T05:44:14.722Z · LW · GW

Hmm, I think you're right. I'm not sure what I was thinking when I wrote that. (Though I give it like 50% that if past-me could explain his reasons, I'd agree with him.)

Possibly I was thinking of epochal double descent, but that shouldn't matter because we're comparing the final outcome of SGD to random sampling, so epochal double descent doesn't come into the picture.

Comment by rohinmshah on Announcing The Inside View Podcast · 2021-05-04T21:20:18.867Z · LW · GW

Fyi, I personally dislike audio as a means of communicating information, and so I probably won't be summarizing these for the Alignment Newsletter while they don't have transcripts.

This is not a request for transcripts. Treat it more like an external constraint of the world, that the Alignment Newsletter happens to have a strong bias against audio- or video-only content. This is also not a guarantee that I will summarize it if it does have a transcript.

Fyi, my guess is that even if it did have transcripts I would usually not summarize it, because I personally am not that interested in forecasting timelines.

Comment by rohinmshah on Low-stakes alignment · 2021-05-04T18:46:12.142Z · LW · GW

Yeah, all of that seems right to me (and I feel like I have a better understanding of why assumptions on inputs are better than assumptions on outputs, which was more like a vague intuition before). I've changed the opinion to:

I like the low-stakes assumption as a way of saying "let's ignore distributional shift for now". Probably the most salient alternative is something along the lines of "assume that the AI system is trying to optimize the true reward function". The main way that low-stakes alignment is cleaner is that it uses an assumption on the _environment_ (an input to the problem) rather than an assumption on the _AI system_ (an output of the problem). This seems to be a lot nicer because it is harder to "unfairly" exploit a not-too-strong assumption on an input rather than on an output. See [this comment thread]( for more discussion.

Comment by rohinmshah on Mundane solutions to exotic problems · 2021-05-04T18:34:25.385Z · LW · GW

Planned summary for the Alignment Newsletter:

The author’s goal is to find “mundane” or simple algorithms that solve even “exotic” problems in AI alignment. Why should we expect this is possible? If an AI system is using powerful, exotic capabilities to evade detection, shouldn’t we need powerful, exotic algorithms to fight that? The key idea here is that we can instead have a mundane algorithm that leverages the exotic capabilities of the AI system to produce an exotic oversight process. For example, we could imagine that a mundane algorithm could be used to create a question-answerer that knows everything the model knows. We could then address <@gradient hacking@>(@Gradient hacking@) by asking the question “what should the loss be?” In this case, our model has an exotic capability: very strong introspective access to its own reasoning and the training process that modifies it. (This is what is needed to successfully hack gradients). As a result, our question answerer should be able to leverage this capability to assign high loss (low reward) to cases where our AI system tries to hack gradients, even if our normal hardcoded loss would not do so.

Comment by rohinmshah on Low-stakes alignment · 2021-05-04T17:49:34.013Z · LW · GW

I guess the natural definition is ...

I was imagining a Cartesian boundary, with a reward function that assigns a reward value to every possible state in the environment (so that the reward is bigger than the environment). So, embeddedness problems are simply assumed away, in which case there is only one correct generalization.

It feels like the low-stakes setting is also mostly assuming away embeddedness problems? I suppose it still includes e.g. cases where the AI system subtly changes the designer's preferences over the course of training, but it excludes e.g. direct modification of the reward, taking over the training process, etc.

I agree that "actually trying" is still hard to define, though you could avoid that messiness by saying that the goal is to provide a reward such that any optimal policy for that reward would be beneficial / aligned (and then the assumption is that a policy that is "actually trying" to pursue the objective would not do as well as the optimal policy but would not be catastrophically bad).

Just to reiterate, I agree that the low-stakes formulation is better; I just think that my reasons for believing that are different from "it's a clean subproblem". My reason for liking it is that it doesn't require you to specify a perfect reward function upfront, only a reward function that is "good enough", i.e. it incentivizes the right behavior on the examples on which the agent is actually trained. (There might be other reasons too that I'm failing to think of now.)

Comment by rohinmshah on Low-stakes alignment · 2021-05-03T23:38:36.938Z · LW · GW

Planned summary for the Alignment Newsletter:

We often split AI alignment into two parts: outer alignment, or ``finding a good reward function'', and inner alignment, or ``robustly optimizing that reward function''. However, these are not very precise terms, and they don't form clean subproblems. In particular, for outer alignment, how good does the reward function have to be? Does it need to incentivize good behavior in all possible situations? How do you handle the no free lunch theorem? Perhaps you only need to handle the inputs in the training set? But then what specifies the behavior of the agent on new inputs?

This post proposes an operationalization of outer alignment that admits a clean subproblem: _low stakes alignment_. Specifically, we are given as an assumption that we don't care much about any small number of decisions that the AI makes -- only a large number of decisions, in aggregate, can have a large impact on the world. This prevents things like quickly seizing control of resources before we have a chance to react. We do not expect this assumption to be true in practice: the point here is to solve an easy subproblem, in the hopes that the solution is useful in solving the hard version of the problem.

The main power of this assumption is that we no longer have to worry about distributional shift. We can simply keep collecting new data online and training the model on the new data. Any decisions it makes in the interim period could be bad, but by the low-stakes assumption, they won't be catastrophic. Thus, the primary challenge is in obtaining a good reward function, that incentivizes the right behavior after the model is trained. We might also worry about whether gradient descent will successfully find a model that optimizes the reward even on the training distribution -- after all, gradient descent has no guarantees for non-convex problems -- but it seems like to the extent that gradient descent doesn't do this, it will probably affect aligned and unaligned models equally.

Note that this subproblem is still non-trivial, and existential catastrophes still seem possible if we fail to solve it. For example, one way that the low-stakes assumption could be made true was if we had a lot of bureaucracy and safeguards that the AI system had to go through before making any big changes to the world. It still seems possible for the AI system to cause lots of trouble if none of the bureaucracy or safeguards can understand what the AI system is doing.

Planned opinion:

I like the low-stakes assumption as a way of saying "let's ignore distributional shift for now". However, I think that's more because it agrees with my intuitions about how you want to carve up the problem of alignment, rather than because it feels like an especially clean subproblem. It seems like there are other ways to get similarly clean subproblems, like "assume that the AI system is trying to optimize the true reward function".

That being said, one way that low-stakes alignment is cleaner is that it uses an assumption on the _environment_ (an input to the problem) rather than an assumption on the _AI system_ (an output of the problem). Plausibly that type of cleanliness is correlated with good decompositions of the alignment problem, and so it is not a coincidence that my intuitions tend to line up with "clean subproblems".

Comment by rohinmshah on AMA: Paul Christiano, alignment researcher · 2021-05-01T18:48:11.716Z · LW · GW

I think Neel is using this in the sense I use the phrase, where you carve up the space of threats in some way, and then a "threat model" is one of the pieces that you carved up, rather than the way in which you carved it up.

This is meant to be similar to how in security there are many possible kinds of risks you might be worried about, but then you choose a particular set of capabilities that an attacker could have and call that a "threat model" -- this probably doesn't capture every setting you care about, but does capture one particular piece of it.

(Though maybe in security the hope is to choose a threat model that actually contains all the threats you expect in reality, so perhaps this analogy isn't the best.)

(I think "that's a thing that people make for themselves" is also a reasonable response for this meaning of "threat model".)

Comment by rohinmshah on Announcing the Technical AI Safety Podcast · 2021-05-01T17:49:09.735Z · LW · GW

Fyi, I personally dislike audio as a means of communicating information, and so I probably won't be summarizing these for the Alignment Newsletter unless they have transcripts.

(This is not a request for transcripts -- I usually don't get that much out of podcasts like this, because I've usually already spent a bunch of time understanding the papers they're based on. Treat it more like an external constraint of the world, that the Alignment Newsletter happens to have a strong bias against audio- or video-only content. This is also not a guarantee that I will summarize it if it does have a transcript.)

Comment by rohinmshah on [AN #148]: Analyzing generalization across more axes than just accuracy or loss · 2021-05-01T03:19:22.762Z · LW · GW

Huh, weird. Probably a data entry error on my part. Fixed, thanks for catching it.

Comment by rohinmshah on [AN #148]: Analyzing generalization across more axes than just accuracy or loss · 2021-04-28T18:36:16.426Z · LW · GW

I asked the authors for feedback on my summary of the distributional generalization paper, and Preetum responded with the following (copied with his permission):

I agree with everything you've said in this summary, so my feedback below is mostly commentary / minor points.

- One intuitive way to think about Feature Calibration is that f(x) is "close to" a sample from p(y|x). Where the quality of the "closeness" is depends on the power of the classifier.

- Re. "classifiers which do not fit their train set": As you say, our paper mostly focuses on Distributional Generalization (DG) for interpolating models. But I am hopeful that DG actually holds much more generally, and we should really be thinking of generalization as saying "test and train behaviors are close *as distributions*".
Though we don't formalize this yet for non-interpolating models, there are some suggestive experiments in Section 7 (eg: the confusion matrix of a model on the test set remains close to its confusion matrix on the train set, throughout the training process.  As you start to fit noise on the train set, you see exactly this noise start to appear on the test set. Regularization which prevents fitting noise on the train set, also prevents this noise from appearing at test time).

- For me, one of the most interesting implications of DG/feature-calibration is that it gives a separation between overparameterized and underparameterized regimes (in the scaling limits of large models/data). With enough data, large enough underparameterized models will converge to Bayes-optimal classifiers, whereas overparameterized models will not (assuming DG). That is, interpolation is not always "benign", it can actually hurt.

- You may like the discussion we added on these issues in the short version of our paper: Section 1.3 ("Related Work and Significance") here:
(there is no new material in this pdf, outside the Related Work).

- Also, we have a number of supporting experiments for Feature Calibration in the appendix that didn't make it into the body (eg: more tasks for decision trees, and experiments with "bad" image classifiers like MLPs and RBF kernels).

- Sidenote: The "agreement property" has been bugging me for a while since it seems kind of magical. My current view is that "agreement" may be be a special case of a stronger (but less magical) property: The joint distribution (f(x), y) is statistically close to (f(x), f'(x))
on the test set, where f' is an independently-trained classifier.
This can also be seen as an instance of DG, and it implies the agreement property. I sketched this conjecture in this tweet:
(But this is speculative -- not in the paper and hasn't been rigorously tested).

- I included this figure in a talk on DG recently -- point being that DG is a general definition, which includes both classical generalization and our new conjectures as special cases (and could include other yet-undiscovered behaviors).

- As mentioned at the end of our paper, there are *many* open questions remaining (and I would be very happy to see more work in this area).

Comment by rohinmshah on The Many Faces of Infra-Beliefs · 2021-04-28T17:53:33.015Z · LW · GW

Ah right, that makes sense. That was a mistake on my part, my bad.

Comment by rohinmshah on FAQ: Advice for AI Alignment Researchers · 2021-04-28T04:20:09.520Z · LW · GW

Yeah that's definitely the one on the list that I think would be most useful.

I may also be understating how much I know about it; I've picked up some over time, e.g. linear programming, minimax, some kinds of duality, mirror descent, Newton's method.

Comment by rohinmshah on The Many Faces of Infra-Beliefs · 2021-04-27T17:13:30.286Z · LW · GW

Thanks for checking! I've changed point 3 to:

Finally, rather than have an environment E that (when combined with a policy π) generates a world history (oa)*, you could have the state s directly be the world history (oa)*, _without_ including the policy π. In normal Bayesianism, using (oa)* as states would be equivalent to using environments E as states (since we could construct a belief over E that implies the given belief over (oa)*), but in the case of infra-Bayesianism it is not. (Roughly speaking, the differences occur when you use a “belief” that isn’t just a claim about reality, but also a claim about which parts of reality you “care about”.) This ends up allowing some but not all flavors of acausal influence, and so the authors call this setup “pseudocausal”.


In normal bayesianism, you do not have a pseudocausal-causal equivalence. Every ordinary environment is straight-up causal.

What I meant was that if you define a Bayesian belief over world-histories (oa)*, that is equivalent to having a Bayesian belief over environments E, which I think you agree with. I've edited slightly to make this clearer.

Comment by rohinmshah on The Many Faces of Infra-Beliefs · 2021-04-25T20:41:43.755Z · LW · GW

Planned summary for the Alignment Newsletter:

When modeling an agent that acts in a world <@that contains it@>(@@), there are different ways that we could represent what a “hypothesis about the world” should look like. (We’ll use <@infra-Bayesianism@>(@Infra-Bayesianism sequence@) to allow us to have hypotheses over environments that are “bigger” than the agent, in the sense of containing the agent.) In particular, hypotheses can vary along two axes:

1. **First-person vs. third-person:** In a first-person perspective, the agent is central. In a third-person perspective, we take a “birds-eye” view of the world, of which the agent is just one part.

2. **Static vs. dynamic:** In a dynamic perspective, the notion of time is explicitly present in the formalism. In a static perspective, we instead have beliefs directly about entire world-histories.

To get a tiny bit more concrete, let the world have states S and the agent have actions A and observations O. The agent can implement policies Π. I will use ΔX to denote a belief over X (this is a bit handwavy, but gets the right intuition, I think). Then the four views are:

1. First-person static: A hypothesis specifies how policies map to beliefs over observation-action sequences, that is, Π → Δ(O × A)*.

2. First-person dynamic: This is the typical POMDP framework, in which a hypothesis is a belief over initial states and transition dynamics, that is, ΔS and S × A → Δ(O × S).

3. Third-person static: A hypothesis specifies a belief over world histories, that is, Δ(S*).

4. Third-person dynamic: A hypothesis specifies a belief over initial states, and over the transition dynamics, that is, we have ΔS and S → ΔS. Notice that despite having “transitions”, actions do not play a role here.

Given a single “reality”, it is possible to move between these different views on reality, though in some cases this requires making assumptions on the starting view. For example, under regular Bayesianism, you can only move from third-person static to third-person dynamic if your belief over world histories Δ(S*) satisfies the Markov condition (future states are conditionally independent of past states given the present state); if you want to make this move even when the Markov condition isn’t satisfied, you have to expand your belief over initial states to be a belief over “initial” world histories.

You can then define various flavors of (a)causal influence by saying which types of states S you allow:

1. If a state s consists of a policy π and an world history (oa)* that is consistent with π, then the environment transitions can depend on your choice of π, leading to acausal influence. This is the sort of thing that would be needed to formalize Newcomb’s problem.

2. In contrast, if a state s consists only of an environment E that responds to actions but _doesn’t_ get to see the full policy, then the environment cannot depend on your policy, and there is only causal influence. You’re implicitly claiming that Newcomb’s problem cannot happen.

3. Finally, rather than have an environment E that (when combined with a policy π) generates a world history (oa*), you could have the state s directly be the world history (oa)*, _without_ including the policy π. This still precludes acausal influence. In normal Bayesianism, this would be equivalent to the previous case (since we could construct a belief over E that implies the given belief over (oa)*), but in the case of infra-Bayesianism it is not, for reasons I won’t go into. (Roughly speaking, the differences occur when you use a “belief” that isn’t just a claim about reality, but also a claim about which parts of reality you “care about”.) Since the existence of E isn’t required, but we do still preclude policy-dependent influence, the authors call this setup “pseudocausal”.

In all three versions, you can define translations between the four different views, such that following any path of translations will always give you the same final output (that is, translating from A to B to C has the same result as A to D to C). This property can be used to _define_ “acausal”, “causal”, and “pseudocausal” as applied to belief functions in infra-Bayesianism. (I’m not going to talk about what a belief function is; see the post for details.)

Comment by rohinmshah on Three reasons to expect long AI timelines · 2021-04-23T22:40:56.909Z · LW · GW

Planned summary for the Alignment Newsletter:

This post outlines and argues for three reasons to expect long AI timelines that the author expects are not taken into account in current forecasting efforts:

1. **Technological deployment lag:** Most technologies take decades between when they're first developed and when they become widely impactful.
2. **Overestimating the generality of AI technology:** Just as people in the 1950s and 1960s overestimated the impact of solving chess, it seems likely that current people are overestimating the impact of recent progress, and how far it can scale in the future.
3. **Regulation will slow things down,** as with [nuclear energy](, for example.

You might argue that the first and third points don’t matter, since what we care about is when AGI is _developed_, as opposed to when it becomes widely deployed. However, it seems that we continue to have the opportunity to intervene until the technology becomes widely impactful, and that seems to be the relevant quantity for decision-making. You could have some specific argument like “the AI goes FOOM and very quickly achieves all of its goals” that then implies that the development time is the right thing to forecast, but none of these seem all that obvious.

Planned opinion:

I broadly agree that (1) and (3) don’t seem to be discussed much during forecasting, despite being quite important. (Though see e.g. [value of the long tail]( I disagree with (2): while it is obviously possible that people are overestimating recent progress, or overconfident about how useful scaling will be, there has at least been a lot of thought put into that particular question -- it seems like one of the central questions tackled by <@bio anchors@>(@Draft report on AI timelines@). See more discussion in this [comment thread](

Comment by rohinmshah on Three reasons to expect long AI timelines · 2021-04-23T22:00:27.325Z · LW · GW

I think it will be hard to figure out how to actually make models do stuff we want. Insofar as this is simply a restatement of the alignment problem, I think this assumption will be fairly uncontroversial around here.

Fwiw, the problem I think is hard is "how to make models do stuff that is actually what we want, rather than only seeming like what we want, or only initially what we want until the model does something completely different like taking over the world".

I don't expect that it will be hard to get models that look like they're doing roughly the thing we want; see e.g. the relative ease of prompt engineering or learning from human preferences. If I thought that were hard, I would agree with you.

I would guess that this is relatively uncontroversial as a view within this field? Not sure though.

(One of my initial critiques of bio anchors was that it didn't take into account the cost of human feedback, except then I actually ran some back-of-the-envelope calculations and it turned out it was dwarfed by the cost of compute; maybe that's your crux too?)

Comment by rohinmshah on Testing The Natural Abstraction Hypothesis: Project Intro · 2021-04-23T21:39:27.739Z · LW · GW

Planned summary for the Alignment Newsletter:

We’ve previously seen some discussion about <@abstraction@>(@Public Static: What is Abstraction?@), and some [claims]( that there are “natural” abstractions, or that AI systems will <@tend@>(@Chris Olah’s views on AGI safety@) to <@learn@>(@Conversation with Rohin Shah@) increasingly human-like abstractions (at least up to a point). To make this more crisp, given a system, let’s consider the information (abstraction) of the system that is relevant for predicting parts of the world that are “far away”. Then, the **natural abstraction hypothesis** states that:

1. This information is much lower-dimensional than the system itself.

2. These low-dimensional summaries are exactly the high-level abstract objects/concepts typically used by humans.

3. These abstractions are “natural”, that is, a wide variety of cognitive architectures will learn to use approximately the same concepts to reason about the world.

For example, to predict the effect of a gas in a larger system, you typically just need to know its temperature, pressure, and volume, rather than the exact positions and velocities of each molecule of the gas. The natural abstraction hypothesis predicts that many cognitive architectures would all converge to using these concepts to reason about gases.

If the natural abstraction hypothesis were true, it could make AI alignment dramatically simpler, as our AI systems would learn to use approximately the same concepts as us, which can help us both to “aim” our AI systems at the right goal, and to peer into our AI systems to figure out what exactly they are doing. So, this new project aims to test whether the natural abstraction hypothesis is true.

The first two claims will likely be tested empirically. We can build low-level simulations of interesting systems, and then compute what summary is useful for predicting its effects on “far away” things. We can then ask how low-dimensional that summary is (to test (1)), and whether it corresponds to human concepts (to test (2)).

A [followup post]( illustrates this in the case of a linear-Gaussian Bayesian network with randomly chosen graph structure. In this case, we take two regions of 110 nodes that are far apart each, and operationalize the relevant information between the two as the covariance matrix between the two regions. It turns out that this covariance matrix has about 3-10 “dimensions” (depending on exactly how you count), supporting claim (1). (And in fact, if you now compare to another neighborhood, two of the three “dimensions” remain the same!) Unfortunately, this doesn’t give much evidence about (2) since humans don’t have good concepts for parts of linear-Gaussian Bayesian networks with randomly chosen graph structure.

While (3) can also be tested empirically through simulation, we would hope that we can also prove theorems that state that nearly all cognitive architectures from some class of models would learn the same concepts in some appropriate types of environments.

To quote the author, “the holy grail of the project would be a system which provably learns all learnable abstractions in a fairly general class of environments, and represents those abstractions in a legible way. In other words: it would be a standardized tool for measuring abstractions. Stick it in some environment, and it finds the abstractions in that environment and presents a standard representation of them.”

Planned opinion:

The notion of “natural abstractions” seems quite important to me. There are at least some weak versions of the hypothesis that seem obviously true: for example, if you ask GPT-3 some new type of question it has never seen before, you can predict pretty confidently that it is still going to respond with real words rather than a string of random characters. This is effectively because you expect that GPT-3 has learned the “natural abstraction” of the words used in English and that it uses this natural abstraction to drive its output (leaving aside the cases where it must produce output in some other language).

The version of the natural abstraction hypothesis investigated here seems a lot stronger and I’m excited to see how the project turns out. I expect the author will post several short updates over time; I probably won’t cover each of these individually and so if you want to follow it in real time I recommend following it on the Alignment Forum.

Comment by rohinmshah on Three reasons to expect long AI timelines · 2021-04-23T19:31:44.494Z · LW · GW

I broadly agree with these points, and (1) and (3) in particular lead to me to shade the bio anchors estimates upwards by ~5 years (note they are already shaded up somewhat to account for these kinds of effects).

I don't really agree on (2).

I see no strong reason to doubt the narrow version of this thesis. I believe it's likely that, as training scales, we'll progressively see more general and more capable machine learning models that can do a ton of impressive things, both on the stuff we expect them to do well on, and some stuff we didn't expect.

But no matter how hard I try, I don't see any current way of making some descendant of GPT-3, for instance, manage a corporation.

I feel like if you were applying this argument to evolution, you'd conclude that humans would be unable to manage corporations, which seems too much. Humans seem to do things that weren't in the ancestral environment, why not GPTs, for the same reason?

You might say "okay, sure, at some level of scaling GPTs learn enough general reasoning that they can manage a corporation, but there's no reason to believe it's near". But one of the major points of the bio anchors framework is to give a reasonable answer to the question of "at what level of scaling might this work", so I don't think you can argue that current forecasts are ignoring (2).

Perhaps you just mean that most people aren't taking bio anchors into account and that's why (2) applies to them -- that seems plausible, I don't have strong beliefs about what other people are thinking.

Comment by rohinmshah on Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers · 2021-04-20T00:38:28.627Z · LW · GW

Sounds good, I've added a sixth bullet point. Fyi, I originally took that list of 5 bullet points verbatim from your post, so you might want to update that list in the post as well.

Comment by rohinmshah on Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers · 2021-04-19T22:53:05.466Z · LW · GW

Planned summary for the Alignment Newsletter:

This is basically 3 months worth of Alignment Newsletters focused solely on interpretability wrapped up into a single post. The authors provide summaries of 70 (!) papers on the topic, and include links to another 90. I’ll focus on their opinions about the field in this summary.

The theory and conceptual clarity of the field of interpretability has improved dramatically since its inception. There are several new or clearer concepts, such as simulatability, plausibility, (aligned) faithfulness, and (warranted) trust. This seems to have had a decent amount of influence over the more typical “methods” papers.

There have been lots of proposals for how to evaluate interpretability methods, leading to the [problem of too many standards]( The authors speculate that this is because both “methods” and “evaluation” papers don’t have sufficient clarity on what research questions they are trying to answer. Even after choosing an evaluation methodology, it is often unclear which other techniques you should be comparing your new method to.

For specific methods for achieving interpretability, at a high level, there has been clear progress. There are cases where we can:

1. identify concepts that certain neurons represent,

2. find feature subsets that account for most of a model's output,

3. find changes to data points that yield requested model predictions,

4. find training data that influences individual test time predictions,

5. generate natural language explanations that are somewhat informative of model reasoning, and

6. create somewhat competitive models that are inherently more interpretable.

There does seem to be a problem of disconnected research and reinventing the wheel. In particular, work at CV conferences, work at NLP conferences, and work at NeurIPS / ICML / ICLR form three clusters that for the most part do not cite each other.

Planned opinion:

This post is great. Especially to the extent that you like summaries of papers (and according to the survey I recently ran, you probably do like summaries), I would recommend reading through this post. You could also read through the highlights from each section, bringing it down to 13 summaries instead of 70.

Comment by rohinmshah on rohinmshah's Shortform · 2021-04-16T17:19:41.414Z · LW · GW

Let's say you're trying to develop some novel true knowledge about some domain. For example, maybe you want to figure out what the effect of a maximum wage law would be, or whether AI takeoff will be continuous or discontinuous. How likely is it that your answer to the question is actually true?

(I'm assuming here that you can't defer to other people on this claim; nobody else in the world has tried to seriously tackle the question, though they may have tackled somewhat related things, or developed more basic knowledge in the domain that you can leverage.)

First, you might think that the probability of your claims being true is linear in the number of insights you have, with some soft minimum needed before you really have any hope of being better than random (e.g. for maximum wage, you probably have ~no hope of doing better than random without Econ 101 knowledge), and some soft maximum where you almost certainly have the truth. This suggests that P(true) is a logistic function of the number of insights.

Further, you might expect that for every doubling of time you spend, you get a constant number of new insights (the logarithmic returns are because you have diminishing marginal returns on time, since you are always picking the low-hanging fruit first). So then P(true) is logistic in terms of log(time spent). And in particular, there is some soft minimum of time spent before you have much hope of doing better than random.

This soft minimum on time is going to depend on a bunch of things -- how "hard" or "complex" or "high-dimensional" the domain is, how smart / knowledgeable you are, how much empirical data you have, etc. But mostly my point is that these soft minimums exist.

A common pattern in my experience on LessWrong is that people will take some domain that I think is hard / complex / high-dimensional, and will then make a claim about it based on some pretty simple argument. In this case my response is usually "idk, that argument seems directionally right, but who knows, I could see there being other things that have much stronger effects", without being able to point to any such thing (because I also have spent barely any time thinking about the domain). Perhaps a better way of saying it would be "I think you need to have thought about this for more time than you have before I expect you to do better than random".

Comment by rohinmshah on Homogeneity vs. heterogeneity in AI takeoff scenarios · 2021-04-16T16:45:37.904Z · LW · GW

Well then, would you agree that Evan's position here:

By default, in the case of deception, my expectation is that we won't get a warning shot at all

is plausible and in particular doesn't depend on believing in a discontinuity, at least not the kind of discontinuity we should consider unlikely?

No, I don't agree with that.

Consider a spectrum of warning shots from very minor to very major. Put a few examples on the spectrum for illustration. Then draw a credence distribution for probability that we'll have warning shots of this kind.

One problem here is that my credences on warning shots are going to be somewhat lower just because I think there's some chance that we just solve the problem before we get warning shots, or there was never any problem in the first place.

I could condition on worlds in which an existential catastrophe occurs, but that will also make it somewhat lower because an existential catastrophe is more likely when we don't get warning shots.

So I think for each type of warning shot I'm going to do a weird operation where I condition on something like "by the time a significant amount of work is being done by AI systems that are sufficiently capable to deliberately cause <type of warning shot> level of damage, we have not yet solved the problem in practice".

I'm also going to assume no discontinuity, since that's the situation we seem to disagree about.

Then, some warning shots we could have:

Minor, leads to result "well of course that happened" without much increase in caution: has already happened

  • Reward gaming: Faulty reward functions in the wild
  • Deception: Robot hand moving in front of a ball to make it look like it is grasping it, even though it isn't (source)
  • Hidden capabilities: GPT-3 answering nonsense questions with "a straight face", except it can tell that the questions are nonsense, as you can see if you design a better prompt (source)

Minor, leads to some actual damage, but mostly PR / loss of trust: 95%

  • Lying / deception: A personal assistant agent, when asked to schedule a meeting by when2meet, insists upon doing it by email instead, because that's how it has always done things. It says "sorry, I don't know how to use when2meet" in order to get this to happen, but it "could" use when2meet if it "wanted" to.
  • Deception: A cleaning robot sweeps the dust under the rug, knowing full well that the user would disapprove if they knew.

Moderate, comparable to things that are punishable by law: 90%

  • Deception: An AI system in charge of a company embezzles money
  • Deception: An AI system runs a Ponzi scheme (that it knows is a Ponzi scheme) (and the designers of the AI system wouldn't endorse it running a Ponzi scheme)
  • Failure of constraints: An AI system helps minors find online stores for drugs and alcohol

Major, lots of damage, would be huge news: 60%

  • An AI system blows up an "enemy building"; it hides its plans from all humans (including users / designers) because it knows they will try to stop it.
  • An AI system captures employees from a rival corporation and tortures them until they give up corporate secrets.
  • (The specific examples I give feel somewhat implausible, but I think that's mostly because I don't know the best ways to achieve goals when you have no moral scruples holding you back.)

"Strong", tries and fails to take over the world: 20%

  • I do think it is plausible that multiple AI systems try to take over the world, and then some of them are thwarted by other AI systems. I'm not counting these, because it seems like humans have lost meaningful control in this situation, so this "warning shot" doesn't help.
  • I mostly assign 20% on this as "idk, seems unlikely, but I can't rule it out, and predicting the future is hard so don't assign an extreme value here"
Comment by rohinmshah on Old post/writing on optimization daemons? · 2021-04-15T23:41:32.026Z · LW · GW

This probably isn't the thing you mean, but your description kinda sounds like tessellating hills and its predecessor demons in imperfect search.

Comment by rohinmshah on Homogeneity vs. heterogeneity in AI takeoff scenarios · 2021-04-15T23:35:08.451Z · LW · GW

I don't automatically exclude lab settings, but other than that, this seems roughly consistent with my usage of the term. (And in particular includes the "weak" warning shots discussed above.)

Comment by rohinmshah on Homogeneity vs. heterogeneity in AI takeoff scenarios · 2021-04-15T17:54:53.666Z · LW · GW

perhaps you think that it would actually provoke a major increase in caution, comparable to the increase we'd get if an AI tried and failed to take over, in which case this minor warning shot vs. major warning shot distinction doesn't matter much.

Well, I think a case of an AI trying and failing to take over would provoke an even larger increase in caution, so I'd rephrase as

it would actually provoke a major increase in caution (assuming we weren't already being very cautious)

I suppose the distinction between "strong" and "weak" warning shots would matter if we thought that we were getting "strong" warning shots. I want to claim that most people (including Evan) don't expect "strong" warning shots, and usually mean the "weak" version when talking about "warning shots", but perhaps I'm just falling prey to the typical mind fallacy.

Comment by rohinmshah on Homogeneity vs. heterogeneity in AI takeoff scenarios · 2021-04-14T18:02:08.611Z · LW · GW

If you think there's something we are not on the same page about here--perhaps what you were hinting at with your final sentence--I'd be interested to hear it.

I'm not sure. Since you were pushing on the claim about failing to take over the world, it seemed like you think (the truth value of) that claim is pretty important, whereas I see it as not that important, which would suggest that there is some underlying disagreement (idk what it would be though).

Comment by rohinmshah on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-04-13T22:44:30.494Z · LW · GW

Perhaps I should start saying "Guys, can we encourage folks to work on both issues please, so that people who care about x-risk have more ways to show up and professionally matter?", and maybe that will trigger less pushback of the form "No, alignment is the most important thing"... 

I think that probably would be true.

For some reason when I express opinions of the form "Alignment isn't the most valuable thing on the margin", alignment-oriented folks (e.g., Paul here) seem to think I'm saying you shouldn't work on alignment (which I'm not), which triggers a "Yes, this is the most valuable thing" reply.

Fwiw my reaction is not "Critch thinks Rohin should do something else", it's more like "Critch is saying something I believe to be false on an important topic that lots of other people will read". I generally want us as a community to converge to true beliefs on important things (part of my motivation for writing a newsletter) and so then I'd say "but actually alignment still seems like the most valuable thing on the margin because of X, Y and Z".

(I've had enough conversations with you at this point to know the axes of disagreement, and I think you've convinced me that "which one is better on the margin" is not actually that important a question to get an answer to. So now I don't feel as much of an urge to respond that way. But that's how I started out.)

Comment by rohinmshah on Homogeneity vs. heterogeneity in AI takeoff scenarios · 2021-04-13T22:34:34.922Z · LW · GW

Not sure why I didn't respond to this, sorry.

I agree with the claim "we may not have an AI system that tries and fails to take over the world (i.e. an AI system that tries but fails to release an engineered pandemic that would kill all humans, or arrange for simultaneous coups in the major governments, or have a robotic army kill all humans, etc) before getting an AI system that tries and succeeds at taking over the world".

I don't see this claim as particularly relevant to predicting the future.

Comment by rohinmshah on Another (outer) alignment failure story · 2021-04-12T21:20:38.804Z · LW · GW

Planned opinion (shared with What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs))

Both the previous story and this one seem quite similar to each other, and seem pretty reasonable to me as a description of one plausible failure mode we are aiming to avert. The previous story tends to frame this more as a failure of humanity’s coordination, while this one frames it (in the title) as a failure of intent alignment. It seems like both of these aspects greatly increase the plausibility of the story, or in other words, if we eliminated or made significantly less bad either of the two failures, then the story would no longer seem very plausible.

A natural next question is then which of the two failures would be best to intervene on, that is, is it more useful to work on intent alignment, or working on coordination? I’ll note that my best guess is that for any given person, this effect is minor relative to “which of the two topics is the person more interested in?”, so it doesn’t seem hugely important to me. Nonetheless, my guess is that on the current margin, for technical research in particular, holding all else equal, it is more impactful to focus on intent alignment. You can see a much more vigorous discussion in e.g. [this comment thread](

Comment by rohinmshah on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-04-12T21:19:35.609Z · LW · GW

Planned summary for the Alignment Newsletter:

A robust agent-agnostic process (RAAP) is a process that robustly leads to an outcome, without being very sensitive to the details of exactly which agents participate in the process, or how they work. This is illustrated through a “Production Web” failure story, which roughly goes as follows:

A breakthrough in AI technology leads to a wave of automation of $JOBTYPE (e.g management) jobs. Any companies that don’t adopt this automation are outcompeted, and so soon most of these jobs are completely automated. This leads to significant gains at these companies and higher growth rates. These semi-automated companies trade amongst each other frequently, and a new generation of "precision manufacturing'' companies arise that can build almost anything using robots given the right raw materials. A few companies develop new software that can automate $OTHERJOB (e.g. engineering) jobs. Within a few years, nearly all human workers have been replaced.

These companies are now roughly maximizing production within their various industry sectors. Lots of goods are produced and sold to humans at incredibly cheap prices. However, we can’t understand how exactly this is happening. Even Board members of the fully mechanized companies can’t tell whether the companies are serving or merely appeasing humanity; government regulators have no chance.

We do realize that the companies are maximizing objectives that are incompatible with preserving our long-term well-being and existence, but we can’t do anything about it because the companies are both well-defended and essential for our basic needs. Eventually, resources critical to human survival but non-critical to machines (e.g., arable land, drinking water, atmospheric oxygen…) gradually become depleted or destroyed, until humans can no longer survive.

Notice that in this story it didn’t really matter what job type got automated first (nor did it matter which specific companies took advantage of the automation). This is the defining feature of a RAAP -- the same general story arises even if you change around the agents that are participating in the process. In particular, in this case competitive pressure to increase production acts as a “control loop” that ensures the same outcome happens, regardless of the exact details about which agents are involved.

Planned opinion (shared with Another (outer) alignment failure story):

Both the previous story and this one seem quite similar to each other, and seem pretty reasonable to me as a description of one plausible failure mode we are aiming to avert. The previous story tends to frame this more as a failure of humanity’s coordination, while this one frames it (in the title) as a failure of intent alignment. It seems like both of these aspects greatly increase the plausibility of the story, or in other words, if we eliminated or made significantly less bad either of the two failures, then the story would no longer seem very plausible.

A natural next question is then which of the two failures would be best to intervene on, that is, is it more useful to work on intent alignment, or working on coordination? I’ll note that my best guess is that for any given person, this effect is minor relative to “which of the two topics is the person more interested in?”, so it doesn’t seem hugely important to me. Nonetheless, my guess is that on the current margin, for technical research in particular, holding all else equal, it is more impactful to focus on intent alignment. You can see a much more vigorous discussion in e.g. [this comment thread](

Comment by rohinmshah on Another (outer) alignment failure story · 2021-04-12T20:00:04.529Z · LW · GW

Planned summary for the Alignment Newsletter:

Suppose we train AI systems to perform task T by having humans look at the results that the AI system achieves and evaluating how well the AI has performed task T. Suppose further that AI systems generalize “correctly” such that even in new situations they are still taking those actions that they predict we will evaluate as good. This does not mean that the systems are aligned: they would still deceive us into _thinking_ things are great when they actually are not. This post presents a more detailed story for how such AI systems can lead to extinction or complete human disempowerment. It’s relatively short, and a lot of the force comes from the specific details that I’m not going to summarize, so I do recommend you read it in full. I’ll be explaining a very abstract version below.

The core aspects of this story are:
1. Economic activity accelerates, leading to higher and higher growth rates, enabled by more and more automation through AI.
2. Throughout this process, we see some failures of AI systems where the AI system takes some action that initially looks good but we later find out was quite bad (e.g. investing in a Ponzi scheme, that the AI knows is a Ponzi scheme but the human doesn’t).
3. Despite this failure mode being known and lots of work being done on the problem, we are unable to find a good conceptual solution. The best we can do is to build better reward functions, sensors, measurement devices, checks and balances, etc. in order to provide better reward functions for agents and make it harder for them to trick us into thinking their actions are good when they are not.
4. Unfortunately, since the proportion of AI work keeps increasing relative to human work, this extra measurement capacity doesn’t work forever. Eventually, the AI systems are able to completely deceive all of our sensors, such that we can’t distinguish between worlds that are actually good and worlds which only appear good. Humans are dead or disempowered at this point.

(Again, the full story has much more detail.)

Comment by rohinmshah on AXRP Episode 6 - Debate and Imitative Generalization with Beth Barnes · 2021-04-12T16:25:50.364Z · LW · GW

Planned summary for the Alignment Newsletter:

This podcast covers a bunch of topics, such as <@debate@>(@AI safety via debate@), <@cross examination@>(@Writeup: Progress on AI Safety via Debate@), <@HCH@>(@Humans Consulting HCH@), <@iterated amplification@>(@Supervising strong learners by amplifying weak experts@), and <@imitative generalization@>(@Imitative Generalisation (AKA 'Learning the Prior')@) (aka [learning the prior]( ([AN #109](, along with themes about <@universality@>(@Towards formalizing universality@). Recommended for getting a broad overview of this particular area of AI alignment.

Comment by rohinmshah on My research methodology · 2021-04-09T06:24:10.606Z · LW · GW

I agree this involves discretion [...] So instead I'm doing some in between thing

Yeah, I think I feel like that's the part where I don't think I could replicate your intuitions (yet).

I don't think we disagree; I'm just noting that this methodology requires a fair amount of intuition / discretion, and I don't feel like I could do this myself. This is much more a statement about what I can do, rather than a statement about how good the methodology is on some absolute scale.

(Probably I could have been clearer about this in the original opinion.)

Comment by rohinmshah on My research methodology · 2021-04-09T03:06:43.316Z · LW · GW

In some sense you could start from the trivial story "Your algorithm didn't work and then something bad happened." Then the "search for stories" step is really just trying to figure out if the trivial story is plausible. I think that's pretty similar to a story like: "You can't control what your model thinks, so in some new situation it decides to kill you."

To fill in the details more:

Assume that we're finding an algorithm to train an agent with a sufficiently large action space (i.e. we don't get safety via the agent having such a restricted action space that it can't do anything unsafe).

It seems like in some sense the game is in constraining the agent's cognition to be such that it is "safe" and "useful". The point of designing alignment algorithms is to impose such constraints, without requiring so much effort as to make the resulting agent useless / uncompetitive.

However, there are always going to be some plausible circumstances that we didn't consider (even if we're talking about amplified humans, which are still bounded agents). Even if we had maximal ability to place constraints on agent cognition, whatever constraints we do place won't have been tested in these unconsidered plausible circumstances. It is always possible that one misfires in a way that makes the agent do something unsafe.

(This wouldn't be true if we had some sort of proof against misfiring, that doesn't assume anything about what circumstances the agent experiences, but that seems ~impossible to get. I'm pretty sure you agree with that.)

More generally, this story is going to be something like:

  1. Suppose you trained your model M to do X using algorithm A.
  2. Unfortunately, when designing algorithm A / constraining M with A, you (or amplified-you) failed to consider circumstance C as a possible situation that might happen.
  3. As a result, the model learned heuristic H, that works in all the circumstances you did consider, but fails in circumstance C.
  4. Circumstance C then happens in the real world, leading to an actual failure.

Obviously, I can't usually instantiate M, X, A, C, and H such that the story works for an amplified human (since they can presumably think of anything I can think of). And I'm not arguing that any of this is probable. However, it seems to meet your bar of "plausible":

there is some way to fill in the rest of the details that's consistent with everything I know about the world.

EDIT: Or maybe more accurately, I'm not sure how exactly the stories you tell are different / more concrete than the ones above.


When I say you have "a better defined sense of what does and doesn't count as a valid step 2", I mean that there's something in your head that disallows the story I wrote above, but allows the stories that you generally use, and I don't know what that something is; and that's why I would have a hard time applying your methodology myself.


Possible analogy / intuition pump for the general story I gave above: Human cognition is only competent in particular domains and must be relearned in new domains (like protein folding) or new circumstances (like when COVID-19 hits), and sometimes human cognition isn't up to the task (like when being teleported to a universe with different physics and immediately dying), or doesn't do so in a way that agrees with other humans (like how some humans would push a button that automatically wirehead everyone for all time, while others would find that abhorrent).

Comment by rohinmshah on Coherence arguments imply a force for goal-directed behavior · 2021-04-09T02:25:46.453Z · LW · GW

Looks good to me :)

Comment by rohinmshah on My research methodology · 2021-04-06T22:39:51.045Z · LW · GW

Planned summary for the Alignment Newsletter:

This post outlines a simple methodology for making progress on AI alignment. The core idea is to alternate between two steps:

1. Come up with some alignment algorithm that solves the issues identified so far

2. Try to find some plausible situation in which either a) the resulting AI system is misaligned or b) the AI system is not competitive.

This is all done conceptually, so step 2 can involve fairly exotic scenarios that probably won't happen. Given such a scenario, we need to argue why no failure in the same class as that scenario will happen, or we need to go back to step 1 and come up with a new algorithm.

This methodology could play out as follows:

Step 1: RL with a handcoded reward function.

Step 2: This is vulnerable to <@specification gaming@>(@Specification gaming examples in AI@).

Step 1: RL from human preferences over behavior, or other forms of human feedback.

Step 2: The system might still pursue actions that are bad that humans can't recognize as bad. For example, it might write a well researched report on whether fetuses are moral patients, which intuitively seems good (assuming the research is good). However, this would be quite bad if the AI wow the report because it calculated that it would increase partisanship leading to civil war.

Step 1: Use iterated amplification to construct a feedback signal that is "smarter" than the AI system it is training.

Step 2: The system might pick up on <@inaccessible information@>(@Inaccessible information@) that the amplified overseer cannot find. For example, it might be able to learn a language just by staring at a large pile of data in that language, and then seek power whenever working in that language, and the amplified overseer may not be able to detect this.

Step 1: Use <@imitative generalization@>(@Imitative Generalisation (AKA 'Learning the Prior')@) so that the human overseer can leverage facts that can be learned by induction / pattern matching, which neural nets are great at.

Step 2: Since imitative generalization ends up learning a description of facts for some dataset, it may learn low-level facts useful for prediction on the dataset, while not including the high-level facts that tell us how the low-level facts connect to things we care about. 

The post also talks about various possible objections you might have, which I’m not going to summarize here.

Planned opinion:

I'm a big fan of having a candidate algorithm in mind when reasoning about alignment. It is a lot more concrete, which makes it easier to make progress and not get lost, relative to generic reasoning from just the assumption that the AI system is superintelligent.

I'm less clear on how exactly you move between the two steps -- from my perspective, there is a core reason for worry, which is something like "you can't fully control what patterns of thought your algorithm learn, and how they'll behave in new circumstances", and it feels like you could always apply that as your step 2. Our algorithms are instead meant to chip away at the problem, by continually increasing our control over these patterns of thought. It seems like the author has a better defined sense of what does and doesn't count as a valid step 2, and that makes this methodology more fruitful for him than it would be for me. More discussion [here](

Comment by rohinmshah on How do scaling laws work for fine-tuning? · 2021-04-04T19:57:16.049Z · LW · GW

I don't think similarly-sized transformers would do much better and might do worse. Section 3.4 shows that large models trained from scratch massively overfit to the data. I vaguely recall the authors saying that similarly-sized transformers tended to be harder to train as well.

Comment by rohinmshah on How do scaling laws work for fine-tuning? · 2021-04-04T16:53:58.607Z · LW · GW

Does this mean that this fine-tuning process can be thought of as training a NN that is 3 OOMs smaller, and thus needs 3 OOMs fewer training steps according to the scaling laws?

My guess is that the answer is mostly yes (maybe not the exact numbers predicted by existing scaling laws, but similar ballpark).

how does that not contradict the scaling laws for transfer described here and used in this calculation by Rohin?

I think this is mostly irrelevant to timelines / previous scaling laws for transfer:

  1. You still have to pretrain the Transformer, which will take the usual amount of compute (my calculation that you linked takes this into account).
  2. The models trained in the new paper are not particularly strong. They are probably equivalent in performance to models that are multiple orders of magnitude smaller trained from scratch. (I think when comparing against training from scratch, the authors did use smaller models because that was more stable, though with a quick search I couldn't find anything confirming that right now.) So if you think of the "default" as "train an X-parameter model from scratch", then to get equivalent performance you'd probably want to do something like "pretrain a 100X-parameter model, then finetune 0.1% of its weights". (Numbers completely made up.)
  3. I expect there are a bunch of differences in how exactly models are trained. For example, the scaling law papers work almost exclusively with compute-optimal training, whereas this paper probably works with models trained to convergence.

You probably could come to a unified view that incorporates both this new paper and previous scaling law papers, but I expect you'd need to spend a bunch of time getting into the minutiae of the details across the two methods. (Probably high tens to low hundreds of hours.)

Comment by rohinmshah on Coherence arguments imply a force for goal-directed behavior · 2021-03-29T21:47:22.884Z · LW · GW

Yes, that's basically right.

You think I take the original argument to be arguing from ‘has goals' to ‘has goals’, essentially, and agree that that holds, but don’t find it very interesting/relevant.

Well, I do think it is an interesting/relevant argument (because as you say it explains how you get from "weakly has goals" to "strongly has goals"). I just wanted to correct the misconception about what I was arguing against, and I wanted to highlight the "intelligent" --> "weakly has goals" step as a relatively weak step in our current arguments. (In my original post, my main point was that that step doesn't follow from pure math / logic.)

In that case, my current understanding is that you are disagreeing with 2, and that you agree that if 2 holds in some case, then the argument goes through.

At least, the argument makes sense. I don't know how strong its effect is -- basically I agree with your phrasing here:

This force probably doesn’t exist out at the zero goal directness edges, but it unclear how strong it is in the rest of the space—i.e. whether it becomes substantial as soon as you move out from zero goal directedness, or is weak until you are in a few specific places right next to ‘maximally goal directed’.)

Comment by rohinmshah on Coherence arguments imply a force for goal-directed behavior · 2021-03-29T21:41:54.369Z · LW · GW

Thanks, that's helpful. I'll think about how to clarify this in the original post.

Comment by rohinmshah on Coherence arguments imply a force for goal-directed behavior · 2021-03-26T17:46:01.060Z · LW · GW

You're mistaken about the view I'm arguing against. (Though perhaps in practice most people think I'm arguing against the view you point out, in which case I hope this post helps them realize their error.) In particular:

Whatever things you care about, you are best off assigning consistent numerical values to them and maximizing the expected sum of those values

If you start by assuming that the agent cares about things, and your prior is that the things it cares about are "simple" (e.g. it is very unlikely to be optimizing the-utility-function-that-makes-the-twitching-robot-optimal), then I think the argument goes through fine. According to me, this means you have assumed goal-directedness in from the start, and are now seeing what the implications of goal-directedness are.

My claim is that if you don't assume that the agent cares about things, coherence arguments don't let you say "actually, principles of rationality tell me that since this agent is superintelligent it must care about things".

Stated this way it sounds almost obvious that the argument doesn't work, but I used to hear things that effectively meant this pretty frequently in the past. Those arguments usually go something like this:

  1. By hypothesis, we will have superintelligent agents.
  2. A superintelligent agent will follow principles of rationality, and thus will satisfy the VNM axioms.
  3. Therefore it can be modeled as an EU maximizer.
  4. Therefore it pursues convergent instrumental subgoals and kill us all.

This talk for example gives the impression that this sort of argument works. (If you look carefully, you can see that it does state that the AI is programmed to have "objects of concern", which is where the goal-directedness assumption comes in, but you can see why people might not notice that as an assumption.)


You might think "well, obviously the superintelligent AI system is going to care about things, maybe it's technically an assumption but surely that's a fine assumption". I think on balance I agree, but it doesn't seem nearly so obvious to me, and seems to depend on how exactly the agent is built. For example, it's plausible to me that superintelligent expert systems would not be accurately described as "caring about things", and I don't think it was a priori obvious that expert systems wouldn't lead to AGI. Similarly, it seems at best questionable whether GPT-3 can be accurately described as "caring about things".


As to whether this argument is relevant for whether we will build goal-directed systems: I don't think that in isolation my argument should strongly change your view on the probability you assign to that claim. I see it more as a constraint on what arguments you can supply in support of that view. If you really were just saying "VNM theorem, therefore 99%", then probably you should become less confident, but I expect in practice people were not doing that and so it's not obvious how exactly their probabilities should change.


I'd appreciate advice on how to change the post to make this clearer -- I feel like your response is quite common, and I haven't yet figured out how to reliably convey the thing I actually mean.

Comment by rohinmshah on Introduction To The Infra-Bayesianism Sequence · 2021-03-25T21:25:40.071Z · LW · GW

But for more general infradistributions this need not be the case. For example, consider  and take the set of a-measures generated by  and . Suppose you start with  dollars and can bet any amount on any outcome at even odds. Then the optimal bet is betting  dollars on the outcome , with a value of  dollars.

I guess my question is more like: shouldn't there be some aspect of reality that determines what my set of a-measures is? It feels like here we're finding a set of a-measures that rationalizes my behavior, as opposed to choosing a set of a-measures based on the "facts" of the situation and then seeing what behavior that implies.

I feel like we agree on what the technical math says, and I'm confused about the philosophical implications. Maybe we should just leave the philosophy alone for a while.

Comment by rohinmshah on My research methodology · 2021-03-25T21:21:01.474Z · LW · GW

Cool, that makes sense, thanks!

Comment by rohinmshah on My AGI Threat Model: Misaligned Model-Based RL Agent · 2021-03-25T16:07:14.408Z · LW · GW

Planned summary for the Alignment Newsletter:

This post lays out a pathway by which an AI-induced existential catastrophe could occur. The author suggests that AGI will be built via model-based reinforcement learning: that is, given a reward function, we will learn a world model, a value function, and a planner / actor. These will learn online, that is, even after being deployed these learned models will continue to be updated by our learning algorithm (gradient descent, or whatever replaces it). Most research effort will be focused on learning these models, with relatively less effort applied to choosing the right reward function.

There are then two alignment problems: the _outer_ alignment problem is whether the reward function correctly reflects the designer's intent, and the _inner_ alignment problem is whether the value function accurately represents the expected reward obtained by the agent over the long term. On the inner alignment side, the value function may not accurately capture the reward for several reasons, including ambiguity in the reward signals (since you only train the value function in some situations, and many reward functions can then produce the same value function), manipulation of the reward signal, failures of credit assignment, ontological crises, and having mutually contradictory "parts" of the value function (similarly to humans). On the outer alignment side, we have the standard problem that the reward function may not reflect what we actually want (i.e. specification gaming or Goodhart's Law). In addition, it seems likely that many capability enhancements will be implemented through the reward function, e.g. giving the agent a curiosity reward, which increases outer misalignment.

Planned opinion:

While I disagree on some of the details, I think this is a good threat model to be thinking about. Its main virtue is that it has a relatively concrete model for what AGI looks like, and it provides a plausible story for both how that type of AGI could be developed (the development model) and how that type of AGI would lead to problems (the risk model). Of course, it is still worth clarifying the plausibility of the scenario, as updates to the story can have significant implications on what research we do. (Some of this discussion is happening in [this post](

Comment by rohinmshah on Against evolution as an analogy for how humans will create AGI · 2021-03-25T15:53:31.917Z · LW · GW

If an AGI learned the skill of speaking english during training, but then learned the skill of speaking french during deployment, then your hypotheses imply that the implementations of those two language skills will be totally different. And it then gets weirder if they overlap - e.g. if an AGI learns a fact during training which gets stored in its weights, and then reads a correction later on during deployment, do those original weights just stay there?

Idk, this just sounds plausible to me. I think the hope is that the weights encode more general reasoning abilities, and most of the "facts" or "background knowledge" gets moved into memory, but that won't happen for everything and plausibly there will be this strange separation between the two. But like, sure, that doesn't seem crazy.

I do expect we reconsolidate into weights through some outer algorithm like gradient descent (and that may not require any human input). If you want to count that as "autonomously editing its weights", then fine, though I'm not sure how this influences any downstream disagreement.

Similar dynamics in humans:

  1. Children are apparently better at learning languages than adults; it seems like adults are using some different process to learn languages (though probably not as different as editing memory vs. editing weights)
  2. One theory of sleep is that it is consolidating the experiences of the day into synapses, suggesting that any within-day learning is not relying as much on editing synapses.

Tbc, I also think explicitly meta-learned update rules are plausible -- don't take any of this as "I think this is definitely going to happen" but more as "I don't see a reason why this couldn't happen".

In fact, this seems like the most likely way in which Steve is right that evolution is a bad analogy.

Fwiw I've mostly been ignoring the point of whether or not evolution is a good analogy. If you want to discuss that, I want to know what specifically you use the analogy for. For example:

  1. I think evolution is a good analogy for how inner alignment issues can arise.
  2. I don't think evolution is a good analogy for the process by which AGI is made (if you think that the analogy is that we literally use natural selection to improve AI systems).

It seems like Steve is arguing the second, and I probably agree (depending on what exactly he means, which I'm still not super clear on).