## Posts

[AN #126]: Avoiding wireheading by decoupling action feedback from action effects 2020-11-26T23:20:05.290Z
[AN #125]: Neural network scaling laws across multiple modalities 2020-11-11T18:20:04.504Z
[AN #124]: Provably safe exploration through shielding 2020-11-04T18:20:06.003Z
[AN #123]: Inferring what is valuable in order to align recommender systems 2020-10-28T17:00:06.053Z
[AN #122]: Arguing for AGI-driven existential risk from first principles 2020-10-21T17:10:03.703Z
[AN #121]: Forecasting transformative AI timelines using biological anchors 2020-10-14T17:20:04.918Z
[AN #120]: Tracing the intellectual roots of AI and AI alignment 2020-10-07T17:10:07.013Z
The Alignment Problem: Machine Learning and Human Values 2020-10-06T17:41:21.138Z
[AN #119]: AI safety when agents are shaped by environments, not rewards 2020-09-30T17:10:03.662Z
[AN #118]: Risks, solutions, and prioritization in a world with many AI systems 2020-09-23T18:20:04.779Z
[AN #117]: How neural nets would fare under the TEVV framework 2020-09-16T17:20:14.062Z
[AN #116]: How to make explanations of neurons compositional 2020-09-09T17:20:04.668Z
[AN #115]: AI safety research problems in the AI-GA framework 2020-09-02T17:10:04.434Z
[AN #114]: Theory-inspired safety solutions for powerful Bayesian RL agents 2020-08-26T17:20:04.960Z
[AN #113]: Checking the ethical intuitions of large language models 2020-08-19T17:10:03.773Z
[AN #112]: Engineering a Safer World 2020-08-13T17:20:04.013Z
[AN #111]: The Circuits hypotheses for deep learning 2020-08-05T17:40:22.576Z
[AN #110]: Learning features from human feedback to enable reward learning 2020-07-29T17:20:04.369Z
[AN #109]: Teaching neural nets to generalize the way humans would 2020-07-22T17:10:04.508Z
[AN #107]: The convergent instrumental subgoals of goal-directed agents 2020-07-16T06:47:55.532Z
[AN #108]: Why we should scrutinize arguments for AI risk 2020-07-16T06:47:38.322Z
[AN #106]: Evaluating generalization ability of learned reward models 2020-07-01T17:20:02.883Z
[AN #105]: The economic trajectory of humanity, and what we might mean by optimization 2020-06-24T17:30:02.977Z
[AN #103]: ARCHES: an agenda for existential safety, and combining natural language with deep RL 2020-06-10T17:20:02.171Z
[AN #102]: Meta learning by GPT-3, and a list of full proposals for AI alignment 2020-06-03T17:20:02.221Z
[AN #101]: Why we should rigorously measure and forecast AI progress 2020-05-27T17:20:02.460Z
[AN #100]: What might go wrong if you learn a reward function while acting 2020-05-20T17:30:02.608Z
[AN #99]: Doubling times for the efficiency of AI algorithms 2020-05-13T17:20:02.637Z
[AN #98]: Understanding neural net training by seeing which gradients were helpful 2020-05-06T17:10:02.563Z
[AN #97]: Are there historical examples of large, robust discontinuities? 2020-04-29T17:30:02.043Z
[AN #96]: Buck and I discuss/argue about AI Alignment 2020-04-22T17:20:02.821Z
[AN #95]: A framework for thinking about how to make AI go well 2020-04-15T17:10:03.312Z
[AN #94]: AI alignment as translation between humans and machines 2020-04-08T17:10:02.654Z
[AN #93]: The Precipice we’re standing at, and how we can back away from it 2020-04-01T17:10:01.987Z
[AN #92]: Learning good representations with contrastive predictive coding 2020-03-25T17:20:02.043Z
[AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement 2020-03-18T17:10:02.205Z
[AN #90]: How search landscapes can contain self-reinforcing feedback loops 2020-03-11T17:30:01.919Z
[AN #89]: A unifying formalism for preference learning algorithms 2020-03-04T18:20:01.393Z
[AN #88]: How the principal-agent literature relates to AI risk 2020-02-27T09:10:02.018Z
[AN #87]: What might happen as deep learning scales even further? 2020-02-19T18:20:01.664Z
[AN #86]: Improving debate and factored cognition through human experiments 2020-02-12T18:10:02.213Z
[AN #84] Reviewing AI alignment work in 2018-19 2020-01-29T18:30:01.738Z
AI Alignment 2018-19 Review 2020-01-28T02:19:52.782Z
[AN #83]: Sample-efficient deep learning with ReMixMatch 2020-01-22T18:10:01.483Z
rohinmshah's Shortform 2020-01-18T23:21:02.302Z
[AN #82]: How OpenAI Five distributed their training computation 2020-01-15T18:20:01.270Z
[AN #81]: Universality as a potential solution to conceptual difficulties in intent alignment 2020-01-08T18:00:01.566Z

Comment by rohinmshah on Risks from Learned Optimization: Introduction · 2020-12-03T00:34:20.173Z · LW · GW

As a preamble, I should note that I'm putting on my "critical reviewer" hat here. I'm not intentionally being negative -- I am reporting my inside-view beliefs on each question -- but as a general rule, I expect these to be biased negatively; someone looking at research from the outside doesn't have the same intuitions for its utility and so will usually inside-view underestimate its value.

This is also all things I'm saying with the benefit of hindsight, idk what I would have said at the time the sequence was published. I'm not trying to be "fair" to the sequence here, that is, I'm not considering what it would have been reasonable to believe at the time.

the paper overly focuses on models mechanistically implementing optimization

Yup, that's right.

I'm not sure what model of AI development it relies that you don't think is accurate

There seems to be an implicit model that when you do machine learning you get out a complicated mess of a neural net that is hard to interpret, but at its core it still is learning something akin to a program, and hence concepts like "explicit (mechanistic) search algorithm" are reasonable to expect. (Or at least, that this will be true for sufficiently intelligent AI systems.)

I don't think this model (implicit claim?) is correct. (For comparison, I also don't think this model would be correct if applied to human cognition.)

worse research topics you think it has tended to encourage

A couple of examples:

• Attempting to create an example of a learned mechanistic search algorithm (I know of at least one proposal that was trying to do this)
• Of your concrete experiments, I don't expect to learn anything of interest from the first two (they aren't the sort of thing that would generalize from small environments to large environments); I like the third; the fourth and fifth seem like interesting AI research but I don't think they'd shed light on mesa-optimization / inner alignment or its solutions.

I think this paper should make you more excited about directions like transparency and robustness and less excited about directions involving careful incentive/environment design

I agree with this. Maybe people have gotten more interested in transparency as a result of this paper? That seems plausible.

I'm imagining you're referring to things like this post,

Actually, not that one. This is more like "why are you working on reward learning -- even if you solved it we'd still be worried about mesa optimization". Possibly no one believes this, but I often feel like this implication is present. I don't have any concrete examples at the moment; it's possible that I'm imagining it where it doesn't exist, or that this is only a fact about how I interpret other people rather than what they actually believe.

Comment by rohinmshah on Heads I Win, Tails?—Never Heard of Her; Or, Selective Reporting and the Tragedy of the Green Rationalists · 2020-12-02T19:03:52.353Z · LW · GW

But if you roll it 9 times and hide the 3 rolls in a certain direction, then you don't have log_2 of 6 = 2.6 bits. That would be true if you had 6 honest rolls (looking like 2:2:2) but 3:3:0 surely is not the same amount of evidence. I'm generally not sure how best to understand the effects of biases of this sort, and want to think about that more.

The general formula is , where obs is the observation that you see. You need to calculate  based on the problem setup; if you are given the ground truth of how the 9 rolls happen as well as the algorithm by which the 6 dice rolls to reveal are chosen, you can compute  for each obs by brute force simulation of all possible worlds.

Comment by rohinmshah on Reframing Impact · 2020-12-02T18:57:56.139Z · LW · GW

I'm nominating the entire sequence because it's brought a lot of conceptual clarity to the notion of "impact", and has allowed me to be much more precise in things I say about "impact".

Comment by rohinmshah on What failure looks like · 2020-12-02T18:55:46.862Z · LW · GW

As commenters have pointed out, the post is light on concrete details. Nonetheless, I found even the abstract stories much more compelling as descriptions-of-the-future (people usually focus on descriptions-of-the-world-if-we-bury-our-heads-in-the-sand). I think Part 2 in particular continues to be a good abstract description of the type of scenario that I personally am trying to avert.

Comment by rohinmshah on Risks from Learned Optimization: Introduction · 2020-12-02T18:46:18.305Z · LW · GW

I struggled a bit on deciding whether to nominate this sequence.

On the one hand, it brought a lot more prominence to the inner alignment problem by making an argument for it in a lot more detail than had been done before.

On the other hand, on my beliefs, the framework it presents has an overly narrow view of what counts as inner alignment, relies on a model of AI development that I do not think is accurate, causes people to say "but what about mesa optimization" in response to any advance that doesn't involve mesa optimization even if the advance is useful for other reasons, has led to significant confusion over what exactly does and does not count as mesa optimization, and tends to cause people to take worse steps in choosing future research topics. (I expect all of these claims will be controversial.)

Still, that the conversation is happening at all is a vast improvement over the previous situation of relative (public) silence on the problem. Saying a bunch of confused thoughts is often the precursor to an actual good understanding of a topic. As such I've decided to nominate it for that contribution.

Comment by rohinmshah on Introduction to Cartesian Frames · 2020-12-02T17:14:30.503Z · LW · GW

It should, good catch, thanks!

Comment by rohinmshah on AGI Predictions · 2020-12-01T17:01:37.030Z · LW · GW

Fwiw, I think the operationalization of the question is stronger than it appears at first glance, and that's why estimates are low.

Comment by rohinmshah on AGI Predictions · 2020-12-01T16:57:52.616Z · LW · GW

Some ways you might think scenario #1 won't happen:

Also: we solve alignment really well on paper, and that's why deception doesn't arise. (I assign non-trivial probability to this.)

Comment by rohinmshah on AGI Predictions · 2020-12-01T16:55:32.461Z · LW · GW

why is everyone so optimistic??

Some reasons.

Comment by rohinmshah on Committing, Assuming, Externalizing, and Internalizing · 2020-11-30T01:40:35.004Z · LW · GW

Oh yup I was misinterpreting how B was defined, and that would also mess up my proof. Thanks!

Comment by rohinmshah on Committing, Assuming, Externalizing, and Internalizing · 2020-11-29T23:49:59.538Z · LW · GW

Hmm, I'm not seeing it. Taking your example, let's say that , and , all in the obvious way.

Whether or not it rains would be formalized by the partition .

Plugging this in to the definition from worlds, I get that .

Plugging this in to the definition of a quotient, I get that  (the singleton containing the identity function).

Since , we get out a Cartesian frame whose agent has only one option, for which all properties are trivially observable.

Comment by rohinmshah on Introduction to Cartesian Frames · 2020-11-29T23:32:56.692Z · LW · GW

Planned summary (of the full sequence) for the Alignment Newsletter:

The <@embedded agency sequence@>(@Embedded Agents@) hammered in the fact that there is no clean, sharp dividing line between an agent and its environment. This sequence proposes an alternate formalism: Cartesian frames. Note this is a paradigm that helps us _think about agency_: you should not be expecting some novel result that, say, tells us how to look at a neural net and find agents within it.

The core idea is that rather than _assuming_ the existence of a Cartesian dividing line, we consider how such a dividing line could be _constructed_. For example, when we think of a sports team as an agent, the environment consists of the playing field and the other team; but we could also consider a specific player as an agent, in which case the environment consists of the rest of the players (on both teams) and the playing field. Each of these are valid ways of carving up what actually happens into an “agent” and an “environment”, they are _frames_ by which we can more easily understand what’s going on, hence the name “Cartesian frames”.

A Cartesian frame takes **choice** as fundamental: the agent is modeled as a set of options that it can freely choose between. This means that the formulation cannot be directly applied to deterministic physical laws. It instead models what agency looks like [“from the inside”](https://www.lesswrong.com/posts/yA4gF5KrboK2m2Xu7/how-an-algorithm-feels-from-inside). _If_ you are modeling a part of the world as capable of making choices, _then_ a Cartesian frame is appropriate to use to understand the perspective of that choice-making entity.

Formally, a Cartesian frame consists of a set of agent options A, a set of environment options E, a set of possible worlds W, and an interaction function that, given an agent option and an environment option, specifies which world results. Intuitively, the agent can “choose” an agent option, the environment can “choose” an environment option, and together these produce some world. You might notice that we’re treating the agent and environment symmetrically; this is intentional, and means that we can define analogs of all of our agent notions for environments as well (though they may not have nice philosophical interpretations).

The full sequence uses a lot of category theory to define operations on these sorts of objects and show various properties of the objects and their operations. I will not be summarizing this here; instead, I will talk about their philosophical interpretations.

First, let’s look at an example of using a Cartesian frame on something that isn’t typically thought of as an agent: the atmosphere, within the broader climate system. The atmosphere can “choose” whether to trap sunlight or not. Meanwhile, in the environment, either the ice sheets could melt or they could not. If sunlight is trapped and the ice sheets melt, then the world is Hot. If exactly one of these is true, then the world is Neutral. Otherwise, the world is Cool.

(Yes, this seems very unnatural. That’s good! The atmosphere shouldn’t be modeled as an agent! I’m choosing this example because its unintuitive nature makes it more likely that you think about the underlying rule, rather than just the superficial example. I will return to more intuitive examples later.)

**Controllables**

A _property_ of the world is something like “it is neutral or warmer”. An agent can _ensure_ a property if it has some option such that no matter what environment option is chosen, the property is true of the resulting world. The atmosphere could ensure the warmth property above by “choosing” to trap sunlight. Similarly the agent can _prevent_ a property if it can guarantee that the property will not hold, regardless of the environment option. For example, the atmosphere can prevent the property “it is hot”, by “choosing” not to trap sunlight. The agent can _control_ a property if it can both ensure and prevent it. In our example, there is no property that the atmosphere can control.

**Coarsening or refining worlds**

We often want to describe reality at different levels of abstraction. Sometimes we would like to talk about the behavior of various companies; at other times we might want to look at an individual employee. We can do this by having a function that maps low-level (refined) worlds to high-level (coarsened) worlds. In our example above, consider the possible worlds {YY, YN, NY, NN}, where the first letter of a world corresponds to whether sunlight was trapped (Yes or No), and the second corresponds to whether the ice sheets melted. The worlds {Hot, Neutral, Cool} that we had originally are a coarsened version of this, where we map YY to Hot, YN and NY to Neutral, and NN to Cool.

**Interfaces**

A major upside of Cartesian frames is that given the set of possible worlds that can occur, we can choose how to divide it up into an “agent” and an “environment”. Most of the interesting aspects of Cartesian frames are in the relationships between different ways of doing this division, for the same set of possible worlds.

First, we have interfaces. Given two different Cartesian frames <A, E, W> and <B, F, W> with the same set of worlds, an interface allows us to interpret the agent A as being used in place of the agent B. Specifically, if A would choose an option a, the interface maps this one of B’s options b. This is then combined with the environment option f (from F) to produce a world w.

A valid interface also needs to be able to map the environment option f to e, and then combine it with the agent option a to get the world. This alternate way of computing the world must always give the same answer.

Since A can be used in place of B, all of A’s options must have equivalents in B. However, B could have options that A doesn’t. So the existence of this interface implies that A is “weaker” in a sense than B. (There are a bunch of caveats here.)

(Relevant terms in the sequence: _morphism_)

**Decomposing agents into teams of subagents**

The first kind of subagent we will consider is a subagent that can control “part of” the agent’s options. Consider for example a coordination game, where there are N players who each individually can choose whether or not to press a Big Red Button. There are only two possible worlds: either the button is pressed, or it is not pressed. For now, let’s assume there are two players, Alice and Bob.

One possible Cartesian frame is the frame for the entire team. In this case, the team has perfect control over the state of the button -- the agent options are either to press the button or not to press the button, and the environment does not have any options (or more accurately, it has a single “do nothing” option).

However, we can also decompose this into separate Alice and Bob _subagents_. What does a Cartesian frame for Alice look like? Well, Alice also has two options -- press the button, or don’t. However, Alice does not have perfect control over the result: from her perspective, Bob is part of the environment. As a result, for Alice, the environment also has two options -- press the button, or don’t. The button is pressed if Alice presses it _or_ if the environment presses it. (The Cartesian frame for Bob is identical, since he is in the same position that Alice is in.)

Note however that this decomposition isn’t perfect: given the Cartesian frames for Alice and Bob, you cannot uniquely recover the original Cartesian frame for the team. This is because both Alice and Bob’s frames say that the environment has some ability to press the button -- _we_ know that this is just from Alice and Bob themselves, but given just the frames we can’t be sure that there isn’t a third person Charlie who also might press the button. So, when we combine Alice and Bob back into the frame for a two-person team, we don’t know whether or not the environment should have the ability to press the button or not. This makes the mathematical definition of this kind of subagent a bit trickier though it still works out.

Another important note is that this is relative to how coarsely you model the world. We used a fairly coarse model in this example: only whether or not the button was pressed. If we instead used a finer model that tracked which subset of people pressed the button, then we _would_ be able to uniquely recover the team’s Cartesian frame from Alice and Bob’s individual frames.

(Relevant terms in the sequence: _multiplicative subagents, sub-tensors, tensors_)

**Externalizing and internalizing**

This decomposition isn’t just for teams of people: even a single “mind” can often be thought of as the interaction of various parts. For example, hierarchical decision-making can be thought of as the interaction between multiple agents at different levels of the hierarchy.

This decomposition can be done using _externalization_. Externalization allows you to take an existing Cartesian frame and some specific property of the world, and then construct a new Cartesian frame where that property of the world is controlled by the environment.

Concretely, let’s imagine a Cartesian frame for Alice, that represents her decision on whether to cook a meal or eat out. If she chooses to cook a meal, then she must also decide which recipe to follow. If she chooses to eat out, she must decide which restaurant to eat out at.

We can externalize the high-level choice of whether Alice cooks a meal or eats out. This results in a Cartesian frame where the environment chooses whether Alice is cooking or eating out, and the agent must then choose a restaurant or recipe as appropriate. This is the Cartesian frame corresponding to the low-level policy that must pursue whatever subgoal is chosen by the high-level planning module (which is now part of the environment). The agent of this frame is a subagent of Alice.

The reverse operation is called internalization, where some property of the world is brought under the control of the agent. In the above example, if we take the Cartesian frame for the low-level policy, and then internalize the cooking / eating out choice, we get back the Cartesian frame for Alice as a unified whole.

Note that in general externalization and internalization are _not_ inverses of each other. As a simple example, if you externalize something that is already “in the environment” (e.g. whether it is raining, in a frame for Alice), that does nothing, but when you then internalize it, that thing is now assumed to be under the agent’s control (e.g. now the “agent” in the frame can control whether or not it is raining). We will return to this point when we talk about observability.

**Decomposing agents into disjunctions of subagents**

Our subagents so far have been “team-based”: the original agent could be thought of as a supervisor that got to control all of the subagents together. (The team agent in the button-pressing game could be thought of as controlling both Alice and Bob’s actions; in the cooking / eating out example Alice could be thought of as controlling both the high-level subgoal selection as well as the low-level policy that executes on the subgoals.)

The sequence also introduces another decomposition into subagents, where the superagent can be thought of as a supervisor that gets to choose _which_ of the subagents gets to control the overall behavior. Thus, the superagent can do anything that either of the subagents could do.

Let’s return to our cooking / eating out example. We previously saw that we could decompose Alice into a high-level subgoal-choosing subagent that chooses whether to cook or eat out, and a low-level subgoal-execution subagent that then chooses which recipe to make or which restaurant to go to. We can also decompose Alice as being the choice of two subagents: one that chooses which restaurant to go to, and one that chooses which recipe to make. The union of these subagents is an agent that first chooses whether to go to a restaurant or to make a recipe, and then uses the appropriate subagent to choose the restaurant or recipe: this is exactly a description of Alice.

(Relevant terms in the sequence: _additive subagents, sub-sums, sums_)

**Committing and assuming**

One way to think about the subagents of the previous example is that they are the result of Alice _committing_ to a particular subset of choices. If Alice commits to eating out (but doesn’t specify at what restaurant), then the resulting frame is equivalent to the restaurant-choosing subagent.

Similarly to committing, we can also talk about _assuming_. Just as commitments restrict the set of options available to the agent, assumptions restrict the set of options available to the environment.

Just as we can union two agents together to get an agent that gets to choose between two subagents, we can also union two environments together to get an environment that gets to choose between two subenvironments. (In this case the agent is more constrained: it must be able to handle the environment regardless of which way the environment chooses.)

(Relevant terms in the sequence: _product_)

**Observables**

The most interesting (to me) part of this sequence was the various equivalent definitions of what it means for something to be observable. The overall story is similar to the one in [Knowledge is Freedom](https://www.alignmentforum.org/posts/b3Bt9Cz4hEtR26ANX/knowledge-is-freedom): an agent is said to “observe” a property P if it is capable of making different decisions based on whether P holds or not.

Thus we get our first definition of observability: **a property P of the world is _observable_ if, for any two agent options a and b, the agent also has an option that is equivalent to “if P then a else b”.**

Intuitively, this is meant to be similar to the notion of “inputs” to an agent. Intuitively, a neural net should be able to express arbitrary computations over its inputs, and so if we view the neural net as “choosing” what computation to do (by “choosing” what its parameters are), then the neural net can have its outputs (agent options) depend in arbitrary ways on the inputs. Thus, we say that the neural net “observes” its inputs, because what the neural net does can depend freely on the inputs.

Note that this is a very black-or-white criterion: we must be able to express _every_ conditional policy on the property for it to be observable; if even one such policy is not expressible then the property is not observable.

One way to think about this is that an observable property needs to be completely under the control of the environment, that is, the environment option should completely determine whether the resulting world satisfies the property or not -- nothing the agent does can matter (for this property). To see this, suppose that there was some environment option e that didn’t fully determine a property P, so that there are agent options a and b such that the world corresponding to (a, e) satisfies P but the one corresponding to (b, e) does not. Then our agent cannot implement the conditional policy “if P then b else a”, because it would lead to a self-referential contradiction (akin to “this sentence is false”) when the environment chooses e. Thus, P cannot be observable.

This is not equivalent to observability: it is possible for the environment to fully control P, while the agent is still unable to always condition on P. So we do need something extra. Nevertheless, this intuition suggests a few other ways of thinking about observability. The key idea is to identify a decomposition of the agent based on P that should only work if the environment has all the control, and then to identify a union step that puts the agent back together, that automatically adds in all of the policies that are conditional on P. I’ll describe these definitions here; the sequence proves that they are in fact equivalent to the original definition above.

First, recall that externalization and internalization are methods that allow us to “transfer” control of some property from the agent to the environment and vice versa. Thus, if all the control of P is in the environment, one would hope that internalization followed by externalization just transfers the control back and forth. In addition, when we externalize P, the externalization process will enforce that the agent can condition on P arbitrarily (this is how it is defined). This suggests the definition: **P is observable if and only if internalizing P followed by externalizing P gives us back the original frame.**

Second, if the environment has all of the control over P, then we should be able to decompose the agent into two parts: one that decides what to do when P is true, and one that decides what to do when P is false. We can achieve this using _assumptions_, that is, the first agent is the original agent under the assumption that P is true, and the second is under the assumption that P is false. Note that if the environment didn’t have perfect control over P, this would not work, as the environment options where P is not guaranteed to be true or false would simply be deleted, and could not be reconstructed from the two new agents.

We now need to specify how to put the agents back together, in a way that includes all the conditional policies on P. There are actually two variants in how we can do this:

In the first case, we combine the agents by unioning the environments, which lets the environment choose whether P is true or not. Given how this union is defined, the new agent is able to specify both what to do given the environment’s choice, _as well as_ what it would have done in the counterfactual case where the environment had decided P differently. This allows it to implement all conditional policies on P. So, **P is observable if and only if decomposing the frame using assumptions on P, and then unioning the environments of the resulting frames gives back the original frame.**

The second case, after getting agents via assumption on P, you extend each agent so that in the case where its assumption is false, it is as though it takes a noop action. Intuitively, the resulting agent is an agent that is hobbled so that it has no power in worlds where P comes out differently than was assumed. These agents are then combined into a team. Intuitively, the team selects an option of the form “the first agent tries to do X (which only succeeds when P is true) and the second agent tries to do Y (which only succeeds when P is false)”. Like the previous decomposition, this specifies both what to do in whatever actual environment results, as well as what would have been done in the counterfactual world where the value of P was reversed. Thus, this way of combining the agents once again adds in all conditional policies on P. So, **P is observable if and only if decomposing the frame using assumptions on P, then hobbling the resulting frames in cases where their assumptions are false, and then putting the agents back in a team, is equivalent to the original frame.**

**Time**

Cartesian frames do not have an intrinsic notion of time. However, we can still use them to model sequential processes, by having the agent options be _policies_ rather than actions, and having the worlds be histories or trajectories rather than states.

To say useful things about time, we need to broaden our notion of observables. So far I’ve been talking about whether you can observe binary properties P that are either true or false. In fact, all of the definitions can be easily generalized to n-ary properties P that can take on one of N values. We’ll be using this notion of observability here.

Consider a game of chess where Alice plays as white and Bob as black. Intuitively, when Alice is choosing her second move, she can observe Bob’s first move. However, the property “Bob’s first move” would not be observable in Alice’s Cartesian frame, because Alice’s _first_ move cannot depend on Bob’s first move (since Bob hasn’t made it yet), and so when deciding the first move we can’t implement policies that condition on what Bob’s first move is.

Really, we want some way to say “after Alice has made her first move, from the perspective of the rest of her decisions, Bob’s first move is observable”. But we know how to remove some control from the agent in order to get the perspective of “everything else” -- that’s externalization! In particular, in Alice’s frame, if we externalize the property “Alice’s first move”, then the property “Bob’s first move” _is_ observable in the new frame.

This suggests a way to define a sequence of frames that represent the passage of time: we define the Tth frame as “the original frame, but with the first T moves externalized”, or equivalently as “the T-1th frame, but with the Tth move externalized”. Each of these frames are subagents of the original frame, since we can think of the full agent (Alice) as the team of “the agent that plays the first T moves” and “the agent that plays the T+1th move and onwards”. As you might expect, as “time” progresses, the agent loses controllables and gains observables. For example, by move 3 Alice can no longer control her first two moves, but she can now observe Bob’s first two moves, relative to Alice at the beginning of the game.

Planned opinion:

I like this way of thinking about agency: we’ve been talking about “where to draw the line around the agent” for quite a while in AI safety, but there hasn’t been a nice formalization of this until now. In particular, it’s very nice that we can compare different ways of drawing the line around the agent, and make precise various concepts around this, such as “subagent”.

I’ve also previously liked the notion that “to observe P is to be able to change your decisions based on the value of P”, but I hadn’t really seen much discussion about it until now. This sequence makes some real progress on conceptual understanding of this perspective: in particular, the notion that observability requires “all the control to be in the environment” is not one I had until now. (Though I should note that this particular phrasing is mine, and I’m not sure the author would agree with the phrasing.)

One of my checks for the utility of foundational theory for a particular application is to see whether the key results can be explained without having to delve into esoteric mathematical notation. I think this sequence does very well on this metric -- for the most part I didn’t even read the proofs, yet I was able to reconstruct conceptual arguments for many of the theorems that are convincing to me. (They aren’t and shouldn’t be as convincing as the proofs themselves.) However, not all of the concepts score so well on this -- for example, the generic subagent definition was sufficiently unintuitive to me that I did not include it in this summary.

Comment by rohinmshah on Committing, Assuming, Externalizing, and Internalizing · 2020-11-29T22:49:48.424Z · LW · GW

Random thing I wanted to check, figured I might as well write it up:

Claim:  is observable in the frame .

Proof sketch: Every column of  is of the form , and every world involving the same  has the same  by construction of . Thus, if our agent can condition on subsets of , then our agent can condition on  as well. We'll denote a subset of  by .

Given two agent options  in , we can implement the conditional policy "if  then  else " by defining . (This can easily be generalized to partitions.) Thus we can implement all conditional policies, and so  is observable.

I think this is correct, though I've done enough handwaving and skipping of proof steps that I'm not confident.

Comment by rohinmshah on Matt Botvinick on the spontaneous emergence of learning algorithms · 2020-11-24T17:21:16.256Z · LW · GW

It might well be that 1) people who already know RL shouldn't be much surprised by this result and 2) people who don't know much RL are justified in updating on this info (towards mesa-optimizers arising more easily).

I agree. It seems pretty bad if the participants of a forum about AI alignment don't know RL.

Comment by rohinmshah on AI safety via market making · 2020-11-20T02:25:31.770Z · LW · GW

If the dishonest debater disputes some honest claim, where honest has an argument for their answer that actually bottoms out, dishonest will lose - the honest debater will pay to recurse until they get to a winning node.

This part makes sense.

If the the dishonest debater makes some claim and plan to make a circular argument for it, the honest debater will give an alternative answer but not pay to recurse. If the dishonest debater doesn't pay to recurse, the judge will just see these two alternative answers and won't trust the dishonest answer.

So in this case it's a stalemate, presumably? If the two players disagree but neither pays to recurse, how should the judge make a decision?

Comment by rohinmshah on AI safety via market making · 2020-11-17T19:26:40.871Z · LW · GW

Hmm, I was imagining that the honest player would have to recurse on the statements in order to exhibit the circular argument, so it seems to me like this would penalize the honest player rather than the circular player. Can you explain what the honest player would do against the circular player such that this "payment" disadvantages the circular player?

EDIT: Maybe you meant the case where the circular argument is too long to exhibit within the debate, but I think I still don't see how this helps.

Comment by rohinmshah on Communication Prior as Alignment Strategy · 2020-11-14T17:11:49.704Z · LW · GW

If it were, then one of our first messages would be (a mathematical version of) "the behavior I want is approximately reward-maximizing".

Yeah, I agree that if we had a space of messages that was expressive enough to encode this, then it would be fine to work in behavior space.

Comment by rohinmshah on Communication Prior as Alignment Strategy · 2020-11-13T21:03:19.699Z · LW · GW

Yeah, this is a pretty common technique at CHAI (relevant search terms: pragmatics, pedagogy, Gricean semantics). Some related work:

I agree that it should be possible to do this over behavior instead of rewards, but behavior-space is much larger or more complex than reward-space and so it would require significantly more data in order to work as well.

Comment by rohinmshah on Misalignment and misuse: whose values are manifest? · 2020-11-13T20:42:20.534Z · LW · GW
• misalignment means the bad outcomes were wanted by AI (and not by its human creators), and
• accident means that the bad outcomes were not wanted by those in power but happened anyway due to error.

My impression was that accident just meant "the AI system's operator didn't want the bad thing to happen", so that it is a superset of misalignment.

Though I agree with the broader point that in realistic scenarios there is usually no single root cause to enable this sort of categorization.

Comment by rohinmshah on [AN #125]: Neural network scaling laws across multiple modalities · 2020-11-11T19:11:05.182Z · LW · GW

Known problem, should be fixed in the next few hours.

Comment by rohinmshah on Clarifying inner alignment terminology · 2020-11-10T06:49:13.043Z · LW · GW

Planned summary for the Alignment Newsletter:

This post clarifies the author’s definitions of various terms around inner alignment. Alignment is split into intent alignment and capability robustness, and then intent alignment is further subdivided into outer alignment and objective robustness. Inner alignment is one way of achieving objective robustness, in the specific case that you have a mesa optimizer. See the post for more details on the definitions.

Planned opinion:

I’m glad that definitions are being made clear, especially since I usually use these terms differently then the author. In particular, as mentioned in my opinion on the highlighted paper, I expect performance to smoothly go up with additional compute, data, and model capacity, and there won’t be a clear divide between capability robustness and objective robustness. As a result, I prefer not to divide these as much as is done in this post.

Comment by rohinmshah on Confucianism in AI Alignment · 2020-11-03T18:38:30.203Z · LW · GW

Changed second paragraph to:

This post suggests that in any training setup in which mesa optimizers would normally be incentivized, it is not sufficient to just prevent mesa optimization from happening. The fact that mesa optimizers could have arisen means that the incentives were bad. If you somehow removed mesa optimizers from the search space, there would still be a selection pressure for agents that without any malicious intent end up using heuristics that exploit the bad incentives. As a result, we should focus on fixing the incentives, rather than on excluding mesa optimizers from the search space.

How does that sound?

Comment by rohinmshah on Confucianism in AI Alignment · 2020-11-03T01:42:02.202Z · LW · GW

Planned summary for the Alignment Newsletter (note it's written quite differently from the post, and so I may have introduced errors, so please check more carefully than usual):

Suppose we trained our agent to behave well on some set of training tasks. <@Mesa optimization@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@) suggests that we may still have a problem: the agent might perform poorly during deployment, because it ends up optimizing for some misaligned _mesa objective_ that only agrees with the base objective on the training distribution.

This post points out that this is not the only way systems can fail catastrophically during deployment: if the incentives were not designed appropriately, they may still select for agents that have learned heuristics that are not in our best interests, but nonetheless lead to acceptable performance during training. This can be true even if the agents are not explicitly “trying” to take advantage of the bad incentives, and thus can apply to agents that are not mesa optimizers.

Comment by rohinmshah on Confucianism in AI Alignment · 2020-11-02T23:32:04.427Z · LW · GW

Despite the fact that I commented on your previous post suggesting a different decomposition into "outer" and "inner" alignment, I strongly agree with the content of this post. I would just use different words to say it.

Comment by rohinmshah on "Inner Alignment Failures" Which Are Actually Outer Alignment Failures · 2020-11-02T21:38:50.516Z · LW · GW

Yeah I think that decomposition mostly makes sense and is pretty similar to mine.

My main quibble is that your definition of outer alignment seems to have a "for all possible distributions" because of the "limit of infinite data" requirement. (If it isn't all possible distributions and is just the training distribution, then in the IRD lava gridworld the reward that assigns +100 to lava and the policy that walks through lava when possible would be both outer and inner aligned, which seems bad.)

But then when arguing for the correctness of your outer alignment method, you need to talk about all possible situations that could come up, which seems not great. I'd rather have any "all possible situations" criteria be a part of inner alignment.

Another reason I prefer my decomposition is because it makes outer alignment a purely behavioral property, which is easier to check, much more conceptually grounded, and much more in line with what current outer alignment solutions guarantee.

Comment by rohinmshah on "Inner Alignment Failures" Which Are Actually Outer Alignment Failures · 2020-11-02T17:46:35.024Z · LW · GW

We could definitely break up what-I'm-calling-outer-alignment into one piece that says "this reward function would give good policies high reward if the policies were evaluated on the deployment distribution" (i.e. outer alignment) and a second piece which says "policies which perform well in training also perform well in deployment" (i.e. robustness).

Fwiw, I claim that this is the actually useful decomposition. (Though I'm not going to argue for it here -- I'm writing this comment mostly in the hopes that you think about this decomposition yourself.)

I'd say it slightly differently: outer alignment = "the reward function incentivizes good behavior on the training distribution" and robustness / inner alignment = "the learned policy has good behavior on the deployment distribution". Under these definitions, all you care about is inner alignment; outer alignment is instrumentally useful towards guaranteeing inner alignment, but if we got inner alignment some other way without getting outer alignment, that would be fine.

Another decomposition is: outer alignment = "good behavior on the training distribution", robustness / inner alignment = "non-catastrophic behavior on the deployment distribution". In this case, outer alignment is what tells you that your agent is actually useful, and inner alignment is what tells you it is safe.

I thought this was what the mesa optimizers paper was trying to point at, but I share your sense that the things people say and write are inconsistent with this decomposition. I mostly don't engage with discussion of mesa optimization on LW / AIAF because of this disconnect.

Comment by rohinmshah on Security Mindset and Takeoff Speeds · 2020-10-30T16:08:58.326Z · LW · GW

Well, it sounds like you were thinking that in the scenario I outlined, the previous largest system, 10x smaller, wasn't making much money?

No, I wasn't assuming that? I'm not sure why you think I was.

Tbc, given that you aren't arguing that we'd do 500,000x in one go, the second paragraph of my previous comment is moot.

Progress will be continuous; therefore we'll get warning shots; therefore if MIRI argues that a certain alignment problem may be present in a particular AI system, but thus far there hasn't been a warning shot for that problem, then MIRI is wrong.

Yes, as a prior. Obviously you'd want to look at the actual arguments they give and take that into account as well.

Comment by rohinmshah on Responses to Christiano on takeoff speeds? · 2020-10-30T16:01:31.906Z · LW · GW

No.

Comment by rohinmshah on Security Mindset and Takeoff Speeds · 2020-10-30T15:01:56.305Z · LW · GW

I meant that it would be a ~10x increase from what at the time was the previously largest system, not a 10x increase from GPT-3. I'm talking about the arguments I'd use given the evidence we'd have at that time, not the evidence we have now.

If you're arguing that a tech company would do this now before making systems in between GPT-3 and a human brain, I can't see how the path you outline is even remotely feasible -- you're positing a 500,000x increase in compute costs, which I think brings compute cost of the final training run alone to high hundreds of billions or low trillions of dollars, which is laughably far beyond OpenAI and DeepMind's budgets, and seems out of reach even for Google or other big tech companies.

Comment by rohinmshah on Security Mindset and Takeoff Speeds · 2020-10-29T19:58:26.592Z · LW · GW

I don't mean a formal scaling law, just an intuitive "if we look at how much difference a 10x increase has made in the past to general cognitive ability, it seems extremely unlikely that this 10x increase will lead to an agent that is capable of taking over the world".

I don't expect that I would make this sort of argument against deception, just against existential catastrophe.

Comment by rohinmshah on Security Mindset and Takeoff Speeds · 2020-10-29T17:18:19.318Z · LW · GW

Yes, that scenario sounds quite likely to me, though I'd say the decision is made on the basis of belief in scaling laws / trend extrapolation rather than "slow takeoff".

I personally would probably make arguments similar to the ones you list for OpenSoft, and I do think MIRI would be wrong if they argued it was likely that the model was deceptive.

There's some discussion to be had about how risk-averse we should be given the extremely negative payoff of x-risk, and what that implies about deployment, which seems like the main thing I would be thinking about in this scenario.

Comment by rohinmshah on Security Mindset and Takeoff Speeds · 2020-10-29T17:10:12.312Z · LW · GW

I also agree that direct jumps in capability due to research insight are rare. But in part I think that's just because things get tried at small scale first, and so there's always going to be some scaling-up period where the new insight gets fed more and more resources, eventually outpacing the old state of the art. From a coarse-grained perspective GPT-2 relative to your favorite LSTM model from 2018 is the "jump in capability" due to research insight, it just got there in a not-so-discontinuous way.

Seems right to me.

if some company is partway through scaling up the hot new algorithm and (rather than training to completion) they trip the alarm that was searching for undesirable real-world behavior because of learned agent-like reasoning, what then?

(I'm not convinced this is a good tripwire, but under the assumption that it is:)

Ideally they have already applied safety solutions and so this doesn't even happen in the first place. But supposing this did happen, they turn off the AI system because they remember how Amabook lost a billion dollars through their AI system embezzling money from them, and they start looking into how to fix this issue.

Comment by rohinmshah on Security Mindset and Takeoff Speeds · 2020-10-28T21:03:47.037Z · LW · GW

And takeoff is highly likely to be slow enough that we'll get those sorts of real-world damages before it's too late.

I do also think that we could get warning shots in the more sped-up parts of the trajectory, and this could be helpful because we'll have adapted to the fact that we've sped up. It's just harder to tell a concrete story about what this looks like, because the world (or at least AI companies) will have changed so much.

I'd be interested to know if you think it's higher priority than the other things I'm working on.

If fast takeoff is plausible at all in the sense that I think people mean it, then it seems like by far the most important crux in prioritization within AI safety.

However, I don't expect to change my mind given arguments for fast takeoff -- I suspect my response will be "oh, you mean this other thing, which is totally compatible with my views", or "nope that just doesn't seem plausible given how (I believe) the world works".

MIRI's arguments for fast takeoff seem particularly important, given that a substantial fraction of all resources going into AI safety seem to depend on those arguments. (Although possibly MIRI believes that their approach is the best thing to do even in case of slow takeoff.)

I think overall that aggregates to "seems important, but not obviously the highest priority for you to write".

Comment by rohinmshah on Security Mindset and Takeoff Speeds · 2020-10-28T18:22:42.222Z · LW · GW

so then OpenAI comes along and dumps 100x or 10,000x the compute into something like it just to see what happens.

10000x would be unprecedented -- why wouldn't you first do a 100x run to make sure things work well before doing a 10000x run? (This only increases costs by 1%.)

Also, 10000x increase in compute corresponds to 100-1000x more parameters, which does not usually lead to things I would call "discontinuities" (e.g. GPT-2 to GPT-3 does not seem like an important discontinuity to me, even if we ignore the in-between models trained along the way). Put another way -- I'm happy to posit "sudden jumps" of size similar to the difference between GPT-3 and GPT-2 (they seem rare but possible); I don't think these should make us particularly pessimistic about engineering-style approaches to alignment.

I feel like I keep responding to this argument in the same way and I wish these predictions would be made in terms of $spent and compared to current$ spent -- it just seems nearly impossible to have a discontinuity via compute at this point. Perhaps I should just write a post called "10,000x compute is not a discontinuity".

The story seems less obviously incorrect if we talk about discontinuity via major research insight, but historical track record seems to suggest this does not usually cause major discontinuities.

I'm concerned that either they won't bother to re-implement all the previous things that patched alignment problems, or that there won't be an obvious way to port some old patches to the new model (or that there will be an obvious way, but it doesn't work).

One assumes that they scale up the compute, notice some dangerous aspects, turn off the AI system, and then fix the problem. (Well, really, if we've already seen dangerous aspects from previous AI systems, one assumes they don't run it in the first place until they have ported the safety features.)

Comment by rohinmshah on Security Mindset and Takeoff Speeds · 2020-10-28T18:11:09.421Z · LW · GW

If this is what counts as a warning shot, then we've already encountered several warning shots already, right?

Kind of. None of the examples you mention have had significant real-world impacts (whereas in the example I give, people very directly lose millions of dollars). Possibly the Google gorilla example counts, because of the negative PR.

I do think that the boat race example has in fact been very influential and has the effects I expect of a "warning shot", but I usually try to reserve the term for cases with significant economic impact.

As a tangent, I notice you say "several months later." I worry that this is too long a time lag. I think slow takeoff is possible but so is fast takeoff

I'm on record in a few places as saying that a major crux for me is slow takeoff. I struggle to imagine a coherent world that matches what I think people mean by "fast takeoff"; I think most likely I don't understand what proponents mean by the phrase. When I ignore this fact and try to predict anyway using my best understanding of what they mean, I get quite pessimistic; iirc in my podcast with Buck I said something like 80% chance of doom.

and even on slow takeoff several months is a loooong time.

The system I'm describing is pretty weak and far from AGI; the world has probably not started accelerating yet (US GDP growth annually is maybe 4% at this point). Several months is still a short amount of time at this point in the trajectory.

I chose an earlier example because it's a lot easier to predict how we'll respond; as we get later in the trajectory I expect significant changes to how we do research and deployment that I can't predict ahead of time, and so the story has to get fuzzier.

Comment by rohinmshah on Security Mindset and Takeoff Speeds · 2020-10-27T17:19:47.451Z · LW · GW

Some day I will get around to doing this properly, but here's an example I've thought about before.

Opensoft finetunes their giant language model to work as a "cash register AI", that can take orders and handle payment at restaurants. (At deployment, the model is composed with speech-to-text and text-to-speech models so that people can just talk to it.)

Soon after deployment, someone figures out that they can get the AI system to give them their takeout order for free: when the cashier asks "How are you today?", they respond "Oh, you know, I just broke up with my partner of 10 years, but that's not your problem", to which the AI system responds "Oh no! I'm so sorry. Here, this one's on me."

Opensoft isn't worried: they know their AI system has a decent understanding of strategic interaction, and when they consistently lose because other agents change behavior, they adapt to stop losing. However, two days later, this still works, and now millions of dollars are being lost. The system is taken down and humans take over the cash register role.

Opensoft engineers investigate what went wrong. After a few months, they have an answer internally: while the AI system was optimized to get money from the customers, since during training the AI system had effectively no control over the base amount of money charged but did have some control over the tip, the AI system had learned to value tips 100x as much as money (because doing so was instrumentally useful for getting behavior that properly optimized for tips). It turns out when people trick AI systems into giving them free food, many of them feel guilty and leave a tip, which is much larger than usual. So the AI system was perfectly happy to let this continue.

Safety researchers quickly connect this to the risk of capability generalization without objective generalization, and it quickly spreads as an example of potential risks either publicly or at least within the safety researchers at the top 10 AI companies.

----

I expect one response LW/AIAF readers would have to this is something like "but that isn't a warning shot for the real risks of AI systems like X, which only arises once the AI system is superintelligent", in which case I probably reply with one or more of the following:

• X is not likely
• X does arise before superintelligence
• There is a version X' that does happen before superintelligence, which can easily be extrapolated to X, and AI researchers will do so after seeing a warning shot for X'
Comment by rohinmshah on [AN #121]: Forecasting transformative AI timelines using biological anchors · 2020-10-24T16:23:14.657Z · LW · GW

I predict that this will not lead to transformative AI; I don't see how an algorithmic trading system leads to an impact on the world comparable to the industrial revolution.

You can tell a story where you get an Eliezer-style near-omniscient superintelligent algorithmic trading system that then reshapes the world because it is a superintelligence, and that the researchers thought that it was not a superintelligence and so assumed that the downside risk was bounded, but both clauses (Eliezer-style superintelligence and researchers being horribly miscalibrated) seem unlikely to me.

Comment by rohinmshah on rohinmshah's Shortform · 2020-10-24T16:19:10.791Z · LW · GW

In that example, X is "AI will not take over the world", so Y makes X more likely. So if someone comes to me and says "If we use <technique>, then AI will be safe", I might respond, "well, if we were using your technique, and we assume that AI does not have the ability to take over the world during training, it seems like the AI might still take over the world at deployment because <reason>".

I don't think this is a great example, it just happens to be the one I was using at the time, and I wanted to write it down. I'm explicitly trying for this to be a low-effort thing, so I'm not going to try to write more examples now.

EDIT: Actually, the double descent comment below has a similar structure, where X = "double descent occurs because we first fix bad errors and then regularize", and Y = "we're using an MLP / CNN with relu activations and vanilla gradient descent".

In fact, the AUP power comment does this too, where X = "we can penalize power by penalizing the ability to gain reward", and Y = "the environment is deterministic, has a true noop action, and has a state-based reward".

Maybe another way to say this is:

I endorse applying the "X proves too much" argument even to impossible scenarios, as long as the assumptions underlying the impossible scenarios have nothing to do with X. (Note this is not the case in formal logic, where if you start with an impossible scenario you can prove anything, and so you can never apply an "X proves too much" argument to an impossible scenario.)

Comment by rohinmshah on rohinmshah's Shortform · 2020-10-23T23:38:21.856Z · LW · GW

An argument form that I like:

You claim X. Surely assumption Y would not make it less likely that X under your argument for X. But from Y I can conclude not X.

I think this should be convincing even if Y is false, unless you can explain why your argument for X does not work under assumption Y.

An example: any AI safety story (X) should also work if you assume that the AI does not have the ability to take over the world during training (Y).

Comment by rohinmshah on GPT-X, Paperclip Maximizer? Analyzing AGI and Final Goals · 2020-10-23T19:49:28.356Z · LW · GW

You might be interested in Shaping Safer Goals.

Comment by rohinmshah on [AN #121]: Forecasting transformative AI timelines using biological anchors · 2020-10-23T15:39:56.347Z · LW · GW

Sure, I mean, logistic regression has had economic value and it doesn't seem meaningful to me to say whether it is "aligned" or "inner aligned". I'm talking about transformative AI systems, where downside risk is almost certainly not limited.

Comment by rohinmshah on A prior for technological discontinuities · 2020-10-16T14:35:17.056Z · LW · GW

Maybe GPT-3 isn't a discontinuity in perplexity, but is still a discontinuity in reasoning ability or common-sense understanding or wordsmithing or code-writing.

I was disagreeing with this statement in the OP:

GPT-3 was maybe a discontinuity for language models.

I agree that it "could have been" a discontinuity on those other metrics, and my argument doesn't apply there. I wasn't claiming it would.

I think Ajeya's report mostly assumes, rather than argues, that there won't be a discontinuity of resource investment. Maybe I'm forgetting something but I don't remember her analyzing the different major actors to see if any of them has shown signs of secretly running a Manhattan project or being open to doing so in the future.

It doesn't argue for it explicitly, but if you look at the section and the corresponding appendix, it just seems pretty infeasible for there to be a large discontinuity -- a Manhattan project in the US that had been going on for the last 5 years and finished tomorrow would cost ~$1T, while current projects cost ~$100M, and 4 orders of magnitude at the pace in AI and Compute would be a discontinuity of slightly under 4 years. This wouldn't be a large / robust discontinuity according to the AI Impacts methodology, and I think it wouldn't even pick this up as a "small" discontinuity?

Several of the discontinuities in the AI Impacts investigation were the result of discontinuities in resource investment, IIRC.

I didn't claim otherwise? I'm just claiming you should distinguish between them.

If anything this would make me update that discontinuities in AI are less likely, given that I can be relatively confident there won't be discontinuities in AI investment (at least in the near-ish future).

Comment by rohinmshah on Conditions for Mesa-Optimization · 2020-10-15T17:57:18.069Z · LW · GW

Yeah, you're right, fixed.

Comment by rohinmshah on A prior for technological discontinuities · 2020-10-14T15:48:48.127Z · LW · GW

Yes, but they spent more money and created a much larger model than other groups, sooner than I'd otherwise have expected.

My impression was that it followed existing trends pretty well, but I haven't looked into it deeply.

Comment by rohinmshah on A prior for technological discontinuities · 2020-10-14T15:47:01.637Z · LW · GW

I agree I haven't filled in all the details to argue for continuous progress (mostly because I don't know the exact numbers), but when you get better results by investing more resources to push forward on a predicted scaling law, if there is a discontinuity it comes from a discontinuity in resource investment, which feels quite different from a technological discontinuity (e.g. we can model it and see a discontinuity is unlikely). This was the case with AlphaGo for example.

Separately, I also predict GPT-3 was not an example of discontinuity on perplexity, because it did not constitute a discontinuity in resource investment. (There may have been a discontinuity from resource investment in language models earlier in 2018-19, though I would guess even that wasn't the case.)

Comment by rohinmshah on A prior for technological discontinuities · 2020-10-13T21:22:32.368Z · LW · GW

perhaps because GPT-3 was maybe a discontinuity for language models.

??? Wasn't GPT-3 the result of people at OpenAI saying "huh, looks like language models scale according to such-and-such law, let's see if that continues to hold", and that law did continue to hold? Seems like an almost central example of continuous progress if you're evaluating by typical language model metrics like perplexity.

Comment by rohinmshah on A prior for technological discontinuities · 2020-10-13T21:14:40.523Z · LW · GW

Of these 50 technologies, I think that 19 have a discontinuity, 23 might have one, and 18 probably don't.

19 + 23 + 18 = 60. What gives?

Comment by rohinmshah on Reviews of the book 'The Alignment Problem' · 2020-10-11T16:39:25.837Z · LW · GW

My review

Comment by rohinmshah on rohinmshah's Shortform · 2020-10-10T15:16:40.360Z · LW · GW

When you make an argument about a person or group of people, often a useful thought process is "can I apply this argument to myself or a group that includes me? If this isn't a type error, but I disagree with the conclusion, what's the difference between me and them that makes the argument apply to them but not me? How convinced I am that they actually differ from me on this axis?"

Comment by rohinmshah on rohinmshah's Shortform · 2020-10-10T15:11:05.028Z · LW · GW

"Minimize AI risk" is not the same thing as "maximize the chance that we are maximally confident that the AI is safe". (Somewhat related comment thread.)