[AN #106]: Evaluating generalization ability of learned reward models 2020-07-01T17:20:02.883Z · score: 14 (4 votes)
[AN #105]: The economic trajectory of humanity, and what we might mean by optimization 2020-06-24T17:30:02.977Z · score: 24 (7 votes)
[AN #104]: The perils of inaccessible information, and what we can learn about AI alignment from COVID 2020-06-18T17:10:02.641Z · score: 19 (7 votes)
[AN #103]: ARCHES: an agenda for existential safety, and combining natural language with deep RL 2020-06-10T17:20:02.171Z · score: 26 (9 votes)
[AN #102]: Meta learning by GPT-3, and a list of full proposals for AI alignment 2020-06-03T17:20:02.221Z · score: 38 (11 votes)
[AN #101]: Why we should rigorously measure and forecast AI progress 2020-05-27T17:20:02.460Z · score: 15 (6 votes)
[AN #100]: What might go wrong if you learn a reward function while acting 2020-05-20T17:30:02.608Z · score: 33 (8 votes)
[AN #99]: Doubling times for the efficiency of AI algorithms 2020-05-13T17:20:02.637Z · score: 30 (10 votes)
[AN #98]: Understanding neural net training by seeing which gradients were helpful 2020-05-06T17:10:02.563Z · score: 20 (5 votes)
[AN #97]: Are there historical examples of large, robust discontinuities? 2020-04-29T17:30:02.043Z · score: 15 (5 votes)
[AN #96]: Buck and I discuss/argue about AI Alignment 2020-04-22T17:20:02.821Z · score: 17 (7 votes)
[AN #95]: A framework for thinking about how to make AI go well 2020-04-15T17:10:03.312Z · score: 20 (6 votes)
[AN #94]: AI alignment as translation between humans and machines 2020-04-08T17:10:02.654Z · score: 11 (3 votes)
[AN #93]: The Precipice we’re standing at, and how we can back away from it 2020-04-01T17:10:01.987Z · score: 25 (6 votes)
[AN #92]: Learning good representations with contrastive predictive coding 2020-03-25T17:20:02.043Z · score: 19 (7 votes)
[AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement 2020-03-18T17:10:02.205Z · score: 16 (5 votes)
[AN #90]: How search landscapes can contain self-reinforcing feedback loops 2020-03-11T17:30:01.919Z · score: 12 (4 votes)
[AN #89]: A unifying formalism for preference learning algorithms 2020-03-04T18:20:01.393Z · score: 17 (5 votes)
[AN #88]: How the principal-agent literature relates to AI risk 2020-02-27T09:10:02.018Z · score: 20 (6 votes)
[AN #87]: What might happen as deep learning scales even further? 2020-02-19T18:20:01.664Z · score: 30 (11 votes)
[AN #86]: Improving debate and factored cognition through human experiments 2020-02-12T18:10:02.213Z · score: 16 (6 votes)
[AN #85]: The normative questions we should be asking for AI alignment, and a surprisingly good chatbot 2020-02-05T18:20:02.138Z · score: 16 (6 votes)
[AN #84] Reviewing AI alignment work in 2018-19 2020-01-29T18:30:01.738Z · score: 24 (10 votes)
AI Alignment 2018-19 Review 2020-01-28T02:19:52.782Z · score: 140 (39 votes)
[AN #83]: Sample-efficient deep learning with ReMixMatch 2020-01-22T18:10:01.483Z · score: 16 (7 votes)
rohinmshah's Shortform 2020-01-18T23:21:02.302Z · score: 14 (3 votes)
[AN #82]: How OpenAI Five distributed their training computation 2020-01-15T18:20:01.270Z · score: 20 (6 votes)
[AN #81]: Universality as a potential solution to conceptual difficulties in intent alignment 2020-01-08T18:00:01.566Z · score: 22 (8 votes)
[AN #80]: Why AI risk might be solved without additional intervention from longtermists 2020-01-02T18:20:01.686Z · score: 34 (16 votes)
[AN #79]: Recursive reward modeling as an alignment technique integrated with deep RL 2020-01-01T18:00:01.839Z · score: 12 (5 votes)
[AN #78] Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison 2019-12-26T01:10:01.626Z · score: 26 (7 votes)
[AN #77]: Double descent: a unification of statistical theory and modern ML practice 2019-12-18T18:30:01.862Z · score: 21 (8 votes)
[AN #76]: How dataset size affects robustness, and benchmarking safe exploration by measuring constraint violations 2019-12-04T18:10:01.739Z · score: 14 (6 votes)
[AN #75]: Solving Atari and Go with learned game models, and thoughts from a MIRI employee 2019-11-27T18:10:01.332Z · score: 39 (11 votes)
[AN #74]: Separating beneficial AI into competence, alignment, and coping with impacts 2019-11-20T18:20:01.647Z · score: 19 (7 votes)
[AN #73]: Detecting catastrophic failures by learning how agents tend to break 2019-11-13T18:10:01.544Z · score: 11 (4 votes)
[AN #72]: Alignment, robustness, methodology, and system building as research priorities for AI safety 2019-11-06T18:10:01.604Z · score: 28 (7 votes)
[AN #71]: Avoiding reward tampering through current-RF optimization 2019-10-30T17:10:02.211Z · score: 13 (5 votes)
[AN #70]: Agents that help humans who are still learning about their own preferences 2019-10-23T17:10:02.102Z · score: 18 (6 votes)
Human-AI Collaboration 2019-10-22T06:32:20.910Z · score: 39 (13 votes)
[AN #69] Stuart Russell's new book on why we need to replace the standard model of AI 2019-10-19T00:30:01.642Z · score: 64 (21 votes)
[AN #68]: The attainable utility theory of impact 2019-10-14T17:00:01.424Z · score: 19 (5 votes)
[AN #67]: Creating environments in which to study inner alignment failures 2019-10-07T17:10:01.269Z · score: 17 (6 votes)
[AN #66]: Decomposing robustness into capability robustness and alignment robustness 2019-09-30T18:00:02.887Z · score: 12 (6 votes)
[AN #65]: Learning useful skills by watching humans “play” 2019-09-23T17:30:01.539Z · score: 12 (4 votes)
[AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning 2019-09-16T17:10:02.103Z · score: 11 (5 votes)
[AN #63] How architecture search, meta learning, and environment design could lead to general intelligence 2019-09-10T19:10:01.174Z · score: 24 (8 votes)
[AN #62] Are adversarial examples caused by real but imperceptible features? 2019-08-22T17:10:01.959Z · score: 28 (11 votes)
Call for contributors to the Alignment Newsletter 2019-08-21T18:21:31.113Z · score: 39 (12 votes)
Clarifying some key hypotheses in AI alignment 2019-08-15T21:29:06.564Z · score: 72 (29 votes)


Comment by rohinmshah on Arguments against myopic training · 2020-07-10T06:53:20.058Z · score: 4 (2 votes) · LW · GW

First disagreement:

the main reason myopia is useful is because it removes the incentive for agents to steer towards incorrectly high-reward states (which I'll call "manipulative" states)

... There's a lot of ways that reward functions go wrong besides manipulation. I agree that if what you're worried about is manipulation in N actions, then you shouldn't let the trajectory go on for N actions before evaluating.

Consider the boat racing example. I'm saying that we wouldn't have had the boat going around in circles if we had used approval feedback, because the human wouldn't have approved of the actions where the boat goes around in a circle.

(You might argue that if a human had been giving the reward signal, instead of having an automated reward function, that also would have avoided the bad behavior. I basically agree with that, but then my point would just be that humans are better at providing approval feedback than reward feedback -- we just aren't very used to thinking in terms of "rewards". See the COACH paper.)

Second disagreement:

Okay, what about a simple approval function, which approves actions insofar as the supervisor expects them to lead to collecting more berries? Then the agent *also* learns convergent instrumental subgoals, because it learns to take whatever actions lead to collecting more berries (assuming the supervisor is right about that).

When the supervisor sees the agent trying to take over the world in order to collect more berries, the supervisor disapproves, and the agent stops taking that action. (I suspect this ends up being the same disagreement as the first one, where you'd say "but the supervisor can do that with rewards too", and I say "sure, but humans are better at giving approval feedback than reward feedback".)

Again, I do agree with you that myopic training is not particularly likely to lead to myopic cognition. It seems to me like this is creeping into your arguments somewhere, but I may be wrong about that.

Comment by rohinmshah on AI safety via market making · 2020-07-10T06:26:54.065Z · score: 4 (2 votes) · LW · GW

Hmm, this seems to rely on having the human trust the outputs of on questions that the human can't verify. It's not obvious to me that this is an assumption you can make without breaking the training process. The basic intuition is that you are hugely increasing the likelihood of bad gradients, since Adv can point to some incorrect / garbage output from M, and the human gives feedback as though this output is correct.

It works in the particular case that you outlined because there is essentially a DAG of arguments -- every claim is broken down into "smaller" claims, that eventually reach a base case, and so everything eventually bottoms out in something the human can check. (In practice this will be built from the ground up during training, similarly as in Supervising strong learners by amplifying weak experts.)

However, in general it doesn't seem like you can guarantee that every argument that Adv gives will result in a "smaller" claim. You could get in cycles, where "8 - 5 = 2" would be justified by Adv saying that M("What is 2 + 5?") = 8, and similarly "2 + 5 = 8" would be justified by saying that M("What is 8 - 5?") = 2. (Imagine that these were much longer equations where the human can check the validity of the algebraic manipulation, but can't check the validity of the overall equation.)

It might be that this is actually an unimportant problem, because in practice for every claim there are a huge number of ways to argue for the truth, and it's extraordinarily unlikely that all of them fail in the same way such that M would argue for the same wrong answer along all of these possible paths, and so eventually M would have to settle on the truth. I'm not sure, I'd be interested in empirical results here.

It occurs to me that the same problem can happen with iterated amplification, though it doesn't seem to be a problem with debate.


Also, echoing my other comment below, I'm not sure if this is an equilibrium in the general case where Adv can make many kinds of arguments that H pays attention to. Maybe once this equilibrium has been reached, Adv starts saying things like "I randomly sampled 2 of the 200 numbers, and they were 20 and 30, and so we should expect the sum to be 25 * 100 = 2500". (But actually 20 and 30 were some of the largest numbers and weren't randomly sampled; the true sum is ~1000.) If this causes the human to deviate even slightly from the previous equilibrium, Adv is incentivized to do it. While we could hope to avoid this in math / arithmetic, it seems hard to avoid this sort of thing in general.

For no pure equilibrium to exist, we just need that for every possible answer, there is something Adv can say that would cause the human to give some other answer (even if the original answer was the truth). This seems likely to be the case.

Comment by rohinmshah on AI safety via market making · 2020-07-09T21:01:24.080Z · score: 2 (1 votes) · LW · GW

Oh, another worry: there may not be a stable equilibrium to converge to -- every time approximates the final result well, may be incentivized to switch to making different arguments to make 's predictions wrong. (Or rather, maybe the stable equilibrium has to be a mixture over policies for this reason, and so you only get the true answer with some probability.)

Comment by rohinmshah on AI safety via market making · 2020-07-09T20:15:42.806Z · score: 8 (2 votes) · LW · GW

Nice idea! I like the simplicity of "find the equilibrium where the human no longer changes their mind" (though as Ofer points out below, you might worry that "doesn't change their mind" comes apart from "the answer is correct").

However, I disagree with you about competitiveness. Roughly speaking, at best is incentivized to predict what the human will think after reading the most relevant arguments, without trusting the source of the arguments (in reality, it will be a bit worse, as is finding not the most relevant arguments but the most persuasive arguments in a particular direction). However, with debate, if the human judge is looking at a transcript of length , then (the hope is that) the equilibrium is for M to argue for the answer that a human would come to when inspecting a tree of size exponential in . The key reason is that in debate, we only require the judge to be able to identify which of two arguments is better, whereas in market-making, we rely on the judge to be able to come to the right conclusion given some arguments.

In complexity theory analogy land, debate corresponds to PSPACE while market making corresponds to NP: as long as can find a polynomial-length witness, that can be verified by the human to get the right answer.

As a concrete example, suppose we want to find the sum of numbers, and each argument is only allowed to reference two numbers and make a claim about their sum. Debate can solve this with a transcript of size . Market-making would require an transcript to solve this. (You can't use the trick of making claims about the sum of half of the list in market-making as you can in debate, because the human has no reason to trust Adv's claims about the sum of half the list, since the human can only verify the sum of two numbers.)

I think this means that market-making is less competitive. If you compare debate with transcripts of length against market-making with transcripts of length , then I think market-making is less performance competitive. Alternatively, if you compare it against market-making with transcripts of length , then I think market-making is less training competitive.

Comment by rohinmshah on Arguments against myopic training · 2020-07-09T19:51:14.136Z · score: 4 (2 votes) · LW · GW

Planned summary for the Alignment Newsletter:

Several (AN #34) <@proposals@>(@An overviAN #34 t involve some form of myopic training, in which an AI system is trained to take actions that only maximize the feedback signal in the **next timestep** (rather than e.g. across an episode, or across all time, as with typical reward signals). In order for this to work, the feedback signal needs to take into account the future consequences of the AI system’s action, in order to incentivize good behavior, and so providing feedback becomes more challenging.
This post argues that there don’t seem to be any major benefits of myopic training, and so it is not worth the cost we pay in having to provide more challenging feedback. In particular, myopic training does not necessarily lead to “myopic cognition”, in which the agent doesn’t think about long-term consequences when choosing an action. To see this, consider the case where we know the ideal reward function R*. In that case, the best feedback to give for myopic training is the optimal Q-function Q*. However, regardless of whether we do regular training with R* or myopic training with Q*, the agent would do well if it estimates Q* in order to select the right action to take, which in turn will likely require reasoning about long-term consequences of its actions. So there doesn’t seem to be a strong reason to expect myopic training to lead to myopic cognition, if we give feedback that depends on (our predictions of) long-term consequences. In fact, for any approval feedback we may give, there is an equivalent reward feedback that would incentivize the same optimal policy.
Another argument for myopic training is that it prevents reward tampering and manipulation of the supervisor. The author doesn’t find this compelling. In the case of reward tampering, it seems that agents would not catastrophically tamper with their reward “by accident”, as tampering is difficult to do, and so they would only do so intentionally, in which case it is important for us to prevent those intentions from arising, for which we shouldn’t expect myopic training to help very much. In the case of manipulating the supervisor, he argues that in the case of myopic training, the supervisor will have to think about the future outputs of the agent in order to be competitive, which could lead to manipulation anyway.

Planned opinion:

I agree with what I see as the key point of this post: myopic training does not mean that the resulting agent will have myopic cognition. However, I don’t think this means myopic training is useless. According to me, the main benefit of myopic training is that small errors in reward specification for regular RL can incentivize catastrophic outcomes, while small errors in approval feedback for myopic RL are unlikely to incentivize catastrophic outcomes. (This is because “simple” rewards that we specify often lead to <@convergent instrumental subgoals@>(@The Basic AI Drives@), which need not be the case for approval feedback.) More details in this comment.
Comment by rohinmshah on Arguments against myopic training · 2020-07-09T18:10:02.322Z · score: 6 (3 votes) · LW · GW

Things I agree with:

1. If humans could give correctly specified reward feedback, it is a significant handicap to have a human provide approval feedback rather than reward feedback, because that requires the human to compute the consequences of possible plans rather than offloading it to the agent.

2. If we could give perfect approval feedback, we could also provide perfect reward feedback (at least for a small action space), via your reduction.

3. Myopic training need not lead to myopic cognition (and isn't particularly likely to for generally intelligent systems).

But I don't think these counteract what I see as the main argument for myopic training:

While small errors in reward specification can incentivize catastrophic outcomes, small errors in approval feedback are unlikely to incentivize catastrophic outcomes.

(I'm using "incentivize" here to talk about outer alignment and not inner alignment.)

In other words, the point is that humans are capable of giving approval / myopic feedback (i.e. horizon = 1) with not-terrible incentives, whereas humans don't seem capable of giving reward feedback (i.e. horizon = infinity) with not-terrible incentives. The main argument for this is that most "simple" reward feedback leads to convergent instrumental subgoals, whereas approval / myopic feedback almost never does unless that's what the human says is correct. (Also we can just look at the long list of specification gaming examples so far.)

I'll rephrase your objections and then respond:

Objection 1: This sacrifices competitiveness, because now the burden of predicting what action leads to good long-term effects falls to the human instead of the agent.

Response: Someone has to predict which action leads to good long-term effects, since we can't wait for 100 years to give feedback to the agent for a single action. In a "default" training setup, we don't want it to be the agent, because we can't trust that the agent selects actions based on what we think is "good". So we either need the human to take on this job (potentially with help from the agent), or we need to figure out some other way to trust that the agent selects "good" actions. Myopia / approval direction takes the first option. We don't really know of a good way to achieve the second option.

Objection 2: This sacrifices competitiveness, because now the human can't look at the medium-term consequences of actions before providing feedback.

This doesn't seem to be true -- if you want, you can collect a full trajectory to see the consequences of the actions, and then provide approval feedback on each of the actions individually when computing gradients.

Objection 3: There's no difference between approval feedback and myopic feedback, since perfect approval feedback can be turned into perfect reward feedback. So you might as well use the perfect reward feedback, since this is more competitive.

I agree that if you take the approval feedback that a human would give, apply this transformation, and then train a non-myopic RL agent on it, that would also not incentivize catastrophic outcomes. But if you start out with approval feedback, why would you want to do this? With approval feedback, the credit assignment problem has already been solved for the agent, whereas with the equivalent reward feedback, you've just undone the credit assignment and the agent now has to redo it all over again. (Like, instead of doing Q-learning, which has a non-stationary target, you could just use supervised learning to learn the fixed approval signal, surely this would be more efficient?)

On the tampering / manipulation points, I think those are special cases of the general point that it's easier for humans to provide non-catastrophe-incentivizing approval feedback than to provide non-catastrophe-incentivizing reward feedback.

I want to reiterate that I agree with the point that myopic training probably does not lead to myopic cognition (though this depends on what exactly we mean by "myopic cognition"), and I don't think of that as a major benefit of myopic training.


M(s,a)−λ max a′(M(s′,a′))

I think you mean γ instead of λ

Comment by rohinmshah on The "AI Debate" Debate · 2020-07-09T16:01:25.743Z · score: 2 (1 votes) · LW · GW

How about a recommendation engine that accidentally learns to show depressed people sequences of videos that affirm their self-hatred that leads them to commit suicide? (It seems plausible that something like this has already happened, though idk if it has.)

I think the thing you actually want to talk about is an agent that "intentionally" deceives its operator / the state? I think even there I'd disagree with your prediction, but it seems more reasonable as a stance (mostly because depending on how you interpret the "intentionally" it may need to have human-level reasoning abilities). Would it count if a malicious actor successfully finetuned GPT-3 to e.g. incite violence while maintaining plausible deniability?

Comment by rohinmshah on AI Research Considerations for Human Existential Safety (ARCHES) · 2020-07-09T06:41:05.828Z · score: 12 (6 votes) · LW · GW

Highlighted in AN #103 with a summary, though it didn't go into the research directions (because it would have become too long, and I thought the intro + categorization was more important on average).

Comment by rohinmshah on The "AI Debate" Debate · 2020-07-08T23:24:22.327Z · score: 2 (1 votes) · LW · GW

Are you predicting there won't be any lethal autonomous weapons before AGI? It seems like if that ends up being true, it would only be because we coordinated well to prevent that. More generally, we don't usually try to kill people, whereas we do try to build AGI.

(Whereas I think at least Paul usually thinks about people not paying the "safety tax" because the unaligned AI is still really good at e.g. getting them money, at least in the short term.)

Comment by rohinmshah on The "AI Debate" Debate · 2020-07-08T05:45:46.970Z · score: 2 (1 votes) · LW · GW
nobody else has anything more valuable than an Amazon Mechanical Turk worker

Huh? Isn't the ML powering e.g. Google Search more valuable than an MTurk worker? Or Netflix's recommendation algorithm? (I think I don't understand what you mean by "value" here.)

Comment by rohinmshah on Idea: Imitation/Value Learning AIXI · 2020-07-04T17:52:53.313Z · score: 3 (2 votes) · LW · GW

If your conclusion is "value learning can never work and is risky", that seems fine (if maybe a bit strong). I agree it's not obvious that (ambitious) value learning can work.

Let's suppose you want to e.g. play Go, and so you use AIXIL on Lee Sedol's games. This will give you an agent that plays however Lee Sedol would play. In particular, AlphaZero would beat this agent handily (at the game of Go). This is what I mean when I say you're limited to human performance.

In contrast, the hope with value learning was that you can apply it to Lee Sedol's games, and get out the reward "1 if you win, 0 if you lose", which when optimized gets you AlphaZero-levels of capability (i.e. superhuman performance).

I think it's reasonable to say "but there's no reason to expect that value learning will infer the right reward, so we probably won't do better than imitation" (and I collated Chapter 1 of the Value Learning sequence to make this point). In that case, you should expect that imitation = human performance and value learning = subhuman / catastrophic performance.

According to me, the main challenge of AI x-risk is how to deal with superhuman AI systems, and so if you have this latter position, I think you should be pessimistic about both imitation learning and value learning (unless you combine it with something that lets you scale to superhuman, e.g. iterated amplification, debate or recursive reward modeling).

Comment by rohinmshah on Idea: Imitation/Value Learning AIXI · 2020-07-04T00:56:25.569Z · score: 2 (1 votes) · LW · GW

If you generate a dataset from a policy (e.g. human behavior) and then run and get policy , you can expect that . I think you could claim "as long as is big enough, the best compression is just to replicate the human decision process, and so we'll have ".

Alternatively, you could claim that you'll find an even better compression of than the human policy . In that case, you expect and is lower KL-complexity than . However, why should you expect to be a "better" policy than according to human values?

Literally cannot delete this pi, please ignore it:

Comment by rohinmshah on Idea: Imitation/Value Learning AIXI · 2020-07-03T21:25:25.053Z · score: 2 (1 votes) · LW · GW

I'm not totally sure what you're asking, but some thoughts:

Yes, if your goal is to recover a policy (i.e. imitation learning), then value learning is only one approach.

Yes, you can recover a policy by supervised learning on a dataset of the policy's behavior. This could be done with neural nets, or it could be done with Bayesian inference with the Solomonoff prior. Either approach would work with enough data (we don't know how much data though), and neither of them inherently learn values (though they may do so as an instrumental strategy).

If you imitate a human policy, you are limiting yourself to human performance. The original hope of value learning was that if a more capable agent optimized the learned reward, you could get to superhuman performance, something that AIXIL would not do.

Comment by rohinmshah on [AN #102]: Meta learning by GPT-3, and a list of full proposals for AI alignment · 2020-07-03T17:33:15.999Z · score: 2 (1 votes) · LW · GW

That seems far too structured to me -- I seriously doubt GPT-3 is doing anything like "generating a large bunch of candidate algorithms", though maybe it has learned heuristics that approximate this sort of computation.

Comment by rohinmshah on Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI · 2020-07-02T22:38:57.473Z · score: 4 (2 votes) · LW · GW

Planned summary for the Alignment Newsletter:

This podcast covers a lot of topics, with special focus on <@Risks from Learned Optimization in Advanced Machine Learning Systems@> and <@An overview of 11 proposals for building safe advanced AI@>.

Planned opinion:

My summary is light on detail because many of the topics have been highlighted before in this newsletter, but if you aren’t familiar with them the podcast is a great resource for learning about them.
Comment by rohinmshah on Locality of goals · 2020-07-02T22:07:40.261Z · score: 2 (1 votes) · LW · GW

I ask myself if there's anything in particular I want to say about the post / paper that the author(s) didn't say, with an emphasis on ensuring that the opinion has content. If yes, then I write it.

(Sorry, that's not very informative, but I don't really have a system for it.)

Comment by rohinmshah on Goals and short descriptions · 2020-07-02T19:10:15.826Z · score: 3 (2 votes) · LW · GW

Planned summary for the Alignment Newsletter:

This post argues that a distinguishing factor of goal-directed policies is that they have low Kolmogorov complexity, relative to e.g. a lookup table that assigns a randomly selected action to each observation. It then relates this to quantilizers (AN #48 ) and <@mesa optimization@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@).

Planned opinion:

This seems reasonable to me as an aspect of goal-directedness. Note that it is not a sufficient condition. For example, the policy that always chooses action A has extremely low complexity, but I would not call it goal-directed.
Comment by rohinmshah on Locality of goals · 2020-07-02T19:09:59.713Z · score: 6 (2 votes) · LW · GW

Planned summary for the Alignment Newsletter:

This post introduces the concept of the _locality_ of a goal, that is, how “far” away the target of the goal is. For example, a thermometer’s “goal” is very local: it “wants” to regulate the temperature of this room, and doesn’t “care” about the temperature of the neighboring house. In contrast, a paperclip maximizer has extremely nonlocal goals, as it “cares” about paperclips anywhere in the universe. We can also consider whether the goal depends on the agent’s internals, its input, its output, and/or the environment.
The concept is useful because for extremely local goals (usually goals about the internals or the input) we would expect wireheading or tampering, whereas for extremely nonlocal goals, we would instead expect convergent instrumental subgoals like resource acquisition.
Comment by rohinmshah on [AN #106]: Evaluating generalization ability of learned reward models · 2020-07-02T18:08:07.739Z · score: 4 (2 votes) · LW · GW
If EPIC(R1, R2) is thought of as two functions f(g(R1), g(R2)), where g returns the optimal policy of its input, and f is a distance function for optimal policies, then f(OptimalPolicy1, OptimalPolicy2) is a metric?

The authors don't prove it, but I believe yes, as long as DS and DA put support over the entire state space / action space (maybe you also need DT to put support over every possible transition).

I usually think of this as "EPIC is a metric if defined over the space of equivalence classes of reward functions".

Can more than one DT be used, so there's more than one measure?


There's a maximum?

For finite, discrete state/action spaces, the uniform distribution over (s, a, s') tuples has maximal entropy. However, it's not clear that that's the worst case for EPIC.

Comment by rohinmshah on The ground of optimization · 2020-07-01T17:34:38.860Z · score: 4 (2 votes) · LW · GW
The whole system is an Optimizing AI, according to the definition given above, but neither of the two parts is by itself

Yeah, I'm talking about the whole system.

it doesn't seem to have the flavor of mesa-optimization

Yeah, I agree it doesn't fit the explanation / definition in Risks from Learned Optimization. I don't like that definition, and usually mean something like "running the model parameters instantiates a computation that does 'reasoning'", which I think does fit this example. I mentioned this a bit later in the comment:

I want to note that under this approach the notion of “search” and “mesa objective” are less natural, which I see as a pro of this approach [...]: the argument is that we’ll get a general inner optimizing AI, but it doesn’t say much about what task that AI will be optimizing for (and it could be an optimizing AI that is retargetable by human instructions).
Comment by rohinmshah on The ground of optimization · 2020-06-29T21:23:45.856Z · score: 4 (2 votes) · LW · GW

+1 to all of this.

We'll presumably need to give O some information about the goal / target configuration set for each task.

I was imagining that the tasks can come equipped with some specification, but some sort of counterfactual also makes sense. This also gets around issues of the AI system not being appropriately "motivated" -- e.g. I might be capable of performing the task "lock up puppies in cages", but I wouldn't do it, and so if you only look at my behavior you couldn't say that I was capable of doing that task.

But this doesn't really get at the spirit of Paul's idea, which I think is about really looking inside the AI and understanding its goals.

+1 especially to this

Comment by rohinmshah on Plausible cases for HRAD work, and locating the crux in the "realism about rationality" debate · 2020-06-29T02:48:33.865Z · score: 6 (3 votes) · LW · GW

Planned summary for the Alignment Newsletter:

This post tries to identify the possible cases for highly reliable agent design (HRAD) work to be the main priority of AI alignment. HRAD is a category of work at MIRI that aims to build a theory of intelligence and agency that can explain things like logical uncertainty and counterfactual reasoning.
The first case for HRAD work is that by becoming less confused about these phenomena, we will be able to help AGI builders predict, explain, avoid, detect, and fix safety issues and help to conceptually clarify the AI alignment problem. For this purpose, we just need _conceptual_ deconfusion -- it isn’t necessary that there must be precise equations defining what an AI system does.
The second case is that if we get a precise, mathematical theory, we can use it to build an agent that we understand “from the ground up”, rather than throwing the black box of deep learning at the problem.
The last case is that by understanding how intelligence works will give us a theory that allows us to predict how _arbitrary_ agents will behave, which will be useful for AI alignment in all the ways described in the first case and <@more@>(@Theory of Ideal Agents, or of Existing Agents?@).
Looking through past discussion on the topic, the author believes that people at MIRI primarily believe in the first two cases. Meanwhile, critics (particularly me) say that it seems pretty unlikely that we can build a precise, mathematical theory, and a more conceptual but imprecise theory may help us understand reasoning better but is less likely to generalize sufficiently well to say important and non-trivial things about AI alignment for the systems we are actually building.

Planned opinion:

I like this post -- it seems like an accessible summary of the state of the debate so far. My opinions are already in the post, so I don’t have much to add.
Comment by rohinmshah on Plausible cases for HRAD work, and locating the crux in the "realism about rationality" debate · 2020-06-29T02:10:00.661Z · score: 2 (1 votes) · LW · GW
One way to reject this case for HRAD work is by saying that imprecise theories of rationality are insufficient for helping to align AI systems. This is what Rohin does in this comment where he says imprecise theories cannot build things "2+ levels above".

I should note that there are some things in world 1 that I wouldn't reject this way -- e.g. one of the examples of deconfusion is “anyhow, we could just unplug [the AGI].” That is directly talking about AGI safety, and so deconfusion on that point is "1 level away" from the systems we actually build, and isn't subject to the critique. (And indeed, I think it is important and great that this statement has been deconfused!)

It is my impression though that current HRAD work is not "directly talking about AGI safety", and is instead talking about things that are "further away", to which I would apply the critique.

Comment by rohinmshah on Plausible cases for HRAD work, and locating the crux in the "realism about rationality" debate · 2020-06-29T01:57:21.163Z · score: 2 (1 votes) · LW · GW

I think that the plans you lay out are all directly talking about the AI system we eventually build, and as a result I'm more optimistic about them (and your work, as it's easy to see how it makes progress towards these plans) relative to HRAD.

In contrast, as far as I can tell, HRAD work does not directly contribute to any of these plans, and instead the case seems to rely on something more indirect where a better understanding of reasoning will later help us execute on one of these plans. It's this indirection that makes me worried.

Comment by rohinmshah on The ground of optimization · 2020-06-23T18:55:14.378Z · score: 6 (3 votes) · LW · GW
It didn't seem like you defined what it meant to evolve towards the target configuration set.

+1 for swapping out the target configuration set with a utility function, and looking for a robust tendency for the utility function to increase. This would also let you express mild optimization (see this thread).

Comment by rohinmshah on The ground of optimization · 2020-06-23T18:49:32.940Z · score: 2 (1 votes) · LW · GW

It sounds like you're assuming that the target configuration set is built into the AI system. According to me, a major point of this post / framework is to avoid that assumption altogether, and only describe problems in terms of the actual observed system behavior.

(This is why within this framework I couldn't formalize outer alignment, and why wireheading and the search / mesa-objective split is unnatural.)

Comment by rohinmshah on The ground of optimization · 2020-06-23T17:54:26.112Z · score: 8 (2 votes) · LW · GW

This makes sense, but I think you'd need a different notion of optimizing systems than the one used in this post. (In particular, instead of a target configuration set, you want a continuous notion of goodness, like a utility function / reward function.)

Comment by rohinmshah on Preparing for "The Talk" with AI projects · 2020-06-22T23:11:32.396Z · score: 4 (2 votes) · LW · GW

Planned summary for the Alignment Newsletter:

At some point in the future, it seems plausible that there will be a conversation in which people decide whether or not to deploy a potentially risky AI system. So one class of interventions to consider is interventions that make such conversations go well. This includes raising awareness about specific problems and risks, but could also include identifying people who are likely to be involved in such conversations _and_ concerned about AI risk, and helping them prepare for such conversations through training, resources, and practice. This latter intervention hasn't been done yet: some simple examples of potential interventions would be generating official lists of AI safety problems and solutions which can be pointed to in such conversations, or doing "practice runs" of these conversations.

Planned opinion:

I certainly agree that we should be thinking about how we can convince key decision makers of the level of risk of the systems they are building (whatever that level of risk is). I think that on the current margin it's much more likely that this is best done through better estimation and explanation of risks with AI systems, but it seems likely that the interventions laid out here will become more important in the future.
Comment by rohinmshah on Public Static: What is Abstraction? · 2020-06-22T06:22:44.132Z · score: 4 (2 votes) · LW · GW

Planned summary for the Alignment Newsletter:

If we are to understand embedded agency, we will likely need to understand abstraction (see <@here@>(@Embedded Agency via Abstraction@)). This post presents a view of abstraction in which we abstract a low-level territory into a high-level map that can still make reliable predictions about the territory, for some set of queries (whether probabilistic or causal).

For example, in an ideal gas, the low-level configuration would specify the position and velocity of _every single gas particle_. Nonetheless, we can create a high-level model where we keep track of things like the number of molecules, average kinetic energy of the molecules, etc which can then be used to predict things like pressure exerted on a piston.

Given a low-level territory L and a set of queries Q that we’d like to be able to answer, the minimal-information high-level model stores P(Q | L) for every possible Q and L. However, in practice we don’t start with a set of queries and then come up with abstractions, we instead develop crisp, concise abstractions that can answer many queries. One way we could develop such abstractions is by only keeping information that is visible “far away”, and throwing away information that would be wiped out by noise. For example, when typing 3+4 into a calculator, the exact voltages in the circuit don’t affect anything more than a few microns away, except for the final result 7, which affects the broader world (e.g. via me seeing the answer).

If we instead take a systems view of this, where we want abstractions of multiple different low-level things, then we can equivalently say that two far-away low-level things should be independent of each other _when given their high-level summaries_, which are supposed to be able to quantify all of their interactions.

Planned opinion:

I really like the concept of abstraction, and think it is an important part of intelligence, and so I’m glad to get better tools for understanding it. I especially like the formulation that low-level components should be independent given high-level summaries -- this corresponds neatly to the principle of encapsulation in software design, and does seem to be a fairly natural and elegant description, though of course abstractions in practice will only approximately satisfy this property.
Comment by rohinmshah on My take on CHAI’s research agenda in under 1500 words · 2020-06-21T20:28:37.105Z · score: 2 (1 votes) · LW · GW

Now that I've read your post on optimization, I'd restate

More generally, it seems like "help X" or "assist X" only means something when you view X as pursuing some goal.


More generally, it seems like "help X" or "assist X" only means something when you view X as an optimizing system.

Which I guess was your point in the first place, that we should view things as optimizing systems and not agents. (Whereas when I hear "agent" I usually think of something like what you call an "optimizing system".)

I think my main point is that "CHAI's agenda depends strongly on an agent assumption" seems only true of the specific mathematical formalization that currently exists; I would not be surprised if the work could then be generalized to optimizing systems instead of agents / EU maximizers in particular.

Comment by rohinmshah on The ground of optimization · 2020-06-21T20:06:15.081Z · score: 7 (4 votes) · LW · GW

Planned summary for the Alignment Newsletter:

Many arguments about AI risk depend on the notion of “optimizing”, but so far it has eluded a good definition. One natural approach is to say that an optimizer causes the world to have higher values according to some reasonable utility function, but this seems insufficient, as then a <@bottle cap would be an optimizer@>(@Bottle Caps Aren't Optimisers@) for keeping water in the bottle.
This post provides a new definition of optimization, by taking a page from <@Embedded Agents@> and analyzing a system as a whole instead of separating the agent and environment. An **optimizing system** is then one which tends to evolve toward some special configurations (called the **target configuration set**), when starting anywhere in some larger set of configurations (called the **basin of attraction**), _even if_ the system is perturbed.
For example, in gradient descent, we start with some initial guess at the parameters θ, and then continually compute loss gradients and move θ in the appropriate direction. The target configuration set is all the local minima of the loss landscape. Such a program has a very special property: while it is running, you can change the value of θ (e.g. via a debugger), and the program will probably _still work_. This is quite impressive: certainly most programs would not work if you arbitrarily changed the value of one of the variables in the middle of execution. Thus, this is an optimizing system that is robust to perturbations in θ. Of course, it isn’t robust to arbitrary perturbations: if you change any other variable in the program, it will probably stop working. In general, we can quantify how powerful an optimizing system is by how robust it is to perturbations, and how small the target configuration set is.
The bottle cap example is _not_ an optimizing system because there is no broad basin of configurations from which we get to the bottle being full of water. The bottle cap doesn’t cause the bottle to be full of water when it didn’t start out full of water.
Optimizing systems are a superset of goal-directed agentic systems, which require a separation between the optimizer and the thing being optimized. For example, a tree is certainly an optimizing system (the target is to be a fully grown tree, and it is robust to perturbations of soil quality, or if you cut off a branch, etc). However, it does not seem to be a goal-directed agentic system, as it would be hard to separate into an “optimizer” and a “thing being optimized”.
This does mean that we can no longer ask “what is doing the optimization” in an optimizing system. This is a feature, not a bug: if you expect to always be able to answer this question, you typically get confusing results. For example, you might say that your liver is optimizing for making money, since without it you would die and fail to make money.
The full post has several other examples that help make the concept clearer.

Planned opinion:

I’ve <@previously argued@>(@Intuitions about goal-directed behavior@) that we need to take generalization into account in a definition of optimization or goal-directed behavior. This definition achieves that by primarily analyzing the robustness of the optimizing system to perturbations. While this does rely on a notion of counterfactuals, it still seems significantly better than any previous attempt to ground optimization.
I particularly like that the concept doesn’t force us to have a separate agent and environment, as that distinction does seem quite leaky upon close inspection. I gave a shot at explaining several other concepts from AI alignment within this framework in this comment, and it worked quite well. In particular, a computer program is a goal-directed AI system if there is an environment such that adding the computer program to the environment transforms it into a optimizing system for some “interesting” target configuration states (with one caveat explained in the comment).
Comment by rohinmshah on The ground of optimization · 2020-06-21T20:03:12.849Z · score: 37 (14 votes) · LW · GW

This is excellent, it feels way better as a definition of optimization than past attempts :) Thanks in particular for the academic style, specifically relating it to previous work, it made it much more accessible for me.

Let's try to build up some core AI alignment arguments with this definition.

Task: A task is simply an “environment” along with a target configuration set. Whenever I talk about a “task” below, assume that I mean an “interesting” task, i.e. something like “build a chair”, as opposed to “have the air molecules be in one of these particular configurations”.

Solving a task: An object O solves a task T if adding O to T’s environment transforms it into an optimizing system for the T’s target configuration set.

Performance on the task: If O solves task T, its performance is quantified by how quickly it reaches the target configuration set, and how robust it is to perturbations.

Generality of intelligence: The generality of O’s intelligence is a function of the number and diversity of tasks T that it can solve, as well as its performance on those tasks.

Optimizing AI: A computer program for which there exists an interesting task such that the computer program solves that task.

This isn’t exactly right, as it includes e.g. accounting programs or video games, which when paired with a human form an optimizing system for correct financials and winning the game, respectively. You might be able to fix this by saying that the optimizing system has to be robust to perturbations in any human behavior in the environment.

AGI: An optimizing AI whose generality of intelligence is at least as great as that of humans.

Argument for AI risk: As optimizing AIs become more and more general, we will apply them to more economically useful tasks T. However, they also become more and more robust to perturbations, possibly including perturbations such as “we try to turn off the AI”. As a result, we might eventually have AIs that form strong optimizing systems for some task T that isn’t the one we actually wanted, which tends to be bad due to fragility of value.

Deep learning AGI implies mesa optimization: Since deep learning is so sample inefficient, it cannot reach human levels of performance if we apply deep learning directly to each possible task T. (For example, it has to relearn how the world works separately for each task T.) As a result, if we do get AGI primarily via deep learning, it must be that we used deep learning to create a new optimizing AI system, and that system was the AGI.

Argument for mesa optimization: Due to the complexity and noise in the real world, most economically useful tasks require setting up a robust optimizing system, rather than directly creating the target configuration state. (See also the importance of feedback for more on this intuition.) It seems likely that humans will find it easier to create algorithms that then find AGIs that can create these robust optimizing systems, rather than creating an algorithm that is directly an AGI.

(The previous argument also applies: this is basically just a generalization of the previous point to arbitrary AI systems, instead of only deep learning.)

I want to note that under this approach the notion of “search” and “mesa objective” are less natural, which I see as a pro of this approach (see also here): the argument is that we’ll get a general inner optimizing AI, but it doesn’t say much about what task that AI will be optimizing for (and it could be an optimizing AI that is retargetable by human instructions).

Outer alignment: ??? Seems hard to formalize in this framework. This makes me feel like outer alignment is less important as a concept. (I also don’t particularly like formalizations outside of this framework.)

Inner alignment: Ensuring that (conditional on mesa optimization occurring) the inner AGI is aligned with the operator / user, that is, combined with the user it forms an optimizing system for “doing what the user wants”. (Note that this is explicitly not intent alignment, as it is hard to formalize intent alignment in this framework.)

Intent alignment: ??? As mentioned above, it’s hard to formalize in this framework, as intent alignment really does require some notion of “motivation”, “goals”, or “trying”, which this framework explicitly leaves out. I see this as a con of this framework.

Expected utility maximization: One particular architecture that could qualify as an AGI (if the utility function is treated as part of the environment, and not part of the AGI). I see the fact that EU maximization is no longer highlighted as a pro of this approach.

Wireheading: Special case of the argument for AI risk with a weird task of “maximize the number in this register”. Unnatural in this framing of the AI risk problem. I see this as a pro of this framing of the problem, though I expect people disagree with me on this point.

Comment by rohinmshah on Likelihood of hyperexistential catastrophe from a bug? · 2020-06-21T15:51:29.716Z · score: 2 (1 votes) · LW · GW

I'd do the same thing for the version about religion (infinite utility from heaven / infinite disutility from hell), where I'm not being exploited, I simply have different beliefs from the person making the argument.

(Note also that the non-exploitability argument isn't sufficient.)

Comment by rohinmshah on Likelihood of hyperexistential catastrophe from a bug? · 2020-06-20T17:30:37.646Z · score: 3 (2 votes) · LW · GW
"It's already happened at least once, at a major AI company, for an important AI system, yes in the future people will be paying more attention probably but that only changes the probability by an order of magnitude or so."

Tbc, I think it will happen again; I just don't think it will have a large impact on the world.

Isn't the cheap solution just... being more cautious about our programming, to catch these bugs before the code starts running? And being more concerned about these signflip errors in general?

If you're writing the AGI code, sure. But in practice it won't be you, so you'd have to convince other people to do this. If you tried to do that, I think the primary impact would be "ML researchers are more likely to think AI risk concerns are crazy" which would more than cancel out the potential benefit, even if I believed the risk was 1 in 30,000.

Comment by rohinmshah on Likelihood of hyperexistential catastrophe from a bug? · 2020-06-20T17:25:04.973Z · score: 2 (1 votes) · LW · GW

If a mugger actually came up to me and said "I am God and will torture 3^^^3 people unless you pay me $5", if you then forced me to put a probability on it, I would in fact say something like 1 in a million. I still wouldn't pay the mugger.

Like, can I actually make a million statements of the same type as that one, and be correct about all but one of them? It's hard to get that kind of accuracy.

(Here I'm trying to be calibrated with my probabilities, as opposed to saying the thing that would reflect my decision process under expected utility maximization.)

Comment by rohinmshah on Likelihood of hyperexistential catastrophe from a bug? · 2020-06-19T21:01:48.707Z · score: 4 (2 votes) · LW · GW

I don't think you should act on probabilities of 1 in a million when the reason for the probability is "I am uncomfortable using smaller probabilities than that in general"; that seems like a Pascal's mugging.

Mainly I think that the solution to this problem is very cheap to implement, and thus we do lots of good in expectation by raising more awareness of this problem.

Huh? What's this cheap solution?

Comment by rohinmshah on Likelihood of hyperexistential catastrophe from a bug? · 2020-06-19T17:23:44.112Z · score: 4 (2 votes) · LW · GW

Where did your credence start out at?

If we're talking about a blank-slate AI system that doesn't yet know anything, that then is trained on the negative of the objective we meant, I give it under one in a million that the AI system kills us all before we notice something wrong. (I mean, in all likelihood this would just result in the AI system failing to learn at all, as has happened the many times I've done this myself.) The reason I don't go lower is something like "sufficiently small probabilities are super weird and I should be careful with them".

Now if you're instead talking about some AI system that already knows a ton about the world and is very capable and now you "slot in" a programmatic version of the goal and the AI system interprets it literally, then this sort of bug seems possible. But I seriously doubt we're in that world. And in any case, in that world you should just be worried about us not being able to specify the goal, with this as a special case of that circumstance.

Comment by rohinmshah on Likelihood of hyperexistential catastrophe from a bug? · 2020-06-18T19:28:24.273Z · score: 2 (5 votes) · LW · GW

If that sort of thing happens, you would turn off the AI system (as OpenAI did in fact do). The AI system is not going to learn so fast that it prevents you from doing so.

Comment by rohinmshah on [AN #104]: The perils of inaccessible information, and what we can learn about AI alignment from COVID · 2020-06-18T19:23:50.243Z · score: 2 (1 votes) · LW · GW

This paper didn't check that, but usually when you train sparse networks you get worse performance than if you train dense networks and then prune them to be sparse.

Comment by rohinmshah on Image GPT · 2020-06-18T19:22:54.979Z · score: 4 (2 votes) · LW · GW

Ah I see, that makes more sense, thanks!

Comment by rohinmshah on Possible takeaways from the coronavirus pandemic for slow AI takeoff · 2020-06-18T19:21:01.974Z · score: 4 (2 votes) · LW · GW

Changed narrow/general to weak/strong in the LW version of the newsletter (unfortunately the newsletter had already gone out when your comment was written).

I wouldn't say that there were many novel problems with covid. The supply chain problem for PPE seems easy enough to predict and prepare for given the predicted likelihood of a global respiratory pandemic. Do you have other examples of novel problems besides the supply chain problem?

There was some worry about supply chain problems for food. Perhaps that didn't materialize, or it did materialize and it was solved without me noticing.

I expect that this was the first extended shelter-in-place order for most if not all of the US, and this led to a bunch of problems in deciding what should and shouldn't be included in the order, how stringent to make it, etc.

More broadly, I'm not thinking of any specific problem, but the world is clearly very different than it was in any recent epidemic (at least in the US), and I would be shocked if this did not bring with it several challenges that we did not anticipate ahead of time (perhaps someone somewhere had anticipated it, but it wasn't widespread knowledge).

I don't agree that we can't prevent problems from arising with pandemics - e.g. we can decrease the interactions with wild animals that can transmit viruses to humans, and improve biosecurity standards to prevent viruses escaping from labs.

I definitely agree that we can decrease the likelihood of pandemics arising, but we can't really hope to eliminate them altogether (with current technology). But really I think this was not my main point, and I summarized my point badly: the point was that given that alignment is about preventing misalignment from arising, the analogous thing for pandemics would be about preventing pandemics from arising; it is unclear to me whether civilization was particularly inadequate along this axis ex ante (i.e. before we knew that COVID was a thing).

Comment by rohinmshah on Image GPT · 2020-06-18T16:30:44.524Z · score: 7 (4 votes) · LW · GW

Consider the two questions:

1. Does GPT-3 have "reasoning" and "understanding of the world"?

2. Does iGPT have "reasoning" and "understanding of the world"?

According to me, these questions are mostly separate, and answering one doesn't much help you answer the other.


However there were some people (and some small probability mass remaining in myself) saying that even GPT-3 wasn't doing any sort of reasoning, didn't have any sort of substantial understanding of the world, etc. Well, this is another nail in the coffin of that idea, in my opinion. Whatever this architecture is doing on the inside, it seems to be pretty capable and general.

... I don't understand what you mean here. The weights of image GPT are different from the weights of regular GPT-3, only the architecture is the same. Are you claiming that just the architecture is capable of "reasoning", regardless of the weights?

Or perhaps you're claiming that for an arbitrary task, we could take the GPT-3 architecture and apply it to that task and it would work well? But it would require a huge dataset and lots of training -- it doesn't seem like that should be called "reasoning" and/or "general intelligence".

Yeah I guess I'm confused what you're claiming here.

Comment by rohinmshah on Inaccessible information · 2020-06-18T06:55:14.210Z · score: 2 (1 votes) · LW · GW

Planned summary for the Alignment Newsletter:

One way to think about the problem of AI alignment is that we only know how to train models on information that is _accessible_ to us, but we want models that leverage _inaccessible_ information.

Information is accessible if it can be checked directly, or if an ML model would successfully transfer to provide the information when trained on some other accessible information. (An example of the latter would be if we trained a system to predict what happens in a day, and it successfully transfers to predicting what happens in a month.) Otherwise, the information is inaccessible: for example, “what Alice is thinking” is (at least currently) inaccessible, while “what Alice will say” is accessible. The post has several other examples.

Note that while an ML model may not directly say exactly what Alice is thinking, if we train it to predict what Alice will say, it will probably have some internal model of what Alice is thinking, since that is useful for predicting what Alice will say. It is nonetheless inaccessible because there’s no obvious way of extracting this information from the model. While we could train the model to also output “what Alice is thinking”, this would have to be training for “a consistent and plausible answer to what Alice is thinking”, since we don’t have the ground truth answer. This could incentivize bad policies that figure out what we would most believe, rather than reporting the truth.

The argument for risk is then as follows: we care about inaccessible information (e.g. we care about what people _actually_ experience, rather than what they _say_ they experience) but can’t easily make AI systems that optimize for it. However, AI systems will be able to infer and use inaccessible information, and would outcompete ones that don’t. AI systems will be able to plan using such inaccessible information for at least some goals. Then, the AI systems that plan using the inaccessible information could eventually control most resources. Key quote: “The key asymmetry working against us is that optimizing flourishing appears to require a particular quantity to be accessible, while danger just requires anything to be accessible.”

The post then goes on to list some possible angles of attack on this problem. Iterated amplification can be thought of as addressing gaps in speed, size, experience, algorithmic sophistication etc. between the agents we train and ourselves, which can limit what inaccessible information our agents can have that we won’t. However, it seems likely that amplification will eventually run up against some inaccessible information that will never be produced. As a result, this could be a “hard core” of alignment.

Planned opinion:

I think the idea of inaccessible information is an important one, but it’s one that feels deceptively hard to reason about. For example, I often think about solving alignment by approximating “what a human would say after thinking for a long time”; this is effectively a claim that human reasoning transfers well when iterated over long periods of time, and “what a human would say” is at least somewhat accessible. Regardless, it seems reasonably likely that AI systems will inherit the same property of transferability that I attribute to human reasoning, in which case the argument for risk applies primarily because the AI system might apply its reasoning towards a different goal than the ones we care about, which leads us back to the <@intent alignment@>(@Clarifying "AI Alignment"@) formulation.

This response views this post as a fairly general argument against black box optimization, where we only look at input-output behavior, as then we can’t use inaccessible information. It suggests that we need to understand how the AI system works, rather than relying on search, to avoid these problems.
Comment by rohinmshah on Possible takeaways from the coronavirus pandemic for slow AI takeoff · 2020-06-18T06:01:44.275Z · score: 2 (1 votes) · LW · GW

I am mostly confused, but I expect that if I learned more I would say that it wasn't a fair characterization of the FDA.

Comment by rohinmshah on Possible takeaways from the coronavirus pandemic for slow AI takeoff · 2020-06-18T05:13:27.065Z · score: 2 (1 votes) · LW · GW

Yeah, these sorts of stories seem possible, and it also seems possible that institutions try some terrible policies, notice that they're terrible, and then fix them. Like, this description:

But there's a decent chance that some other regulatory will be involved, which is following the underlying FDA impulse of "Wield the one hammer we know how to wield to justify our jobs." (In a large company, it's possible that regulatory body could be a department inside the org, rather than a government agency)

just doesn't seem to match my impression of non-EAs-or-rationalists working on AI governance. It's possible that people in government are much less competent than people at think tanks, but this would be fairly surprising to me. In addition, while I can't explain FDA decisions, I still pretty strongly penalize views that ascribe huge very-consequential-by-their-goals irrationality to small groups of humans working full time on something.

(Note I would defend the claim that institutions work well enough that in a slow takeoff world the probability of extinction is < 80%, and probably < 50%, just on the basis that if AI alignment turned out to be impossible, we can coordinate not to build powerful AI.)

Comment by rohinmshah on Sparsity and interpretability? · 2020-06-18T04:46:49.357Z · score: 2 (1 votes) · LW · GW

Planned summary for the Alignment Newsletter:

If you want to visualize exactly what a neural network is doing, one approach is to visualize the entire computation graph of multiplies, additions, and nonlinearities. While this is extremely complex even on MNIST, we can make it much simpler by making the networks _sparse_, since any zero weights can be removed from the computation graph. Previous work has shown that we can remove well over 95% of weights from a model without degrading accuracy too much, so the authors do this to make the computation graph easier to understand.
They use this to visualize an MLP model for classifying MNIST digits, and for a DQN agent trained to play Cartpole. In the MNIST case, the computation graph can be drastically simplified by visualizing the first layer of the net as a list of 2D images, where the kth activation is given by the dot product of the 2D image with the input image. This deals with the vast majority of the weights in the neural net.

Planned opinion:

This method has the nice property that it visualizes exactly what the neural net is doing -- it isn’t “rationalizing” an explanation, or eliding potentially important details. It is possible to gain interesting insights about the model: for example, the logit for digit 2 is always -2.39, implying that everything else is computed relative to -2.39. Looking at the images for digit 7, it seems like the model strongly believes that sevens must have the top few rows of pixels be blank, which I found a bit surprising. (I chose to look at the digit 7 somewhat arbitrarily.)
Of course, since the technique doesn’t throw away any information about the model, it becomes very complicated very quickly, and wouldn’t scale to larger models.
Comment by rohinmshah on More on disambiguating "discontinuity" · 2020-06-18T02:03:32.898Z · score: 3 (2 votes) · LW · GW

Planned summary for the Alignment Newsletter:

This post considers three different kinds of “discontinuity” that we might imagine with AI development. First, there could be a sharp change in progress or the rate of progress that breaks with the previous trendline (this is the sort of thing <@examined@>(@Discontinuous progress in history: an update@) by AI Impacts). Second, the rate of progress could either be slow or fast, regardless of whether there is a discontinuity in it. Finally, the calendar time could either be short or long, regardless of the rate of progress.

The post then applies these categories to three questions. Will we see AGI coming before it arrives? Will we be able to “course correct” if there are problems? Is it likely that a single actor obtains a decisive strategic advantage?
Comment by rohinmshah on Possible takeaways from the coronavirus pandemic for slow AI takeoff · 2020-06-18T01:19:54.802Z · score: 2 (1 votes) · LW · GW

Ah, I see. I agree with this and do think it cuts against my point #1, but not points #2 and #3. Edited the top-level comment to note this.

I'm sort of hesitant to jump into the "why covid obviously looks like mass institutional failure, given a very straightforward, well understood scenario" argument because I feel like it's been hashed out a lot in the past 3 months and I'm not sure where to go with it – I'm assuming you've read the relevant arguments and didn't find them convincing.

Tbc, I find it quite likely that there was mass institutional failure with COVID; I'm mostly arguing that soft takeoff is sufficiently different from COVID that we shouldn't necessarily expect the same mass institutional failure in the case of soft takeoff. (This is similar to Matthew's argument that the pandemic shares more properties with fast takeoff than with slow takeoff.)

Comment by rohinmshah on Possible takeaways from the coronavirus pandemic for slow AI takeoff · 2020-06-17T22:46:57.278Z · score: 6 (3 votes) · LW · GW
Perhaps you mean "AI alignment in the slow takeoff frame", where 'narrow' is less a binary judgment and more of a continuous judgment

I do mean this.

This also jumped out at me as being only a subset of what I think of as "AI alignment"; like, ontological collapse doesn't seem to have been a failure of narrow AI systems.

I'd predict that either ontological collapse won't be a problem, or we'll notice it in AI systems that are less general than humans. (After all, humans have in fact undergone ontological collapse, so presumably AI systems will also have undergone it by the time they reach human level generality.)

I still think the baseline prediction should be doom if we can only ever solve problems after encountering them.

This depends on what you count as "encountering a problem".

At one extreme, you might look at Faulty Reward Functions in the Wild and this counts as "encountering" the problem "If you train using PPO with such-and-such hyperparameters on the score reward function in the CoastRunners game then on this specific level the boat might get into a cycle of getting turbo boosts instead of finishing the race". If this is what it means to encounter a problem, then I agree the baseline prediction should be doom if we only solve problems after encountering them.

At the other extreme, maybe you look at it and this counts as "encountering" the problem "sometimes AI systems are not beneficial to humans". So, if you solve this problem (which we've already encountered), then almost tautologically you've solved AI alignment.

I'm not sure how to make further progress on this disagreement.

Comment by rohinmshah on Possible takeaways from the coronavirus pandemic for slow AI takeoff · 2020-06-17T22:33:03.275Z · score: 2 (1 votes) · LW · GW
you think there will be fewer novel problems arising during AI (a completely unprecedented phenomenon) than in Covid?

Relative to our position now, there will be more novel problems from powerful AI systems than for COVID.

Relative to our position e.g. two years before the "point of no return" (perhaps the deployment of the AI system that will eventually lead to extinction), there will be fewer novel problems than for COVID, at least if we are talking about the underlying causes of misalignment.

(The difference is that with AI alignment we're trying to prevent misaligned powerful AI systems from being deployed, whereas with pandemics we don't have the option of preventing "powerful diseases" from arising; we instead have to mitigate their effects.)

I agree that powerful AI systems will lead to more novel problems in their effects on society than COVID did, but that's mostly irrelevant if your goal is to make sure you don't have a superintelligent AI system that is trying to hurt you.

I'm also somewhat confused what facts you think we didn't know about covid that prevented us from preparing

I think it is plausible that we "could have" completely suppressed COVID, and that mostly wouldn't have required facts we didn't know, and the fact that we didn't do that is at least a weak sign of inadequacy.

I think given that we didn't suppress COVID, mitigating its damage probably involved new problems that we didn't have solutions for before. As an example, I would guess that in past epidemics the solution to "we have a mask shortage" would have been "buy masks from <country without the epidemic>", but that no longer works for COVID. But really the intuition is more like "life is very different in this pandemic relative to previous epidemics; it would be shocking if this didn't make the problem harder in some way that we failed to foresee".