[AN #73]: Detecting catastrophic failures by learning how agents tend to break 2019-11-13T18:10:01.544Z · score: 10 (3 votes)
[AN #72]: Alignment, robustness, methodology, and system building as research priorities for AI safety 2019-11-06T18:10:01.604Z · score: 28 (7 votes)
[AN #71]: Avoiding reward tampering through current-RF optimization 2019-10-30T17:10:02.211Z · score: 11 (3 votes)
[AN #70]: Agents that help humans who are still learning about their own preferences 2019-10-23T17:10:02.102Z · score: 18 (6 votes)
Human-AI Collaboration 2019-10-22T06:32:20.910Z · score: 39 (13 votes)
[AN #69] Stuart Russell's new book on why we need to replace the standard model of AI 2019-10-19T00:30:01.642Z · score: 56 (18 votes)
[AN #68]: The attainable utility theory of impact 2019-10-14T17:00:01.424Z · score: 19 (5 votes)
[AN #67]: Creating environments in which to study inner alignment failures 2019-10-07T17:10:01.269Z · score: 17 (6 votes)
[AN #66]: Decomposing robustness into capability robustness and alignment robustness 2019-09-30T18:00:02.887Z · score: 12 (6 votes)
[AN #65]: Learning useful skills by watching humans “play” 2019-09-23T17:30:01.539Z · score: 12 (4 votes)
[AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning 2019-09-16T17:10:02.103Z · score: 11 (5 votes)
[AN #63] How architecture search, meta learning, and environment design could lead to general intelligence 2019-09-10T19:10:01.174Z · score: 24 (8 votes)
[AN #62] Are adversarial examples caused by real but imperceptible features? 2019-08-22T17:10:01.959Z · score: 28 (11 votes)
Call for contributors to the Alignment Newsletter 2019-08-21T18:21:31.113Z · score: 39 (12 votes)
Clarifying some key hypotheses in AI alignment 2019-08-15T21:29:06.564Z · score: 68 (28 votes)
[AN #61] AI policy and governance, from two people in the field 2019-08-05T17:00:02.048Z · score: 11 (5 votes)
[AN #60] A new AI challenge: Minecraft agents that assist human players in creative mode 2019-07-22T17:00:01.759Z · score: 25 (10 votes)
[AN #59] How arguments for AI risk have changed over time 2019-07-08T17:20:01.998Z · score: 43 (9 votes)
Learning biases and rewards simultaneously 2019-07-06T01:45:49.651Z · score: 43 (12 votes)
[AN #58] Mesa optimization: what it is, and why we should care 2019-06-24T16:10:01.330Z · score: 50 (13 votes)
[AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming 2019-06-05T23:20:01.202Z · score: 28 (9 votes)
[AN #56] Should ML researchers stop running experiments before making hypotheses? 2019-05-21T02:20:01.765Z · score: 22 (6 votes)
[AN #55] Regulatory markets and international standards as a means of ensuring beneficial AI 2019-05-05T02:20:01.030Z · score: 18 (6 votes)
[AN #54] Boxing a finite-horizon AI system to keep it unambitious 2019-04-28T05:20:01.179Z · score: 21 (6 votes)
Alignment Newsletter #53 2019-04-18T17:20:02.571Z · score: 22 (6 votes)
Alignment Newsletter One Year Retrospective 2019-04-10T06:58:58.588Z · score: 93 (27 votes)
Alignment Newsletter #52 2019-04-06T01:20:02.232Z · score: 20 (5 votes)
Alignment Newsletter #51 2019-04-03T04:10:01.325Z · score: 28 (5 votes)
Alignment Newsletter #50 2019-03-28T18:10:01.264Z · score: 16 (3 votes)
Alignment Newsletter #49 2019-03-20T04:20:01.333Z · score: 26 (8 votes)
Alignment Newsletter #48 2019-03-11T21:10:02.312Z · score: 31 (13 votes)
Alignment Newsletter #47 2019-03-04T04:30:11.524Z · score: 21 (5 votes)
Alignment Newsletter #46 2019-02-22T00:10:04.376Z · score: 18 (8 votes)
Alignment Newsletter #45 2019-02-14T02:10:01.155Z · score: 27 (9 votes)
Learning preferences by looking at the world 2019-02-12T22:25:16.905Z · score: 47 (13 votes)
Alignment Newsletter #44 2019-02-06T08:30:01.424Z · score: 20 (6 votes)
Conclusion to the sequence on value learning 2019-02-03T21:05:11.631Z · score: 48 (11 votes)
Alignment Newsletter #43 2019-01-29T21:10:02.373Z · score: 15 (5 votes)
Future directions for narrow value learning 2019-01-26T02:36:51.532Z · score: 12 (5 votes)
The human side of interaction 2019-01-24T10:14:33.906Z · score: 18 (5 votes)
Alignment Newsletter #42 2019-01-22T02:00:02.082Z · score: 21 (7 votes)
Following human norms 2019-01-20T23:59:16.742Z · score: 27 (10 votes)
Reward uncertainty 2019-01-19T02:16:05.194Z · score: 20 (6 votes)
Alignment Newsletter #41 2019-01-17T08:10:01.958Z · score: 23 (4 votes)
Human-AI Interaction 2019-01-15T01:57:15.558Z · score: 27 (8 votes)
What is narrow value learning? 2019-01-10T07:05:29.652Z · score: 21 (9 votes)
Alignment Newsletter #40 2019-01-08T20:10:03.445Z · score: 21 (4 votes)
Reframing Superintelligence: Comprehensive AI Services as General Intelligence 2019-01-08T07:12:29.534Z · score: 94 (37 votes)
AI safety without goal-directed behavior 2019-01-07T07:48:18.705Z · score: 53 (17 votes)
Will humans build goal-directed agents? 2019-01-05T01:33:36.548Z · score: 43 (13 votes)


Comment by rohinmshah on AI alignment landscape · 2019-11-17T20:51:05.114Z · score: 2 (1 votes) · LW · GW

Summary for the Alignment Newsletter: Basically just pasting in the image.


Here are a few points about this decomposition that were particularly salient or interesting to me.
First, at the top level, the problem is decomposed into alignment, competence, and coping with the impacts of AI. The "alignment tax" (extra technical cost for safety) is only applied to alignment, and not competence. While there isn't a tax in the "coping" section, I expect that is simply due to a lack of space; I expect that extra work will be needed for this, though it may not be technical. I broadly agree with this perspective: to me, it seems like the major technical problem which differentially increases long-term safety is to figure out how to get powerful AI systems that are trying to do what we want, i.e. they have the right motivation. Such AI systems will hopefully make sure to check with us before taking unusual irreversible actions, making e.g. robustness and reliability less important. Note that techniques like verification, transparency, and adversarial training may still be needed to ensure that the alignment itself is robust and reliable (see the inner alignment box); the claim is just that robustness and reliability of the AI's capabilities is less important.
Second, strategy and policy work here is divided into two categories: improving our ability to pay technical taxes (extra work that needs to be done to make AI systems better), and improving our ability to handle impacts of AI. Often, generically improving coordination can help with both categories: for example, the publishing concerns around GPT-2 have allowed researchers to develop synthetic text detection (the first category) as well as to coordinate on when not to release models (the second category).
Third, the categorization is relatively agnostic to the details of the AI systems we develop -- these only show up in level 4, where Paul specifies that he is mostly thinking about aligning learning, and not planning and deduction. It's not clear to me to what extent the upper levels of the decomposition make as much sense if considering other types of AI systems: I wouldn't be surprised if I thought the decomposition was not as good for risks from e.g. powerful deductive algorithms, but it would depend on the details of how deductive algorithms become so powerful. I'd be particularly excited to see more work presenting more concrete models of powerful AGI systems, and reasoning about risks in those models, as was done in Risks from Learned Optimization.
Comment by rohinmshah on Will transparency help catch deception? Perhaps not · 2019-11-17T20:45:56.014Z · score: 2 (1 votes) · LW · GW

My summary for the Alignment Newsletter:

Recent posts have been optimistic about using transparency tools to detect deceptive behavior. This post argues that we may not want to use transparency tools, because then the deceptive model can simply adapt to fool the transparency tools. Instead, we need something more like an end-to-end trained deception checker that's about as smart as the deceptive model, so that the deceptive model can't fool it.

My opinion:

In a comment, Evan Hubinger makes a point I agree with: the transparency tools don't need to be able to detect all deception; they just need to prevent the model from developing deception. If deception gets added slowly (i.e. the model doesn't "suddenly" become perfectly deceptive), then this can be way easier than detecting deception in arbitrary models, and could be done by tools.
Comment by rohinmshah on More variations on pseudo-alignment · 2019-11-17T20:44:07.643Z · score: 4 (2 votes) · LW · GW

Nicholas's summary for the Alignment Newsletter:

This post identifies two additional types of pseudo-alignment not mentioned in Risks from Learned Optimization. Corrigible pseudo-alignment is a new subtype of corrigible alignment. In corrigible alignment, the mesa optimizer models the base objective and optimizes that. Corrigible pseudo-alignment occurs when the model of the base objective is a non-robust proxy for the true base objective. Suboptimality deceptive alignment is when deception would help the mesa-optimizer achieve its objective, but it does not yet realize this. This is particularly concerning because even if AI developers check for and prevent deception during training, the agent might become deceptive after it has been deployed.

Nicholas's opinion:

These two variants of pseudo-alignment seem useful to keep in mind, and I am optimistic that classifying risks from mesa-optimization (and AI more generally) will make them easier to understand and address.
Comment by rohinmshah on Robin Hanson on the futurist focus on AI · 2019-11-14T18:12:21.904Z · score: 13 (7 votes) · LW · GW

Robin, I still don't understand why economic models predict only modest changes in agency problems, as you claimed here, when the principal is very limited and the agent is fully rational. I attempted to look through the literature, but did not find any models of this form. This is very likely because my literature search was not very good, as I am not an economist, so I would appreciate references.

That said, I would be very surprised if these references convinced me that a strongly superintelligent expected utility maximizer with a misaligned utility function (like "maximize the number of paperclips") would not destroy almost all of the value from our source (assuming the AI itself is not valuable). To me, this is the extreme example of a principal-agent problem where the principal is limited and the agent is very capable. When I hear "principal-agent problems are not much worse with a smarter agent", I hear "a paperclip maximizer wouldn't destroy most of the value", which seems crazy. Perhaps that is not what you mean though.

(Of course, you can argue that this scenario is not very likely, and I agree with that. I point to it mainly as a crystallization of the disagreement about principal-agent problems.)

Comment by rohinmshah on Robin Hanson on the futurist focus on AI · 2019-11-14T18:11:25.793Z · score: 3 (2 votes) · LW · GW

I was struck by how much I broadly agreed with almost everything Robin said. ETA: The key points of disagreement are a) I think principal-agent problems with a very smart agent can get very bad, see comment above, and b) on my inside view, timelines could be short (though I agree from the outside timelines look long).

To answer the questions:

Setting aside everything you know except what this looks like from the outside, would you predict AGI happening soon?


Should reasoning around AI risk arguments be compelling to outsiders outside of AI?

Depends on which arguments you're talking about, but I don't think it justifies devoting lots of resources to AI risk, if you rely just on the arguments / reasoning (as opposed to e.g. trusting the views of people worried about AI risk).

What percentage of people who agree with you that AI risk is big, agree for the same reasons that you do?

Depending on the definition of "big", I may or may not think that long-term AI risk is big. I do think AI risk is worthy of more attention than most other future scenarios, though 100 people thinking about it seems quite reasonable to me.

I think most people who agree do so for a similar broad reason, which is that agency problems can get very bad when the agent is much more capable than you. However, the details of the specific scenarios they are worried about tend to be different.

Comment by rohinmshah on The Credit Assignment Problem · 2019-11-14T06:26:33.690Z · score: 2 (1 votes) · LW · GW
I'm not sure how to further convey my sense that this is all very interesting. My model is that you're like "ok sure" but don't really see why I'm going on about this.

Yeah, I think this is basically right. For the most part though, I'm trying to talk about things where I disagree with some (perceived) empirical claim, as opposed to the overall "but why even think about these things" -- I am not surprised when it is hard to convey why things are interesting in an explicit way before the research is done.

Here, I was commenting on the perceived claim of "you need to have two-level algorithms in order to learn at all; a one-level algorithm is qualitatively different and can never succeed", where my response is "but no, REINFORCE would do okay, though it might be more sample-inefficient". But it seems like you aren't claiming that, just claiming that two-level algorithms do quantitatively but not qualitatively better.

Comment by rohinmshah on Rohin Shah on reasons for AI optimism · 2019-11-13T02:55:48.744Z · score: 2 (1 votes) · LW · GW

That would probably be part of my response, but I think I'm also considering a different argument.

The thing that I was arguing against was "(c): agents that we build are optimizing some objective function". This is importantly different from "mesa-optimisers [would] end up being approximately optimal for some objective/utility function" when you consider distributional shift.

It seems plausible that the agent could look like it is "trying to achieve" some simple utility function, and perhaps it would even be approximately optimal for that simple utility function on the training distribution. (Simple here is standing in for "isn't one of the weird meaningless utility functions in Coherence arguments do not imply goal-directed behavior, and looks more like 'maximize happiness' or something like that".) But if you then take this agent and place it in a different distribution, it wouldn't do all the things that an EU maximizer with that utility function would do, it might only do some of the things, because it isn't internally structured as a search process for sequences of actions that lead to high utility behavior, it is instead structured as a bunch of heuristics that were selected for high utility on the training environment that may or may not work well in the new setting.

(In my head, the Partial Agency sequence is meandering towards this conclusion, though I don't think that's actually true.)

(I think people have overupdated on "what Rohin believes" from the coherence arguments post -- I do think that powerful AI systems will be agent-ish, and EU maximizer-ish, I just don't think that it is going to be a 100% EU maximizer that chooses actions by considering reasonable sequences of actions and doing the one with the best predicted consequences. With that post, I was primarily arguing against the position that EU maximization is required by math.)

Comment by rohinmshah on Full toy model for preference learning · 2019-11-11T05:44:36.433Z · score: 4 (2 votes) · LW · GW

Planned summary:

This post applies Stuart's general preference learning algorithm to a toy environment in which a robot has a mishmash of preferences about how to classify and bin two types of objects.

Planned opinion:

This is a nice illustration of the very abstract algorithm proposed before; I'd love it if more people illustrated their algorithms this way.
Comment by rohinmshah on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-11T02:36:02.646Z · score: 5 (3 votes) · LW · GW

Nicholas's summary, that I'm copying over on his behalf:

This post argues that while it is impressive that AlphaStar can build up concepts complex enough to win at StarCraft, it is not actually developing reactive strategies. Rather than scouting what the opponent is doing and developing a new strategy based on that, AlphaStar just executes one of a predetermined set of strategies. This is because AlphaStar does not use causal reasoning and that keeps it from beating any of the top players.

Nicholas's opinion:

While I haven’t watched enough of the games to have a strong opinion on whether AlphaStar is empirically reacting to its opponents strategies, I agree with Paul Christiano’s comment that in principle causal reasoning is just one type of computation that should be learnable.
This discussion also highlights the need for interpretability tools for deep RL so that we can have more informed discussions on exactly how and why strategies are decided on.
Comment by rohinmshah on Chris Olah’s views on AGI safety · 2019-11-10T18:05:53.268Z · score: 4 (2 votes) · LW · GW
Do you mean something like, "operating within the worldview"?

Basically yes. Longer version: "Suppose we were in scenario X. Normally, in such a scenario, I would discard this worldview, or put low weight on it, because reason Y. But suppose by fiat that I continue to use the worldview, with no other changes made to scenario X. Then ..."

It's meant to be analogous to imputing a value in a causal Bayes net, where you simply "suppose" that some event happened, and don't update on anything causally upstream, but only reason forward about things that are causally downstream. (I seem to recall Scott Garrabrant writing a good post on this, but I can't find it now.)

Comment by rohinmshah on Goal-thinking vs desire-thinking · 2019-11-10T17:55:58.976Z · score: 3 (2 votes) · LW · GW
Consider this essay [...] which takes the point of view that obviously a rational person would kill themselves

That sounded interestingly different from my usual perspective, so I read it, and it doesn't seem to me to be arguing that at all? At best you could say that it's arguing that if humans were more rational, then suicide rates would go up, which seems much less controversial.

Comment by rohinmshah on The Credit Assignment Problem · 2019-11-09T21:12:32.281Z · score: 5 (3 votes) · LW · GW

Oh, I see. You could also have a version of REINFORCE that doesn't make the episodic assumption, where every time you get a reward, you take a policy gradient step for each of the actions taken so far, with a weight that decays as actions go further back in time. You can't prove anything interesting about this, but you also can't prove anything interesting about actor-critic methods that don't have episode boundaries, I think. Nonetheless, I'd expect it would somewhat work, in the same way that an actor-critic method would somewhat work. (I'm not sure which I expect to work better; partly it depends on the environment and the details of how you implement the actor-critic method.)

(All of this said with very weak confidence; I don't know much RL theory)

Comment by rohinmshah on The Credit Assignment Problem · 2019-11-08T17:39:59.060Z · score: 5 (3 votes) · LW · GW
Unfortunately, we can't just copy this trick. Artificial evolution requires that we decide how to kill off / reproduce things, in the same way that animal breeding requires breeders to decide what they're optimizing for. This puts us back at square one; IE, needing to get our gradient from somewhere else.

Suppose we have a good reward function (as is typically assumed in deep RL). We can just copy the trick in that setting, right? But the rest of the post makes it sound like you still think there's a problem, in that even with that reward, you don't know how to assign credit to each individual action. This is a problem that evolution also has; evolution seemed to manage it just fine.

(Similarly, even if you think actor-critic methods don't count, surely REINFORCE is one-level learning? It works okay; added bells and whistles like critics are improvements to its sample efficiency.)

Comment by rohinmshah on AI Alignment Open Thread October 2019 · 2019-11-06T23:25:02.197Z · score: 2 (1 votes) · LW · GW
What about AIs as deployed in social media, which many people think are pushing discourse in bad directions, but which remain deployed anyway due to lack of technical solutions and economic competition?

Two responses:

First, this is more of a social coordination problem -- I'm claiming that regular engineering practices allow you to notice when something is wrong before it has catastrophic consequences. You may not be able to solve them; in that case you need to have enough social coordination to no longer deploy them.

Second, is there a consensus that recommendation algorithms are net negative? Within this community, that's probably the consensus, but I don't think it's a consensus more broadly. If we can't solve the bad discourse problem, but the recommendation algorithms are still net positive overall, then you want to keep them.

(Part of the social coordination problem is building consensus that something is wrong.)

the failure scenario I usually have in mind is that AI safety/alignment is too slow to be developed or costly to use, but more and more capable AIs get deployed anyway due to competitive pressures, and they slowly or quickly push human civilization off the rails, in any number of ways.

For many ways of how they push human civilization off the rails, I would not expect transparency / interpretability to help. One example would be the scenario in which each AI is legitimately trying to help some human(s), but selection / competitive pressures on the humans lead to sacrificing all values except productivity. I'd predict that most people optimistic about transparency / interpretability would agree with at least that example.

Comment by rohinmshah on Rohin Shah on reasons for AI optimism · 2019-11-06T23:10:43.625Z · score: 4 (2 votes) · LW · GW

... I'm not sure why I used the word "we" in the sentence you quoted. (Maybe I was thinking about a group of value-aligned agents? Maybe I was imagining that "reasonable reflection process" meant that we were in a post-scarcity world, everyone agreed that we should be doing reflection, everyone was already safe? Maybe I didn't want the definition to sound like I would only care about what I thought and not what everyone else thought? I'm not sure.)

In any case, I think you can change that sentence to "whatever I decide based on some 'reasonable' reflection process is good", and that's closer to what I meant.

I am much more uncertain about multiagent interactions. Like, suppose we give every person access to a somewhat superintelligent AI assistant that is legitimately trying to help them. Are things okay by default? I lean towards yes, but I'm uncertain. I did read through those two articles, and I broadly buy the theses they advance; I still lean towards yes because:

  • Things have broadly become better over time, despite the effects that the articles above highlight. The default prediction is that they continue to get better. (And I very uncertainly think people from the past would agree, given enough time to understand our world?)
  • In general, we learn reasonably well from experience; we try things and they go badly, but then things get better as we learn from that.
  • Humans tend to be quite risk-averse at trying things, and groups of humans seem to be even more risk-averse. As a result, it seems unlikely that we try a thing that ends up having a "direct" existentially bad effect.
  • You could worry about an "indirect" existentially bad effect, along the lines of Moloch, where there isn't any single human's optimization causing bad things to happen, but selection pressure causes problems. Selection pressure has existed for a long time and hasn't caused an existentially-bad outcome yet, so the default is that it won't in the future.
  • Perhaps AI accelerates the rate of progress in a way where we can't adapt fast enough, and this is why selection pressures can now cause an existentially bad effect. But this didn't happen with the Industrial Revolution. (That said, I do find this more plausible than the other scenarios.)

But in fact I usually don't aim to make claims about these sorts of scenarios; as I mentioned above I'm more optimistic about social solutions (that being the way we have solved this in the past).

Comment by rohinmshah on [AN #72]: Alignment, robustness, methodology, and system building as research priorities for AI safety · 2019-11-06T20:20:35.188Z · score: 4 (2 votes) · LW · GW

Good catch, not sure how that happened. Fixed here, we'll probably send an email update as well?

Comment by rohinmshah on More variations on pseudo-alignment · 2019-11-05T15:44:31.433Z · score: 5 (3 votes) · LW · GW

I think it's more like: the model is optimizing for some misaligned mesa-objective, deception would be a better way to achieve the mesa-objective, but for some reason (see examples here) it isn't using deception yet. Which is a more specific version of the thing you said.

Comment by rohinmshah on AI Alignment Open Thread October 2019 · 2019-11-05T08:05:56.393Z · score: 2 (1 votes) · LW · GW

Hmm, I think I would make the further claim that in this world regular engineering practices are likely to work well, because they usually work well.

(If a single failure meant that we lose, then I wouldn't say this; so perhaps we also need to add in another claim that the first failure does not mean automatic loss. Regular engineering practices get you to high degrees of reliability, not perfect reliability.)

Comment by rohinmshah on More variations on pseudo-alignment · 2019-11-05T07:59:23.007Z · score: 4 (2 votes) · LW · GW

Yes, good point. I'd make the same claim with "doesn't know about deception" replaced by "hasn't figured out that deception is a good strategy (assuming deception is a good strategy)".

Comment by rohinmshah on Book Review: Design Principles of Biological Circuits · 2019-11-05T07:57:35.227Z · score: 3 (2 votes) · LW · GW


Comment by rohinmshah on But exactly how complex and fragile? · 2019-11-05T01:55:21.289Z · score: 4 (2 votes) · LW · GW
That may be the crux. I'm generally of the mindset that "can't guarantee/verify" implies "completely useless for AI safety". Verifying that's it's safe is the whole point of AI safety research. If we were hoping to make something that just happened to be safe even though we couldn't guarantee it beforehand or double-check afterwards, that would just be called "AI".

It would be nice if you said this in comments in the future. This post seems pretty explicitly about the empirical question to me, and even if you don't think the empirical question counts as AI safety research (a tenable position, though I don't agree with it), the empirical questions are still pretty important for prioritization research, and I would like people to be able to have discussions about that.

(Partly I'm a bit frustrated at having had another long comment conversation that bottomed out in a crux that I already knew about, and I don't know how I could have known this ahead of time, because it really sounded to me like you were attempting to answer the empirical question.)

Although it occurs to me that you might be claiming that empirically, if we fail to verify, then we're near-definitely doomed. If so, I want to know the reasons for that belief, and how they contradict my arguments, rather than whatever it is we're currently debating. (And also, I retract both of the paragraphs above.)

Re: the rest of your comment: I don't in fact want to have AI systems that try to guess human "values" and then optimize that -- as you said we don't even know what "values" are. I more want AI systems that are trying to help us, in the same way that a personal assistant might help you, despite not knowing your "values".

Comment by rohinmshah on More variations on pseudo-alignment · 2019-11-05T01:18:39.518Z · score: 4 (2 votes) · LW · GW
I think that suboptimality deceptive alignment complicates a lot of stories for how we can correct issues in our AIs as they appear.

I don't think it would be useful to actually discuss this, since I expect the cruxes for our disagreement are elsewhere, but since it is a direct disagreement with my position, I'll state (but not argue for) my position here:

  • I expect that something (people, other AIs) will continue to monitor AI systems after deployment, though probably not as much as during training.
  • I don't think a competent-at-human-level system doesn't know about deception, and I don't think a competent-at-below-human-level system can cause extinction-level catastrophe (modulo something about "unintentionally causing us to have the wrong values")
  • Even if the AI system "discovers" deception during deployment, I expect that it will be bad at deception, or will deceive us in small-stakes situations, and we'll notice and fix the problem.

(There's a decent chance that I don't reply to replies to this comment.)

Comment by rohinmshah on Book Review: Design Principles of Biological Circuits · 2019-11-05T01:02:14.753Z · score: 17 (7 votes) · LW · GW

It seems like there are two claims here:

  • Biological systems are not random, in the sense that they have purpose
  • Biological systems are human-understandable with enough effort

The first one seems to be expected even under the "everything is a mess" model -- even though evolution is just randomly trying things, the only things that stick around are the ones that are useful, so you'd expect that most things that appear on first glance to be useless actually do have some purpose.

The second one is the claim I'm most interested in.

Some of your summaries seem to be more about the first claim. For Chapters 7-8:

For our purposes, the main takeaway from these two chapters is that, just because the system looks wasteful/arbitrary, does not mean it is. Once we know what to look for, it becomes clear that the structure of biological systems is not nearly so arbitrary as it looks.

This seems to be entirely the first claim.

The other chapters seem like they do mostly address the second claim, but it's a bit hard to tell. I'm curious if, now knowing about these two distinct claims, you still think the book is strong evidence for the second claim? What about chapters 7-8 in particular?

Comment by rohinmshah on But exactly how complex and fragile? · 2019-11-05T00:39:38.506Z · score: 4 (2 votes) · LW · GW
I agree that ML systems will get very good at "understanding" images in the sense of predicting motion or hidden pixels or whatever.

... So why can't ML systems get very good at predicting what humans value, if they can predict motion / pixels? Or perhaps you can think they can predict motion / pixels, but they can't e.g. caption images, because that relies on higher-level concepts? If so, I predict that ML systems will also be good at that, and maybe that's the crux.

But while different humans seem to have pretty similar concepts of what a tree is, it is not at all clear that ML systems have the same tree-concept as a human.

I'm also predicting that vision-models-trained-with-richer-data will have approximately the same tree-concept as humans. (Not exactly the same, e.g. they won't have a notion of a "Christmas tree", presumably.)

and even if they did, how could we verify that, in a manner robust to both distribution shifts and Goodhart?

I'm not claiming we can verify it. I'm trying to make an empirical prediction about what happens. That's very different from what I can guarantee / verify. I'd argue the OP is also speaking in this frame.

Comment by rohinmshah on Chris Olah’s views on AGI safety · 2019-11-05T00:21:33.141Z · score: 7 (4 votes) · LW · GW
some understanding here may be more dangerous than no understanding, precisely because it's enough to accomplish some things without accomplishing everything that you needed to.

Fwiw, under the worldview I'm outlining, this sounds like a "clever argument" to me, that I would expect on priors to be less likely to be true, regardless of my position on takeoff. (Takeoff does matter, in that I expect that this worldview is not very accurate/good if there's discontinuous takeoff, but imputing the worldview I don't think takeoff matters.)

I often think of this as penalizing nth-order effects in proportion to some quickly-growing function of n. (Warning: I'm using the phrase "nth-order effects" in a non-standard, non-technical way.)

Under the worldview I mentioned, the first-order effect of better understanding of AI systems, is that you are more likely to build AI systems that are useful and do what you want.

The second-order effect is "maybe there's a regime where you can build capable-but-not-safe things; if we're currently below that, it's bad to go up into that regime". This requires a more complicated model of the world (given this worldview) and more assumptions of where we are.

(Also, now that I've written this out, the model also predicts there's no chance of solving alignment, because we'll first reach the capable-but-not-safe things, and die. Probably the best thing to do on this model is to race ahead on understanding as fast as possible, and hope we leapfrog directly to the capable-and-safe regime? Or you work on understanding AI in secret, and only release once you know how to do capable-and-safe, so that no one has the chance to work on capable-but-not-safe? You can see why this argument feels a bit off under the worldview I outlined.)

Comment by rohinmshah on Rohin Shah on reasons for AI optimism · 2019-11-05T00:08:57.858Z · score: 2 (1 votes) · LW · GW
"goals that are not our own" is ambiguous to me. Does it include a goal that someone currently thinks they have or behaves as if they have, but isn't really part of their "actual" values? Does it include a goal that someone gets talked into by a superintelligent AI?

"goals that are our own" is supposed to mean our "actual" values, which I don't know how to define, but shouldn't include a goal that you are "incorrectly" persuaded of by a superintelligent AI. The best operationalization I have is the values that I'd settle on after some "reasonable" reflection process. There are multiple "reasonable" reflection processes; the output of any of them is fine. But even this isn't exactly right, because there might be some values that I end up having in the world with AI, that I wouldn't have come across with any reasonable reflection process because I wouldn't have thought about the weird situations that occur once there is superintelligent AI, and I still want to say that those sorts of values could be fine.

Are you including risks that come from AI not being value-neutral, in other words, the AI being better at optimizing for some kinds of values over others, to the extent that the future is dominated by the the goals of a small group of humans?

I was not including those risks (if you mean a setting where there are N groups of humans with different values, but AI can only help M < N of them, and so those M values dominate the future instead of all N).

I suspect there may be some illusion of transparency going on where you think terms like "adversarial optimization" and "goals that are not our own" have clear and obvious meanings...

I don't think "goals that are not our own" is philosophically obvious, but I think that it points to a fuzzy concept that cleaves reality at its joints, of which the central examples are quite clear. (The canonical example being the paperclip maximizer.) I agree that once you start really trying to identify the boundaries of the concept, things get very murky (e.g. what if an AI reports true information to you, causing you to adopt value X, and the AI is also aligned with value X? Note that since you can't understand all information, the AI has necessarily selected what information to show you. I'm sure there's a Stuart Armstrong post about this somewhere.)

By "adversarial optimization", I mean that the AI system is "trying to accomplish" some goal X, while humans instead "want" some goal Y, and this causes conflict between the AI system and humans.

(I could make it sound more technical by saying that the AI system is optimizing some utility function, while humans are optimizing some other utility function, which leads to conflict between the two because of convergent instrumental subgoals. I don't think this is more precise than the previous sentence.)

I think even with extreme moral anti-realism, there's still a significant risk that AIs could learn values that are wrong enough (i.e., different enough from our values, or are otherwise misaligned enough) to cause an existential-level bad outcome, but not human extinction. Can you confirm that you really endorse the ~0% figure?

Oh, whoops, I accidentally estimated the answer to "(existential-level) bad outcome happens due to AI by default, without involving adversarial optimization". I agree that you could get existential-level bad outcomes that aren't human extinction due to adversarial optimization. I'm not sure how likely I find that -- it seems like that depends on what the optimal policy for a superintelligent AI is, which, who knows if that involves literally killing all humans. (Obviously, to be consistent with earlier estimates, it must be <= 10%.)

Can I convince you that you should be uncertain enough about this, and that enough other people disagree with you about this (in particular that social coordination may be hard enough that we should try to solve a wider kind of AI risk via technical means), that more careful language to distinguish between different kinds of risk and different kinds of research would be a good idea to facilitate thinking and discussion? (I take your point that you weren't expecting this interview to be make public, so I'm just trying to build a consensus about what should ideally happen in the future.)

Yeah, I do try to do this already. The note I quoted above is one that I asked to be added post-conversation for basically this reason. (It's somewhat hard to do so though, my brain is pretty bad at keeping track of uncertainty that doesn't come from an underlying inside-view model.)

Comment by rohinmshah on But exactly how complex and fragile? · 2019-11-04T23:46:23.303Z · score: 4 (2 votes) · LW · GW
The key point is that we don't even know what the relevant distance metric is. Even in human terms, we don't know what the relevant metric is. We cannot expect to be able to distinguish an ML system which has learned the "correct" metric from one which has not.

This seems true, and also seems true for the images case, yet I (and I think most researchers) predict that image understanding will get very good / superhuman. What distinguishes the images case from the human values case? My guess at your response is that we aren't applying optimization pressure on the learned distance function for images.

In that case, my response would be that yes, if you froze in place the learned distance metric / "human value representation" at any given point, and then ratcheted up the "capabilities" of the agent, that's reasonably likely to go badly (though I'm not sure, and it depends how much the current agent has already been trained). But presumably the agent is going to continue learning over time.

Even in the case where we freeze the values and ratchet up the capabilities: you're presumably not aligned with me, but it doesn't seem like ratcheting up your capabilities obviously leads to doom for me. (It doesn't obviously not lead to doom either though.)

Comment by rohinmshah on But exactly how complex and fragile? · 2019-11-04T23:39:07.508Z · score: 2 (1 votes) · LW · GW

In the images case, I meant that if you had a richer dataset with more images in more conditions, accompanied with touch-based information, perhaps even audio, and the agent were allowed to interact with the world and see through these input mechanisms what the world did in response, then it would learn concepts that allow it to understand the world the way we do -- it wouldn't be fooled by occlusions, or by putting picture of a baseball on top of an ocean picture, etc. (This also requires a sufficiently large dataset; I don't know how large.)

I'm not saying that such a dataset would lead it to learn what we value. I don't know what that dataset would look like, partly because it's not clear to me what exactly we value.

Comment by rohinmshah on AI Alignment Open Thread October 2019 · 2019-11-04T23:32:50.520Z · score: 3 (2 votes) · LW · GW

In my case in particular, it's definitely more the second case; I promote posts to AF pretty haphazardly. (I'm not sure if I've ever done it for a post I didn't already know about before it was posted.)

Comment by rohinmshah on AI Alignment Open Thread October 2019 · 2019-11-04T23:29:07.820Z · score: 4 (2 votes) · LW · GW

If you expect discontinuous takeoff, or you want a proof that your AGI is safe, then I agree transparency / interpretability is unlikely to give you what you want.

If you instead expect gradual takeoff, then it seems reasonable to expect that regular engineering practices are the sort of thing you want, of which interpretability / transparency tools are probably the most obvious thing you want to try. (Red teaming would be included in this.)

However, I suspect Chris Olah, Evan Hubinger, Daniel Filan, and Matthew Barnett would all not justify interpretability / transparency on these grounds. I don't know about Paul Christiano.

Comment by rohinmshah on Chris Olah’s views on AGI safety · 2019-11-04T23:19:55.821Z · score: 12 (6 votes) · LW · GW
That's an interesting and clever point (although it triggers some sort of "clever argument" safeguard that makes me cautious of it).

I think it shouldn't be in the "clever argument" category, and the only reason it feels like that is because you're using the capabilities-alignment framework.

Consider instead this worldview:

The way you build things that are useful and do what you want is to understand how things work and put them together in a deliberate way. If you put things together randomly, they either won't work, or will have unintended side effects.

(This worldview can apply to far more than AI; e.g. it seems right in basically every STEM field. You might argue that putting things together randomly seems to work surprisingly well in AI, to which I say that it really doesn't, you just don't see all of the effort where you put things together randomly and it simply flat-out fails.)

The argument "it's good to for people to understand AI techniques better even if it accelerates AGI" is a very straightforward non-clever consequence of this worldview.

Somewhat more broadly, I recommend being able to inhabit this other worldview. I expect it to be more useful / accurate than the capabilities / alignment worldview.

(Disclaimer: I believed this point before this post -- in fact I had several conversations with people about it back in May, when I was considering a project with potential effects along these lines.)

Comment by rohinmshah on But exactly how complex and fragile? · 2019-11-04T17:33:09.778Z · score: 13 (4 votes) · LW · GW

The natural response to this is "ML seems really good at learning good distance metrics".

And it is definitely the case, at least, that many mathematically simple distance metrics do display value fragility.

Which is why you learn the distance metric. "Mathematically simple" rules for vision, speech recognition, etc. would all be very fragile, but ML seems to solve those tasks just fine.

One obvious response is "but what about adversarial examples"; my position is that image datasets are not rich enough for ML to learn the human-desired concepts; the concepts they do learn are predictive, just not about things we care about.

Another response is "but there are lots of rewards / utilities that are compatible with observed behavior, so you might learn the wrong thing, e.g. you might learn influence-seeking behavior". This is the worry behind inner alignment concerns as well. This seems like a real worry to me, but it's only tangentially related to the complexity / fragility of value.

Comment by rohinmshah on But exactly how complex and fragile? · 2019-11-04T01:27:35.285Z · score: 17 (7 votes) · LW · GW

This Facebook post has the best discussion of this I know of; in particular check out Dario's comment and the replies to it.

Comment by rohinmshah on Rohin Shah on reasons for AI optimism · 2019-11-02T17:23:57.544Z · score: 6 (3 votes) · LW · GW
Can you quote that please?

It's here:

[Note: In this interview, Rohin was only considering risks arising because of AI systems that try to optimize for goals that are not our own, not other forms of existential risks from AI.]

Ok, I'm curious how likely you think it is that an (existential-level) bad outcome happens due to AI by default, without involving human extinction.

I mostly want to punt on this question, because I'm confused about what "actual" values are. I could imagine operationalizations where I'd say > 90% chance (e.g. if our "actual" values are the exact thing we would settle on after a specific kind of reflection that we may not know about right now), and others where I'd assign ~0% chance (e.g. the extremes of a moral anti-realist view).

I do lean closer to the stance of "whatever we decide based on some 'reasonable' reflection process is good", which seems to encompass a wide range of futures, and seems likely to me to happen by default.

ETA: Also, what was your motivation for talking about a fairly narrow kind of AI risk, when the interviewer started with a more general notion?

I mean, the actual causal answer is "that's what I immediately thought about", it wasn't a deliberate decision. But here are some rationalizations after the fact, most of which I'd expect are causal in that they informed the underlying heuristics that caused me to immediately think of the narrow kind of AI risk:

  • My model was that the interviewer(s) were talking about the narrow kind of AI risk, so it made sense to talk about that.
  • Initially, there wasn't any plan for this interview to be made public, so I was less careful about making myself broadly understandable, and instead tailoring my words to my "audience" of 3 people.
  • I mostly think about and have expertise on the narrow kind (adversarial optimization against humans).
  • I expect that technical solutions are primarily important only for the narrow kind of AI risk (I'm more optimistic about social coordination for the general kind). So when I'm asked a question positing "without additional intervention by us doing safety research", I tend to think of adversarial optimization, since that's what I expect to be addressed by safety research.
Comment by rohinmshah on Rohin Shah on reasons for AI optimism · 2019-11-01T21:21:02.011Z · score: 3 (2 votes) · LW · GW

There's a note early in the transcript that says that basically everything I say in the interview is about adversarial optimization against humans only, which includes the 90% figure.

I wouldn't take the number too seriously. If you asked me some other time of day, or in some other context, or with slightly different words, I might have said 80%, or 95%. I doubt I would have said 70%, or 99%.

Another reason to not take the numbers too seriously: arguably, by the numbers, I have a large disagreement with Will MacAskill, but having discussed it with him I think we basically agree on most things, except how to aggregate the considerations together. I expect that I agree with Will more than I agree with a random AI safety researcher who estimates the same 90% number that I estimated.

even without any additional intervention from current longtermists, advanced AI systems will not cause human extinction by adversarially optimizing against humans

This is the best operationalization of the ones you've listed.

Comment by rohinmshah on Rohin Shah on reasons for AI optimism · 2019-11-01T20:57:50.962Z · score: 4 (2 votes) · LW · GW

I am not arguing that we'll end up building tool AI; I do think it will be agent-like. At a high level, I'm arguing that the intelligence and agentiness will increase continuously over time, and as we notice the resulting (non-existential) problems we'll fix them, or start over.

I agree with your point that long-term planning will develop even with a bunch of heuristics.

Comment by rohinmshah on Rohin Shah on reasons for AI optimism · 2019-11-01T20:54:53.902Z · score: 3 (2 votes) · LW · GW

I enjoyed this comment, thanks for thinking it through! Some comments:

If our superintelligent AI is just a bunch of well developed heuristics, it is unlikely that those heuristics will be generatively strategic enough to engage in super-long-term planning

This is not my belief. I think that powerful AI systems, even if they are a bunch of well developed heuristics, will be able to do super-long-term planning (in the same way that I'm capable of it, and I'm a bunch of heuristics, or Eliezer is to take your example).

Obviously this depends on how good the heuristics are, but I do think that heuristics will get to the point where they do super-long-term planning, and my belief that we'll be safe by default doesn't depend on assuming that AI won't do long-term planning.

I think Rohin would agree with this belief in heuristic kludges that are effecively agential despite not being a One True Algorithm

Yup, that's correct.

So I agree that we have a good chance of ensuring that this kind of AI is safe--mainly because I don't think the level of heuristics involved invoke an AI take-off slow enough to clearly indicate safety risks before they become x-risks.

Should "I don't think" be "I do think"? Otherwise I'm confused. With that correction, I basically agree.

However, I don't think that machine-learned heuristics are the only way we can get highly dangerous agenty heuristics. We've made a lot of mathematical process on understanding logic, rationality and decision theory and, while machine-learned heuristics may figure out approximately Perfect Reasoning Capabilities just by training, I think it's possible that we can directly hardcode heuristics that do the same thing based on our current understanding of things we associate with Perfect Reasoning Capabilities.

I would be very surprised if this worked in the near term. Like, <1% in 5 years, <5% in 20 years, and really I want to say < 1% that this is the first way we get AGI (no matter when), but I can't actually be that confident.

My impression is that many researchers at MIRI would qualitatively agree with me on this, though probably with less confidence.

Comment by rohinmshah on Rohin Shah on reasons for AI optimism · 2019-11-01T19:22:53.327Z · score: 4 (2 votes) · LW · GW

While I didn't make the claim, my general impression from conversations (i.e. not public statements) is that the claim is broadly true for AI safety researchers weighted by engagement with AI safety, in the Bay Area at least, and especially true when comparing to MIRI.

Comment by rohinmshah on AI safety without goal-directed behavior · 2019-10-31T19:17:46.629Z · score: 3 (2 votes) · LW · GW
(I think I'm just agreeing with your comment here?)

Yeah, I think that's basically right.

Comment by rohinmshah on AI safety without goal-directed behavior · 2019-10-30T16:08:47.779Z · score: 3 (2 votes) · LW · GW

To be clear, this post was not arguing that CIRL is not goal-directed -- you'll notice that CIRL is not on my list of potential non-goal-directed models above.

I think CIRL is in this weird in-between place where it is kind of sort of goal-directed. You can think of three different kinds of AI systems:

  • An agent optimizing a known, definite utility function
  • An agent optimizing a utility function that it is uncertain about, that it gets information about from humans
  • A system that isn't maximizing any simple utility function at all

I claim the first is clearly goal-directed, and the last is not goal-directed. CIRL is in the second set, where it's not totally clear: it's actions are driven by a goal, but that goal comes from another agent (a human). (This is also the case with imitation learning, and that case is also not clear -- see this thread.)

I did in fact keep thinking to myself "How do we make sure that the AI really has the goal of "The human gets what they want.", as opposed to a proxy to it that will diverge out-of-distribution?"

I think this is a reasonable critique to have. In the context of Stuart's book, this is essentially a quibble with principle 3:

3. The ultimate source of information about human preferences is human behavior.

The goal learned by the AI system depends on how it maps human behavior (or sensory data) into (beliefs about) human preferences. If that mapping is not accurate (quite likely), then it will in fact learn some other goal, which could be catastrophic.

Comment by rohinmshah on Are minimal circuits deceptive? · 2019-10-29T22:16:10.640Z · score: 4 (2 votes) · LW · GW
First, a minimal circuit is not the same as a speed-prior-minimal algorithm. Minimal circuits have to be minimal in width + depth, so a GLUT would definitely lose out.

I don't really understand this -- my understanding is that with a minimal circuit, you want to minimize the number of gates in the circuit, and the circuit must be a DAG (if you're allowed to have loops + clocks, then you can build a regular computer, and for complex tasks the problem should be very similar to finding the shortest program and implementing it on a universal circuit).

But then I can create a Turing Machine that interprets its input as a circuit, and simulates each gate of the circuit in sequence. Then the running time of the TM is proportional to the number of gates in the circuit, so an input with minimal running time should be a circuit with the minimal number of gates. This is not technically a Universal TM, since loop-free circuits are not Turing-complete, but I would expect a speed prior using such a Turing Machine would be relatively similar to a speed prior with a true UTM.

For example, consider the task of finding the minimum of an arbitrary convex function. Certainly for the infinite set of all possible convex functions (on the rationals, say), I would be pretty surprised if something like gradient descent weren't the fastest way to do that.

I agree that if you have to work on an infinite set of inputs, a GLUT is not the way to do that. I was thinking of the case where you have to work on a small finite set of inputs (hence why I talk about the optimal trajectory instead of the optimal policy), which is always going to be the case in the real world. But this is too pedantic, we can certainly think of the theoretical case where you have to work on an infinite set of inputs. I was mostly trying to use the GLUT as an intuition pump, not arguing that it was always better.

In the case with infinite inputs, I still have the intuition that meta learning is what you do when you don't know enough about the problem to write down the good heuristics straight away, as opposed to being the fastest way of solving the problem. But I agree that the fastest solution won't be a GLUT; I'm thinking more a combination of really good heuristics that "directly" solve the problem. (Part of this is an intuition that for any reasonably structured set of inputs, value-neutral optimization of a fixed objective is very inefficient.)

Comment by rohinmshah on Impact measurement and value-neutrality verification · 2019-10-29T22:00:55.376Z · score: 6 (3 votes) · LW · GW

Oh, a thing I forgot to mention about the proposed formalization: if your distribution over utility functions includes some functions that are amenable to change via optimization (e.g. number of paperclips) and some that are not amenable to change via optimization (e.g. number of perpetual motion machines), then any optimization algorithm, including ones we'd naively call "value-neutral", would lead to distributions of changes in attainable utility with large standard deviation. It might be possible to fix this through some sort of normalization scheme, though I'm not sure how.

Comment by rohinmshah on Impact measurement and value-neutrality verification · 2019-10-29T21:56:20.822Z · score: 4 (2 votes) · LW · GW

I understand the point made in that comment; the part I'm confused about is why the two subpoints in that comment are true:

If "strat­egy-steal­ing as­sump­tion" is true, we can get most of what we "really" want by doing strategy-stealing. (Example of how this can be false: (Log­i­cal) Time is of the essence)
It's not too hard to make "strat­egy-steal­ing as­sump­tion" true.

Like... why? If we have unaligned AI but not aligned AI, then we have failed to make the strategy-stealing assumption true. If we do succeed in building aligned AI, why are we worried about unaligned AI, since we presumably won't deploy it (and so strategy-stealing is irrelevant)? I could imagine that some people mistakenly think that unaligned AI is actually aligned and so build it, or that some malicious actors build AI aligned with them, and the strategy-stealing assumption means that this is basically fine as long as they don't start out with too many resources, but this doesn't seem like the mainline scenario to worry about: it seems much more relevant whether we can align AI or not.

Comment by rohinmshah on Gradient hacking · 2019-10-29T01:20:31.604Z · score: 4 (2 votes) · LW · GW

Planned summary:

This post calls attention to the problem of **gradient hacking**, where a powerful agent being trained by gradient descent could structure its computation in such a way that it causes its gradients to update it in some particular way. For example, a mesa optimizer could structure its computation to first check whether its objective has been tampered with, and if so to fail catastrophically, so that the gradients tend to point away from tampering with the objective.

Planned opinion:

I'd be interested in work that further sketches out a scenario in which this could occur. I wrote about some particular details in this comment.

Comment by rohinmshah on Gradient hacking · 2019-10-29T01:18:52.372Z · score: 4 (2 votes) · LW · GW

Fwiw, I ran an experiment that was similarly inspired. It was a year and a half ago, so I might get the details a bit wrong. The goal was to train a neural net to predict MNIST digits, except to always misclassify a 3 as an 8 (or something like that), and also make sure the gradients on 3s and 8s were zero. The "hope" was that if you finetuned on real MNIST, the gradients for fixing the problem would be really small, and so the buggy behavior would persist.

The result of the experiment was that it did not work, and the finetuning was still able to fix the bad model, though I didn't try very hard to get it to work.

Comment by rohinmshah on Gradient hacking · 2019-10-29T01:13:45.743Z · score: 4 (2 votes) · LW · GW

I'd be interested in work that further sketches out a scenario in which this could occur. Some particular details that would be interesting:

  • Does the agent need to have the ability to change the weights in the neural net that it is implemented in? If so, how does it get that ability?
  • How does the agent deal with the fact that the outer optimization algorithm will force it to explore? Or does that not matter?
  • How do we deal with the fact that gradients mostly don't account for counterfactual behavior? Things like "Fail hard if the objective is changed" don't work as well if the gradients can't tell that changing the objective leads to failure. That said, maybe gradients can tell that that happens, depending on the setup.
  • Why can't the gradient descent get rid of the computation that decides to perform gradient hacking, or repurpose it for something more useful?
Comment by rohinmshah on Impact measurement and value-neutrality verification · 2019-10-29T00:46:26.207Z · score: 4 (2 votes) · LW · GW

Planned summary:

So far, most <@uses of impact formalizations@> don't help with inner alignment, because we simply add impact to the (outer) loss function. This post suggests that impact formalizations could also be adapted to verify whether an optimization algorithm is _value-neutral_ -- that is, no matter what objective you apply it towards, it provides approximately the same benefit. In particular, <@AUP@> measures the _expectation_ of the distribution of changes in attainable utilities for a given action. You could get a measure of the value-neutrality of an action by instead computing the _standard deviation_ of this distribution, since that measures how different the changes in utility are. (Evan would use policies instead of actions, but conceptually that's a minor difference.) Verifying value-neutrality could be used to ensure that the <@strategy-stealing assumption@> is true.

Planned opinion:

I continue to be confused about the purpose of the strategy-stealing assumption, so I don't have a strong opinion about the importance of value-neutrality verification. I do think that the distribution of changes to attainable utilities is a powerful mathematical object, and it makes sense that there are other properties of interest that involve analyzing it.

Comment by rohinmshah on Are minimal circuits deceptive? · 2019-10-28T23:35:25.191Z · score: 2 (1 votes) · LW · GW

Planned summary:

While it has been argued that the simplest program that solves a complex task is likely to be deceptive, it hasn't yet been argued whether the fastest program that solves a complex task will be deceptive. This post argues that fast programs will often be forced to learn a good policy (just as we need to do today), and the learned policy is likely to be deceptive (presumably due to risks from learned optimization). Thus, there are at least some tasks where the fastest program will also be deceptive.

Planned opinion:

This is an intriguing hypothesis, but I'm not yet convinced: it's not clear why the fastest program would have to learn the best policy, rather than directly hardcoding the best policy. If there are multiple possible tasks, the program could have a nested if structure that figures out which task needs to be done and then executes the best policy for that task. More details in this comment.

Comment by rohinmshah on Are minimal circuits deceptive? · 2019-10-28T23:29:48.730Z · score: 4 (2 votes) · LW · GW
if there exists some set of natural tasks for which the fastest way to solve them is to do some sort of machine learning to find a good policy

This would be pretty strange -- why not just directly hardcode the policy? Wouldn't that be faster? We need to use machine learning because we aren't able to write down a program that "directly" solves e.g. image recognition, but the "direct" program would be faster if we had some way of finding it. The general reason for optimism is that the "fastest" requirement implies that any extraneous computation (e.g. deception) is removed; that same reason implies that any search would be removed and replaced with a correct output of the search.

Another way to think of this: when you don't have a simplicity bias, you have to compete with the GLUT (Giant Lookup Table), which can be very fast. Even if you take into account the time taken to perform the lookup, in a deterministic environment the GLUT only has to encode the optimal trajectory, not the full policy. In a stochastic environment, the GLUT may need to be exponentially larger, so it may be too slow, but even so you can have GLUT-like things built out of higher-level abstractions, which might be enough to avoid deception. Basically you can do a lot of weird stuff when you don't require simplicity; it's not clear that meta learning should be modeled the same way.

Comment by rohinmshah on Partial Agency · 2019-10-28T19:10:26.335Z · score: 2 (1 votes) · LW · GW

Sorry for the very late reply, I've been busy :/

To be clear, I don't think iid explains it in all cases, I also think iid is just a particularly clean example. Hence why I said (emphasis added now):

So my position is "partial agency arises because any embedded learning algorithm will necessarily leave out aspects that the idealized learning algorithm can identify". And as a subclaim, that this often happens because of the effective iid assumption between data points in a learning algorithm.


I'm not sure what you're saying here. I agree that "no one wants that".

My point is that the relevant distinction in that case seems to be "instrumental goal" vs. "terminal goal", rather than "full agency" vs. "partial agency". In other words, I expect that a map that split things up based on instrumental vs. terminal would do a better job of understanding the territory than one that used full vs. partial agency.

Re: evolution example, I agree that particular learning algorithms can be designed such that they incentivize partial agency. I think my intuition is that all of the particular kinds of partial agency we could incentivize would be too much of a handicap on powerful AI systems (or won't work at all, e.g. if the way to get powerful AI systems is via mesa optimization).

I'm only claiming that **if the rules of the game remain intact** we can incentivise partial agency.

Definitely agree with that.