Prediction markets for internet points? 2019-10-27T19:30:00.898Z · score: 40 (18 votes)
AI alignment landscape 2019-10-13T02:10:01.135Z · score: 43 (16 votes)
Taxing investment income is complicated 2019-09-22T01:30:01.242Z · score: 34 (13 votes)
The strategy-stealing assumption 2019-09-16T15:23:25.339Z · score: 62 (14 votes)
Reframing the evolutionary benefit of sex 2019-09-14T17:00:01.184Z · score: 66 (22 votes)
Ought: why it matters and ways to help 2019-07-25T18:00:27.918Z · score: 83 (33 votes)
Aligning a toy model of optimization 2019-06-28T20:23:51.337Z · score: 52 (17 votes)
What failure looks like 2019-03-17T20:18:59.800Z · score: 209 (84 votes)
Security amplification 2019-02-06T17:28:19.995Z · score: 20 (4 votes)
Reliability amplification 2019-01-31T21:12:18.591Z · score: 21 (5 votes)
Techniques for optimizing worst-case performance 2019-01-28T21:29:53.164Z · score: 23 (6 votes)
Thoughts on reward engineering 2019-01-24T20:15:05.251Z · score: 30 (8 votes)
Learning with catastrophes 2019-01-23T03:01:26.397Z · score: 28 (9 votes)
Capability amplification 2019-01-20T07:03:27.879Z · score: 24 (7 votes)
The reward engineering problem 2019-01-16T18:47:24.075Z · score: 23 (4 votes)
Towards formalizing universality 2019-01-13T20:39:21.726Z · score: 29 (6 votes)
Directions and desiderata for AI alignment 2019-01-13T07:47:13.581Z · score: 29 (6 votes)
Ambitious vs. narrow value learning 2019-01-12T06:18:21.747Z · score: 20 (6 votes)
AlphaGo Zero and capability amplification 2019-01-09T00:40:13.391Z · score: 27 (11 votes)
Supervising strong learners by amplifying weak experts 2019-01-06T07:00:58.680Z · score: 28 (7 votes)
Benign model-free RL 2018-12-02T04:10:45.205Z · score: 12 (3 votes)
Corrigibility 2018-11-27T21:50:10.517Z · score: 39 (9 votes)
Humans Consulting HCH 2018-11-25T23:18:55.247Z · score: 19 (3 votes)
Approval-directed bootstrapping 2018-11-25T23:18:47.542Z · score: 19 (4 votes)
Approval-directed agents 2018-11-22T21:15:28.956Z · score: 29 (5 votes)
Prosaic AI alignment 2018-11-20T13:56:39.773Z · score: 36 (9 votes)
An unaligned benchmark 2018-11-17T15:51:03.448Z · score: 27 (6 votes)
Clarifying "AI Alignment" 2018-11-15T14:41:57.599Z · score: 54 (16 votes)
The Steering Problem 2018-11-13T17:14:56.557Z · score: 38 (10 votes)
Preface to the sequence on iterated amplification 2018-11-10T13:24:13.200Z · score: 39 (14 votes)
The easy goal inference problem is still hard 2018-11-03T14:41:55.464Z · score: 38 (9 votes)
Could we send a message to the distant future? 2018-06-09T04:27:00.544Z · score: 40 (14 votes)
When is unaligned AI morally valuable? 2018-05-25T01:57:55.579Z · score: 101 (31 votes)
Open question: are minimal circuits daemon-free? 2018-05-05T22:40:20.509Z · score: 122 (39 votes)
Weird question: could we see distant aliens? 2018-04-20T06:40:18.022Z · score: 85 (25 votes)
Implicit extortion 2018-04-13T16:33:21.503Z · score: 74 (22 votes)
Prize for probable problems 2018-03-08T16:58:11.536Z · score: 135 (37 votes)
Argument, intuition, and recursion 2018-03-05T01:37:36.120Z · score: 99 (29 votes)
Funding for AI alignment research 2018-03-03T21:52:50.715Z · score: 108 (29 votes)
Funding for independent AI alignment research 2018-03-03T21:44:44.000Z · score: 0 (0 votes)
The abruptness of nuclear weapons 2018-02-25T17:40:35.656Z · score: 105 (37 votes)
Arguments about fast takeoff 2018-02-25T04:53:36.083Z · score: 104 (35 votes)
Funding opportunity for AI alignment research 2017-08-27T05:23:46.000Z · score: 1 (1 votes)
Ten small life improvements 2017-08-20T19:09:23.673Z · score: 26 (19 votes)
Crowdsourcing moderation without sacrificing quality 2016-12-02T21:47:57.719Z · score: 15 (11 votes)
Optimizing the news feed 2016-12-01T23:23:55.403Z · score: 9 (10 votes)
The universal prior is malign 2016-11-30T22:31:41.000Z · score: 4 (4 votes)
Recent AI control posts 2016-11-29T18:53:57.656Z · score: 12 (13 votes)
My recent posts 2016-11-29T18:51:09.000Z · score: 5 (5 votes)
If we can't lie to others, we will lie to ourselves 2016-11-26T22:29:54.990Z · score: 25 (18 votes)


Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-03T04:32:35.150Z · score: 8 (3 votes) · LW · GW

But to be clear, (i) it would then also be learned by imitating a large enough dataset from human players who did something like tree search internally while playing, (ii) I think the tree search makes a quantitative not qualitative change, and it's not that big (mostly improves stability, and *maybe* a 10x speedup, over self-play).

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-03T04:12:56.112Z · score: 19 (9 votes) · LW · GW
In one comical case, AlphaStar had surrounded the units it was building with its own factories so that they couldn't get out to reach the rest of the map. Rather than lifting the buildings to let the units out, which is possible for Terran, it destroyed one building and then immediately began rebuilding it before it could move the units out!

It seems like AlphaStar played 90 ladder matches as Terran:

  • 30 with the initial policy trained with SL
  • 30 with the a policy from the middle of training
  • 30 from the final RL policy.

This sounds like the kind of mistake that the SL policy would definitely make (no reason it should be able to recover), whereas it's not clear whether RL would learn how to recover (I would expect it to, but not too strongly).

If it's easy for anyone to check and they care, it might be worth looking quickly through the replays and seeing whether this particular game was from the SL or RL policies. This is something I've been curious about since seeing the behavior posted on Reddit, and it would have a moderate impact on my understanding of AlphaStar's skill.

It looks like they released 90 replays and played 90 ladder games so it should be possible to check.

The replays are here, hosted on the DM site, sorted into three folders based on the policy, if it's one of the SL matches it's either AlphaStarSupervised_013_TvT.SC2Replay, or one of _017_, _019_, or _022_ (based on being TvT and being on Kairos Junction). The video in question is here. I'd check if I had SC2 installed.

(Of course better still would be to find a discussion of the 30 RL replays, from someone who understands the game. Maybe that's been posted somewhere, I haven't looked and it's hard to know who to trust.)

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-03T03:55:36.981Z · score: 15 (6 votes) · LW · GW

I meant to ask about the policy network in AlphaZero directly. It plays at the professional level (the Nature paper puts it at a comparable Elo to Fan Hui) with no tree search, using a standard neural network architecture trained by supervised learning. It performs fine on parts of the search tree that never appeared during training. What distinguishes this kind of reasoning from "if I see X, I do Y"?

(ETA clarification, because I think this was probably the misunderstanding: the policy network plays Go with no tree search, tree search is only used to generate training data. That suggests the AlphaStar algorithm would produce similar behavior without using tree search ever, probably using at most 100x the compute of AlphaZero and I'd be willing to bet on <10x.)

From the outside, it looks like human-level play at Starcraft is more complicated (in a sense) than human-level play at Go, and so it's going to take bigger models in order to reach a similar level of performance. I don't see a plausible-looking distinction-in-principle that separates the strategy in Starcraft from strategy in Go.

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-02T15:05:53.593Z · score: 33 (12 votes) · LW · GW
Finally, RL practitioners have known that genuine causal reasoning could never be achieved via known RL architectures- you'd only ever get something that could execute the same policy as an agent that had reasoned that way, via a very expensive process of evolving away from dominated strategies at each step down the tree of move and countermove. It's the biggest known unknown on the way to AGI.

What's the argument here? Do you think that the AGZ policy (which is extremely good at Go or Chess even without any tree search) doesn't do any causal reasoning? That it only ever learns to play parts of the game tree it's seen during training? What does "genuine causal reasoning" even mean?

It looks to me like causal reasoning is just another type of computation, and that you could eventually find that computation by local search. If you need to use RL to guide that search then it's going to take a long time---AlphaStar was very expensive, and still only trained a policy with ~80M parameters.

From my perspective it seems like the big questions are just how large a policy you would need to train using existing methods in order to be competitive with a human (my best guess would be a ~trillion to a ~quadrillion), and whether you can train it by copying rather than needing to use RL.

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-02T14:55:38.405Z · score: 28 (14 votes) · LW · GW

In what sense is this information warfare or even misleading? The second sentence of the blog post says: "AlphaStar was ranked above 99.8% of active players," which seems quite clear. They seem to have done a pretty good job of making that comparison as fair as you could expect. What do they say or even imply which is highly misleading?

Perhaps they say "Grandmaster level," and it's possible that this gives a misleading impression to people who don't know what that term means in Starcraft? Though I think chess grandmaster also means roughly "better than 99.8% of ladder players," and the competitive player pools have similar size. So while it might be misleading in the sense that Chess has a larger player pool a smaller fraction of whom are competitive, it seems fairly straightforward.

Comment by paulfchristiano on AI alignment landscape · 2019-10-22T16:49:59.394Z · score: 2 (1 votes) · LW · GW

(edited post to include)

Comment by paulfchristiano on The strategy-stealing assumption · 2019-10-19T17:53:45.567Z · score: 6 (3 votes) · LW · GW

I wrote this post imagining "strategy-stealing assumption" as something you would assume for the purpose of an argument, for example I might want to justify an AI alignment scheme by arguing "Under a strategy-stealing assumption, this AI would result in an OK outcome." The post was motivated by trying to write up another argument where I wanted to use this assumption, spending a bit of time trying to think through what the assumption was, and deciding it was likely to be of independent interest. (Although that hasn't yet appeared in print.)

I'd be happy to have a better name for the research goal of making it so that this kind of assumption is true. I agree this isn't great. (And then I would probably be able to use that name in the description of this assumption as well.)

Comment by paulfchristiano on Thoughts on "Human-Compatible" · 2019-10-10T16:09:31.958Z · score: 12 (6 votes) · LW · GW

(See also the concept of "decoupled RL" from some DeepMind folks.)

Comment by paulfchristiano on Taxing investment income is complicated · 2019-10-03T05:23:25.549Z · score: 5 (3 votes) · LW · GW

"Social cost" in economics usually refers to the sum of private costs (wikipedia), such that a transfer from one person to another would have no social cost.

"Internalizing an externality" usually means making the private costs better reflect the social costs (so it's not relevant what is done with the tax revenue).

Comment by paulfchristiano on List of resolved confusions about IDA · 2019-10-01T05:05:21.501Z · score: 13 (2 votes) · LW · GW
By way of illustration, AlphaZero's long-term preference is to win the chess game, its short-term preference is whatever its policy network spits out as the best move to make next, and its short-term preference-on-reflection is the move it wants to make next after doing a fuck-ton of MCTS.

Short-term preferences are the value function one or a few moves out. If the algorithm is "reasonable," then its short-term preference-on-reflection are the true function P(I win the game|I make this move). You could also talk about intermediate degrees of reflection.

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-29T20:52:02.131Z · score: 4 (2 votes) · LW · GW
Now that I understand "corrigible" isn't synonymous with “satisfying my short-term preferences-on-reflection”, “corrigibility is relatively easy to learn” doesn't seem enough to imply these things

I agree that you still need the AI to be trying to do the right thing (even though we don't e.g. have any clear definition of "the right thing"), and that seems like the main way that you are going to fail.

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-29T17:19:43.271Z · score: 6 (3 votes) · LW · GW
As I understand it, the original motivation for corrigibility_MIRI was to make sure that someone can always physically press the shutdown button, and the AI would shut off. But if a corrigible_Paul AI thinks (correctly or incorrectly) that my preferences-on-reflection (or "true" preferences) is to let the AI keep running, it will act against my (actual physical) attempts to shut down the AI, and therefore it's not corrigible_MIRI.

Note that "corrigible" is not synonymous with "satisfying my short-term preferences-on-reflection" (that's why I said: "our short-term preferences, including (amongst others) our preference for the agent to be corrigible.")

I'm just saying that when we talk about concepts like "remain in control" or "become better informed" or "shut down," those all need to be taken as concepts-on-reflection. We're not satisfying current-Paul's judgment of "did I remain in control?" they are the on-reflection notion of "did I remain in control"?

Whether an act-based agent is corrigible depends on our preferences-on-reflection (this is why the corrigibility post says that act-based agents "can be corrigible"). It may be that our preferences-on-reflection are for an agent to not be corrigible. It seems to me that for robustness reasons we may want to enforce corrigibility in all cases even if it's not what we'd prefer-on-reflection, for robustness reasons.

That said, even without any special measures, saying "corrigibility is relatively easy to learn" is still an important argument about the behavior of our agents, since it hopefully means that either (i) our agents will behave corrigibly, (ii) our agents will do something better than behaving corriglby, according to our preferences-on-reflection, (iii) our agents are making a predictable mistake in optimizing our preferences-on-reflection (which might be ruled out by them simply being smart enough and understanding the kinds of argument we are currently making).

By "corrigible" I think we mean "corrigible by X" with the X implicit. It could be "corrigible by some particular physical human."

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-29T04:11:24.398Z · score: 5 (3 votes) · LW · GW

(In that post I did use narrow in the way we are currently using short-term, contrary to my claim the grandparent. Sorry for the confusion this caused.)

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-29T02:11:19.856Z · score: 4 (2 votes) · LW · GW

Like I mentioned above, I mostly think of narrow value learning is a substitute for imitation learning or approval-direction, realistically to be used as a distillation step rather than as your whole AI. In particular, an agent trained with narrow value learning absolutely is probably not aligned+competitive in a way that might allow you to apply this kind of strategy-stealing argument.

In concrete approval-directed agents I'm talking about a different design, it's not related to narrow value learning.

I don't use narrow and short-term interchangeably. I've only ever used it in the context of value learning, in order to make this particular distinction between two different goals you might have when doing value learning.

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-29T00:33:33.509Z · score: 4 (2 votes) · LW · GW

One of us just misunderstood (1), I don't think there is any difference.

I mean preferences about what happens over the near future, but the way I rank "what happens in the near future" will likely be based on its consequences (further in the future, and in other possible worlds, and etc.). So I took (1) to be basically equivalent to (2).

"Terminal preferences over the near future" is not a thing I often think about and I didn't realize it was a candidate interpretation (normally when I write about short-term preferences I'm writing about things like control, knowledge, and resource acquisition).

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-28T15:29:07.773Z · score: 12 (4 votes) · LW · GW

By "short" I mean short in sense (1) and (2). "Short" doesn't imply anything about senses (3), (4), (5), or (6) (and "short" and "long" don't seem like good words to describe those axes, though I'll keep using them in this comment for consistency).

By "preferences-on-reflection" I mean long in sense (3) and neither in sense (6). There is a hypothesis that "humans with AI help" is a reasonable way to capture preferences-on-reflection, but they aren't defined to be the same. I don't use understandable and evaluable in this way.

I think (4) and (5) are independent axes. (4) just sounds like "is your AI good at optimizing," not a statement about what it's optimizing. In the discussion with Eliezer I'm arguing against it being linked to any of these other axes. (5) is a distinction about two senses in which an AI can be "optimizing my short-term preferences-on-reflection"

When discussing perfect estimations of preferences-on-reflection, I don't think the short vs. long distinction is that important. "Short" is mostly important when talking about ways in which an AI can fall short of perfectly estimating preferences-on-reflection.

Assuming my interpretation is correct, my confusion is that you say we shouldn't expect a situation where "the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy" (I take you to be talking about sense (3) from above). It seems like the user-on-reflection and the current user would disagree about many things (that is the whole point of reflection), so if the AI acts in accordance with the intentions of the user-on-reflection, the current user is likely to end up unhappy.

I introduced the term "preferences-on-reflection" in the previous comment to make a particular distinction. It's probably better to say something like "actual preferences" (though this is also likely to be misinterpreted). The important property is that I'd prefer to have an AI that satisfies my actual preferences than to have any other kind of AI. We could also say "better by my lights" or something else.

There's a hypothesis that "what I'd say after some particular idealized process of reflection" is a reasonable way to capture "actual preferences," but I think that's up for debate---e.g. it could fail if me-on-reflection is selfish and has values opposed to current-me, and certainly it could fail for any particular process of reflection and so it might just happen to be the case that there is no process of reflection that satisfies it.

The claim I usually make is that "what I'd say after some particular idealized process of reflection" describes the best mechanism we can hope to find for capturing "actual preferences," because whatever else we might do to capture "actual preferences" can just be absorbed into that process of reflection.

"Actual preferences" is a pretty important concept here, I don't think we could get around the need for it, I'm not sure if there is disagreement about this concept or just about the term being used for it.

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-25T01:22:15.546Z · score: 7 (3 votes) · LW · GW
All three of these corrigible AIs deal with much narrower preferences than "acquire flexible influence that I can use to get what I want". The narrow value learner post for example says:

Imitation learning, approval-direction, and narrow value learning are not intended to exceed the overseer's capabilities. These are three candidates for the distillation step in iterated distillation and amplification.

The AI we actually deploy, which I'm discussing in the OP, is produced by imitating (or learning the values of, or maximizing the approval of) an even smarter AI---whose valuations of resources reflect everything that unaligned AIs know about which resources will be helpful.

Corrigibility is about short-term preferences-on-reflection. I see how this is confusing. Note that the article doesn't make sense at all when interpreted in the other way. For example, the user can't even tell whether they are in control of the situation, so what does it mean to talk about their preference to be in control of the situation if these aren't supposed to be preferences-on-reflection? (Similarly for "preference to be well-informed" and so on.) The desiderata discussed in the original corrigibility post seem basically the same as the user not being able to tell what resources will help them achieve their long-term goals, but still wanting the AI to accumulate those resources.

I also think the act-based agents post is correct if "preferences" means preferences-on-reflection. It's just that the three approaches listed at the top are limited to the capabilities of the overseer. I think that distinguishing between preferences-as-elicited and preferences-on-reflection is the most important thing to disambiguate here. I usually use "preference" to mean preference-on-idealized-reflection (or whatever "actual preference" should mean, acknowledging that we don't have a real ground truth definition), which I think is the more typical usage. I'd be fine with suggestions for disambiguation.

If there's somewhere else I've equivocated in the way you suggest, then I'm happy to correct it. It seems like a thing I might have done in a way that introduces an error. I'd be surprised if it hides an important problem (I think the big problems in my proposal are lurking other places, not here), and I think in the corrigibility post I think that I have these concepts straight.

One thing you might have in mind is the following kind of comment:

If on average we are unhappy with the level of corrigibility of a benign act-based agent, then by construction it is mistaken about our short-term preferences.

That is, you might be concerned: "the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy." I'm saying that you shouldn't expect this to happen, if the AI is well-calibrated and has enough of an understanding of humans to understand e.g. this discussion we are currently having---if it decides not to be corrigible, we should expect it to be right on average.

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-24T09:25:37.185Z · score: 4 (2 votes) · LW · GW
This seems too glib, if "long-term preferences" are in some sense the "right" preferences, e.g., if under reflective equilibrium we would wish that we currently put a lot more weight on long-term preferences. Even if we only give unaligned AIs a one-time advantage (which I'm not sure about LW), that could still cause us to lose much of the potential value of the universe.

To be clear, I am worried about people not understanding or caring about the long-term future, and AI giving them new opportunities to mess it up.

I'm particularly concerned about things like people giving their resources to some unaligned AI that seemed like a good idea at the time, rather than simply opting out of competition so that unaligned AIs might represent a larger share of future-influencers. This is another failure of strategy-stealing that probably belongs in the post---even if we understand alignment, there may be plenty of people not trying to solve alignment and instead doing something else, and the values generated by that "something else" will get a natural boost.

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-24T09:10:56.615Z · score: 2 (1 votes) · LW · GW
To sum up, I think there's a fundamental tension between corrigibility (in the sense of respecting the human user's short-term preferences) and long-term success/competitiveness, which underlies many of the specific failure scenarios described in the OP, and worse, makes it unclear how "strategy-stealing" can work at all.

By short-term preference I don't mean "Start a car company, I hear those are profitable," I mean more like "Make me money, and then make sure that I remain in control of that company and its profits," or even better "acquire flexible influence that I can use to get what I want."

(This is probably not the response you were looking for. I'm still mostly intending to give up on communication here over the short term, because it seems too hard. If you are confused by particular things I've said feel free to quote them so that I can either clarify, register a disagreement, or write them off as sloppy or mistaken comments.)

Comment by paulfchristiano on Taxing investment income is complicated · 2019-09-23T15:31:38.461Z · score: 3 (2 votes) · LW · GW
The solution you propose also sounds really complicated when people have to optimize the timing of when they make capital gains with times when the risk-free rate is low.

When you sell assets you deduct the amount you paid for them. The proposal is to multiply that basis by the total amount of risk-free interest that would have accumulated over the intervening period, which can be calculated by looking up a single number in a table. I agree that using the risk-free rate when you sell would be insane.

(From the perspective of tax optimization, I think this is much simpler than the status quo. From the perspective of tax accounting, this mechanism takes the place of the distinction between long-term and short-term capital gains, and is radically simpler than that.)

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-23T09:29:35.061Z · score: 2 (1 votes) · LW · GW
What if the user fails to realize that a certain kind of resource is valuable? (By "resources" we're talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)

As long as the user and AI appreciate the arguments we are making right now, then we shouldn't expect it to do worse than stealing the unaligned AI's strategy. There is all the usual ambiguity about "what the user wants," but if the user expects that the resources other agents are gathering will be more useful than the resources its AI is gathering, then its AI would clearly do better (in the user's view) by doing what others are doing.

(I think I won't have time to engage much on this in the near future, it seems plausible that I am skipping enough steps or using language in an unfamiliar enough way that this won't make sense to readers in which case so it goes; it's also possible that I'm missing something.)

Comment by paulfchristiano on Taxing investment income is complicated · 2019-09-23T09:22:31.106Z · score: 2 (1 votes) · LW · GW

This is basically the argument in my second bullet. You can make what you will of that argument, I think I still believe it---taxing savings just seems strictly worse than a similarly-progressive tax on income.

(I also don't much like a wealth tax for the same reason.)

Comment by paulfchristiano on Taxing investment income is complicated · 2019-09-23T09:20:10.603Z · score: 4 (3 votes) · LW · GW

These are good options when available. You should start by setting all the Pigouvian taxes at optimal levels then go from there. Having not thought about it very much, taxes on unimproved value of land seem good but can't fund something like a modern government without major distortions, so you'll end up with lots of other stuff in your basket.

Comment by paulfchristiano on Taxing investment income is complicated · 2019-09-23T09:16:12.657Z · score: 3 (2 votes) · LW · GW
Also, my current understanding is that risk compensation is definitely not the large majority of investment returns. The last chapter of Cochrane's Asset Pricing text has a great discussion of the topic. The main conclusion is that explaining returns via risk exposure requires unrealistically high levels of risk aversion - like, one or two orders of magnitude above the risk aversion levels implied by other activities.

What's the competing explanation?

Haven't looked at the historical numbers, but in recent times it seems like (i) with log utility and a naive model of "future=past," optimal leverage is around 2x, (ii) most investors are much more risk averse than log utility (even for casino risk). So it seems like things basically add up here for most of the market. Was the situation an order of magnitude different in the past?

Comment by paulfchristiano on Taxing investment income is complicated · 2019-09-22T17:17:04.387Z · score: 5 (3 votes) · LW · GW

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-21T16:47:10.940Z · score: 8 (2 votes) · LW · GW
For example, suppose meta-execution asks the subquestion “What does the user want?”, gets a representation of their values, and then asks the subquestion “What behavior is best according to those values?” I’ve then generated incorrigible behavior by accident, after taking innocuous steps.

Estimating values then optimizing those seems (much) worse than optimizing "what the user wants." One natural strategy for getting what the user wants can be something like "get into a good position to influence the world and then ask the user later."

This is very confusing because elsewhere you say that the kind of AI you're trying to design is just satisfying short-term preferences / instrumental values of the user

I don't have a very strong view about the distinction between corrigibility to the user and corrigibility to some other definition of value (e.g. a hypothetical version of the user who is more secure).

This is very confusing because elsewhere you say that the kind of AI you're trying to design is just satisfying short-term preferences / instrumental values of the user, but here "good for the user" seemingly has to be interpreted as "good in the long run".

By "trying to find a strategy that's good for the user" I mean: trying to pursue the kind of resources that the user thinks are valuable, without costs that the user would consider serious, etc.

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-18T04:00:58.989Z · score: 2 (1 votes) · LW · GW
Is the aligned AI literally applying a planning algorithm to the same long-term goal as the unaligned AI, and then translating that plan into a plan for acquiring flexible influence, or is it just generally trying to come up with a plan to acquire flexible influence?

The latter

It is trying to find a strategy that's instrumentally useful for a variety of long-term goals

It's presumably trying to find a strategy that's good for the user, but in the worst case where it understands nothing about the user it still shouldn't do any worse than "find a strategy that's instrumentally useful for a variety of long-term goals."

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-17T17:07:36.253Z · score: 2 (1 votes) · LW · GW
What I mean is that the unaligned AI isn't trying to "acquire influence", but rather trying to accomplish a specific long-term / terminal goal. The aligned AI doesn't have a long-term / terminal goal, so it can't just "uses whatever procedure the unaligned AI originally used to find that strategy", at least not literally.

Yeah, that's supposed to be the content of the strategy-stealing assumption---that good plans for having a long-term impact can be translated into plans for acquiring flexible influence. I'm interested in looking at ways that can fail. (Alignment is the most salient to me.)

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-17T15:38:15.983Z · score: 2 (1 votes) · LW · GW

Strategy stealing doesn't usually involve actual stealing, just using the hypothetical strategy the second player could have used.

How? The unaligned AI is presumably applying some kind of planning algorithm to its long-term/terminal goal to find its strategy, but in your scenario isn't the aligned/corrigble AI just following the short-term/instrumental goals of its human users? How is it able to use the unaligned AI's strategy-finding procedure?

This is what alignment is supposed to give you---a procedure that works just as well as the unaligned AI strategy (e.g. by updating on all the same logical facts about how to acquire influence that the unaligned AI might discover and then using those---this post is mostly about whether you should expect that to work. You could also use a different set that is equally useful because you are similarly matching the meta-level strategy for discovering useful facts about how to uncover information.)

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-17T07:40:26.264Z · score: 2 (1 votes) · LW · GW
Cryptic strategies
The unaligned AI comes up with some kind of long term strategy that the aligned AI can't observe or can't understand, for example because the aligned AI is trying to satisfy humans' short-term preferences and humans can't observe or understand the unaligned AI's long term strategy.

I'm not imagining that the aligned AI literally observes and copies the strategy of the unaligned AI. It just uses whatever procedure the unaligned AI originally used to find that strategy.

Trying to kill everyone as a terminal goal
Under "reckless" you say "Overall I think this isn’t a big deal, because it seems much easier to cause extinction by trying to kill everyone than as an accident." but then you don't list this as an independent concern. Some humans want to kill everyone (e.g. to eliminate suffering) and so they could build AIs that have this goal.

I agree that people who want a barren universe have an advantage, this is similar to recklessness and fragility but maybe worth separating.

Weird priors
From this comment: Here’s a possible way for another AI (A) to exploit your AI (B). Search for a statement S such that B can’t consult its human about S’s prior and P(A will win a future war against B | S) is high. Then adopt a high prior for S, wait for B to do the same, and come to B to negotiate a deal that greatly favors A.

I'm not sure I understand this, but it seems like my earlier response ("I'm not imagining that the aligned AI literally observes and copies the strategy of the unaligned AI") is relevant.

Or, the unaligned AI deliberately does this just so that you can't copy its strategy.

It's not clear to me whether this is possible.

Comment by paulfchristiano on Reframing the evolutionary benefit of sex · 2019-09-16T15:25:54.264Z · score: 2 (1 votes) · LW · GW

Evolution isn't episodic. In some sense the question motivating the OP was whether many of the important phenomena can come from being episodic with an appropriate utility function (something like exp(fitness) instead of fitness).

Comment by paulfchristiano on Reframing the evolutionary benefit of sex · 2019-09-16T15:19:41.785Z · score: 2 (1 votes) · LW · GW

(Already asked them if they could crosspost to alignment forum. I was just being reactive for this one though, they asked if I was OK cross-posting from sideways-view. I can manually cross-post most recent post.)

Comment by paulfchristiano on Reframing the evolutionary benefit of sex · 2019-09-15T15:40:54.626Z · score: 2 (1 votes) · LW · GW

On this picture, the claim is just that sex is worthwhile if it adds enough variance to make it worthwhile, so you could either reduce costs or increase variance.

(This picture could be wrong though, if the average of two organisms is just more fit on average than the parents and it's not about variance at all.)

(This isn't really affected by Wei Dai's concern above.)

Comment by paulfchristiano on Reframing the evolutionary benefit of sex · 2019-09-15T15:34:15.690Z · score: 2 (1 votes) · LW · GW

I didn't set it up, the LW team did.

Comment by paulfchristiano on Reframing the evolutionary benefit of sex · 2019-09-15T15:16:29.578Z · score: 5 (2 votes) · LW · GW

From an evolutionary perspective, males and females presumably get an equally good deal on the margin (or else the sex ratio would shift). That need not need to look like a productive investment in order to justify this basic picture (e.g. the story would be the same if males fought for territory and protected mates against other males).

If the low-investment sex doesn't add value of any kind then it would change this picture. E.g. if males just compete for mates and then do nothing beyond mate, then females would get an advantage by cloning themselves. Maybe plants actually fit into this picture most straightforwardly of all.

(This would actually also happen if males invested 50%, if they couldn't track paternity and females could secretly fertilize themselves.)

In the case where you get nothing in return, it seems like you are taking a 50% fitness hit from sex by passing on half as many genes. So if differences in fitness between kids were 5%, it would take about 300 generations for sex to break even. If fertilizing yourself is a complicated adaptation, then maybe that's enough to stick with sex (after it evolves in some more equitable species) but it's pretty different from my claim about breaking even in 6 generations. And in the case of plants or other hermaphrodites presumably there is an easier gradient to more self-fertilization, so that's even more puzzling and maybe this would bring me back to the more usual view where there is something more to be explained (either evolution is surprisingly forward-looking, or we need some story about why the advantage is bigger than it looks).

Comment by paulfchristiano on Conversation with Paul Christiano · 2019-09-12T15:31:58.307Z · score: 5 (4 votes) · LW · GW

Someone from MIRI can chime in. I think that MIRI researchers are much happier to build AI that solves a narrow range of tasks, and isn't necessarily competitive. I think I'm probably the most extreme person on this spectrum.

Comment by paulfchristiano on Conversation with Paul Christiano · 2019-09-12T15:31:09.539Z · score: 4 (3 votes) · LW · GW

Maybe you have a 30% chance of solving the clean theoretical problem. And a 30% chance that you could wing AI alignment with no technical solution. If they were independent, you would have a 50% probability of being able to do one or the other.

But things are worse than this, because both of them are more likely to work if alignment turns out to be easy. So maybe it's more like a 40% probability of being able to do one or the other.

But in reality, you don't need to solve the full theoretical problem or wing the problem without understanding anything more than we do today. You can have a much better theoretical understanding than we currently do, but not good enough to solve the problem. And you can be pretty prepared to wing it, even if it's not good enough to solve the problem without knowing anything it might be good enough if combined with a reasonable theoretical picture.

(Similarly for coordination.)

Comment by paulfchristiano on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-11T19:08:41.404Z · score: 6 (3 votes) · LW · GW
the Oracle could still hack the grading system to give itself a zero loss

Gradient descent won't optimize for this behavior though, it really seems like you want to study this under inner alignment. (It's hard for me to see how you can meaningfully consider the problems separately.)

Yes, if the oracle gives itself zero loss by hacking the grading system then it will stop being updated, but the same is true if the mesa-optimizer tampers with the outer training process in any other way, or just copies itself to a different substrate, or whatever.

Comment by paulfchristiano on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-11T19:06:37.358Z · score: 5 (2 votes) · LW · GW

Yes, most of the algorithms in use today are known to converge or roughly converge to optimizing per-episode rewards. In most cases it's relatively clear that there is no optimization across episode boundaries (by the outer optimizer).

Comment by paulfchristiano on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-11T16:01:30.130Z · score: 5 (2 votes) · LW · GW

How do you even tell that an algorithm is optimizing something?

In most cases we have some argument that an algorithm is optimizing the episodic reward, and it just comes down to the details of that argument.

If you are concerned with optimization that isn't necessarily intended and wondering how to more effectively look out for it, it seems like you should ask "would a policy that has property P be more likely to be produced under this algorithm?" For P="takes actions that lead to high rewards in future episodes" the answer is clearly yes, since any policy that persists for a long time necessarily has property P (though of course it's unclear if the algorithm works at all). For normal RL algorithms there's not any obvious mechanism by which this would happen. It's not obvious that it doesn't, until you prove that these algorithms converge to optimizing per-episode rewards. I don't see any mechanical way to test that (just like I don't see any mechanical way to test almost any property that we talk about in almost any argument about anything).

Comment by paulfchristiano on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-10T15:28:59.385Z · score: 7 (3 votes) · LW · GW

Episodic learning algorithms will still penalize this behavior if it appears on the training distribution, so it seems reasonable to call this an inner alignment problem.

Comment by paulfchristiano on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-10T15:28:03.970Z · score: 3 (2 votes) · LW · GW
For example, consider the following episodic learning algorithm

When I talk about an episodic learning algorithm, I usually mean one that actually optimizes performance within an episode (like most of the algorithms in common use today, e.g. empirical risk minimization treating episode initial conditions as fixed). The algorithm you described doesn't seem like an "episodic" learning algorithm, given that it optimizes total performance (and essentially ignores episode boundaries).

Comment by paulfchristiano on [AN #62] Are adversarial examples caused by real but imperceptible features? · 2019-08-31T21:02:50.596Z · score: 2 (1 votes) · LW · GW

I think AI will probably be good enough to pose a catastrophic risk before it can exactly imitate a human. (But as Wei Dai says elsewhere, if you do amplification then you will definitely get into the regime where you can't imitate.)

Comment by paulfchristiano on [AN #62] Are adversarial examples caused by real but imperceptible features? · 2019-08-30T15:26:12.772Z · score: 3 (2 votes) · LW · GW

Against mimicry is mostly motivated by the case of imitating an amplified agent. (I try to separate the problem of distillation and amplification, and imitation learning is a candidate for mimicry.)

You could try to avoid the RL exploiting a security vulnerability in the overseer by:

  • Doing something like quantilizing where you are constrained to be near the original policy (we impose a KL constraint that prevents the policy from drifting too far from an attempted-imitation).
  • Doing something like meeting halfway.

These solutions seem tricky but maybe helpful. But my bigger concern is that you need to fix security vulnerabilities anyway:

  • The algorithm "Search over lots of actions to find the one for which Q(a) is maximized" is a pretty good algorithm, that you need to be able to use at test time in order to be competitive, and which seems to require competitiveness.
  • Iterated amplification does optimization anyway (by amplifying the optimization done by the individual humans) and without security you are going to have problems there.

I mostly hope to solve this problem with security amplification (see also).

Comment by paulfchristiano on Soft takeoff can still lead to decisive strategic advantage · 2019-08-25T03:12:00.367Z · score: 6 (4 votes) · LW · GW
Wouldn't one project have more compute than the others, and thus pull ahead so long as funds lasted?

To have "more compute than all the others" seems to require already being a large fraction of all the world's spending (since a large fraction of spending is on computers---or whatever bundle of inputs is about to let this project take over the world---unless you are positing a really bad mispricing). At that point we are talking "coalition of states" rather than "project."

I totally agree that it wouldn't be crazy for a major world power to pull ahead of others technologically and eventually be able to win a war handily, and that will tend happen over shorter and shorter timescales if economic and technological progress accelerate.

(Or you might think the project is a small fraction of world compute but larger than any other project, but if economies of scale are in fact this critical, then you are again suggesting a really gigantic market failure. That's not beyond the pale, but we should be focusing on why this crazy market failure is happening.)

Comment by paulfchristiano on Soft takeoff can still lead to decisive strategic advantage · 2019-08-25T03:06:05.478Z · score: 7 (4 votes) · LW · GW
This gets us into the toy model & its problems. I don't think I understand your alternative model. I maybe don't get what you mean by trading. Does one party giving money to another party in return for access to their technology or products count? If so, then I think my original model still stands: The leading project will be able to hoard technology/innovation and lengthen its lead over the rest of the world so long as it still has funding to buy the necessary stuff.

The reason I let other people use my IP is because they pay me money, with which I can develop even more IP. If the leading project declines to do this, then it will have less IP than any of its normal competitors. If the leading project's IP allows it to be significantly more productive than everyone else, then they could have just taken over the world through the normal mechanism of selling products. (Modulo leaks/spying.) As far as I can tell, until you are a large fraction of the world, the revenue you get from selling lets you grow faster, and I don't think the toy model really undermines that typical argument (which has to go through leaks/spying, market frictions, etc.).

Comment by paulfchristiano on Soft takeoff can still lead to decisive strategic advantage · 2019-08-25T02:51:20.495Z · score: 5 (3 votes) · LW · GW
A coalition strong enough to prevent the world's leading project from maintaining and lengthening its lead would need to have some way of preventing the leading project from accessing the innovations of the coalition. Otherwise the leading project will free-ride off the research done by the coalition. For this reason I think that a coalition would look very different from the world economy; in order to prevent the leading project from accessing innovations deployed in the world economy you would need to have an enforced universal embargo on them pretty much, and if you have that much political power, why stop there? Why not just annex them or shut them down?

Are you saying that the leading project can easily spy on other projects, but other projects can't spy on it? Is this because the rest of the world is trading with each other, and trading opens up opportunities for spying? Some other reason I missed? I don't think it's usually the case that gains from rabbit-holing, in terms of protection from spying, are large enough to outweigh the costs from not trading. It seems weird to expect AI to change that, since you are arguing that the proportional importance of spying will go down, not up, because it won't be accelerated as much.

If the leading project can't spy on everyone else, then how does it differ from all of the other companies who are developing technology, keeping it private, and charging other people to use it? The leading project can use others' technology when it pays them, just like they use each other's technology when they pay each other. The leading project can choose not to sell its technology, but then it just has less money and so falls further and further behind in terms of compute etc. (and at any rate, it needs to be selling something to the other people in order to even be able to afford to use their technology).

(I may just be missing something about your model.)

Comment by paulfchristiano on Soft takeoff can still lead to decisive strategic advantage · 2019-08-24T15:31:06.762Z · score: 14 (9 votes) · LW · GW

Why would a coalition look very different from the world economy, be controlled by a few people, and be hard to form? My default expectation is that it would look much like the world economy. (With the most obvious changes being a fall in the labor share of income / increasing wage inequality.) A few big underlying disagreements:

  • I don't think I agree that most progress in AI is driven by rare smart individuals talking to each other---I think it's not very accurate as a description of current AI progress, that it will be even less true as AI progress becomes a larger share of the world economy, and that most "AI" progress is driven by compute/energy/data/other software rather than stuff that most looks like insight-driven AI progress.
  • Your toy model seems wrong: most projects make extensive use of other people's private innovations, by trading with them. So the project that hoards the most innovations can still only be competitive if it trades with others (in order to get access to their hoarded innovations).
  • I think my more basic complaint is with the "pressure to make a profit over some timescale" model. It think it's more like: you need inputs from the rest of the economy and so you trade with them. Right now deep learning moonshots don't trade with the rest of the world because they don't make anything of much value, but if they were creating really impactful technology then the projects which traded would be radically faster than the projects which just used their innovations in house. This is true even if all innovations are public, since they need access to physical capital.

I think any of these would be enough to carry my objection. (Though if you reject my first claim, and thought that rare smart individuals drove AI progress even then AI progress was overwhelmingly economically important, then you could imagine a sufficiently well-coordinated cartel of those rare smart individuals having a DSA.)

Comment by paulfchristiano on Clarifying "AI Alignment" · 2019-08-23T19:25:57.611Z · score: 4 (2 votes) · LW · GW
Would you say that the system in my example is both trying to do what H wants it to do, and also trying to do something that H doesn't want? Is it intent aligned period, or intent aligned at some points in time and not at others, or simultaneously intent aligned and not aligned, or something else?

The oracle is not aligned when asked questions that cause it to do malign optimization.

The human+oracle system is not aligned in situations where the human would pose such questions.

For a coherent system (e.g. a multiagent system which has converged to a Pareto efficient compromise), it make sense to talk about the one thing that it is trying to do.

For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things. I try to use benign when talking about possibly-incoherent systems, or things that don't even resemble optimizers.

The definition in this post is a bit sloppy here, but I'm usually imagining that we are building roughly-coherent AI systems (and that if they are incoherent, some parts are malign). If you wanted to be a bit more careful with the definition, and want to admit vagueness in "what H wants it to do" (such that there can be several different preferences that are "what H wants") we could say something like:

A is aligned with H if everything it is trying to do is "what H wants."

That's not great either though (and I think the original post is more at an appropriate level of attempted-precision).

Comment by paulfchristiano on Clarifying "AI Alignment" · 2019-08-23T02:28:45.605Z · score: 8 (4 votes) · LW · GW
I'm pretty sure Paul would give a different answer, if we ask him about "intent alignment".

Yes, I'd say that to the extent that "trying to do X" is a useful concept, it applies to systems with lots of agents just as well as it applies to one agent.

Even a very theoretically simple system like AIXI doesn't seem to be "trying" to do just one thing, in the sense that it can e.g. exert considerable optimization power at things other than reward, even in cases where the system seems to "know" that its actions won't lead to reward.

You could say that AIXI is "optimizing" the right thing and just messing up when it suffers inner alignment failures, but I'm not convinced that this division is actually doing much useful work. I think it's meaningful to say "defining what we want is useful," but beyond that it doesn't seem like a workable way to actually analyze the hard parts of alignment or divide up the problem.

(For example, I think we can likely get OK definitions of what we value, along the lines of A Formalization of Indirect Normativity, but I've mostly stopped working along these lines because it no longer seems directly useful.)

It seems more obvious that multiagent systems just falls outside of the definition-optimization framework, which seems to be a point in its favor as far as conceptual clarity is concerned.

I agree.

Of course, it also seems quite likely that AIs of the kind that will probably be built ("by default") also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.