Hedonic asymmetries 2020-01-26T02:10:01.323Z · score: 56 (21 votes)
Moral public goods 2020-01-26T00:10:01.803Z · score: 97 (31 votes)
Of arguments and wagers 2020-01-10T22:20:02.213Z · score: 57 (17 votes)
Prediction markets for internet points? 2019-10-27T19:30:00.898Z · score: 40 (18 votes)
AI alignment landscape 2019-10-13T02:10:01.135Z · score: 43 (16 votes)
Taxing investment income is complicated 2019-09-22T01:30:01.242Z · score: 34 (13 votes)
The strategy-stealing assumption 2019-09-16T15:23:25.339Z · score: 68 (18 votes)
Reframing the evolutionary benefit of sex 2019-09-14T17:00:01.184Z · score: 67 (23 votes)
Ought: why it matters and ways to help 2019-07-25T18:00:27.918Z · score: 84 (34 votes)
Aligning a toy model of optimization 2019-06-28T20:23:51.337Z · score: 52 (17 votes)
What failure looks like 2019-03-17T20:18:59.800Z · score: 211 (86 votes)
Security amplification 2019-02-06T17:28:19.995Z · score: 20 (4 votes)
Reliability amplification 2019-01-31T21:12:18.591Z · score: 21 (5 votes)
Techniques for optimizing worst-case performance 2019-01-28T21:29:53.164Z · score: 23 (6 votes)
Thoughts on reward engineering 2019-01-24T20:15:05.251Z · score: 30 (8 votes)
Learning with catastrophes 2019-01-23T03:01:26.397Z · score: 28 (9 votes)
Capability amplification 2019-01-20T07:03:27.879Z · score: 24 (7 votes)
The reward engineering problem 2019-01-16T18:47:24.075Z · score: 23 (4 votes)
Towards formalizing universality 2019-01-13T20:39:21.726Z · score: 29 (6 votes)
Directions and desiderata for AI alignment 2019-01-13T07:47:13.581Z · score: 29 (6 votes)
Ambitious vs. narrow value learning 2019-01-12T06:18:21.747Z · score: 20 (6 votes)
AlphaGo Zero and capability amplification 2019-01-09T00:40:13.391Z · score: 29 (12 votes)
Supervising strong learners by amplifying weak experts 2019-01-06T07:00:58.680Z · score: 28 (7 votes)
Benign model-free RL 2018-12-02T04:10:45.205Z · score: 12 (3 votes)
Corrigibility 2018-11-27T21:50:10.517Z · score: 39 (9 votes)
Humans Consulting HCH 2018-11-25T23:18:55.247Z · score: 19 (3 votes)
Approval-directed bootstrapping 2018-11-25T23:18:47.542Z · score: 19 (4 votes)
Approval-directed agents 2018-11-22T21:15:28.956Z · score: 29 (5 votes)
Prosaic AI alignment 2018-11-20T13:56:39.773Z · score: 36 (9 votes)
An unaligned benchmark 2018-11-17T15:51:03.448Z · score: 27 (6 votes)
Clarifying "AI Alignment" 2018-11-15T14:41:57.599Z · score: 60 (17 votes)
The Steering Problem 2018-11-13T17:14:56.557Z · score: 38 (10 votes)
Preface to the sequence on iterated amplification 2018-11-10T13:24:13.200Z · score: 39 (14 votes)
The easy goal inference problem is still hard 2018-11-03T14:41:55.464Z · score: 40 (11 votes)
Could we send a message to the distant future? 2018-06-09T04:27:00.544Z · score: 40 (14 votes)
When is unaligned AI morally valuable? 2018-05-25T01:57:55.579Z · score: 101 (31 votes)
Open question: are minimal circuits daemon-free? 2018-05-05T22:40:20.509Z · score: 122 (39 votes)
Weird question: could we see distant aliens? 2018-04-20T06:40:18.022Z · score: 85 (25 votes)
Implicit extortion 2018-04-13T16:33:21.503Z · score: 74 (22 votes)
Prize for probable problems 2018-03-08T16:58:11.536Z · score: 135 (37 votes)
Argument, intuition, and recursion 2018-03-05T01:37:36.120Z · score: 103 (31 votes)
Funding for AI alignment research 2018-03-03T21:52:50.715Z · score: 108 (29 votes)
Funding for independent AI alignment research 2018-03-03T21:44:44.000Z · score: 5 (1 votes)
The abruptness of nuclear weapons 2018-02-25T17:40:35.656Z · score: 105 (37 votes)
Arguments about fast takeoff 2018-02-25T04:53:36.083Z · score: 115 (37 votes)
Funding opportunity for AI alignment research 2017-08-27T05:23:46.000Z · score: 1 (1 votes)
Ten small life improvements 2017-08-20T19:09:23.673Z · score: 26 (19 votes)
Crowdsourcing moderation without sacrificing quality 2016-12-02T21:47:57.719Z · score: 16 (11 votes)
Optimizing the news feed 2016-12-01T23:23:55.403Z · score: 16 (11 votes)
The universal prior is malign 2016-11-30T22:31:41.000Z · score: 4 (4 votes)


Comment by paulfchristiano on Moral public goods · 2020-01-26T17:56:31.274Z · score: 3 (2 votes) · LW · GW
'Redistribution' (ie. theft) is an exercise in pointlessness.

Using coercive force to fund public goods is also 'theft', but still it can end up with near-unanimous support. So I don't think that this is a good argument in and of itself.

As long as there is scarcity there will be haves and have nots, and wealth will accumulate as a natural function of time and successful strategies. You can reset the game board as often as you like but you can never ensure a permanent and even stalemate. Even assuming you could destroy the entire point of competing, well then you've destroyed everything you get from that too.

This post isn't really about leveling the playing field. (Even in the stupid example with nobles, the nobles still end up 1000x richer than the peasants.)

Comment by paulfchristiano on Moral public goods · 2020-01-26T17:48:04.224Z · score: 4 (2 votes) · LW · GW
I don't understand the new model that you're proposing here. If people want to see a world free from extreme poverty and view that as a public good provision problem, shouldn't they advocate for or work towards international coordination on achieving that? (Given international coordination on other issues, this clearly isn't an impossible goal.) Why are they unilaterally chipping in small amounts towards reducing poverty in a piecemeal fashion?

This seems to be how people relate to local public goods.

I've been modeling this as people having moral uncertainty which are modeled as factions within someone's brain, where the altruist faction has control over some (typically small) fraction of their budget.

I think that's a better model than fixed weights for different values, but I don't think it explains everything.

Comment by paulfchristiano on Moral public goods · 2020-01-26T17:44:46.477Z · score: 2 (1 votes) · LW · GW

In this model the nobles don't recommend a further tax if in aggregate they have 50% of the total income, utility is logarithmic, and they care about themselves as much as everyone else put together.

(This is a good argument that this model may just not capture your intuition. But I do think the case for taxes is in fact quite a lot better when the nobles collectively control a large share of the income rather than just being individually wealthy.)

Comment by paulfchristiano on Moral public goods · 2020-01-26T17:41:14.701Z · score: 3 (2 votes) · LW · GW

I agree that this is more like the dilemma for modeling someone as a welfarist than a general consequentialist (if they were a total utilitarian then I think they'd already be committed to option 2). But I think you do have similar problems with any attempt to model them as consequentialists.

Comment by paulfchristiano on Moral public goods · 2020-01-26T17:36:31.238Z · score: 2 (1 votes) · LW · GW

The post is just arguing that redistribution can behave like a public good, it's not arguing for increases in redistribution. (I think the quantitative behavior of the hypothetical is not at all like the real world, there isn't a small group that gets 90% of the income, and the optimal tax rate is very sensitive to the fraction of income the nobles start with.)

Comment by paulfchristiano on Moral public goods · 2020-01-26T17:34:28.177Z · score: 2 (1 votes) · LW · GW

If you want an example with more modest numbers:

  • There are a billion rich people and a billion poor people (and a bunch in the middle we'll ignore)
  • Each rich person cares about themselves 5x as much as all poor people put together
  • The poor people get 100x the welfare from $1 as the rich people

Then it's a bad idea for a rich person to donate, but they would support taxes to fund foreign aid until they reached the level where the poorest billion valued money only 5x more than the richest billion.

Comment by paulfchristiano on Moral public goods · 2020-01-26T17:29:28.889Z · score: 5 (3 votes) · LW · GW
The first part of this turned seemed like mostly politics - oversimple and flat-out non-real example being used to justify a policy without any nuance or sense.

The example is just to illustrate that it's possible for everyone to prefer taxation but not want to donate unilaterally. Maybe this is an easy enough point that it wasn't worth illustrating with an example. I tried to make the example obviously silly so that it wouldn't be taken as politically live, but I think that failed.

you haven't specified even what "redistribution" means, especially in a dynamic equilibrium where wealth and income are related but distinct.

e.g. policies justified by benefiting poor people at the expense of rich people, e.g. an income tax which the state then spends to benefit people equitably.

I absolutely agree that this is not a case for an income tax, it's one argument for an income tax (which is different from arguments about justice or fairness and seems worth having in a separate mental category).

Point 2 is completely missing the fundamental question of what people want - Friedman's point that if people actually were self-aligned that they care about feeding specific poor people rather than getting a new iPhone, they'd do it. Instead, they want abstract poor people to get fed, and only if they can force others to do so (along with themselves, in many cases, but rarely just unilaterally). You don't address this disparity.

It seems consistent for me to prefer that all poor people get food than that all rich people get iPhones, yet to prefer that I get an iPhone than that a particular poor person get food (since I care more about myself than the average rich person). Do you disagree that this would be a consistent set of preferences? Do you agree that it's consistent but just disagree that it's empirically plausible? At any rate, it seems like we should agree that Friedman's argument doesn't work without some additional assumptions.

Comment by paulfchristiano on Moral public goods · 2020-01-26T17:22:08.453Z · score: 4 (2 votes) · LW · GW
Real-life nobles don't produce 10,000x value; they extract value from peasants, by force of arms and law and custom. It makes no sense to redistribute wealth by taxing everyone's income if the nobles get their income by taxing the peasants; just stop the nobles from extracting so much value.

I think that redistribution by taxing still makes sense, e.g. if nobles effectively get their money by owning and controlling the land (by force or whatever) and taking a big cut from the people who work it. But I also agree that there may be easier and better things to do than raising taxes, it seems like a waste of effort for nobles to collect local taxes and then the king to collect taxes from the nobles and pay it back.

But most of their income still comes from capital and owning the tools of production and all that (citation required).

I think this probably isn't right---e.g. capital income is a minority for the top 1% of earners in the US today, and the situation is even starker for global inequality.

A terminology of "nobles" and "peasants" implies to me the idea that most all of the nobles' (the modern rich) income is extracted from from the peasants (everyone else), enabled by the same state that then taxes them. Did you intend or endorse this view? If not, or if you think it's irrelevant to the thought experiment, do you think the framing of "nobles" and "peasants" distracts from the issue? It does for me.

In retrospect I agree it would have been better to use a different example.

(In retrospect people also didn't like the big and unrealistic numbers, so I could have just made them 10 and 100 instead. I generally overestimated the extent to which readers would separate the simple quantitative point, which I wanted to make in the shortest way possible but didn't think about that much, from other features of the scenario.)

Comment by paulfchristiano on Moral public goods · 2020-01-26T05:27:17.576Z · score: 4 (2 votes) · LW · GW

I think the example works fine with numbers like "the welfare effect of $1 is a hundred times larger for this poor person than that rich person" which seem conservative. (e.g. I think it is very likely that many poor people would value a doubling of consumption at least as much as I would, suggesting multipliers >100).

I think the weird thing about the example is the nobles having 90% of the total income.

Comment by paulfchristiano on Modest Superintelligences · 2020-01-25T23:44:13.473Z · score: 4 (2 votes) · LW · GW

They are +4.2SD on the genetic component of the property (= 1 in 90,000), but the median person with those genetics is still only +3SD on the overall property (= 1 in 750), right?

(That is, the expected boost from the abnormally extreme genetics should be the same as the expected boost from the abnormally extreme environment, if the two are equally important. So each of them should be half of the total effect, i.e. 3SD on the overall trait.)

Comment by paulfchristiano on Modest Superintelligences · 2020-01-25T06:29:48.227Z · score: 3 (2 votes) · LW · GW

For a normally distributed property 1/billion is +6 sigma, while +3 sigma is 1/750. If a property is normally distributed, the clones share 50% of the variation, and von neumann is 1 in a billion, then I think it's right that our median guess for the median clone should be the 1 in 750 level.

(But of your 100,000 clones several of them will be at the one in a trillion level, a hundred will be more extreme than von neumann, and >20,000 of them will be one in 20,000. I'm generally not sure what you are supposed to infer from the "one in X" metric. [Edited to add: all of those are the fractions in expectation, and they are significant underestimates because they ignore the uncertainty in the genetic component.])

Comment by paulfchristiano on Clarifying The Malignity of the Universal Prior: The Lexical Update · 2020-01-15T17:35:28.874Z · score: 7 (4 votes) · LW · GW

I still tentatively think the lexical update works, but it's been a while and I might be missing something.

I'll follow your convention that our universe is U-simple, our universal prior is U', and so the aliens' universe is U'-simple (I think---sorry if I got confused and that's not what you mean).

If we sample from a universe that is U'-simple, then:

  • Assume the aliens care about U'-simplicity. They will preferentially sample from U', and so have U'(our world) mass on our world. Within that, they will correctly guess that the machine they are supposed to control is using U' as its prior. That is, they basically pay U'(our world) * P(us|someone using U' to predict).
  • But our universal prior was also U', wasn't it? So we are also paying U'(our world) to pick out our world. I.e. we pay U'(our world) * P(someone making important predictions | our world) * P(someone using U' to predict | someone making important predictions) * P(us|someone using U' to predict).
  • I don't see any program whose behavior depends on U(world) for the "real" simplicity prior U according to which our world is simple (and that concept seems slippery).

Does that seem right?

Comment by paulfchristiano on Of arguments and wagers · 2020-01-12T17:37:30.707Z · score: 4 (2 votes) · LW · GW

Thanks, corrected.

Comment by paulfchristiano on Of arguments and wagers · 2020-01-11T18:17:25.746Z · score: 12 (4 votes) · LW · GW

(Note: this is a post from 2014 that I recently added to I still think it's a very interesting scheme and I'm excited about people exploring better mechanisms for resolving arguments.)

Comment by paulfchristiano on Of arguments and wagers · 2020-01-11T18:14:13.578Z · score: 2 (1 votes) · LW · GW
I think the resulting odds won't reflect the probability of anything, because they depend a lot on whether Alice or Bob is more risk-tolerant (=rich).

If one of them is willing to tolerate risk equal to the value of Judy's time to hear out the argument, then you are fine. If the total willingness to risk of people who believe "Judy will believe X on reflection" is lower than the value of Judy's time, then I think you are basically inevitably stuck unless Judy is willing to risk her own attention. If she is willing to risk her own attention, then she can just give people a budget of "minutes" to spend making wagers, as discussed in the post, and as long as the budget is large enough relative to the size of the disagreement it seems like you are OK.

Also, it seems to me that your scheme works best for yes/no questions. For anything more complicated, Alice and Bob can cooperate to mislead Judy, which is especially scary in case of AIs. I'm not sure how to fix that problem: it seems to require a way for a non-expert to check the work of a malicious expert, not just adjudicate between two experts.

The scheme works if one of the experts advocates for the truth. If there are two options, and both players want to manipulate Judy into believing "yes," then you are similarly in trouble. I agree that if there are more options than experts then it becomes less likely that "by chance" someone wants to advocate for the right answer. But I think in general you are banking on there being some density of experts who want to argue for the truth because it is the truth.

Comment by paulfchristiano on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-06T17:18:19.002Z · score: 4 (2 votes) · LW · GW

For context, here's the one time in the interview I mention "AI risk" (quoting 2 earlier paragraphs for context):

Paul Christiano: I don’t know, the future is 10% worse than it would otherwise be in expectation by virtue of our failure to align AI. I made up 10%, it’s kind of a random number. I don’t know, it’s less than 50%. It’s more than 10% conditioned on AI soon I think.
Asya Bergal: I think my impression is that that 10% is lower than some large set of people. I don’t know if other people agree with that.
Paul Christiano: Certainly, 10% is lower than lots of people who care about AI risk. I mean it’s worth saying, that I have this slightly narrow conception of what is the alignment problem. I’m not including all AI risk in the 10%. I’m not including in some sense most of the things people normally worry about and just including the like ‘we tried to build an AI that was doing what we want but then it wasn’t even trying to do what we want’. I think it’s lower now or even after that caveat, than pessimistic people. It’s going to be lower than all the MIRI folks, it’s going to be higher than almost everyone in the world at large, especially after specializing in this problem, which is a problem almost no one cares about, which is precisely how a thousand full time people for 20 years can reduce the whole risk by half or something.

(But it's still the case that asked "Can you explain why it's valuable to work on AI risk?" I responded by almost entirely talking about AI alignment, since that's what I work on and the kind of work where I have a strong view about cost-effectiveness.)

Comment by paulfchristiano on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-03T00:20:08.343Z · score: 16 (10 votes) · LW · GW

E.g. if you have a broad distribution over possible worlds, some of which are "fragile" and have 100 things that cut value down by 10%, and some of which are "robust" and don't, then you get 10,000x more value from the robust worlds. So unless you are a priori pretty confident that you are in a fragile world (or they are 10,000x more valuable, or whatever), the robust worlds will tend to dominate.

Similar arguments work if we aggregate across possible paths to achieving value within a fixed, known world---if there are several ways things can go well, some of which are more robust, those will drive almost all of the EV. And similarly for moral uncertainty (if there are several plausible views, the ones that consider this world a lost cause will instead spend their influence on other worlds) and so forth. I think it's a reasonably robust conclusion across many different frameworks: your decision shouldn't end up being dominated by some hugely conjunctive event.

Comment by paulfchristiano on A dilemma for prosaic AI alignment · 2019-12-19T20:37:49.009Z · score: 6 (4 votes) · LW · GW
In the case of something like amplification or debate, I think the bet that you're making is that language modeling alone is sufficient to get you everything you need in a competitive way.

I'm skeptical of language modeling being enough to be competitive, in the sense of maximizing "log prob of some naturally occurring data or human demonstrations." I don't have a strong view about whether you can get away using only language data rather than e.g. taking images as input and producing motor torques as output.

I'm also not convinced that amplification or debate need to make this bet though. If we can do joint training / fine-tuning of a language model using whatever other objectives we need, then it seems like we could just as well do joint training / fine-tuning for a different kind of model. What's so bad if we use non-language data?

Comment by paulfchristiano on A dilemma for prosaic AI alignment · 2019-12-19T20:26:00.305Z · score: 7 (5 votes) · LW · GW

We could also ask: "Would AlphaStar remain as good as it is, if fine-tuned to answer questions?"

In either case it's an empirical question. I think the answer is probably yes if you do it carefully.

You could imagine separating this into two questions:

  • Is there a policy that plays starcraft and answers questions, that is only slightly larger than a policy for playing starcraft alone? This is a key premise for the whole project. I think it's reasonably likely; the goal is only to answer questions the model "already knows," so it seems realistic to hope for only a constant amount of extra work to be able to use that knowledge to answer questions. I think most of the uncertainty here is about details of "know" and question-answering and so on.
  • Can you use joint optimization to find that policy with only slightly more training time? I think probably yes.
Comment by paulfchristiano on A dilemma for prosaic AI alignment · 2019-12-18T16:10:52.604Z · score: 9 (6 votes) · LW · GW

I normally imagine using joint training in these cases, rather than pre-training + fine-tuning. e.g., at every point in time we maintain an agent and a question-answerer, where the question-answerer "knows everything the agent knows." They get better together, with each gradient update affecting both of them, rather than first training a good agent and then adding a good question-answerer.

(Independently of concerns about mesa-optimization, I think the fine-tuning approach would have trouble because you couldn't use statistical regularities from the "main" objective to inform your answers to questions, and therefore your question answers will be dumber than the policy and so you couldn't get a good reward function or specification of catastrophically bad behavior.)

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-15T17:12:51.088Z · score: 2 (1 votes) · LW · GW

I don't have a big difference in my model of mid vs. final, they have very similar MMR, the difference between them is pretty small in the scheme of things (e..g probably smaller than the impact of doubling model size) and my picture isn't refined enough to appreciate those differences. For any particular dumb mistake I'd be surprised if the line between not making it and making it was in that particular doubling.

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-15T03:38:46.790Z · score: 2 (1 votes) · LW · GW

AGZ is only trained on the situations that actually arise in games it plays.

I agree with the point that "imitation learning from human games" will only make you play well on kinds of situations that arise in human games, and that self-play can do better by making you play well on a broader set of situations. You could also train on all the situations that arise in a bigger tree search (though AGZ did not) or against somewhat-random moves (which AGZ probably did).

(Though I don't see this as affecting the basic point.)

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-15T03:33:06.819Z · score: 6 (4 votes) · LW · GW

Thanks! That's only marginally less surprising than the final RL policy, and I suspect the final RL policy will make the same kind of mistake. Seems like the OP's example was legit and I overestimated the RL agent.

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-03T04:32:35.150Z · score: 8 (3 votes) · LW · GW

But to be clear, (i) it would then also be learned by imitating a large enough dataset from human players who did something like tree search internally while playing, (ii) I think the tree search makes a quantitative not qualitative change, and it's not that big (mostly improves stability, and *maybe* a 10x speedup, over self-play).

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-03T04:12:56.112Z · score: 19 (9 votes) · LW · GW
In one comical case, AlphaStar had surrounded the units it was building with its own factories so that they couldn't get out to reach the rest of the map. Rather than lifting the buildings to let the units out, which is possible for Terran, it destroyed one building and then immediately began rebuilding it before it could move the units out!

It seems like AlphaStar played 90 ladder matches as Terran:

  • 30 with the initial policy trained with SL
  • 30 with the a policy from the middle of training
  • 30 from the final RL policy.

This sounds like the kind of mistake that the SL policy would definitely make (no reason it should be able to recover), whereas it's not clear whether RL would learn how to recover (I would expect it to, but not too strongly).

If it's easy for anyone to check and they care, it might be worth looking quickly through the replays and seeing whether this particular game was from the SL or RL policies. This is something I've been curious about since seeing the behavior posted on Reddit, and it would have a moderate impact on my understanding of AlphaStar's skill.

It looks like they released 90 replays and played 90 ladder games so it should be possible to check.

The replays are here, hosted on the DM site, sorted into three folders based on the policy, if it's one of the SL matches it's either AlphaStarSupervised_013_TvT.SC2Replay, or one of _017_, _019_, or _022_ (based on being TvT and being on Kairos Junction). The video in question is here. I'd check if I had SC2 installed.

(Of course better still would be to find a discussion of the 30 RL replays, from someone who understands the game. Maybe that's been posted somewhere, I haven't looked and it's hard to know who to trust.)

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-03T03:55:36.981Z · score: 15 (6 votes) · LW · GW

I meant to ask about the policy network in AlphaZero directly. It plays at the professional level (the Nature paper puts it at a comparable Elo to Fan Hui) with no tree search, using a standard neural network architecture trained by supervised learning. It performs fine on parts of the search tree that never appeared during training. What distinguishes this kind of reasoning from "if I see X, I do Y"?

(ETA clarification, because I think this was probably the misunderstanding: the policy network plays Go with no tree search, tree search is only used to generate training data. That suggests the AlphaStar algorithm would produce similar behavior without using tree search ever, probably using at most 100x the compute of AlphaZero and I'd be willing to bet on <10x.)

From the outside, it looks like human-level play at Starcraft is more complicated (in a sense) than human-level play at Go, and so it's going to take bigger models in order to reach a similar level of performance. I don't see a plausible-looking distinction-in-principle that separates the strategy in Starcraft from strategy in Go.

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-02T15:05:53.593Z · score: 33 (12 votes) · LW · GW
Finally, RL practitioners have known that genuine causal reasoning could never be achieved via known RL architectures- you'd only ever get something that could execute the same policy as an agent that had reasoned that way, via a very expensive process of evolving away from dominated strategies at each step down the tree of move and countermove. It's the biggest known unknown on the way to AGI.

What's the argument here? Do you think that the AGZ policy (which is extremely good at Go or Chess even without any tree search) doesn't do any causal reasoning? That it only ever learns to play parts of the game tree it's seen during training? What does "genuine causal reasoning" even mean?

It looks to me like causal reasoning is just another type of computation, and that you could eventually find that computation by local search. If you need to use RL to guide that search then it's going to take a long time---AlphaStar was very expensive, and still only trained a policy with ~80M parameters.

From my perspective it seems like the big questions are just how large a policy you would need to train using existing methods in order to be competitive with a human (my best guess would be a ~trillion to a ~quadrillion), and whether you can train it by copying rather than needing to use RL.

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-02T14:55:38.405Z · score: 29 (15 votes) · LW · GW

In what sense is this information warfare or even misleading? The second sentence of the blog post says: "AlphaStar was ranked above 99.8% of active players," which seems quite clear. They seem to have done a pretty good job of making that comparison as fair as you could expect. What do they say or even imply which is highly misleading?

Perhaps they say "Grandmaster level," and it's possible that this gives a misleading impression to people who don't know what that term means in Starcraft? Though I think chess grandmaster also means roughly "better than 99.8% of ladder players," and the competitive player pools have similar size. So while it might be misleading in the sense that Chess has a larger player pool a smaller fraction of whom are competitive, it seems fairly straightforward.

Comment by paulfchristiano on AI alignment landscape · 2019-10-22T16:49:59.394Z · score: 2 (1 votes) · LW · GW

(edited post to include)

Comment by paulfchristiano on The strategy-stealing assumption · 2019-10-19T17:53:45.567Z · score: 6 (3 votes) · LW · GW

I wrote this post imagining "strategy-stealing assumption" as something you would assume for the purpose of an argument, for example I might want to justify an AI alignment scheme by arguing "Under a strategy-stealing assumption, this AI would result in an OK outcome." The post was motivated by trying to write up another argument where I wanted to use this assumption, spending a bit of time trying to think through what the assumption was, and deciding it was likely to be of independent interest. (Although that hasn't yet appeared in print.)

I'd be happy to have a better name for the research goal of making it so that this kind of assumption is true. I agree this isn't great. (And then I would probably be able to use that name in the description of this assumption as well.)

Comment by paulfchristiano on Thoughts on "Human-Compatible" · 2019-10-10T16:09:31.958Z · score: 12 (6 votes) · LW · GW

(See also the concept of "decoupled RL" from some DeepMind folks.)

Comment by paulfchristiano on Taxing investment income is complicated · 2019-10-03T05:23:25.549Z · score: 5 (3 votes) · LW · GW

"Social cost" in economics usually refers to the sum of private costs (wikipedia), such that a transfer from one person to another would have no social cost.

"Internalizing an externality" usually means making the private costs better reflect the social costs (so it's not relevant what is done with the tax revenue).

Comment by paulfchristiano on List of resolved confusions about IDA · 2019-10-01T05:05:21.501Z · score: 13 (2 votes) · LW · GW
By way of illustration, AlphaZero's long-term preference is to win the chess game, its short-term preference is whatever its policy network spits out as the best move to make next, and its short-term preference-on-reflection is the move it wants to make next after doing a fuck-ton of MCTS.

Short-term preferences are the value function one or a few moves out. If the algorithm is "reasonable," then its short-term preference-on-reflection are the true function P(I win the game|I make this move). You could also talk about intermediate degrees of reflection.

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-29T20:52:02.131Z · score: 4 (2 votes) · LW · GW
Now that I understand "corrigible" isn't synonymous with “satisfying my short-term preferences-on-reflection”, “corrigibility is relatively easy to learn” doesn't seem enough to imply these things

I agree that you still need the AI to be trying to do the right thing (even though we don't e.g. have any clear definition of "the right thing"), and that seems like the main way that you are going to fail.

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-29T17:19:43.271Z · score: 6 (3 votes) · LW · GW
As I understand it, the original motivation for corrigibility_MIRI was to make sure that someone can always physically press the shutdown button, and the AI would shut off. But if a corrigible_Paul AI thinks (correctly or incorrectly) that my preferences-on-reflection (or "true" preferences) is to let the AI keep running, it will act against my (actual physical) attempts to shut down the AI, and therefore it's not corrigible_MIRI.

Note that "corrigible" is not synonymous with "satisfying my short-term preferences-on-reflection" (that's why I said: "our short-term preferences, including (amongst others) our preference for the agent to be corrigible.")

I'm just saying that when we talk about concepts like "remain in control" or "become better informed" or "shut down," those all need to be taken as concepts-on-reflection. We're not satisfying current-Paul's judgment of "did I remain in control?" they are the on-reflection notion of "did I remain in control"?

Whether an act-based agent is corrigible depends on our preferences-on-reflection (this is why the corrigibility post says that act-based agents "can be corrigible"). It may be that our preferences-on-reflection are for an agent to not be corrigible. It seems to me that for robustness reasons we may want to enforce corrigibility in all cases even if it's not what we'd prefer-on-reflection, for robustness reasons.

That said, even without any special measures, saying "corrigibility is relatively easy to learn" is still an important argument about the behavior of our agents, since it hopefully means that either (i) our agents will behave corrigibly, (ii) our agents will do something better than behaving corriglby, according to our preferences-on-reflection, (iii) our agents are making a predictable mistake in optimizing our preferences-on-reflection (which might be ruled out by them simply being smart enough and understanding the kinds of argument we are currently making).

By "corrigible" I think we mean "corrigible by X" with the X implicit. It could be "corrigible by some particular physical human."

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-29T04:11:24.398Z · score: 7 (4 votes) · LW · GW

(In that post I did use narrow in the way we are currently using short-term, contrary to my claim the grandparent. Sorry for the confusion this caused.)

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-29T02:11:19.856Z · score: 4 (2 votes) · LW · GW

Like I mentioned above, I mostly think of narrow value learning is a substitute for imitation learning or approval-direction, realistically to be used as a distillation step rather than as your whole AI. In particular, an agent trained with narrow value learning absolutely is probably not aligned+competitive in a way that might allow you to apply this kind of strategy-stealing argument.

In concrete approval-directed agents I'm talking about a different design, it's not related to narrow value learning.

I don't use narrow and short-term interchangeably. I've only ever used it in the context of value learning, in order to make this particular distinction between two different goals you might have when doing value learning.

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-29T00:33:33.509Z · score: 4 (2 votes) · LW · GW

One of us just misunderstood (1), I don't think there is any difference.

I mean preferences about what happens over the near future, but the way I rank "what happens in the near future" will likely be based on its consequences (further in the future, and in other possible worlds, and etc.). So I took (1) to be basically equivalent to (2).

"Terminal preferences over the near future" is not a thing I often think about and I didn't realize it was a candidate interpretation (normally when I write about short-term preferences I'm writing about things like control, knowledge, and resource acquisition).

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-28T15:29:07.773Z · score: 12 (4 votes) · LW · GW

By "short" I mean short in sense (1) and (2). "Short" doesn't imply anything about senses (3), (4), (5), or (6) (and "short" and "long" don't seem like good words to describe those axes, though I'll keep using them in this comment for consistency).

By "preferences-on-reflection" I mean long in sense (3) and neither in sense (6). There is a hypothesis that "humans with AI help" is a reasonable way to capture preferences-on-reflection, but they aren't defined to be the same. I don't use understandable and evaluable in this way.

I think (4) and (5) are independent axes. (4) just sounds like "is your AI good at optimizing," not a statement about what it's optimizing. In the discussion with Eliezer I'm arguing against it being linked to any of these other axes. (5) is a distinction about two senses in which an AI can be "optimizing my short-term preferences-on-reflection"

When discussing perfect estimations of preferences-on-reflection, I don't think the short vs. long distinction is that important. "Short" is mostly important when talking about ways in which an AI can fall short of perfectly estimating preferences-on-reflection.

Assuming my interpretation is correct, my confusion is that you say we shouldn't expect a situation where "the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy" (I take you to be talking about sense (3) from above). It seems like the user-on-reflection and the current user would disagree about many things (that is the whole point of reflection), so if the AI acts in accordance with the intentions of the user-on-reflection, the current user is likely to end up unhappy.

I introduced the term "preferences-on-reflection" in the previous comment to make a particular distinction. It's probably better to say something like "actual preferences" (though this is also likely to be misinterpreted). The important property is that I'd prefer to have an AI that satisfies my actual preferences than to have any other kind of AI. We could also say "better by my lights" or something else.

There's a hypothesis that "what I'd say after some particular idealized process of reflection" is a reasonable way to capture "actual preferences," but I think that's up for debate---e.g. it could fail if me-on-reflection is selfish and has values opposed to current-me, and certainly it could fail for any particular process of reflection and so it might just happen to be the case that there is no process of reflection that satisfies it.

The claim I usually make is that "what I'd say after some particular idealized process of reflection" describes the best mechanism we can hope to find for capturing "actual preferences," because whatever else we might do to capture "actual preferences" can just be absorbed into that process of reflection.

"Actual preferences" is a pretty important concept here, I don't think we could get around the need for it, I'm not sure if there is disagreement about this concept or just about the term being used for it.

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-25T01:22:15.546Z · score: 7 (3 votes) · LW · GW
All three of these corrigible AIs deal with much narrower preferences than "acquire flexible influence that I can use to get what I want". The narrow value learner post for example says:

Imitation learning, approval-direction, and narrow value learning are not intended to exceed the overseer's capabilities. These are three candidates for the distillation step in iterated distillation and amplification.

The AI we actually deploy, which I'm discussing in the OP, is produced by imitating (or learning the values of, or maximizing the approval of) an even smarter AI---whose valuations of resources reflect everything that unaligned AIs know about which resources will be helpful.

Corrigibility is about short-term preferences-on-reflection. I see how this is confusing. Note that the article doesn't make sense at all when interpreted in the other way. For example, the user can't even tell whether they are in control of the situation, so what does it mean to talk about their preference to be in control of the situation if these aren't supposed to be preferences-on-reflection? (Similarly for "preference to be well-informed" and so on.) The desiderata discussed in the original corrigibility post seem basically the same as the user not being able to tell what resources will help them achieve their long-term goals, but still wanting the AI to accumulate those resources.

I also think the act-based agents post is correct if "preferences" means preferences-on-reflection. It's just that the three approaches listed at the top are limited to the capabilities of the overseer. I think that distinguishing between preferences-as-elicited and preferences-on-reflection is the most important thing to disambiguate here. I usually use "preference" to mean preference-on-idealized-reflection (or whatever "actual preference" should mean, acknowledging that we don't have a real ground truth definition), which I think is the more typical usage. I'd be fine with suggestions for disambiguation.

If there's somewhere else I've equivocated in the way you suggest, then I'm happy to correct it. It seems like a thing I might have done in a way that introduces an error. I'd be surprised if it hides an important problem (I think the big problems in my proposal are lurking other places, not here), and I think in the corrigibility post I think that I have these concepts straight.

One thing you might have in mind is the following kind of comment:

If on average we are unhappy with the level of corrigibility of a benign act-based agent, then by construction it is mistaken about our short-term preferences.

That is, you might be concerned: "the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy." I'm saying that you shouldn't expect this to happen, if the AI is well-calibrated and has enough of an understanding of humans to understand e.g. this discussion we are currently having---if it decides not to be corrigible, we should expect it to be right on average.

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-24T09:25:37.185Z · score: 4 (2 votes) · LW · GW
This seems too glib, if "long-term preferences" are in some sense the "right" preferences, e.g., if under reflective equilibrium we would wish that we currently put a lot more weight on long-term preferences. Even if we only give unaligned AIs a one-time advantage (which I'm not sure about LW), that could still cause us to lose much of the potential value of the universe.

To be clear, I am worried about people not understanding or caring about the long-term future, and AI giving them new opportunities to mess it up.

I'm particularly concerned about things like people giving their resources to some unaligned AI that seemed like a good idea at the time, rather than simply opting out of competition so that unaligned AIs might represent a larger share of future-influencers. This is another failure of strategy-stealing that probably belongs in the post---even if we understand alignment, there may be plenty of people not trying to solve alignment and instead doing something else, and the values generated by that "something else" will get a natural boost.

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-24T09:10:56.615Z · score: 2 (1 votes) · LW · GW
To sum up, I think there's a fundamental tension between corrigibility (in the sense of respecting the human user's short-term preferences) and long-term success/competitiveness, which underlies many of the specific failure scenarios described in the OP, and worse, makes it unclear how "strategy-stealing" can work at all.

By short-term preference I don't mean "Start a car company, I hear those are profitable," I mean more like "Make me money, and then make sure that I remain in control of that company and its profits," or even better "acquire flexible influence that I can use to get what I want."

(This is probably not the response you were looking for. I'm still mostly intending to give up on communication here over the short term, because it seems too hard. If you are confused by particular things I've said feel free to quote them so that I can either clarify, register a disagreement, or write them off as sloppy or mistaken comments.)

Comment by paulfchristiano on Taxing investment income is complicated · 2019-09-23T15:31:38.461Z · score: 3 (2 votes) · LW · GW
The solution you propose also sounds really complicated when people have to optimize the timing of when they make capital gains with times when the risk-free rate is low.

When you sell assets you deduct the amount you paid for them. The proposal is to multiply that basis by the total amount of risk-free interest that would have accumulated over the intervening period, which can be calculated by looking up a single number in a table. I agree that using the risk-free rate when you sell would be insane.

(From the perspective of tax optimization, I think this is much simpler than the status quo. From the perspective of tax accounting, this mechanism takes the place of the distinction between long-term and short-term capital gains, and is radically simpler than that.)

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-23T09:29:35.061Z · score: 2 (1 votes) · LW · GW
What if the user fails to realize that a certain kind of resource is valuable? (By "resources" we're talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)

As long as the user and AI appreciate the arguments we are making right now, then we shouldn't expect it to do worse than stealing the unaligned AI's strategy. There is all the usual ambiguity about "what the user wants," but if the user expects that the resources other agents are gathering will be more useful than the resources its AI is gathering, then its AI would clearly do better (in the user's view) by doing what others are doing.

(I think I won't have time to engage much on this in the near future, it seems plausible that I am skipping enough steps or using language in an unfamiliar enough way that this won't make sense to readers in which case so it goes; it's also possible that I'm missing something.)

Comment by paulfchristiano on Taxing investment income is complicated · 2019-09-23T09:22:31.106Z · score: 2 (1 votes) · LW · GW

This is basically the argument in my second bullet. You can make what you will of that argument, I think I still believe it---taxing savings just seems strictly worse than a similarly-progressive tax on income.

(I also don't much like a wealth tax for the same reason.)

Comment by paulfchristiano on Taxing investment income is complicated · 2019-09-23T09:20:10.603Z · score: 4 (3 votes) · LW · GW

These are good options when available. You should start by setting all the Pigouvian taxes at optimal levels then go from there. Having not thought about it very much, taxes on unimproved value of land seem good but can't fund something like a modern government without major distortions, so you'll end up with lots of other stuff in your basket.

Comment by paulfchristiano on Taxing investment income is complicated · 2019-09-23T09:16:12.657Z · score: 3 (2 votes) · LW · GW
Also, my current understanding is that risk compensation is definitely not the large majority of investment returns. The last chapter of Cochrane's Asset Pricing text has a great discussion of the topic. The main conclusion is that explaining returns via risk exposure requires unrealistically high levels of risk aversion - like, one or two orders of magnitude above the risk aversion levels implied by other activities.

What's the competing explanation?

Haven't looked at the historical numbers, but in recent times it seems like (i) with log utility and a naive model of "future=past," optimal leverage is around 2x, (ii) most investors are much more risk averse than log utility (even for casino risk). So it seems like things basically add up here for most of the market. Was the situation an order of magnitude different in the past?

Comment by paulfchristiano on Taxing investment income is complicated · 2019-09-22T17:17:04.387Z · score: 6 (4 votes) · LW · GW

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-21T16:47:10.940Z · score: 8 (2 votes) · LW · GW
For example, suppose meta-execution asks the subquestion “What does the user want?”, gets a representation of their values, and then asks the subquestion “What behavior is best according to those values?” I’ve then generated incorrigible behavior by accident, after taking innocuous steps.

Estimating values then optimizing those seems (much) worse than optimizing "what the user wants." One natural strategy for getting what the user wants can be something like "get into a good position to influence the world and then ask the user later."

This is very confusing because elsewhere you say that the kind of AI you're trying to design is just satisfying short-term preferences / instrumental values of the user

I don't have a very strong view about the distinction between corrigibility to the user and corrigibility to some other definition of value (e.g. a hypothetical version of the user who is more secure).

This is very confusing because elsewhere you say that the kind of AI you're trying to design is just satisfying short-term preferences / instrumental values of the user, but here "good for the user" seemingly has to be interpreted as "good in the long run".

By "trying to find a strategy that's good for the user" I mean: trying to pursue the kind of resources that the user thinks are valuable, without costs that the user would consider serious, etc.

Comment by paulfchristiano on The strategy-stealing assumption · 2019-09-18T04:00:58.989Z · score: 2 (1 votes) · LW · GW
Is the aligned AI literally applying a planning algorithm to the same long-term goal as the unaligned AI, and then translating that plan into a plan for acquiring flexible influence, or is it just generally trying to come up with a plan to acquire flexible influence?

The latter

It is trying to find a strategy that's instrumentally useful for a variety of long-term goals

It's presumably trying to find a strategy that's good for the user, but in the worst case where it understands nothing about the user it still shouldn't do any worse than "find a strategy that's instrumentally useful for a variety of long-term goals."