Writeup: Progress on AI Safety via Debate 2020-02-05T21:04:05.303Z · score: 77 (21 votes)
Hedonic asymmetries 2020-01-26T02:10:01.323Z · score: 84 (30 votes)
Moral public goods 2020-01-26T00:10:01.803Z · score: 124 (42 votes)
Of arguments and wagers 2020-01-10T22:20:02.213Z · score: 57 (17 votes)
Prediction markets for internet points? 2019-10-27T19:30:00.898Z · score: 40 (18 votes)
AI alignment landscape 2019-10-13T02:10:01.135Z · score: 43 (16 votes)
Taxing investment income is complicated 2019-09-22T01:30:01.242Z · score: 34 (13 votes)
The strategy-stealing assumption 2019-09-16T15:23:25.339Z · score: 68 (18 votes)
Reframing the evolutionary benefit of sex 2019-09-14T17:00:01.184Z · score: 67 (23 votes)
Ought: why it matters and ways to help 2019-07-25T18:00:27.918Z · score: 86 (35 votes)
Aligning a toy model of optimization 2019-06-28T20:23:51.337Z · score: 52 (17 votes)
What failure looks like 2019-03-17T20:18:59.800Z · score: 214 (89 votes)
Security amplification 2019-02-06T17:28:19.995Z · score: 20 (4 votes)
Reliability amplification 2019-01-31T21:12:18.591Z · score: 22 (6 votes)
Techniques for optimizing worst-case performance 2019-01-28T21:29:53.164Z · score: 24 (7 votes)
Thoughts on reward engineering 2019-01-24T20:15:05.251Z · score: 31 (9 votes)
Learning with catastrophes 2019-01-23T03:01:26.397Z · score: 28 (9 votes)
Capability amplification 2019-01-20T07:03:27.879Z · score: 24 (7 votes)
The reward engineering problem 2019-01-16T18:47:24.075Z · score: 24 (5 votes)
Towards formalizing universality 2019-01-13T20:39:21.726Z · score: 29 (6 votes)
Directions and desiderata for AI alignment 2019-01-13T07:47:13.581Z · score: 30 (7 votes)
Ambitious vs. narrow value learning 2019-01-12T06:18:21.747Z · score: 21 (7 votes)
AlphaGo Zero and capability amplification 2019-01-09T00:40:13.391Z · score: 30 (13 votes)
Supervising strong learners by amplifying weak experts 2019-01-06T07:00:58.680Z · score: 28 (7 votes)
Benign model-free RL 2018-12-02T04:10:45.205Z · score: 13 (4 votes)
Corrigibility 2018-11-27T21:50:10.517Z · score: 40 (10 votes)
Humans Consulting HCH 2018-11-25T23:18:55.247Z · score: 20 (4 votes)
Approval-directed bootstrapping 2018-11-25T23:18:47.542Z · score: 19 (4 votes)
Approval-directed agents 2018-11-22T21:15:28.956Z · score: 29 (5 votes)
Prosaic AI alignment 2018-11-20T13:56:39.773Z · score: 38 (11 votes)
An unaligned benchmark 2018-11-17T15:51:03.448Z · score: 28 (7 votes)
Clarifying "AI Alignment" 2018-11-15T14:41:57.599Z · score: 61 (18 votes)
The Steering Problem 2018-11-13T17:14:56.557Z · score: 39 (11 votes)
Preface to the sequence on iterated amplification 2018-11-10T13:24:13.200Z · score: 40 (15 votes)
The easy goal inference problem is still hard 2018-11-03T14:41:55.464Z · score: 42 (13 votes)
Could we send a message to the distant future? 2018-06-09T04:27:00.544Z · score: 40 (14 votes)
When is unaligned AI morally valuable? 2018-05-25T01:57:55.579Z · score: 101 (31 votes)
Open question: are minimal circuits daemon-free? 2018-05-05T22:40:20.509Z · score: 122 (39 votes)
Weird question: could we see distant aliens? 2018-04-20T06:40:18.022Z · score: 85 (25 votes)
Implicit extortion 2018-04-13T16:33:21.503Z · score: 74 (22 votes)
Prize for probable problems 2018-03-08T16:58:11.536Z · score: 135 (37 votes)
Argument, intuition, and recursion 2018-03-05T01:37:36.120Z · score: 103 (31 votes)
Funding for AI alignment research 2018-03-03T21:52:50.715Z · score: 108 (29 votes)
Funding for independent AI alignment research 2018-03-03T21:44:44.000Z · score: 5 (1 votes)
The abruptness of nuclear weapons 2018-02-25T17:40:35.656Z · score: 105 (37 votes)
Arguments about fast takeoff 2018-02-25T04:53:36.083Z · score: 116 (38 votes)
Funding opportunity for AI alignment research 2017-08-27T05:23:46.000Z · score: 1 (1 votes)
Ten small life improvements 2017-08-20T19:09:23.673Z · score: 26 (19 votes)
Crowdsourcing moderation without sacrificing quality 2016-12-02T21:47:57.719Z · score: 16 (11 votes)
Optimizing the news feed 2016-12-01T23:23:55.403Z · score: 16 (11 votes)


Comment by paulfchristiano on Three Kinds of Competitiveness · 2020-04-01T01:19:07.370Z · score: 5 (3 votes) · LW · GW

I think our current best implementation of IDA would neither be competitive nor scalably aligned :)

Comment by paulfchristiano on Three Kinds of Competitiveness · 2020-03-31T16:49:23.928Z · score: 4 (2 votes) · LW · GW

In most cases you can continuously trade off performance and cost; for that reason I usually think of them as a single metric of "competitive with X% overhead." I agree there are cases where they come apart, but I think there are pretty few examples. (Even for nuclear weapons you could ask "how much more expensive is it to run a similarly-destructive bombing campaign with conventional explosives.")

I think this works best if you consider a sequence of increments each worth +10%, rather than say accumulating 70 of those increments, because "spend 1000x more" is normally not available and so we don't have a useful handle on what a technology looks like when scaled up 1000x (and that scaleup would usually involve a bunch of changes that are hard to anticipate).

That is, if we have a sequence of technologies A0, A1, A2, ..., AN, each of which is 10% cheaper than the one before, then we may say that AN is better than A0 by N 10% steps (rather than trying to directly evaluate how many orders of magnitude you'd have to spend on A0 to compete with AN, because the process "spend a thousand times more on A0 in a not-stupid way" is actually kind of hard to imagine).

Comment by paulfchristiano on Three Kinds of Competitiveness · 2020-03-31T16:43:32.877Z · score: 5 (3 votes) · LW · GW

IDA is really aiming to be cost-competitive and performance-competitive, say to within overhead of 10%. That may or may not be possible, but it's the goal.

If the compute required to build and run your reward function is small relative to the compute required to train your model, then it seems like overhead is small. If you can do semi-supervised RL and only require a reward function evaluation on a minority of trajectories (e.g. because most of the work is learning about how to manipulate the environment), then you can be OK as long as the cost of running the reward function isn't too much higher.

Whether that's possible is a big open question. Whether it's date competitive depends on how fast you figure out how to do it.

Comment by paulfchristiano on What are the most plausible "AI Safety warning shot" scenarios? · 2020-03-27T16:00:02.205Z · score: 19 (11 votes) · LW · GW

I think "makes 50% of currently-skeptical people change their minds" is a high bar for a warning shot. On that definition e.g. COVID-19 will probably not be a warning shot for existential risk from pandemics. I do think it is plausible that AI warning shots won't be much better than pandemic warning shots. (On your definition it seems likely that there won't ever again be a warning shot for any existential risk.)

For a more normal bar, I expect plenty of AI systems to fail at large scales in ways that seem like "malice," and then to cover up the fact that they've failed. AI employees will embezzle funds, AI assistants will threaten and manipulate their users, AI soldiers will desert. Events like this will make it clear to most people that there is a serious problem, which plenty of people will be working on in order to make AI useful. The base rate will remain low but there will be periodic high-profile blow-ups.

I don't expect this kind of total unity of AI motivations you are imagining, where all of them want to take over the world (so that the only case where you see something frightening is a failed bid to take over the world). That seems pretty unlikely to me, though it's conceivable (maybe 10-20%?) and may be an important risk scenario. I think it's much more likely that we stamp out all of the other failures gradually, and are left with only the patient+treacherous failures, and in that case whether it's a warning shot or not depends entirely on how much people are willing to generalize.

I do think the situation in the AI community will be radically different after observing these kinds of warning shots, even if we don't observe an AI literally taking over a country.

There is a very narrow range of AI capability between "too stupid to do significant damage of the sort that would scare people" and "too smart to fail at takeover if it tried."

Why do you think this is true? Do you think it's true of humans? I think it's plausible if you require "take over a country" but not if you require e.g. "kill plenty of people" or "scare people who hear about it a lot."

(This is all focused on intent alignment warning shots. I expect there will also be other scary consequences of AI that get people's attention, but the argument in your post seemed to be just about intent alignment failures.)

Comment by paulfchristiano on March Coronavirus Open Thread · 2020-03-12T02:36:09.213Z · score: 22 (11 votes) · LW · GW

Disclaimer: I don't know if this is right, I'm reasoning entirely from first principles.

If there is dispersion in R0, then there would likely be some places where the virus survives even if you take draconian measures. If you later relax those draconian measures, it will begin spreading in the larger population again at the same rate as before.

In particular, if the number of cases is currently decreasing overall most places, then soon most of the cases will be in regions or communities where containment was less successful and so the number of cases will stop decreasing.

If it's infeasible to literally stamp it out everywhere (which I've heard), then you basically want to either delay long enough to have a vaccine or have people get sick at the largest rate that the health care system can handle.

Comment by paulfchristiano on Writeup: Progress on AI Safety via Debate · 2020-02-20T02:37:43.449Z · score: 6 (3 votes) · LW · GW

The intuitive idea is to share activations as well as weights, i.e. to have two heads (or more realistically one head consulted twice) on top of the same model. There is a fair amount of uncertainty about this kind of "detail" but I think for now it's smaller than the fundamental uncertainty about whether anything in this vague direction will work.

Comment by paulfchristiano on On the falsifiability of hypercomputation, part 2: finite input streams · 2020-02-17T20:27:51.761Z · score: 6 (3 votes) · LW · GW

It's an interesting coincidence that arbitration is the strongest thing we can falsify, and also apparently the strongest thing that can consistently apply to itself (if we allow probabilistic arbitration). Maybe not a coincidence?

Comment by paulfchristiano on On the falsifiability of hypercomputation, part 2: finite input streams · 2020-02-17T20:27:35.185Z · score: 8 (4 votes) · LW · GW

It's not obvious to me that "consistent with PA" is the right standard for falsification though. It seems like simplicity considerations might lead you to adopt a stronger theory, and that this might allow for some weaker probabilistic version of falsification for things beyond arbitration. After all, how did we get induction anyway?

(Do we need induction, or could we think of falsification as being relative to some weaker theory?)

(Maybe this is just advocating for epistemic norms other than falsification though. It seems like the above move would be analogous to saying: the hypothesis that X is a halting oracle is really simple and explains the data, so we'll go with it even though it's not falsifiable.)

Comment by paulfchristiano on Open & Welcome Thread - February 2020 · 2020-02-05T17:35:38.882Z · score: 9 (4 votes) · LW · GW

tl;dr: seems like you need some story for what values a group highly regards / rewards. If those are just the values that serve the group, this doesn't sound very distinct from "groups try to enforce norms which benefit the group, e.g. public goods provision" + "those norms are partially successful, though people additionally misrepresent the extent to which they e.g. contribute to public goods."

Similarly, larger countries do not have higher ODA as the public goods model predicts

Calling this the "public goods model" still seems backwards. "Larger countries have higher ODA" is a prediction of "the point of ODA is to satisfy the donor's consequentialist altruistic preferences."

The "public goods model" is an attempt to model the kind of moral norms / rhetoric / pressures / etc. that seem non-consequentialist. It suggests that such norms function in part to coordinate the provision of public goods, rather than as a direct expression of individual altruistic preferences. (Individual altruistic preferences will sometimes be why something is a public good.)

This system probably evolved to "solve" local problems like local public goods and fairness within the local community, but has been co-opted by larger-scale moral memeplexes.

I agree that there are likely to be failures of this system (viewed teleologically as a mechanism for public goods provision or conflict resolution) and that "moral norms are reliably oriented towards provide public goods" is less good than "moral norms are vaguely oriented towards providing public goods." Overall the situation seems similar to a teleological view of humans.

For example if global anti-poverty suddenly becomes much more cost effective, one doesn't vote or donate to spend more on global poverty, because the budget allocated to that faction hasn't changed.

I agree with this, but it seems orthogonal to the "public goods model," this is just about how people or groups aggregate across different values. I think it's pretty obvious in the case of imperfectly-coordinated groups (who can't make commitments to have their resource shares change as beliefs about relative efficacy change), and I think it also seems right in the case of imperfectly-internally-coordinated people.

(We have preference alteration because preference falsification is cognitively costly, and we have preference falsification because preference alteration is costly in terms of physical resources.)

Relevant links: if we can't lie to others, we will lie to ourselves, the monkey and the machine.

E.g., people overcompensate for private deviations from moral norms by putting lots of effort into public signaling including punishing norm violators and non-punishers, causing even more preference alteration and falsification by others.

I don't immediately see why this would be "compensation," it seems like public signaling of virtue would always be a good idea regardless of your private behavior. Indeed, it probably becomes a better idea as your private behavior is more virtuous (in economics you'd only call the behavior "signaling" to the extent that this is true).

As a general point, I think calling this "signaling" is kind of misleading. For example, when I follow the law, in part I'm "signaling" that I'm law-abiding, but to a significant extent I'm also just responding to incentives to follow the law which are imposed because other people want me to follow the law. That kind of thing is not normally called signaling. I think many of the places you are currently saying "virtue signaling" have significant non-signaling components.

Comment by paulfchristiano on Moral public goods · 2020-02-02T08:51:09.114Z · score: 2 (1 votes) · LW · GW
That reminds me that another prediction your model makes is that larger countries should spend more on ODA (which BTW excludes military aid), but this is false

The consideration in this post would help explain why smaller countries spend more than you would expect on a naive view (where ODA just satisfies the impartial preferences of the voting population in a simple consequentialist way). It seems like there is some confusion here, but I still don't feel like it's very important.

I think there was an (additional?) earlier miscommunication or error regarding the "factions within someone's brain":

  • When talking about the weight of altruistic preferences, I (like you) am generally more into models like "X% of my resources are controlled by an altruistic faction" rather than "I have X exchange rate between my welfare and the welfare of others." (For a given individual at a given time we can move between these freely, so it doesn't matter for any of the discussion in the OP.)
  • When I say that "resources controlled by altruistic factions" doesn't explain everything, I mean that you still need to have some additional hypothesis like "donations are like contributions to public goods." I don't think those two hypotheses are substitutes, and you probably need both (or some other alternative to "donations are like contributions to public goods," like some fleshed out version of "nothing is altruistic after all" which seems to be your preference but which I'm withholding judgment on until it's fleshed out.)
  • In the OP, I agree that "and especially their compromises between altruistic and selfish ends" was either wrong or unclear. I really meant the kind of tension that I described in the immediately following bullet point, where people appear to make very different tradeoffs between altruistic and selfish values in different contexts.
Comment by paulfchristiano on High-precision claims may be refuted without being replaced with other high-precision claims · 2020-01-31T02:24:34.911Z · score: 38 (13 votes) · LW · GW

It seems like there is a real phenomenon in computers and proofs (and some other brittle systems), where they are predicated on long sequences of precise relationships and so quickly break down as the relationships become slightly less true. But this situation seems rare in most domains.

If there's a single exception to conservation of energy, then a high percentage of modern physics theories completely break. The single exception may be sufficient to, for example, create perpetual motion machines. Physics, then, makes a very high-precision claim that energy is conserved, and a refuter of this claim need not supply an alternative physics.

I don't know what "break" means, these theories still give good predictions in everyday cases and it would be a silly reason to throw them out unless weird cases became common enough. You'd end up with something like "well we think these theories work in the places we are using them, and will keep doing so until we get a theory that works better in practice" rather than "this is a candidate for the laws governing nature." But that's just what most people have already done with nearly everything they call a "theory."

Physics is weird example because it's one of the only domains where we could hope to have a theory in the precise sense you are talking about. But even e.g. the standard model isn't such a theory! Maybe in practice "theories" are restricted to mathematics and computer science? (Not coincidentally, these are domains where the word "theory" isn't traditionally used.)

In particular, theories are also responsible for a negligible fraction of high-precision knowledge. My claim that there's an apple because I'm looking at an apple is fairly high-precision. Most people get there without having anything like an exceptionless "theory" explaining the relationship between the appearance of an apple and the actual presence of an apple. You could try and build up some exceptionless theories that can yield these kinds of judgments, but it will take you quite some time.

I'm personally happy never using the word "theory," not knowing what it means. But my broader concern is that there are a bunch of ways that people (including you) arrive at truth, that in the context of those mechanisms it's very frequently correct to say things like "well it's the best we have" of an explicit model that makes predictions, and that there are relatively few cases of "well it's the best we have" where the kind of reasoning in this post would move you from "incorrectly accept" to "correctly reject." (I don't know if you have an example in mind.)

(ETA: maybe by "theory" you mean something just like "energy is conserved"? But in these cases the alternative is obvious, namely "energy is often conserved," and it doesn't seem like that's a move anyone would question after having exhibited a counterexample. E.g. most people don't question "people often choose the option they prefer" as an improvement over "people always choose the option they prefer." Likewise, I think most people would accept "there isn't an apple on the table" as a reasonable alternative to "there is an apple on the table," though they might reasonably ask for a different explanation for their observations.)

Comment by paulfchristiano on Moral public goods · 2020-01-31T01:51:05.098Z · score: 2 (1 votes) · LW · GW
Looking at it appears that foreign aid as %GNI for DAC countries has actually gone down since 1960, and I don't see any correlation with any of the (non-enforced) agreements signed about the 0.7% target. It just looks like countries do ODA for reasons completely unrelated to the target/agreements.

Do you think the story is different for the climate change agreements? I guess the temporal trend is different, but I think the actual causal story from agreements to outcomes is equally unclear (I don't think the agreements have much causal role) and enforcement seems similarly non-existent.

Information goods are a form of public goods, and they are provided (in large part) because governments enforce copyrights.

Copyright enforcement seems more like trade. I will require my citizens to pay you for your information, if you require your citizens to pay me for my information.

You can analogize copyright enforcement to a public good if you want, but the actual dynamics of provision and cost-benefit analyses seem quite different. For example, signing up to a bilateral copyright agreement is a good deal between peer states (if copyright agreements ever are)---you've protected your citizens copyright to the same extent you've lost the ability to infringe on others. The same is not true of a public good, where bilateral agreement is almost the same as unilateral action.

At any rate, I actually don't think almost anything from the OP hinges on this disagreement (though it seems like an instructive difference in background views about the international order). We are just debating whether the lack of international agreements on foreign aid implies that people don't much care about the humanitarian impacts of aid, with me claiming that international coordination is generally weak with rare exceptions and so it's not much evidence.

There is plenty of other evidence though. E.g. when considering the US you don't really need to invoke international agreements. The US represents >20% of gross world product, so US unilateral action is nearly as good as international action. US government aid is mostly military aid which has no real pretension of humanitarian motivation, and I assume US private transfers to developing countries are mostly remittances. So I certainly agree that people in the US don't care to spend to very much on aid.

Comment by paulfchristiano on Moral public goods · 2020-01-30T20:44:52.206Z · score: 2 (1 votes) · LW · GW

I don't know much about this, but describes a 0.7% target for ODA and claims that

the 0.7% target served as a reference for 2005 political commitments to increase ODA from the EU, the G8 Gleneagles Summit and the UN World Summit


DAC members generally accepted the 0.7% target for ODA, at least as a long-term objective, with some notable exceptions: Switzerland – not a member of the United Nations until 2002 – did not adopt the target, and the United States stated that it did not subscribe to specific targets or timetables, although it supported the more general aims of the Resolution.

At face value it seems like like 0.7%/year is considerably larger than the investments in any of the other efforts at international coordination you mention (and uptake seems comparable).

(The Montreal Protocol seems like a weird case in that the gains are so large---I've been told that the gains were large enough for the US that unilateral participation was basically justifiable. Copyright agreements don't seem like public goods provision. I don't think countries are meeting their Paris agreement obligations any better than they are meeting their 0.7%/year ODA targets, and enforcement seems just as non-existent.)

Comment by paulfchristiano on Moral public goods · 2020-01-30T20:15:55.003Z · score: 2 (1 votes) · LW · GW
Assuming I now have a correct understanding, I can restate my objection as, if anti-poverty is a public good, why hasn't it followed the trend of other public goods, and shifted from informal private provision to formal government or internationally-coordinated provision?

Most redistribution is provided formally by governments and it may be the single most common topic of political debate. I'm not even sure this is evidence one way or the other though---why would you expect people not to signal virtue by advocating for policies? (Isn't that a key part of your story?)

Relatedly, how does "we don't want the government to enforce X so that we can signal our virtue by doing X" even work? Advocating for "make everyone do X" signals the same kind of virtue as doing X, advocating against seems to send the opposite signal, and surely the signaling considerations are just as dominant for advocacy as for the object-level decision? I think I often can't really engage with the virtue signaling account because I don't understand it at the level of precision that would be needed to actually make a prediction about anything.

Domestically, are you asking: "why do people donate so much more to charity than to other public goods"? I don't think any of the competing theories really say much about that until we get way more specific about them and what makes a situation good for signaling virtue vs. what makes public goods easy to coordinate about in various ways vs. etc. (and also get way more into the quantitative data about other apparent public goods which are supported by donations).

(Overall this doesn't seem like a particularly useful line of discussion to me so I'm likely to drop it. Most useful for me would probably be a description of the virtue signaling account that makes sense to me.)

Comment by paulfchristiano on Moral public goods · 2020-01-30T16:54:23.004Z · score: 2 (1 votes) · LW · GW

I don't think Scott is talking about the bay area in that quote, is he?

(ETA: also if his estimate is per year then I think it's similar to the report you quoted, which estimates $700M/year to provide shelter to all of the homeless at a cost of ~$25k/person/year, so that seems like another plausible source of discrepancy.)

Comment by paulfchristiano on Moral public goods · 2020-01-30T16:41:33.572Z · score: 2 (1 votes) · LW · GW

For the nobles the ratio is only 1000 (= the total number of nobles). In e.g. the modern US the multiples are much higher since the tax base is much larger. That is, there is a gap of > a million between the levels of altruism at which you would prefer higher taxes vs. actually give away some money.

Comment by paulfchristiano on Moral public goods · 2020-01-29T04:03:25.865Z · score: 5 (2 votes) · LW · GW
What does it not explain, that your model explains?

If you literally just mean to compare to "people have a fixed fraction of their budget they spend on altruistic things":

  • Rhetoric about doing your part, feelings of guilt. In general the structural similarities between discourse and norms around alms on the one hand and public goods on the other.
  • If the bucket served by US donations is "caring about US needy" then I think you have to explain people's apparent relative ambivalence between political advocacy for more redistribution and direct donations.
  • I think that local giving makes more sense as part of a story about collective provision of public goods, though I haven't thought it through much and this may just be responding to rhetoric and so double-counting the first observation.

I haven't thought about it that much, but my basic sense is that you are going to have to invoke a virtue signaling explanation for lots of behaviors, and that's going to start to look more similar to norms for providing public goods. E.g. is your view that normal public goods (like funding a local park) are provided because of virtue signaling? If so, then it's not as clear there is much gap between our views and maybe this is more an argument about some accounts of "virtue signaling."

Comment by paulfchristiano on Moral public goods · 2020-01-29T03:49:40.136Z · score: 2 (1 votes) · LW · GW
I don't think such a comparison would make sense, since different public goods have different room for funding. For example the World Bank has a bigger budget than the WHO, but development/anti-poverty has a lot more room for funding (or less diminishing returns) than preventing global pandemics.

Do you have some example of a public good that you are using to calibrate your expectations about international spending on typical public goods?

I don't think it's enough to say: people do a tiny amount of X but they don't coordinate explicitly. You should also provide some evidence about the overall ability to coordinate.

(That said, I also agree that most of what's going on, for explaining the difference between real aid budgets and what a utilitarian would spend, is that people don't care very much.)

Comment by paulfchristiano on Moral public goods · 2020-01-28T18:12:35.278Z · score: 4 (3 votes) · LW · GW
I feel like you've come up with an example where people are just barely charitable enough that they support redistribution, but not charitable enough that they would ever give a gift themselves. This is a counterexample to Friedman's claim, but it's not obvious that it's real.

For consequentialists, the gap between "charitable enough to give" and "charitable enough to support redistribution" seems to be more than a million-fold; if so, I don't think it warrants that "just barely" modifier.

Comment by paulfchristiano on Hedonic asymmetries · 2020-01-28T18:10:47.249Z · score: 3 (2 votes) · LW · GW

I think this part of the reversed argument is wrong:

The agent will randomly seek behaviours that get rewarded, but as long as these behaviours are reasonably rare (and are not that bad) then that’s not too costly

Even if the behaviors are very rare, and have a "normal" reward, then the agent will seek them out and so miss out on actually good states.

Comment by paulfchristiano on Moral public goods · 2020-01-27T16:35:51.029Z · score: 2 (1 votes) · LW · GW
Why do we see so little of that for global poverty?

I'm not convinced this is the case. Do you have some comparisons of international spending on different public goods, or lobbying for such spending?

(I agree that there is more international coordination on arms control, but don't think that this is analogous.)

Comment by paulfchristiano on Hedonic asymmetries · 2020-01-27T16:34:14.453Z · score: 3 (2 votes) · LW · GW
For example, if the world is symmetric in the appropriate sense in terms of what actions get you rewarded or penalized, and you maximize expected utility instead of satisficing in some way, then the argument is wrong. I'm sure there is good literature on how to model evolution as a player, and the modeling of the environment shouldn't be difficult.

I would think it would hold even in that case, why is it clearly wrong?

Comment by paulfchristiano on Moral public goods · 2020-01-27T07:10:04.942Z · score: 5 (3 votes) · LW · GW
Where does most of the income of rich people come from, then?

I think it's mostly wages.

Can you point me to some relevant resource?

Might be misreading, but see table III here (h/t Howie Lempel for the source). Looks like even the top 0.01% is still <50% capital income.

[Edit: though in the past the capital shares were higher, in 1929 gets up to 50% of income for the top 0.1%.]

There are various ways this data isn't exactly what you want, but I still think it's very unlikely that it's more than half capital income.

Comment by paulfchristiano on Moral public goods · 2020-01-26T17:56:31.274Z · score: 5 (3 votes) · LW · GW
'Redistribution' (ie. theft) is an exercise in pointlessness.

Using coercive force to fund public goods is also 'theft', but still it can end up with near-unanimous support. So I don't think that this is a good argument in and of itself.

As long as there is scarcity there will be haves and have nots, and wealth will accumulate as a natural function of time and successful strategies. You can reset the game board as often as you like but you can never ensure a permanent and even stalemate. Even assuming you could destroy the entire point of competing, well then you've destroyed everything you get from that too.

This post isn't really about leveling the playing field. (Even in the stupid example with nobles, the nobles still end up 1000x richer than the peasants.)

Comment by paulfchristiano on Moral public goods · 2020-01-26T17:48:04.224Z · score: 4 (2 votes) · LW · GW
I don't understand the new model that you're proposing here. If people want to see a world free from extreme poverty and view that as a public good provision problem, shouldn't they advocate for or work towards international coordination on achieving that? (Given international coordination on other issues, this clearly isn't an impossible goal.) Why are they unilaterally chipping in small amounts towards reducing poverty in a piecemeal fashion?

This seems to be how people relate to local public goods.

I've been modeling this as people having moral uncertainty which are modeled as factions within someone's brain, where the altruist faction has control over some (typically small) fraction of their budget.

I think that's a better model than fixed weights for different values, but I don't think it explains everything.

Comment by paulfchristiano on Moral public goods · 2020-01-26T17:44:46.477Z · score: 2 (1 votes) · LW · GW

In this model the nobles don't recommend a further tax if in aggregate they have 50% of the total income, utility is logarithmic, and they care about themselves as much as everyone else put together.

(This is a good argument that this model may just not capture your intuition. But I do think the case for taxes is in fact quite a lot better when the nobles collectively control a large share of the income rather than just being individually wealthy.)

Comment by paulfchristiano on Moral public goods · 2020-01-26T17:41:14.701Z · score: 3 (2 votes) · LW · GW

I agree that this is more like the dilemma for modeling someone as a welfarist than a general consequentialist (if they were a total utilitarian then I think they'd already be committed to option 2). But I think you do have similar problems with any attempt to model them as consequentialists.

Comment by paulfchristiano on Moral public goods · 2020-01-26T17:36:31.238Z · score: 2 (1 votes) · LW · GW

The post is just arguing that redistribution can behave like a public good, it's not arguing for increases in redistribution. (I think the quantitative behavior of the hypothetical is not at all like the real world, there isn't a small group that gets 90% of the income, and the optimal tax rate is very sensitive to the fraction of income the nobles start with.)

Comment by paulfchristiano on Moral public goods · 2020-01-26T17:34:28.177Z · score: 2 (1 votes) · LW · GW

If you want an example with more modest numbers:

  • There are a billion rich people and a billion poor people (and a bunch in the middle we'll ignore)
  • Each rich person cares about themselves 5x as much as all poor people put together
  • The poor people get 100x the welfare from $1 as the rich people

Then it's a bad idea for a rich person to donate, but they would support taxes to fund foreign aid until they reached the level where the poorest billion valued money only 5x more than the richest billion.

Comment by paulfchristiano on Moral public goods · 2020-01-26T17:29:28.889Z · score: 6 (4 votes) · LW · GW
The first part of this turned seemed like mostly politics - oversimple and flat-out non-real example being used to justify a policy without any nuance or sense.

The example is just to illustrate that it's possible for everyone to prefer taxation but not want to donate unilaterally. Maybe this is an easy enough point that it wasn't worth illustrating with an example. I tried to make the example obviously silly so that it wouldn't be taken as politically live, but I think that failed.

you haven't specified even what "redistribution" means, especially in a dynamic equilibrium where wealth and income are related but distinct.

e.g. policies justified by benefiting poor people at the expense of rich people, e.g. an income tax which the state then spends to benefit people equitably.

I absolutely agree that this is not a case for an income tax, it's one argument for an income tax (which is different from arguments about justice or fairness and seems worth having in a separate mental category).

Point 2 is completely missing the fundamental question of what people want - Friedman's point that if people actually were self-aligned that they care about feeding specific poor people rather than getting a new iPhone, they'd do it. Instead, they want abstract poor people to get fed, and only if they can force others to do so (along with themselves, in many cases, but rarely just unilaterally). You don't address this disparity.

It seems consistent for me to prefer that all poor people get food than that all rich people get iPhones, yet to prefer that I get an iPhone than that a particular poor person get food (since I care more about myself than the average rich person). Do you disagree that this would be a consistent set of preferences? Do you agree that it's consistent but just disagree that it's empirically plausible? At any rate, it seems like we should agree that Friedman's argument doesn't work without some additional assumptions.

Comment by paulfchristiano on Moral public goods · 2020-01-26T17:22:08.453Z · score: 4 (2 votes) · LW · GW
Real-life nobles don't produce 10,000x value; they extract value from peasants, by force of arms and law and custom. It makes no sense to redistribute wealth by taxing everyone's income if the nobles get their income by taxing the peasants; just stop the nobles from extracting so much value.

I think that redistribution by taxing still makes sense, e.g. if nobles effectively get their money by owning and controlling the land (by force or whatever) and taking a big cut from the people who work it. But I also agree that there may be easier and better things to do than raising taxes, it seems like a waste of effort for nobles to collect local taxes and then the king to collect taxes from the nobles and pay it back.

But most of their income still comes from capital and owning the tools of production and all that (citation required).

I think this probably isn't right---e.g. capital income is a minority for the top 1% of earners in the US today, and the situation is even starker for global inequality.

A terminology of "nobles" and "peasants" implies to me the idea that most all of the nobles' (the modern rich) income is extracted from from the peasants (everyone else), enabled by the same state that then taxes them. Did you intend or endorse this view? If not, or if you think it's irrelevant to the thought experiment, do you think the framing of "nobles" and "peasants" distracts from the issue? It does for me.

In retrospect I agree it would have been better to use a different example.

(In retrospect people also didn't like the big and unrealistic numbers, so I could have just made them 10 and 100 instead. I generally overestimated the extent to which readers would separate the simple quantitative point, which I wanted to make in the shortest way possible but didn't think about that much, from other features of the scenario.)

Comment by paulfchristiano on Moral public goods · 2020-01-26T05:27:17.576Z · score: 4 (2 votes) · LW · GW

I think the example works fine with numbers like "the welfare effect of $1 is a hundred times larger for this poor person than that rich person" which seem conservative. (e.g. I think it is very likely that many poor people would value a doubling of consumption at least as much as I would, suggesting multipliers >100).

I think the weird thing about the example is the nobles having 90% of the total income.

Comment by paulfchristiano on Modest Superintelligences · 2020-01-25T23:44:13.473Z · score: 4 (2 votes) · LW · GW

They are +4.2SD on the genetic component of the property (= 1 in 90,000), but the median person with those genetics is still only +3SD on the overall property (= 1 in 750), right?

(That is, the expected boost from the abnormally extreme genetics should be the same as the expected boost from the abnormally extreme environment, if the two are equally important. So each of them should be half of the total effect, i.e. 3SD on the overall trait.)

Comment by paulfchristiano on Modest Superintelligences · 2020-01-25T06:29:48.227Z · score: 3 (2 votes) · LW · GW

For a normally distributed property 1/billion is +6 sigma, while +3 sigma is 1/750. If a property is normally distributed, the clones share 50% of the variation, and von neumann is 1 in a billion, then I think it's right that our median guess for the median clone should be the 1 in 750 level.

(But of your 100,000 clones several of them will be at the one in a trillion level, a hundred will be more extreme than von neumann, and >20,000 of them will be one in 20,000. I'm generally not sure what you are supposed to infer from the "one in X" metric. [Edited to add: all of those are the fractions in expectation, and they are significant underestimates because they ignore the uncertainty in the genetic component.])

Comment by paulfchristiano on Clarifying The Malignity of the Universal Prior: The Lexical Update · 2020-01-15T17:35:28.874Z · score: 7 (4 votes) · LW · GW

I still tentatively think the lexical update works, but it's been a while and I might be missing something.

I'll follow your convention that our universe is U-simple, our universal prior is U', and so the aliens' universe is U'-simple (I think---sorry if I got confused and that's not what you mean).

If we sample from a universe that is U'-simple, then:

  • Assume the aliens care about U'-simplicity. They will preferentially sample from U', and so have U'(our world) mass on our world. Within that, they will correctly guess that the machine they are supposed to control is using U' as its prior. That is, they basically pay U'(our world) * P(us|someone using U' to predict).
  • But our universal prior was also U', wasn't it? So we are also paying U'(our world) to pick out our world. I.e. we pay U'(our world) * P(someone making important predictions | our world) * P(someone using U' to predict | someone making important predictions) * P(us|someone using U' to predict).
  • I don't see any program whose behavior depends on U(world) for the "real" simplicity prior U according to which our world is simple (and that concept seems slippery).

Does that seem right?

Comment by paulfchristiano on Of arguments and wagers · 2020-01-12T17:37:30.707Z · score: 4 (2 votes) · LW · GW

Thanks, corrected.

Comment by paulfchristiano on Of arguments and wagers · 2020-01-11T18:17:25.746Z · score: 12 (4 votes) · LW · GW

(Note: this is a post from 2014 that I recently added to I still think it's a very interesting scheme and I'm excited about people exploring better mechanisms for resolving arguments.)

Comment by paulfchristiano on Of arguments and wagers · 2020-01-11T18:14:13.578Z · score: 2 (1 votes) · LW · GW
I think the resulting odds won't reflect the probability of anything, because they depend a lot on whether Alice or Bob is more risk-tolerant (=rich).

If one of them is willing to tolerate risk equal to the value of Judy's time to hear out the argument, then you are fine. If the total willingness to risk of people who believe "Judy will believe X on reflection" is lower than the value of Judy's time, then I think you are basically inevitably stuck unless Judy is willing to risk her own attention. If she is willing to risk her own attention, then she can just give people a budget of "minutes" to spend making wagers, as discussed in the post, and as long as the budget is large enough relative to the size of the disagreement it seems like you are OK.

Also, it seems to me that your scheme works best for yes/no questions. For anything more complicated, Alice and Bob can cooperate to mislead Judy, which is especially scary in case of AIs. I'm not sure how to fix that problem: it seems to require a way for a non-expert to check the work of a malicious expert, not just adjudicate between two experts.

The scheme works if one of the experts advocates for the truth. If there are two options, and both players want to manipulate Judy into believing "yes," then you are similarly in trouble. I agree that if there are more options than experts then it becomes less likely that "by chance" someone wants to advocate for the right answer. But I think in general you are banking on there being some density of experts who want to argue for the truth because it is the truth.

Comment by paulfchristiano on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-06T17:18:19.002Z · score: 4 (2 votes) · LW · GW

For context, here's the one time in the interview I mention "AI risk" (quoting 2 earlier paragraphs for context):

Paul Christiano: I don’t know, the future is 10% worse than it would otherwise be in expectation by virtue of our failure to align AI. I made up 10%, it’s kind of a random number. I don’t know, it’s less than 50%. It’s more than 10% conditioned on AI soon I think.
Asya Bergal: I think my impression is that that 10% is lower than some large set of people. I don’t know if other people agree with that.
Paul Christiano: Certainly, 10% is lower than lots of people who care about AI risk. I mean it’s worth saying, that I have this slightly narrow conception of what is the alignment problem. I’m not including all AI risk in the 10%. I’m not including in some sense most of the things people normally worry about and just including the like ‘we tried to build an AI that was doing what we want but then it wasn’t even trying to do what we want’. I think it’s lower now or even after that caveat, than pessimistic people. It’s going to be lower than all the MIRI folks, it’s going to be higher than almost everyone in the world at large, especially after specializing in this problem, which is a problem almost no one cares about, which is precisely how a thousand full time people for 20 years can reduce the whole risk by half or something.

(But it's still the case that asked "Can you explain why it's valuable to work on AI risk?" I responded by almost entirely talking about AI alignment, since that's what I work on and the kind of work where I have a strong view about cost-effectiveness.)

Comment by paulfchristiano on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-03T00:20:08.343Z · score: 16 (10 votes) · LW · GW

E.g. if you have a broad distribution over possible worlds, some of which are "fragile" and have 100 things that cut value down by 10%, and some of which are "robust" and don't, then you get 10,000x more value from the robust worlds. So unless you are a priori pretty confident that you are in a fragile world (or they are 10,000x more valuable, or whatever), the robust worlds will tend to dominate.

Similar arguments work if we aggregate across possible paths to achieving value within a fixed, known world---if there are several ways things can go well, some of which are more robust, those will drive almost all of the EV. And similarly for moral uncertainty (if there are several plausible views, the ones that consider this world a lost cause will instead spend their influence on other worlds) and so forth. I think it's a reasonably robust conclusion across many different frameworks: your decision shouldn't end up being dominated by some hugely conjunctive event.

Comment by paulfchristiano on A dilemma for prosaic AI alignment · 2019-12-19T20:37:49.009Z · score: 6 (4 votes) · LW · GW
In the case of something like amplification or debate, I think the bet that you're making is that language modeling alone is sufficient to get you everything you need in a competitive way.

I'm skeptical of language modeling being enough to be competitive, in the sense of maximizing "log prob of some naturally occurring data or human demonstrations." I don't have a strong view about whether you can get away using only language data rather than e.g. taking images as input and producing motor torques as output.

I'm also not convinced that amplification or debate need to make this bet though. If we can do joint training / fine-tuning of a language model using whatever other objectives we need, then it seems like we could just as well do joint training / fine-tuning for a different kind of model. What's so bad if we use non-language data?

Comment by paulfchristiano on A dilemma for prosaic AI alignment · 2019-12-19T20:26:00.305Z · score: 7 (5 votes) · LW · GW

We could also ask: "Would AlphaStar remain as good as it is, if fine-tuned to answer questions?"

In either case it's an empirical question. I think the answer is probably yes if you do it carefully.

You could imagine separating this into two questions:

  • Is there a policy that plays starcraft and answers questions, that is only slightly larger than a policy for playing starcraft alone? This is a key premise for the whole project. I think it's reasonably likely; the goal is only to answer questions the model "already knows," so it seems realistic to hope for only a constant amount of extra work to be able to use that knowledge to answer questions. I think most of the uncertainty here is about details of "know" and question-answering and so on.
  • Can you use joint optimization to find that policy with only slightly more training time? I think probably yes.
Comment by paulfchristiano on A dilemma for prosaic AI alignment · 2019-12-18T16:10:52.604Z · score: 9 (6 votes) · LW · GW

I normally imagine using joint training in these cases, rather than pre-training + fine-tuning. e.g., at every point in time we maintain an agent and a question-answerer, where the question-answerer "knows everything the agent knows." They get better together, with each gradient update affecting both of them, rather than first training a good agent and then adding a good question-answerer.

(Independently of concerns about mesa-optimization, I think the fine-tuning approach would have trouble because you couldn't use statistical regularities from the "main" objective to inform your answers to questions, and therefore your question answers will be dumber than the policy and so you couldn't get a good reward function or specification of catastrophically bad behavior.)

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-15T17:12:51.088Z · score: 2 (1 votes) · LW · GW

I don't have a big difference in my model of mid vs. final, they have very similar MMR, the difference between them is pretty small in the scheme of things (e..g probably smaller than the impact of doubling model size) and my picture isn't refined enough to appreciate those differences. For any particular dumb mistake I'd be surprised if the line between not making it and making it was in that particular doubling.

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-15T03:38:46.790Z · score: 2 (1 votes) · LW · GW

AGZ is only trained on the situations that actually arise in games it plays.

I agree with the point that "imitation learning from human games" will only make you play well on kinds of situations that arise in human games, and that self-play can do better by making you play well on a broader set of situations. You could also train on all the situations that arise in a bigger tree search (though AGZ did not) or against somewhat-random moves (which AGZ probably did).

(Though I don't see this as affecting the basic point.)

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-15T03:33:06.819Z · score: 6 (4 votes) · LW · GW

Thanks! That's only marginally less surprising than the final RL policy, and I suspect the final RL policy will make the same kind of mistake. Seems like the OP's example was legit and I overestimated the RL agent.

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-03T04:32:35.150Z · score: 8 (3 votes) · LW · GW

But to be clear, (i) it would then also be learned by imitating a large enough dataset from human players who did something like tree search internally while playing, (ii) I think the tree search makes a quantitative not qualitative change, and it's not that big (mostly improves stability, and *maybe* a 10x speedup, over self-play).

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-03T04:12:56.112Z · score: 19 (9 votes) · LW · GW
In one comical case, AlphaStar had surrounded the units it was building with its own factories so that they couldn't get out to reach the rest of the map. Rather than lifting the buildings to let the units out, which is possible for Terran, it destroyed one building and then immediately began rebuilding it before it could move the units out!

It seems like AlphaStar played 90 ladder matches as Terran:

  • 30 with the initial policy trained with SL
  • 30 with the a policy from the middle of training
  • 30 from the final RL policy.

This sounds like the kind of mistake that the SL policy would definitely make (no reason it should be able to recover), whereas it's not clear whether RL would learn how to recover (I would expect it to, but not too strongly).

If it's easy for anyone to check and they care, it might be worth looking quickly through the replays and seeing whether this particular game was from the SL or RL policies. This is something I've been curious about since seeing the behavior posted on Reddit, and it would have a moderate impact on my understanding of AlphaStar's skill.

It looks like they released 90 replays and played 90 ladder games so it should be possible to check.

The replays are here, hosted on the DM site, sorted into three folders based on the policy, if it's one of the SL matches it's either AlphaStarSupervised_013_TvT.SC2Replay, or one of _017_, _019_, or _022_ (based on being TvT and being on Kairos Junction). The video in question is here. I'd check if I had SC2 installed.

(Of course better still would be to find a discussion of the 30 RL replays, from someone who understands the game. Maybe that's been posted somewhere, I haven't looked and it's hard to know who to trust.)

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-03T03:55:36.981Z · score: 15 (6 votes) · LW · GW

I meant to ask about the policy network in AlphaZero directly. It plays at the professional level (the Nature paper puts it at a comparable Elo to Fan Hui) with no tree search, using a standard neural network architecture trained by supervised learning. It performs fine on parts of the search tree that never appeared during training. What distinguishes this kind of reasoning from "if I see X, I do Y"?

(ETA clarification, because I think this was probably the misunderstanding: the policy network plays Go with no tree search, tree search is only used to generate training data. That suggests the AlphaStar algorithm would produce similar behavior without using tree search ever, probably using at most 100x the compute of AlphaZero and I'd be willing to bet on <10x.)

From the outside, it looks like human-level play at Starcraft is more complicated (in a sense) than human-level play at Go, and so it's going to take bigger models in order to reach a similar level of performance. I don't see a plausible-looking distinction-in-principle that separates the strategy in Starcraft from strategy in Go.

Comment by paulfchristiano on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-02T15:05:53.593Z · score: 33 (12 votes) · LW · GW
Finally, RL practitioners have known that genuine causal reasoning could never be achieved via known RL architectures- you'd only ever get something that could execute the same policy as an agent that had reasoned that way, via a very expensive process of evolving away from dominated strategies at each step down the tree of move and countermove. It's the biggest known unknown on the way to AGI.

What's the argument here? Do you think that the AGZ policy (which is extremely good at Go or Chess even without any tree search) doesn't do any causal reasoning? That it only ever learns to play parts of the game tree it's seen during training? What does "genuine causal reasoning" even mean?

It looks to me like causal reasoning is just another type of computation, and that you could eventually find that computation by local search. If you need to use RL to guide that search then it's going to take a long time---AlphaStar was very expensive, and still only trained a policy with ~80M parameters.

From my perspective it seems like the big questions are just how large a policy you would need to train using existing methods in order to be competitive with a human (my best guess would be a ~trillion to a ~quadrillion), and whether you can train it by copying rather than needing to use RL.