Posts

Ways that China is surpassing the US 2019-11-04T09:45:53.881Z · score: 45 (16 votes)
List of resolved confusions about IDA 2019-09-30T20:03:10.506Z · score: 98 (33 votes)
Don't depend on others to ask for explanations 2019-09-18T19:12:56.145Z · score: 78 (25 votes)
Counterfactual Oracles = online supervised learning with random selection of training episodes 2019-09-10T08:29:08.143Z · score: 45 (12 votes)
AI Safety "Success Stories" 2019-09-07T02:54:15.003Z · score: 102 (30 votes)
Six AI Risk/Strategy Ideas 2019-08-27T00:40:38.672Z · score: 60 (29 votes)
Problems in AI Alignment that philosophers could potentially contribute to 2019-08-17T17:38:31.757Z · score: 80 (31 votes)
Forum participation as a research strategy 2019-07-30T18:09:48.524Z · score: 112 (37 votes)
On the purposes of decision theory research 2019-07-25T07:18:06.552Z · score: 65 (21 votes)
AGI will drastically increase economies of scale 2019-06-07T23:17:38.694Z · score: 41 (15 votes)
How to find a lost phone with dead battery, using Google Location History Takeout 2019-05-30T04:56:28.666Z · score: 52 (23 votes)
Where are people thinking and talking about global coordination for AI safety? 2019-05-22T06:24:02.425Z · score: 94 (32 votes)
"UDT2" and "against UD+ASSA" 2019-05-12T04:18:37.158Z · score: 48 (15 votes)
Disincentives for participating on LW/AF 2019-05-10T19:46:36.010Z · score: 79 (34 votes)
Strategic implications of AIs' ability to coordinate at low cost, for example by merging 2019-04-25T05:08:21.736Z · score: 49 (19 votes)
Please use real names, especially for Alignment Forum? 2019-03-29T02:54:20.812Z · score: 40 (13 votes)
The Main Sources of AI Risk? 2019-03-21T18:28:33.068Z · score: 69 (28 votes)
What's wrong with these analogies for understanding Informed Oversight and IDA? 2019-03-20T09:11:33.613Z · score: 39 (9 votes)
Three ways that "Sufficiently optimized agents appear coherent" can be false 2019-03-05T21:52:35.462Z · score: 69 (18 votes)
Why didn't Agoric Computing become popular? 2019-02-16T06:19:56.121Z · score: 53 (16 votes)
Some disjunctive reasons for urgency on AI risk 2019-02-15T20:43:17.340Z · score: 37 (10 votes)
Some Thoughts on Metaphilosophy 2019-02-10T00:28:29.482Z · score: 57 (16 votes)
The Argument from Philosophical Difficulty 2019-02-10T00:28:07.472Z · score: 47 (13 votes)
Why is so much discussion happening in private Google Docs? 2019-01-12T02:19:19.332Z · score: 86 (25 votes)
Two More Decision Theory Problems for Humans 2019-01-04T09:00:33.436Z · score: 58 (19 votes)
Two Neglected Problems in Human-AI Safety 2018-12-16T22:13:29.196Z · score: 77 (25 votes)
Three AI Safety Related Ideas 2018-12-13T21:32:25.415Z · score: 73 (26 votes)
Counterintuitive Comparative Advantage 2018-11-28T20:33:30.023Z · score: 76 (29 votes)
A general model of safety-oriented AI development 2018-06-11T21:00:02.670Z · score: 70 (23 votes)
Beyond Astronomical Waste 2018-06-07T21:04:44.630Z · score: 93 (41 votes)
Can corrigibility be learned safely? 2018-04-01T23:07:46.625Z · score: 75 (26 votes)
Multiplicity of "enlightenment" states and contemplative practices 2018-03-12T08:15:48.709Z · score: 99 (24 votes)
Online discussion is better than pre-publication peer review 2017-09-05T13:25:15.331Z · score: 18 (15 votes)
Examples of Superintelligence Risk (by Jeff Kaufman) 2017-07-15T16:03:58.336Z · score: 5 (5 votes)
Combining Prediction Technologies to Help Moderate Discussions 2016-12-08T00:19:35.854Z · score: 13 (14 votes)
[link] Baidu cheats in an AI contest in order to gain a 0.24% advantage 2015-06-06T06:39:44.990Z · score: 14 (13 votes)
Is the potential astronomical waste in our universe too small to care about? 2014-10-21T08:44:12.897Z · score: 25 (27 votes)
What is the difference between rationality and intelligence? 2014-08-13T11:19:53.062Z · score: 13 (13 votes)
Six Plausible Meta-Ethical Alternatives 2014-08-06T00:04:14.485Z · score: 47 (48 votes)
Look for the Next Tech Gold Rush? 2014-07-19T10:08:53.127Z · score: 39 (37 votes)
Outside View(s) and MIRI's FAI Endgame 2013-08-28T23:27:23.372Z · score: 16 (19 votes)
Three Approaches to "Friendliness" 2013-07-17T07:46:07.504Z · score: 20 (23 votes)
Normativity and Meta-Philosophy 2013-04-23T20:35:16.319Z · score: 12 (14 votes)
Outline of Possible Sources of Values 2013-01-18T00:14:49.866Z · score: 14 (16 votes)
How to signal curiosity? 2013-01-11T22:47:23.698Z · score: 21 (22 votes)
Morality Isn't Logical 2012-12-26T23:08:09.419Z · score: 19 (35 votes)
Beware Selective Nihilism 2012-12-20T18:53:05.496Z · score: 40 (44 votes)
Ontological Crisis in Humans 2012-12-18T17:32:39.150Z · score: 45 (49 votes)
Reasons for someone to "ignore" you 2012-10-08T19:50:36.426Z · score: 23 (24 votes)
"Hide comments in downvoted threads" is now active 2012-10-05T07:23:56.318Z · score: 18 (30 votes)

Comments

Comment by wei_dai on Arguing about housing · 2019-11-15T01:44:41.430Z · score: 7 (3 votes) · LW · GW

(Replying here even though it says "Comment via: facebook" because I prefer the LW platform and I don't know most of JK's Facebook friends. If that's frowned upon, please let me know.)

From this post, it's not clear who you are arguing against. (I checked your Facebook post and most people there seem to agree with you.) But my guess is that a lot of the reasons people give in public for their opposition to housing are not their actual reasons. Here's a paper that suggests what the real reasons may be for many: https://www.dartmouth.edu/~wfischel/Papers/00-04.PDF.

Abstract: An owner-occupied home is an unusual asset because it cannot be diversified among locations and because it is the only sizable asset that most owners possess. Among the uninsured risks of homeownership is devaluation by nearby changes in land use. Opponents of land-use change are called NIMBYs (“Not In My Back Yard”). This article submits that NIMBYism is a rational response to the uninsured risks of homeownership. It explores to the possibilities and drawbacks of providing an insurance market to cover such risks. It concludes that some progress is being made towards developing such markets.

Comment by wei_dai on Robin Hanson on the futurist focus on AI · 2019-11-14T23:47:00.199Z · score: 5 (2 votes) · LW · GW

You previously wrote:

We do have some models of [boundedly] rational principals with perfectly rational agents, and those models don’t display huge added agency rents. If you want to claim that relative intelligence creates large agency problems, you should offer concrete models that show such an effect.

The conclusions of those models seem very counterintuitive to me. I think the most likely explanation is that they make some assumptions that I do not expect to apply to the default scenarios involving humans and AGI. To check this, can you please reference some of the models that you had in mind when you wrote the above? (This might also help people construct concrete models that they would consider more realistic.)

Comment by wei_dai on Insights from the randomness/ignorance model are genuine · 2019-11-14T03:03:41.879Z · score: 3 (1 votes) · LW · GW

I think this might be a situation where people tend to leave the debate and move on to something else when they seem to have found a satisfactory position

Well not exactly, I came up with UDASSA originally but found it not entirely satisfactory, so I moved on to something that eventually came to be called UDT. I wrote down my reasons at against UD+ASSA and under Paul's post.

Perhaps it would be good to have this history be more readily available to people looking for solutions to anthropic reasoning though, if you guys have suggestions on how to do that.

Comment by wei_dai on The Credit Assignment Problem · 2019-11-10T08:30:55.288Z · score: 5 (3 votes) · LW · GW

One part of it is that I want to scrap classical (“static”) decision theory and move to a more learning-theoretic (“dynamic”) view.

Can you explain more what you mean by this, especially "learning-theoretic"? I've looked at learning theory a bit and the typical setup seems to involve a loss or reward that is immediately observable to the learner, whereas in decision theory, utility can be over parts of the universe that you can't see and therefore can't get feedback from, so it seems hard to apply typical learning theory results to decision theory. I wonder if I'm missing the whole point though... What do you think are the core insights or ideas of learning theory that might be applicable to decision theory?

Comment by wei_dai on AI Alignment Open Thread October 2019 · 2019-11-10T07:59:02.101Z · score: 5 (2 votes) · LW · GW

First, this is more of a social coordination problem—I’m claiming that regular engineering practices allow you to notice when something is wrong before it has catastrophic consequences. You may not be able to solve them; in that case you need to have enough social coordination to no longer deploy them.

Ok, I think it makes sense to be more optimistic about transparency/interpretability allowing people to notice when something is wrong. My original complaint was about people seemingly being optimistic about using it to solve alignment, not just to notice when an AI isn't aligned. (I didn't state this clearly in my original comment, but the links I gave did go to posts where people seemed to be optimistic about "solving", not just "noticing".)

As I've argued before, I think a large part of solving social coordination is making sure that strategists and policy makers have correct beliefs about how difficult alignment is, which is why I was making this complaint in the first place.

Comment by wei_dai on Pricing externalities is not necessarily economically efficient · 2019-11-10T07:38:54.305Z · score: 14 (7 votes) · LW · GW

Suppose that, in a particular case, the pollution does $100,000 a year worth of damage and can be eliminated at a cost of only $80,000 a year (from here on, all costs are per year). Further assume that the cost of shifting all of the land down wind to a new use unaffected by the pollution—growing timber instead of renting out summer resorts, say—is only $50,000. If we impose an emission fee of a hundred thousand dollars a year, the steel mill stops polluting and the damage is eliminated—at a cost of $80,000. If we impose no emission fee the mill keeps polluting, the owners of the land stop advertising for tenants and plant trees instead, and the problem is again solved—at a cost of $50,000. In this case the result without Pigouvian taxes is efficient—the problem is eliminated at the lowest possible cost—and the result with Pigouvian taxes in inefficient.

This seems to be the only part of the article that argues for what the title says, "Pricing externalities is not necessarily economically efficient", however it assumes that the Pigouvian tax is set to $100,000 instead of the opportunity cost of the pollution (in this case $50,000) which would be more natural for anyone with an understanding that "cost" ought to mean "opportunity cost" in most situations when thinking about economics. So I think the title is misleading.

Also I think Coase’s choice of terminology "transaction costs" tends to give people the wrong impression about what's preventing markets from solving externalities, because it seems to suggest that the costs involved in physically doing a transaction is the main problem and if markets could be made more frictionless they would automatically solve externalities. In actuality "transaction costs" includes loses from inefficient bargaining (i.e., lost opportunities for mutually beneficial deals) caused by asymmetric information, which (until we invent mind readers or lie detector) is an inevitable part of how voluntary transactions work in general, so this limits how much the problem of externalities can be solved via voluntary transactions in a fundamental way.

One might ask why asymmetric information isn't an issue in regular markets, and the answer is that it is (it's responsible for deadweight loss of a monopoly for example), but the problem is much less serious in competitive markets where having private information about (for example) the cost of making something isn't very relevant because if you set your price above that, someone else will underbid you anyway.

Another question one could ask is why isn't asymmetric information a problem for policy makers setting a Pigouvian tax? The answer there is that the policy maker only needs to estimate the total cost (or benefit) of an externality, whereas the market / voluntary transactions approach (to be efficient) requires that the true cost/benefit for each individual affected person to be revealed (which can't happen because that's not incentive compatible). See my "government funding of information goods based on measured value" for a concrete example of this difference.

Comment by wei_dai on Ways that China is surpassing the US · 2019-11-08T19:50:08.920Z · score: 5 (2 votes) · LW · GW

"travel blog by someone I don’t know" seems a bit too dismissive, but I take your point that much of the article is based on one person's observations. (Or two if you count my endorsement of it based on my observations, and I perhaps should have mentioned that I stayed in China for 6 weeks on my trip, and was able to see more than the typical tourist stuff.) It does talk about "A 2017 Ipsos survey of almost 20,000 people reported that 87% of respondents from China believe that the country is heading in the right direction. Compare that to 43% of respondents in the U.S. and an average of 40% among all of the countries surveyed." and "hard results speak for themselves in terms of shipping traffic, passenger-miles on high-speed rail, and tons of steel produced and concrete poured." The latter statistics are not directly cited but are easy enough to look up.

I do have a bunch of uncertainty about how much of the "good governance" in China is just appearances and how much is real, and perhaps should have conveyed that more in the OP. Some of it is also based on my recent experiences with local politics and government in the US, which kind of shocked me as to how pathological and dysfunctional they are, and it's possible that I overcorrected based on that.

More data and facts are obviously welcome, but it seems hard to measure governance in any way that doesn't depend on human judgment, so again I don't think it's a good idea to dismiss the article based on that, unless someone actually did do a more objective study and the result contradicts the article's conclusions. I'm also actively trying to seek out contrary perspectives - see this comment I posted on EA Forum.

Comment by wei_dai on [Team Update] Why we spent Q3 optimizing for karma · 2019-11-08T06:39:45.879Z · score: 8 (4 votes) · LW · GW

Due to changes in how karma is computed introduced with LW2, we can’t compare our karma metric backwards in time. Fortunately, the number of votes is a good proxy.

I'm not sure about this. At least for me personally, I feel like voting is more costly on LW2 than on LW1, and I probably vote substantially less as a result. (Not totally sure because I haven't kept statistics on my own voting behavior.) The reasons are:

  1. Having to decide between strong vs weak vote.
  2. Having a high enough karma that my vote strengths (3 for weak and 10 for strong) are pretty identifiable, so I have to think more about social implications. (Maybe I shouldn't, but I do.)
  3. Sometimes I'm uncomfortable voting something up or down by at least 3 points because I'm not sure of my judgement of its quality.

Hmm, on second thought the number of people in my position is probably small enough that this isn't likely to significantly affect your "number of votes" comparison. I'll leave this here anyway as general feedback on the voting system. (To be clear I'm not advocating to change the current system, just offering a data point.)

Another thing I've been wondering about is, there's generally less voting per post/comment on LW2 than on LW1, but the karma on comparable posts seems more similar. Could it be that people have inherited their sense of how much karma different kinds of posts/comments "deserve" from LW1 and tend to stop voting up a post once it reaches that amount, which would result in similar karma but fewer votes?

Comment by wei_dai on AI Alignment Open Thread October 2019 · 2019-11-06T19:49:33.470Z · score: 5 (2 votes) · LW · GW

Hmm, I think I would make the further claim that in this world regular engineering practices are likely to work well, because they usually work well.

What about AIs as deployed in social media, which many people think are pushing discourse in bad directions, but which remain deployed anyway due to lack of technical solutions and economic competition? Aside from "single failure meant that we lose", the failure scenario I usually have in mind is that AI safety/alignment is too slow to be developed or costly to use, but more and more capable AIs get deployed anyway due to competitive pressures, and they slowly or quickly push human civilization off the rails, in any number of ways. In this case WRT interpretability I was complaining that having humans look at reverse compiled neural networks and give "feedback on process" as part of ML training seems impractically expensive.

Comment by wei_dai on Ways that China is surpassing the US · 2019-11-06T19:10:38.012Z · score: 7 (3 votes) · LW · GW

I found an article, Devin Helton's Democracy versus Autocracy: A False Dichotomy, that isn't directly about modern China but does suggest that I shouldn't have been as surprised by its level of governance relative to the US as I was. I still really want to have a good gears-level model of how politics and government works in China and how that level of governance is achieved, if anyone has good resources for that.

Comment by wei_dai on Rohin Shah on reasons for AI optimism · 2019-11-06T19:01:40.030Z · score: 3 (1 votes) · LW · GW

I agree that if you just had one leader with absolute power then it probably won't work, and that kind of government probably isn't included in the author's "many forms of government will work quite well". I think what he probably has in mind are governments that look authoritarian from the outside but still has some kind of internal politics/checks-and-balances that can keep the top leader(s) from going off the rails. I wish I had a good gears-level model of how that kind of government/politics works though. I do suspect that "work quite well" might be fragile/temporary and dependent on the top leaders not trying very hard to take absolute power for themselves, but I'm very uncertain about this due to lack of knowledge and expertise.

Comment by wei_dai on Normative reductionism · 2019-11-06T10:25:40.925Z · score: 5 (2 votes) · LW · GW

Here’s a concept that seems useful, but that I don’t remember ever hearing explicitly referred to

I've talked about it before, and called it "independence" (after the Axiom of Independence in decision theory, which I think is analogous or related), and I think the word "utility" (in the sense of utilitarianism) is also referring to this. I'll just quote the whole comment:

The VNM-stuff is about decision theory. The preference aggregation stuff is about moral philosophy. Those should be completely firewalled. There is no value to a superconcept that crosses that boundary.

But surely the intuition that value ought to be aggregated linearly across “possible outcomes” is related to the intuition that value ought to be aggregated linearly across “individuals”? I think it basically comes down to independence: how much something (a lottery over possible outcomes / a set of individuals) is valued should be independent of other things (other parts of the total probabilistic mixture over outcomes / other individuals who exist).

When framed this way, the two problems in decision theory and moral philosophy can be merged together as the question of “where should one draw the boundary between things that are valued independently?” and the general notion of “utility” as “representation of preference that can be evaluated on certain objects independently of others and then aggregated linearly” does seem to have value.

My post The Moral Status of Independent Identical Copies also talked about this and gave an example where "independence" or "normative reductionism" seems wrong or incompatible with our intuitions.

ETA: On second thought both "independence" and "utility" are too potentially confusing (i.e., ambiguous or misleading) to use to specifically refer to "The value of a world history is equal to the value of its parts (for some definition of relevant parts)." I think I do like "Aggregative Consequentialism" or "Total Consequentialism" for this.

Comment by wei_dai on Open & Welcome Thread - November 2019 · 2019-11-06T09:19:50.419Z · score: 10 (4 votes) · LW · GW

Upvoted. I'm pretty mystified and am very curious about the answers myself.

Comment by wei_dai on Rohin Shah on reasons for AI optimism · 2019-11-06T09:01:48.123Z · score: 20 (4 votes) · LW · GW

I do lean closer to the stance of “whatever we decide based on some ‘reasonable’ reflection process is good”, which seems to encompass a wide range of futures, and seems likely to me to happen by default.

I think I disagree pretty strongly, and this is likely an important crux. Would you be willing to read a couple of articles that point to what I think is convincing contrary evidence? (As you read the first article, consider what would have happened if the people involved had access to AI-enabled commitment or mind-modification technologies.)

If these articles don't cause you to update, can you explain why? For example do you think it would be fairly easy to design reflection/deliberation processes that would avoid these pathologies? What about future ones we don't yet foresee?

Comment by wei_dai on Will transparency help catch deception? Perhaps not · 2019-11-05T07:44:26.345Z · score: 7 (3 votes) · LW · GW

It seems like you're now putting most of your hope on training a deception checker end-to-end, but that seems highly non-trivial to me. Can you talk about how you hope to do that? For example what training data would you use and why do you think the deception checker will be able to correctly generalize from the training examples to the real models that the it will be applied to?

Comment by wei_dai on AI Alignment Open Thread October 2019 · 2019-11-05T04:50:27.985Z · score: 6 (3 votes) · LW · GW

If you in­stead ex­pect grad­ual take­off, then it seems rea­son­able to ex­pect that reg­u­lar en­g­ineer­ing prac­tices are the sort of thing you want, of which in­ter­pretabil­ity /​ trans­parency tools are prob­a­bly the most ob­vi­ous thing you want to try.

I support work on in­ter­pretabil­ity/trans­parency, in part because I'm uncertain about dis­con­tin­u­ous vs gradual take­off, and in part because I'm not very optimistic about any other AI safety approach either and think we probably just need to try a whole bunch of different approaches that each have low probability of success in the hope that something (or some combination of things) works out in the end. My point was that I find the stories people tell about why they are optimistic (e.g., re­verse com­piling a neural network into human readable code and then using that to generate hu­man feed­back on the model’s de­ci­sion-mak­ing pro­cess) to be very questionable.

Comment by wei_dai on AI Alignment Open Thread October 2019 · 2019-11-05T03:59:49.210Z · score: 5 (2 votes) · LW · GW

Perhaps more importantly though, we shouldn’t expect by default that in the X% of times where our guess is bad, that the neural network is adversarially optimizing against us.

I'm confused because the post you made one day later from this comment seems to argue the opposite of this. Did you change your mind in between, or am I missing something?

Comment by wei_dai on Rohin Shah on reasons for AI optimism · 2019-11-04T20:54:04.337Z · score: 5 (2 votes) · LW · GW

[Note: In this interview, Rohin was only considering risks arising because of AI systems that try to optimize for goals that are not our own, not other forms of existential risks from AI.]

"goals that are not our own" is ambiguous to me. Does it include a goal that someone currently thinks they have or behaves as if they have, but isn't really part of their "actual" values? Does it include a goal that someone gets talked into by a superintelligent AI? Are you including risks that come from AI not being value-neutral, in other words, the AI being better at optimizing for some kinds of values over others, to the extent that the future is dominated by the the goals of a small group of humans?

Also, you've been using "adversarial optimization" a lot in this thread but a search on this site doesn't show you as having defined or used it before, except in https://www.lesswrong.com/posts/9mscdgJ7ao3vbbrjs/an-70-agents-that-help-humans-who-are-still-learning-about but that part wasn't even written by you so I'm not sure if you mean the same thing by it. If you have defined it somewhere, can you please link to it? (I suspect there may be some illusion of transparency going on where you think terms like "adversarial optimization" and "goals that are not our own" have clear and obvious meanings...)

I mostly want to punt on this question, because I’m confused about what “actual” values are. I could imagine operationalizations where I’d say > 90% chance (e.g. if our “actual” values are the exact thing we would settle on after a specific kind of reflection that we may not know about right now), and others where I’d assign ~0% chance (e.g. the extremes of a moral anti-realist view).

I think even with extreme moral anti-realism, there's still a significant risk that AIs could learn values that are wrong enough (i.e., different enough from our values, or are otherwise misaligned enough) to cause an existential-level bad outcome, but not human extinction. Can you confirm that you really endorse the ~0% figure?

I expect that technical solutions are primarily important only for the narrow kind of AI risk (I’m more optimistic about social coordination for the general kind). So when I’m asked a question positing “without additional intervention by us doing safety research”, I tend to think of adversarial optimization, since that’s what I expect to be addressed by safety research.

Can I convince you that you should be uncertain enough about this, and that enough other people disagree with you about this (in particular that social coordination may be hard enough that we should try to solve a wider kind of AI risk via technical means), that more careful language to distinguish between different kinds of risk and different kinds of research would be a good idea to facilitate thinking and discussion? (I take your point that you weren't expecting this interview to be make public, so I'm just trying to build a consensus about what should ideally happen in the future.)

Comment by wei_dai on But exactly how complex and fragile? · 2019-11-04T19:35:50.313Z · score: 3 (1 votes) · LW · GW

One obvious response is “but what about adversarial examples”; my position is that image datasets are not rich enough for ML to learn the human-desired concepts; the concepts they do learn are predictive, just not about things we care about.

To clarify, are you saying that if we had a rich enough dataset, the concepts they learn would be things we care about? If so, what is this based on, and how rich of a dataset do you think we would need? If not, can you explain more what you mean?

Comment by wei_dai on Elon Musk is wrong: Robotaxis are stupid. We need standardized rented autonomous tugs to move customized owned unpowered wagons. · 2019-11-04T19:13:18.926Z · score: 7 (2 votes) · LW · GW
  • I'm guessing robotaxis will come first, just because it's easier and requires less coordination between multiple parties, and then the demand for movable lockers will probably be satisfied by, well, movable containers that can go inside the robotaxis and use "AI" to follow people around when outside of the taxis, or maybe initially just containers that people can push around.
  • If "take out your stuff every time you leave and check whether the previous user did every time you enter" turns out to be a big issue (don't people already do this all the time in regular taxis?) the robotaxis could have a narrow AI to scan and detect items that have been left behind, and remind you before you leave.
  • "interference by users" can be handled the same way.
  • Allergy to perfumes seems too rare to matter much, and could be handled by building a HEPA filter into the robotaxis.
  • "A lot of complexity is removed from production, so these are much cheaper to build than robotaxis." But this is actually a "con" for the tugs/wagons model because this saving is more than made up for by the additional wagons that would have to be built.
  • All in all it doesn't seem like your model offers enough additional value on top of the default scenario to overcome the additional coordination and other costs.
Comment by wei_dai on Ways that China is surpassing the US · 2019-11-04T10:12:53.810Z · score: 2 (2 votes) · LW · GW

Sorry about the "America vs. China" framing, but it's more about comparing/contrasting than tribalism, I think both for me and for the author of the article. So maybe you can try to look past that and consider the rest of the content and the questions I posted?

Comment by wei_dai on But exactly how complex and fragile? · 2019-11-04T06:03:32.939Z · score: 15 (4 votes) · LW · GW

I wonder if Paul Christiano ever wrote down his take on this, because he seems to agree with Eliezer that using ML to directly learn and optimize for human values will be disastrous, and I'm guessing that his reasons/arguments would probably be especially relevant to people like Katja Grace, Joshua Achiam, and Dario Amodei.

I myself am somewhat fuzzy/confused/not entirely convinced about the "complex/fragile" argument and even wrote kind of a counter-argument a while ago. I think my current worries about value learning or specification has less to do with the "complex/fragile" argument and more to do with what might be called "ignorance of values" (to give it an equally pithy name) which is that humans just don't know what our real values are (especially when applied to unfamiliar situations that will come up in the future) so how can AI designers specify them or how can AIs learn them?

People try to get around this by talking about learning meta-preferences, e.g., preferences for how to deliberate about values, but that's not some "values" that we already have and the AI can just learn, but instead a big (and I think very hard) philosophical and social science/engineering project to try to figure out what kinds of deliberation would be better than other kinds or would be good enough to eventually lead to good outcomes. (ETA: See also this comment.)

It’s not obvious to me that imperfectly aligned AI is likely to be worse than the currently misaligned processes, and even that it won’t be a net boon for the side of alignment.

My own worry is less that "imperfectly aligned AI is likely to be worse than the currently misaligned processes" but more that the advent of AGI might be the last good chance for humanity to get alignment right (including addressing "human safety problem"), and if we don't do a good enough job (even if we improve on the current situation in some sense) we'll be largely stuck with the remaining misalignment because there won't be another opportunity like it. ETA: A good slogan for this might be "AI risk as the risk of missed opportunity".

This again seems like an empirical question of the scale of different effects, unless there is a an argument that some effect will be totally overwhelming.

I'm not entirely sure I understand this sentence, but this post might be relevant here: https://www.lesswrong.com/posts/Qz6w4GYZpgeDp6ATB/beyond-astronomical-waste.

Comment by wei_dai on AI Alignment Open Thread October 2019 · 2019-11-03T18:54:28.622Z · score: 16 (7 votes) · LW · GW

I feel like there's currently a wave of optimism among some AI safety researchers around transparency/interpretability, and to me it looks like another case of "optimism by default + not thinking things through", analogous to how many people, such as Eliezer, were initially very optimistic about AGI being beneficial when they first thought of the idea. I find myself asking the same skeptical questions to different people who are optimistic about transparency/interpretability and not really getting good answers. Anyone want to try to convince me that I'm wrong about this?

Comment by wei_dai on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-02T08:15:52.529Z · score: 9 (5 votes) · LW · GW

My model of the world doesn't find this kind of thing very surprising, due to previous reports like this and this, and just on theoretical grounds. I do wonder if this causes anyone who is more optimistic about x-risk to update though.

On the other hand, the information warfare seems to be pitched at a level below what people like us can ultimately rise above. So for example the misleading APM comparison was quickly detected (and probably wasn't aimed at people like us in the first place) and this analysis of AlphaStar eventually came out (and many of us probably already had similar but unarticulated suspicions). So maybe that's a silver lining, depending on how you expect the world to be "saved"?

Comment by wei_dai on Rohin Shah on reasons for AI optimism · 2019-11-01T21:41:22.331Z · score: 5 (2 votes) · LW · GW

There’s a note early in the transcript that says that basically everything I say in the interview is about adversarial optimization against humans only, which includes the 90% figure.

Can you quote that please? I can't find it even with this clue. "adversarial optimization against humans" is also ambiguous to me and I wonder if the original language was clearer or included a longer explanation. (ETA: E.g., is it meant to exclude someone deliberately using AI to cause human extinction? Or are you contrasting "adversarial optimization against humans" with something else, like AI causing dangerous technologies to be developed faster than ways of safeguarding them?)

even without any additional intervention from current longtermists, advanced AI systems will not cause human extinction by adversarially optimizing against humans

This is the best operationalization of the ones you’ve listed.

Ok, I'm curious how likely you think it is that an (existential-level) bad outcome happens due to AI by default, without involving human extinction. (As an example of what I mean, the default development of AI causes human values to be corrupted or just locked in or frozen, in a way that would cause enormous regret if we found out what our "actual" values are.)

ETA: Also, what was your motivation for talking about a fairly narrow kind of AI risk, when the interviewer started with a more general notion?

Comment by wei_dai on Chris Olah’s views on AGI safety · 2019-11-01T21:14:43.406Z · score: 13 (7 votes) · LW · GW

The model that Chris has here is something like a reverse compilation process that turns a neural network into human-understable code. Chris notes that the resulting code might be truly gigantic—e.g. the entire Linux kernel—but that it would be faithful to the model and understandable by humans.

Does "faithful" mean "100% identical in terms of I/O", or more like "captures all of the important elements of"? My understanding is that neural networks are continuous whereas human-understandable code like the Linux kernel are discrete, so it seemingly just can't work in the former case, and I'm not sure how it can work in the latter case either.

Do you or Chris think that a test of this might be to take a toy model (say a 100-neuron ANN) that solves some toy problem, and see if it can be reversed compiled? (Or let me know if this has already been done.) If not, what's the earliest meaningful test that can be done?

I'm also concerned that combining ML, reverse compilation, and "giving feedback on process" essentially equals programming by nudging which just seems like a really inefficient way of programming. ETA: Is there an explanation of why this kind of ML would be better (in any sense of that word) than someone starting with a random piece of code and trying to end up with an AI by modifying it a little bit at a time?

ETA2: I wonder if Chris is assuming some future ML technique that learns a lot faster (i.e., is much more sample efficient) than what we have today, so that humans wouldn't have to give a lot of feedback on process, and "programming by nudging" wouldn't be a good analogy anymore.

Comment by wei_dai on Toon Alfrink's sketchpad · 2019-11-01T19:51:38.277Z · score: 3 (1 votes) · LW · GW

This is what social justice means with “oppression”. You’re not actively pushing anyone down, but you are doing so passively, because you haven’t resolved your “ignorance”.

Do you endorse this way of using "oppression"? I'm going to argue against this, with the caveat that you may just be describing a phenomenon and not endorsing it.

I think to most people (prior to SJ) "oppression" involves intent or "actively pushing" and therefore is especially morally blameworthy. Using the same word to now describe situations where something bad is happening because of ignorance instead of intent is sneaking in connotations. I think if someone endorses doing that, they're fundamentally trying to do a power grab instead of good epistemics (i.e., by surreptitiously making people more outraged than they would really endorse under reflection).

ETA: On second thought this may have come off too strongly, because some of it is just humans having to use discrete words to describe a continuous situation, where there is a continuous spectrum in how much active versus passive pushing is happening in any given situation (e.g., somebody could be doing things that look a lot like "active pushing" due to ignorance or an epistemic error, people could be subconsciously trying to avoid fixing their ignorance, etc.), as well as how certain an observer is about what is happening. I would say there is still some element of power grabbing that is happening though.

Comment by wei_dai on Toon Alfrink's sketchpad · 2019-11-01T19:51:21.300Z · score: 3 (1 votes) · LW · GW

Ok, I think that makes sense.

Comment by wei_dai on orthonormal's Shortform · 2019-11-01T19:27:58.201Z · score: 12 (5 votes) · LW · GW

This is the clearest and most insightful analysis of AlphaStar I've seen and IMO really should be a top-level post.

Comment by wei_dai on Rohin Shah on reasons for AI optimism · 2019-11-01T19:20:00.619Z · score: 6 (3 votes) · LW · GW

I collected some AI risk estimates in this EA forum comment and also made the complaint that it's hard to compare existing statements about AI risk because everyone seems to include different things in the "risk" they're estimating.

Here, the 90% figure originally came in response to the question "what do you think is the credence that by default things go well, without additional intervention by us doing safety research or something like that?" but then it gets reported here in the OP as "chance that AI systems will be safe without additional intervention" and elsewhere as "even without any additional intervention from current longtermists, advanced AI systems will not cause human extinction by adversarially optimizing against humans".

(I haven't read the whole transcript yet, so I'm not sure whether Rohin says or implies that 90% applies to "human extinction by adversarially optimizing against humans" somewhere, but I couldn't find it with some CTRL-F searching. Also if 90% does apply to all of these definitions of "risk" then it would imply ~0% chance that AI is unsafe or leads to a bad outcome in a way that doesn't involve human extinction which seems way too strong to me.)

I'm bringing this up again here in the hope that people will be more sensitive about different ways of describing or defining AI risk, and also maybe some organization will read this and decide to do some kind of well-considered survey to collect people's estimates in a way that's easy to compare.

Comment by wei_dai on The Simulation Epiphany Problem · 2019-11-01T03:13:33.676Z · score: 10 (5 votes) · LW · GW

You missed a crucial point of the post, which is that when the AI does a simulation to consider the consequences of some action that the AI normally wouldn't do, observing that action is itself a clue that SimDave is being simulated. Here's the relevant part from the OP:

So Dave has just asked PAL to get him a cup of coffee. Dave is used to seeing PAL take route A to the coffee machine, and is initially puzzled because PAL is driving along route B. But then Dave has an epiphany. Dave knows with very high certainty that no PAL computer has ever made a mistake, so he can conclude with equally high certainty that he is no longer Dave. He is [Dave], a simulated version of Dave created inside PAL while it is computing the utility of taking route B.

Comment by wei_dai on Rohin Shah on reasons for AI optimism · 2019-11-01T02:48:43.431Z · score: 17 (6 votes) · LW · GW

Can please you make the audio recording of this available? (Or let me know if you can send it to me directly.) I find it taxing to read interview transcripts but have gotten used to listening to interviews (on podcasts for example).

Comment by wei_dai on Toon Alfrink's sketchpad · 2019-10-31T21:20:38.682Z · score: 3 (1 votes) · LW · GW

Optimal finance means optimal allocation of money across your life, regardless of when you earn it.

That's part of it, but there's also coordination between people (e.g., investors coming together to finance a capital-intensive business that no single person can afford to fund) and managing risk and incentives (e.g., a sole proprietorship has better incentives but worse risk characteristics compared to a company with many shareholders, a company issuing both stocks and bonds so investors can make their own risk/reward tradeoff, etc.).

I think maybe something like "finance is about how a group of people can cooperate to pursue EU maximization over time, given some initial endowment of assets and liabilities" would capture most of it?

Comment by wei_dai on Is requires ought · 2019-10-31T04:06:34.293Z · score: 5 (2 votes) · LW · GW

The assumption "You don’t know how long the company has been around or where in the chain you are" seems unrealistic/contrived, much like the assumption of "synchronized RL" in your previous argument. Again this seems like it's not going to be very convincing to a skeptic, at least without, for example, a further argument for why the assumption actually makes sense on some deeper level.

Aside from that, here's a counter-argument: among all fields of research, math is probably one of the hardest to corrupt, because publishing theorems require proofs which can be checked relatively easily and if frauds/errors (false theorems) creep into the literature anyway, eventually a contradiction will be derived and the field will know something went wrong and backtrack to find the problem. If fear of acausally corrupting the current state of the field is the main reason for refraining from doing fraud, then math ought to have a higher amount of fraud relative to other fields, but actually the opposite is true (AFAICT).

Comment by wei_dai on Maybe Lying Doesn't Exist · 2019-10-30T06:22:11.408Z · score: 6 (3 votes) · LW · GW

I think the embedded agency people are working on this?—I’m afraid I’m not up-to-date on the details. But when I object to people making appeals to consequences, the thing I’m objecting to is never people trying to do a sophisticated embedded-agency thing; I’m objecting to people trying to get away with choosing to be biased.

In that case, maybe you can clarify (in this or future posts) that you're not against doing sophisticated embedded-agency things? Also, can you give some examples of what you're objecting to, so I can judge for myself whether they're actually doing sophisticated embedded-agency things?

Appeals to consequences tend to diverge (because everyone has her own idiosyncratic favored consequence); “just make the map be accurate” is a natural focal point (because the truth is generally useful to everyone).

This just means that in most cases, appeals to consequences won't move others much, even if they took such consequences into consideration. It doesn't seem to be a reason for people to refuse to consider such appeals at all. If appeals to consequences only tend to diverge, it seems a good idea to actually consider such appeals, so that in the rare cases where people's interests converge, they can be moved by such appeals.

So, I have to say that I still don't understand why you're taking the position that you are. If you have a longer version of the "story" that you can tell, please consider doing that.

Comment by wei_dai on Is requires ought · 2019-10-30T06:12:15.603Z · score: 5 (2 votes) · LW · GW

This is an issue of EDT having problems, I wrote about this problem and a possible solution here.

That post seems to be trying to solve a different problem (it still assumes that the agent knows its own source code, AFAICT). Can you please re-read what I wrote and if that post really is addressing the same problem, explain how?

I am. Consider tragedy of the commons which is simpler. If there are many on-policy RL agents that are playing tragedy of the commons and are synchronized with each other (so they always take the same action, including exploration actions) then they can notice that they expect less utility when they defect than when they cooperate.

I see, but the synchronization seems rather contrived. To the extent that humans are RL agents, our learning algorithms are not synchronized (and defecting in tragedy of the commons happens very often as a result), so why is synchronized RL relevant? I don't see how this is supposed to help convince a skeptic.

Comment by wei_dai on Is requires ought · 2019-10-30T04:47:52.519Z · score: 5 (2 votes) · LW · GW

If your world model represents random variables such as “the action I will take in 1 second” then condition on that random variable being some value.

I don't think that works, especially for kind of purpose you have in mind. For example suppose I'm in a situation where I'm pretty sure the normative/correct action is A but due to to things like cosmic rays I have some chance of choosing B. Then if I condition on "the action I will take in 1 second is B" I will mostly be conditioning on choosing B due to things like cosmic rays, which would be very different from conditioning on "source code X outputs action B".

It wouldn’t be hard to code up a reinforcement learning agent based on EDT (that’s essentially what on-policy learning is), which isn’t EDT proper due to not having a world model, but which strongly suggests that EDT is coherent.

Can you explain what the connection between on-policy learning and EDT is? (And you're not suggesting that an on-policy learning algorithm would directly produce an agent that would refrain from mathematical fraud for the kind of reason you give, or something analogous to that, right?)

The relevant question is how “mathematical truth” ends up seeming like a terminal value to so many; it’s unlikely to be baked in, it’s likely to be some Schelling point reached through a combination of priors and cultural learning.

It seems like truth and beauty are directly baked in and maybe there's some learning involved for picking out or settling on what kinds of truth and beauty to value as a culture. But I'm not seeing how this supports your position.

Comment by wei_dai on Is requires ought · 2019-10-30T02:40:51.472Z · score: 10 (2 votes) · LW · GW

Additionally, if others are behaving similarly to me, then my own decision to corrupt the process is evidence that others also decide to corrupt the process. Some of these others are in the past; my own decision to corrupt the process is evidence that my own mathematical knowledge is false, as it is evidence that those before me have decided similarly. So, my own mathematical “is” claims rely on myself following mathematical “ought” claims.

(More precisely, both evidential decision theory and functional decision theory have a notion by which present decisions can have past consequences, including past consequences affecting the accuracy of presently-available information)

Not sure how much you're relying on this for your overall point, but I'm skeptical of this kind of application of decision theory.

  1. I don't know how to formalize the decision theory math for humans. According to EDT "the best action is the one which, conditional on one's having chosen it, gives one the best expectations for the outcome" but what does it actually mean to condition on "one's having chosen it"? UDT assumes that the agent knows their source code and can condition on "source code X outputs action/policy Y" but this is obviously not possible for humans and I don't know what the analogous thing is for humans.
  2. My guess is that mathematicians typically refrain from conducting mathematical fraud due to a combination of fearing the consequences of being caught, and having mathematical truth (for themselves and others) as something like a terminal value, and not due to this kind of decision theoretic reasoning. If almost no one used this kind of decision theoretic reasoning to make this kind of decision in the past, my current thought process has few other instances to "logically correlate" with (at least as far as the past and the present are concerned).
Comment by wei_dai on Deliberation as a method to find the "actual preferences" of humans · 2019-10-30T01:54:42.408Z · score: 8 (2 votes) · LW · GW

I think my main confusion is that Paul talks about many different ways deliberation could work (e.g. RL-based IDA and human-in-the-counterfactual-loop seem pretty different), and it’s not clear what approach he thinks is most plausible.

I have similar questions, and I'm not sure how much of it is that Paul is uncertain himself, and how much is Paul not having communicated his thinking yet. Also one thing to keep in mind is that different forms of deliberation could be used at different levels of the system, so for example one method can be used to model/emulate/extrapolate the overseer's deliberation and another one for the end-user.

On a more general note, I'm really worried that we don't have much understanding of how or why human deliberation can lead to good outcomes in the long run. It seems clear that an individual human deliberating in isolation is highly likely to get stuck or go off the rails, and groups of humans often do so as well. To the extent that we as a global civilization seemingly are able to make progress in the very long run, it seems at best a fragile process, which we don't know how to reliably preserve, or reproduce in an artificial setting.

Comment by wei_dai on Social Class · 2019-10-20T15:42:51.391Z · score: 5 (2 votes) · LW · GW

I'm curious about the information diet of the upper class. Do they have blogs or forums, or periodicals and books written for each other?

Comment by wei_dai on What's your big idea? · 2019-10-20T15:38:34.647Z · score: 11 (6 votes) · LW · GW

There was some previous discussion of this idea in Modest Superintelligences and its comments. I'm guessing nobody is doing it due to a combination of weirdness, political correctness, and short-term thinking. This would require a government effort and no government can spend this much resources on a project that won't have any visible benefits for at least a decade or two, and is also weird and politically incorrect.

Comment by wei_dai on Is value amendment a convergent instrumental goal? · 2019-10-20T05:28:55.904Z · score: 6 (3 votes) · LW · GW

Is the tendency for an AI to amend its values also convergent?

I think there's a chance that it is (although I'd probably call it a convergent "behavior" rather than "instrumental goal"). The scenario I imagine is if it's not feasible to build highly intelligent AIs that maximize some utility function or some fixed set of terminal goals, and instead all practical AI (beyond a certain level of intelligence and generality) are kind of confused about their goals like humans are, and have to figure them out using something like philosophical reasoning.

Comment by wei_dai on The strategy-stealing assumption · 2019-10-20T05:24:37.099Z · score: 3 (1 votes) · LW · GW

I wrote this post imagining “strategy-stealing assumption” as something you would assume for the purpose of an argument, for example I might want to justify an AI alignment scheme by arguing “Under a strategy-stealing assumption, this AI would result in an OK outcome.”

When you say "strategy-stealing assumption" in this sentence, do you mean the relatively narrow assumption that you gave in this post, specifically about "flexible influence":

This ar­gu­ment rests on what I’ll call the strat­egy-steal­ing as­sump­tion: for any strat­egy an un­al­igned AI could use to in­fluence the long-run fu­ture, there is an analo­gous strat­egy that a similarly-sized group of hu­mans can use in or­der to cap­ture a similar amount of flex­ible in­fluence over the fu­ture.

or a stronger assumption that also includes that the universe and our values are such that "capture a similar amount of flexible influence over the future" would lead to an OK outcome? I'm guessing the latter? I feel like people, including me sometimes and you in this instance, are equivocating back and forth between these two meanings when using "strategy-stealing assumption". Maybe we should have two different terms for these two concepts too?

Comment by wei_dai on Maybe Lying Doesn't Exist · 2019-10-19T17:56:02.227Z · score: 4 (2 votes) · LW · GW

In theory this can be beneficial, but in practice the ability to reason about what’s going on deteriorates.

I think (speaking from my experience) specifications are often compromises in the first place between elegance / ease of reasoning and other considerations like performance. So I don't think it's taboo to "patch a procedure in a way that violates its specification in order to improve overall performance of the program or to fix an externally observable bug." (Of course you'd have to also patch the specification to reflect the change and make sure it doesn't break the rest of the program, but that's just part of the cost that you have to take into account when making this decision.)

Assuming you still disagree, can you explain why in these cases, we can't trust people to use learning and decision theory (i.e., human approximations to EU maximization or cost-benefit analysis) to make decisions, and we instead have to make them follow a rule (i.e., "don't ever do this")? What is so special about these cases? (Aren't there tradeoffs between ease of reasoning and other considerations everywhere?) Or is this part of a bigger philosophical disagreement between rule consequentialism and act consequentialism, or something like that?

Comment by wei_dai on Maybe Lying Doesn't Exist · 2019-10-19T03:47:00.076Z · score: 4 (2 votes) · LW · GW

I’m claiming that whether we call something a “danger” should not take into account considerations like, “We shouldn’t consider this a ‘danger’, because if we did, then people would feel afraid, and their fear is suffering to be minimized according to the global utilitarian calculus.”

Is the reason that you don't think we should not take this kind of consideration into account that if we did decide to not consider the object under discussion a "danger", that will have worse consequences in the long run? If so, why not argue for taking both of these considerations into account and argue that the second consideration is stronger? Kind of a "fight speech with more speech instead of censorship" approach? (That would allow for the possibility that we override considerations for people's feelings in most cases, but avoid calling something a "danger" in extreme cases where the emotional or other harm of doing so is exceptionally great.)

It seems like the only reason you'd be against this is if you think that most people are too irrational to correctly weigh these kinds of considerations against each on a case by case basis, and there's no way to train them to be more rational about this. Is that true, and if so why do you think that?

That kind of utilitarianism might (or might not) be a good reason to not tell people about the danger, but it’s not a good reason to change the definition of “danger” itself.

I'm questioning whether there is any definition of "danger" itself (in the sense of things that are considered dangerous, not the abstract concept of danger), apart from the collection of things we decide to call "danger".

Comment by wei_dai on Maybe Lying Doesn't Exist · 2019-10-18T20:40:31.829Z · score: 4 (2 votes) · LW · GW

This is a bad example, because whether something is a crime is, in fact, fully determined by whether “we” (in the sense of “we, as a society, expressing our will through legislation, etc.”) decide to label it a ‘crime’. There is no “fact of the matter” about whether something “is a crime”, beyond that.

Maybe a better example is "danger"? Everything is somewhat dangerous, there are no "concen­tra­tions of un­usu­ally high prob­a­bil­ity den­sity in Thingspace" that we can draw boundaries around, where one concentration is more dangerous than the other with a clear gap in between, so whether we decide to call something a "danger" seemingly must depend entirely or mostly on the consequences of doing so. Yet there is clearly a fact of the matter about how dangerous something really is.

Comment by wei_dai on Impact measurement and value-neutrality verification · 2019-10-18T18:47:08.362Z · score: 15 (4 votes) · LW · GW

Some clarifications I got from Evan (evhub) on MIRIxDiscord:

  1. AI not being value-neutrality is one way that the strategy-stealing assumption might be false, and therefore one thing we can work on if we want to make the strategy-stealing assumption true (or true to the extent possible).
  2. It's not clear if "AI not being value-neutrality" falls into one of Paul's 11 failure scenarios for strategy-stealing. The closest seems to be failure #1 "AI alignment" but one could also argue that an AI can be aligned but still not value-neutral.
  3. The "neutrality" measure formally defined in this post is meant to be a starting point for people to work on, and not necessarily close to the final solution.
  4. "Strategy-stealing" was originally defined in terms of maintaining a baseline resource distribution, but it's not clear if that's the right concept, and in this post Evan "moved somewhat towards maintaining a value distribution."

I think Evan has or will incorporate some of these clarifications into the post itself, but this may still be helpful for people who read the original post.

Comment by wei_dai on Random Thoughts on Predict-O-Matic · 2019-10-18T17:18:57.456Z · score: 8 (4 votes) · LW · GW

In a recent post, Wei Dai mentions a similar distinction (italics added by me):

Supervised training—This is safer than reinforcement learning because we don’t have to worry about reward hacking (i.e., reward gaming and reward tampering), and it eliminates the problem of self-confirming predictions (which can be seen as a form of reward hacking). In other words, if the only thing that ever sees the Oracle’s output during a training episode is an automated system that computes the Oracle’s reward/loss, and that system is secure because it’s just computing a simple distance metric (comparing the Oracle’s output to the training label), then reward hacking and self-confirming predictions can’t happen.

I think I've updated a bit from when I wrote this (due to this discussion). (ETA: I've now added a link from that paragraph to this comment.) Now I would say that the safety-relevant differences between SL and RL are:

  1. The loss computation for SL is typically simpler than the reward computation for RL, and therefore more secure / harder to hack, but maybe we shouldn't depend on that for safety.
  2. SL doesn't explore, so it can't "stumble onto" a way to hack the reward/loss computation like RL can. But it can still learn to hack the loss computation or the training label if the model becomes a mesa optimizer that cares about minimizing "loss" (e.g., the output of the physical loss computation) as either a terminal or instrumental goal. In other words, if reward/loss hacking happens with SL, the optimization power for it seemingly has to come from a mesa optimizer, whereas for RL it could come from either the base or mesa optimizer.
Comment by wei_dai on The strategy-stealing assumption · 2019-10-17T18:46:21.781Z · score: 14 (3 votes) · LW · GW

This ar­gu­ment rests on what I’ll call the strat­egy-steal­ing as­sump­tion: for any strat­egy an un­al­igned AI could use to in­fluence the long-run fu­ture, there is an analo­gous strat­egy that a similarly-sized group of hu­mans can use in or­der to cap­ture a similar amount of flex­ible in­fluence over the fu­ture.

The word "assumption" in "strat­egy-steal­ing as­sump­tion" keeps making me think that you're assuming this as a proposition and deriving consequences from it, but the actual assumption you're making is more like "it's a good idea to pick strategy-stealing as an instrumental goal to work towards, i.e., to work on things that would make the 'strat­egy-steal­ing as­sump­tion' true." This depends on at least 2 things:

  1. If "strat­egy-steal­ing as­sump­tion" is true, we can get most of what we "really" want by doing strategy-stealing. (Example of how this can be false: (Log­i­cal) Time is of the essence)
  2. It's not too hard to make "strat­egy-steal­ing as­sump­tion" true.

(If either 1 or 2 is false, then it would make more sense to work in another direction, like trying to get a big enough advantage to take over the world and prevent any unaligned AIs from arising, or trying to coordinate world governments to do that.)

Is this understanding correct? Also, because there is no name for "it's a good idea to try to make the 'strat­egy-steal­ing as­sump­tion' true' I think I and others have occasionally been using "strat­egy-steal­ing as­sump­tion" to refer to that as well, which I'm not sure if you'd endorse or not. Since there are other issues with the name (like "stealing" making some people think "literally stealing"), I wonder if you'd be open to reconsidering the terminology.

ETA: Re-reading the sentence I quoted makes me realize that you named it "assumption" because it's an assumption needed for Jessica's argument, so it does make sense in that context. In the long run though, it might make more sense to call it something like a "goal" or "framework" since again in the larger scheme of things you're not so much assuming it and trying to figure out what to do given that it's true, as trying to make it true or using it as a framework for finding problems to work on.

Comment by wei_dai on Maybe Lying Doesn't Exist · 2019-10-16T05:40:49.646Z · score: 28 (9 votes) · LW · GW

When conscious intent is selectively discouraged more than unconscious intent, the result is rule by unconscious intent. Those who can conveniently forget, who can maintain narcissistic fantasies, who can avoid introspection, who can be ruled by emotions with hidden causes, will be the only ones able to deceive (or otherwise to violate norms) blamelessly.

Conscious intent being selectively discouraged more than unconscious intent does not logically imply that unconscious intent to deceive will be blameless or "free from or not deserving blame", only that it will be blamed less.

(I think you may have an unconscious motivation to commit this logical error in order to further your side of the argument. Normally I wouldn't say this out loud, or in public, but you seem to be proposing a norm where people do state such beliefs freely. Is that right? And do you think this instance also falls under "lying"?)

I think conscious intent being selectively discouraged more than unconscious intent can make sense for several reasons:

  1. Someone deceiving with conscious intent can apply more compute / intelligence and other resources for optimizing and maintaining the lie, which means the deception can be much bigger and more consequential, thereby causing greater damage to others.
  2. Deceiving with conscious intent implies that the person endorses lying in that situation which means you probably need to do something substantially different to dissuade that person from lying in a similar situation in the future, compared to someone deceiving with unconscious intent. In the latter case, it might suffice to diplomatically (e.g., privately) bring up the issue to that person's conscious awareness, so they can consciously override their unconscious motivations.
  3. Conscious lies tend to be harder to detect (due to more optimizing power applied towards creating the appearance of truth). Economics research into optimal punishment suggests that (all else equal) crimes that are harder to detect should be punished more.
  4. Unconscious deception is hard to distinguish from innocent mistakes. If you try to punish what you think are cases of unconscious deception, you'll end up making a lot people feel like they were punished unfairly, either because they're truly innocent, or because they're not consciously aware of any deceptive intent and therefore think they're innocent. You inevitably make a lot of enemies to you personally or to the norm you're proposing.

(There are some issues in the way I stated points 1-4 above that I can see but don't feel like spending more time to fix. I would rather spend my time on other topics but nobody is bringing up these points so I feel like I have to, given how much the parent comment has been upvoted.)