Don't depend on others to ask for explanations 2019-09-18T19:12:56.145Z · score: 66 (21 votes)
Counterfactual Oracles = online supervised learning with random selection of training episodes 2019-09-10T08:29:08.143Z · score: 37 (11 votes)
AI Safety "Success Stories" 2019-09-07T02:54:15.003Z · score: 87 (25 votes)
Six AI Risk/Strategy Ideas 2019-08-27T00:40:38.672Z · score: 59 (28 votes)
Problems in AI Alignment that philosophers could potentially contribute to 2019-08-17T17:38:31.757Z · score: 76 (27 votes)
Forum participation as a research strategy 2019-07-30T18:09:48.524Z · score: 111 (36 votes)
On the purposes of decision theory research 2019-07-25T07:18:06.552Z · score: 65 (21 votes)
AGI will drastically increase economies of scale 2019-06-07T23:17:38.694Z · score: 41 (15 votes)
How to find a lost phone with dead battery, using Google Location History Takeout 2019-05-30T04:56:28.666Z · score: 52 (23 votes)
Where are people thinking and talking about global coordination for AI safety? 2019-05-22T06:24:02.425Z · score: 94 (32 votes)
"UDT2" and "against UD+ASSA" 2019-05-12T04:18:37.158Z · score: 43 (14 votes)
Disincentives for participating on LW/AF 2019-05-10T19:46:36.010Z · score: 77 (33 votes)
Strategic implications of AIs' ability to coordinate at low cost, for example by merging 2019-04-25T05:08:21.736Z · score: 49 (19 votes)
Please use real names, especially for Alignment Forum? 2019-03-29T02:54:20.812Z · score: 40 (13 votes)
The Main Sources of AI Risk? 2019-03-21T18:28:33.068Z · score: 64 (27 votes)
What's wrong with these analogies for understanding Informed Oversight and IDA? 2019-03-20T09:11:33.613Z · score: 37 (8 votes)
Three ways that "Sufficiently optimized agents appear coherent" can be false 2019-03-05T21:52:35.462Z · score: 68 (17 votes)
Why didn't Agoric Computing become popular? 2019-02-16T06:19:56.121Z · score: 53 (16 votes)
Some disjunctive reasons for urgency on AI risk 2019-02-15T20:43:17.340Z · score: 37 (10 votes)
Some Thoughts on Metaphilosophy 2019-02-10T00:28:29.482Z · score: 57 (16 votes)
The Argument from Philosophical Difficulty 2019-02-10T00:28:07.472Z · score: 47 (13 votes)
Why is so much discussion happening in private Google Docs? 2019-01-12T02:19:19.332Z · score: 86 (25 votes)
Two More Decision Theory Problems for Humans 2019-01-04T09:00:33.436Z · score: 58 (19 votes)
Two Neglected Problems in Human-AI Safety 2018-12-16T22:13:29.196Z · score: 77 (25 votes)
Three AI Safety Related Ideas 2018-12-13T21:32:25.415Z · score: 73 (26 votes)
Counterintuitive Comparative Advantage 2018-11-28T20:33:30.023Z · score: 76 (29 votes)
A general model of safety-oriented AI development 2018-06-11T21:00:02.670Z · score: 70 (23 votes)
Beyond Astronomical Waste 2018-06-07T21:04:44.630Z · score: 93 (41 votes)
Can corrigibility be learned safely? 2018-04-01T23:07:46.625Z · score: 73 (25 votes)
Multiplicity of "enlightenment" states and contemplative practices 2018-03-12T08:15:48.709Z · score: 93 (23 votes)
Online discussion is better than pre-publication peer review 2017-09-05T13:25:15.331Z · score: 18 (15 votes)
Examples of Superintelligence Risk (by Jeff Kaufman) 2017-07-15T16:03:58.336Z · score: 5 (5 votes)
Combining Prediction Technologies to Help Moderate Discussions 2016-12-08T00:19:35.854Z · score: 13 (14 votes)
[link] Baidu cheats in an AI contest in order to gain a 0.24% advantage 2015-06-06T06:39:44.990Z · score: 14 (13 votes)
Is the potential astronomical waste in our universe too small to care about? 2014-10-21T08:44:12.897Z · score: 25 (27 votes)
What is the difference between rationality and intelligence? 2014-08-13T11:19:53.062Z · score: 13 (13 votes)
Six Plausible Meta-Ethical Alternatives 2014-08-06T00:04:14.485Z · score: 44 (45 votes)
Look for the Next Tech Gold Rush? 2014-07-19T10:08:53.127Z · score: 39 (37 votes)
Outside View(s) and MIRI's FAI Endgame 2013-08-28T23:27:23.372Z · score: 16 (19 votes)
Three Approaches to "Friendliness" 2013-07-17T07:46:07.504Z · score: 20 (23 votes)
Normativity and Meta-Philosophy 2013-04-23T20:35:16.319Z · score: 12 (14 votes)
Outline of Possible Sources of Values 2013-01-18T00:14:49.866Z · score: 14 (16 votes)
How to signal curiosity? 2013-01-11T22:47:23.698Z · score: 21 (22 votes)
Morality Isn't Logical 2012-12-26T23:08:09.419Z · score: 19 (35 votes)
Beware Selective Nihilism 2012-12-20T18:53:05.496Z · score: 40 (44 votes)
Ontological Crisis in Humans 2012-12-18T17:32:39.150Z · score: 45 (49 votes)
Reasons for someone to "ignore" you 2012-10-08T19:50:36.426Z · score: 23 (24 votes)
"Hide comments in downvoted threads" is now active 2012-10-05T07:23:56.318Z · score: 18 (30 votes)
Under-acknowledged Value Differences 2012-09-12T22:02:19.263Z · score: 47 (50 votes)
Kelly Criteria and Two Envelopes 2012-08-16T21:57:41.809Z · score: 11 (8 votes)


Comment by wei_dai on Conditions for Mesa-Optimization · 2019-09-19T21:41:08.717Z · score: 5 (2 votes) · LW · GW

Humans do lots of things that look like “changing their objective” [...]

That's true but unless the AI is doing something like human imitation or metaphilosophy (in other words, we have some reason to think that the AI will converge to the "right" values), it seems dangerous to let it "changing their objective" on its own. Unless, I guess, it's doing something like mild optimization or following norms, so that it can't do much damage even if it switches to a wrong objective, and we can just shut it down and start over. But if it's as messy as humans are, how would we know that it's strictly following norms or doing mild optimization, and won't "change its mind" about that too at some point (kind of like a human who isn't very strategic suddenly has an insight or reads something on the Internet and decides to become strategic)?

I think overall I'm still confused about your perspective here. Do you think this kind of "messy" AI is something we should try to harness and turn into a safety success story (if so how), or do you think it's a danger that we should try to avoid (which may for example have to involve global coordination because it might be more efficient than safer AIs that do have clean separation)?

Oh, going back to an earlier comment, I guess you're suggesting some of each: try to harness at lower capability levels, and coordinate to avoid at higher capability levels.

Comment by wei_dai on Just Imitate Humans? · 2019-09-19T17:58:55.117Z · score: 5 (2 votes) · LW · GW

Regarding the other two points, my intuition was that a few dozen people could work out the details satisfactorily in a year. If you don’t share this intuition, I’ll adjust downward on that.

I'm pretty skeptical of this, but then I'm pretty skeptical of all current safety/alignment approaches and this doesn't seem especially bad by comparison, so I think it might be worth including in a portfolio approach. But I'd like to better understand why you think it's promising. Do you have more specific ideas of how ~HSIFAUH can be used to achieve a Singleton and to keep it safe, or just a general feeling that it should be possible?

Comment by wei_dai on Feature Wish List for LessWrong · 2019-09-19T17:12:11.702Z · score: 3 (1 votes) · LW · GW

Voting from recent comments was enabled for a while, but Said objected to it because he thought people might vote on comments in a knee-jerk way without understanding the context.

In that case it would help if the Permalink button (which I have to first press in order to vote) goes to a page that shows the context of the comment. Currently it just shows the single comment, so you're not really making people see the context before they vote anyway. And I think that would be helpful even aside from this concern, because often I want to check out the context of a comment, and right now I have to press the up-arrow/parent button a bunch of times to do so.

Comment by wei_dai on Conditions for Mesa-Optimization · 2019-09-19T16:42:27.981Z · score: 10 (3 votes) · LW · GW

It’s plausible to me that for tasks that we actually train on, we end up creating systems that are like mesa optimizers in the sense that they have broad capabilities that they can use on relatively new domains that they haven’t had much experience on before, but nonetheless because they aren’t made up of a two clean parts (mesa objective + capabilities) there isn’t a single obvious mesa objective that the AI system is optimizing for off distribution.

Coming back to this, can you give an example of the kind of thing you're thinking of (in humans, animals, current ML systems)? Or other reason you think this could be the case in the future?

Also, do you think this will be significantly more efficient than "two clean parts (mesa objective + capabilities)"? (If not, it seems like we can use inner alignment techniques, e.g., transparency and verification, to force the model to be "two clean parts" if that's better for safety.)

Comment by wei_dai on Conditions for Mesa-Optimization · 2019-09-18T18:29:08.038Z · score: 6 (3 votes) · LW · GW

Humans and systems produced by meta learning both do reasonably well at learning, and don’t do “search” (depending on how loose you are with your definition of “search”).

Part of what inspired me to write my comment was watching my kid play logic puzzles. When she starts a new game, she has to do a lot of random trial-and-error with backtracking, much like MCTS. (She does the trial-and-error on the physical game board, but when I play I often just do it in my head.) Then her intuition builds up and she can start to recognize solutions earlier and earlier in the search tree, sometimes even immediately upon starting a new puzzle level. Then the game gets harder (the puzzle levels slowly increase in difficulty) or moves to a new regime where her intuitions don't work, and she has to do more trial-and-error again, and so on. This sure seems like "search" to me.

Fwiw, on the original point, even standard machine learning algorithms (not the resulting models) don’t seem like “search” to me, though they also aren’t just a bag of heuristics and they do have a clearly delineated objective, so they fit well enough in the mesa optimization story.

This really confuses me. Maybe with some forms of supervised learning you can either calculate the solution directly, or just follow a gradient (which may be arguable whether that's search or not), but with RL, surely the "explore" steps have to count as "search"? Do you have a different kind of thing in mind when you think of "search"?

Comment by wei_dai on Conditions for Mesa-Optimization · 2019-09-17T23:33:38.476Z · score: 10 (5 votes) · LW · GW

The Risks from Learned Optimization paper and this sequence don't seem to talk about the possibility of mesa-optimizers developing from supervised learning and the resulting inner alignment problem. The part that gets closest is

First, though we largely focus on reinforcement learning in this sequence, RL is not necessarily the only type of machine learning where mesa-optimizers could appear. For example, it seems plausible that mesa-optimizers could appear in generative adversarial networks.

I wonder if this was intentional, and if not maybe it would be worth making a note somewhere in the paper/posts that an oracle/predictor that is trained on sufficiently diverse data using SL could also become a mesa-optimizer (especially since this seems counterintuitive and might be overlooked by AI researchers/builders). See related discussion here.

Comment by wei_dai on Conditions for Mesa-Optimization · 2019-09-17T23:06:20.388Z · score: 4 (2 votes) · LW · GW

I meant that claim to apply to "realistic" tasks (which I don't yet know how to define).

Machine learning seems hard to do without search, if that counts as a "realistic" task. :)

I wonder if you can say something about what your motivation is to talk about this, i.e., are there larger implications if "just heuristics" is enough for arbitrary levels of performance on "realistic" tasks?

Comment by wei_dai on Realism and Rationality · 2019-09-17T17:43:51.502Z · score: 6 (3 votes) · LW · GW

However, early/foundational community writing seems to reject the idea that there’s any meaningful conceptually distinct sense in which we can talk about an action being “reasonable.”

I think there's a distinction (although I'm not sure if I've talked explicitly about it before). Basically there's quite possibly more to what the "right" or "reasonable" action is than "what action that someone who tends to 'win' a lot over the course of their life would take?" because the latter isn't well defined. In a multiverse the same strategy/policy would lead to 100% winning in some worlds/branches and 100% losing in other worlds/branches, so you'd need some kind of "measure" to say who wins overall. But what the right measure is seems to be (or could be) a normative fact that can't be determined by just looking at or thinking "who tends to 'win' a lot'.

ETA: Another way that "tends to win" isn't well defined is that if you look at the person who literally wins the most, they might just be very lucky instead of actually doing the "reasonable" thing. So I think "tends to win" is more of an intuition pump for what the right conception of "reasonable" is than actually identical to it.

Comment by wei_dai on The strategy-stealing assumption · 2019-09-17T17:20:40.051Z · score: 3 (1 votes) · LW · GW

I'm still not sure I understand. Is the aligned AI literally applying a planning algorithm to the same long-term goal as the unaligned AI, and then translating that plan into a plan for acquiring flexible influence, or is it just generally trying to come up with a plan to acquire flexible influence? If the latter, what kind of thing do you imagine it actually doing? For example is it trying to "find a strategy that’s instrumentally useful for a variety of long-term goals" as I guessed earlier? (It's hard for me to help "look for ways that can fail" when this picture isn't very clear.)

Comment by wei_dai on Realism and Rationality · 2019-09-17T16:31:59.635Z · score: 3 (1 votes) · LW · GW

The "morally normative" and "epistemically normative" examples in our conversation over on EAF are the kinds of things I'm referring to. ETA: Another example of a normative fact is if there is a right prior for a Bayesian.

Comment by wei_dai on The strategy-stealing assumption · 2019-09-17T15:53:34.003Z · score: 3 (1 votes) · LW · GW

Strategy stealing doesn’t usually involve actual stealing, just using the hypothetical strategy the second player could have used.

Oh, didn't realize that it's an established technical term in game theory.

by updating on all the same logical facts about how to acquire influence that the unaligned AI might discover and then using those

What I mean is that the unaligned AI isn't trying to "acquire influence", but rather trying to accomplish a specific long-term / terminal goal. The aligned AI doesn't have a long-term / terminal goal, so it can't just "uses whatever procedure the unaligned AI originally used to find that strategy", at least not literally.

Comment by wei_dai on The strategy-stealing assumption · 2019-09-17T08:26:50.473Z · score: 3 (1 votes) · LW · GW

I’m not imagining that the aligned AI literally observes and copies the strategy of the unaligned AI. It just uses whatever procedure the unaligned AI originally used to find that strategy.

How? The unaligned AI is presumably applying some kind of planning algorithm to its long-term/terminal goal to find its strategy, but in your scenario isn't the aligned/corrigble AI just following the short-term/instrumental goals of its human users? How is it able to use the unaligned AI's strategy-finding procedure?

To make a guess, are you thinking that the user tells the AI "Find a strategy that's instrumentally useful for a variety of long-term goals, and follow that until further notice?" If so, it's not literally the same procedure that the unaligned AI uses but you're hoping it's close enough?

As a matter of terminology, if you're not thinking of literally observing and copying strategy, why not call it "strategy matching" instead of "strategy stealing" (which has a strong connotation of literal copying)?

Comment by wei_dai on Feature Wish List for LessWrong · 2019-09-17T04:08:03.435Z · score: 3 (1 votes) · LW · GW

Thanks! While I have you here, any reason why voting is disabled in the inbox and recent comments on GW?

Also, I submitted a feature request on the GW issue tracker to display AF karma (the numbers next to the Omega signs on LW). This is a problem especially on mobile (Android Firefox) because the "LW" button that takes me to the LW version of a post/comment doesn't show there so I can't even click on that to see the AF karma.

Comment by wei_dai on Reframing the evolutionary benefit of sex · 2019-09-16T18:40:02.050Z · score: 3 (1 votes) · LW · GW

In some sense the question motivating the OP was whether many of the important phenomena can come from being episodic with an appropriate utility function (something like exp(fitness) instead of fitness

I don't understand this. Want to elaborate?

Comment by wei_dai on The strategy-stealing assumption · 2019-09-16T18:36:13.178Z · score: 5 (2 votes) · LW · GW

I expect this to become more true over time — I expect groups of agents with diverse preferences to eventually approach efficient outcomes, since otherwise there are changes that every agent would prefer (though this is not obvious, especially in light of bargaining failures).

This seems the same as saying that coordination is easy (at least in the long run), but coordination could be hard, especially cosmic coordination. Also, Robin Hanson says governance is hard, which would be a response to your response to 9. (I think I personally have uncertainty that covers both ends of this spectrum.)

Comment by wei_dai on The strategy-stealing assumption · 2019-09-16T18:20:49.510Z · score: 9 (5 votes) · LW · GW

Thanks, stating (part of) your success story this way makes it easier for me to understand and to come up with additional "ways it could fail".

Cryptic strategies

The unaligned AI comes up with some kind of long term strategy that the aligned AI can't observe or can't understand, for example because the aligned AI is trying to satisfy humans' short-term preferences and humans can't observe or understand the unaligned AI's long term strategy.

Different resources for different goals

The unaligned AI uses up useful resources for human goals to get resources that are useful for itself. Aligned AI copies this and it's too late when humans figure out what their goals actually are. (Actually this doesn't apply because you said "This is intended as an interim solution, i.e. you would expect to transition to using a “correct” prior before accessing most of the universe’s resources (say within 1000 years). The point of this approach is to avoiding losing influence during the interim period." I'll leave this here anyway to save other people time in case they think of it.)

Trying to kill everyone as a terminal goal

Under "reckless" you say "Overall I think this isn’t a big deal, because it seems much easier to cause extinction by trying to kill everyone than as an accident." but then you don't list this as an independent concern. Some humans want to kill everyone (e.g. to eliminate suffering) and so they could build AIs that have this goal.

Time-inconsistent values and other human irrationalities

This may give unaligned AI systems a one-time advantage for influencing the long-term future (if they care more about it) but doesn’t change the basic dynamics of strategy-stealing.

This may be false if humans don't have time-consistent values. See this and this for examples of such values. (Will have to think about how big of a deal this is, but thought I'd just flag it for now.)

Weird priors

From this comment: Here’s a possible way for another AI (A) to exploit your AI (B). Search for a statement S such that B can’t consult its human about S’s prior and P(A will win a future war against B | S) is high. Then adopt a high prior for S, wait for B to do the same, and come to B to negotiate a deal that greatly favors A.

Additional example of 11

This seems like an important example of 11 to state explicitly: The optimal strategy for unaligned AI to gain resources is to use lots of suffering subroutines or commit a lot of "mindcrime". Or, the unaligned AI deliberately does this just so that you can't copy its strategy.

Comment by wei_dai on Reframing the evolutionary benefit of sex · 2019-09-16T10:10:56.226Z · score: 3 (1 votes) · LW · GW

Please ask them to auto-crosspost too, unless you're not doing that intentionally? (I have some comments on your latest post there and would rather discuss here than on Medium.)

Comment by wei_dai on Feature Wish List for LessWrong · 2019-09-16T07:56:12.806Z · score: 3 (1 votes) · LW · GW

Would email notifications with a daily digest (and settings to be informed of certain events individually and immediately) be sufficient for a lot of these, or is there something specific about the push-notifications from the app that would help with this?

Seems like app notifications could be much more user-friendly. I wouldn't need to open up the email, then press again to open up the browser, then close the browser tab, delete the email. Also, the app notification can directly show me the most important information so I can decide whether or not look further, whereas I'll probably have to open the email to see that. Also with emails you'd have to deal with or worry about spam filters, rate limiters, delivery problems, etc.

For notifications, I actually think just making LW a full progressive web app and using the notifications API of mobile browsers seems like the best choice to me, but not confident.

I'm not familiar with PWA, but if it works it may be a good alternative to native apps.

Comment by wei_dai on Just Imitate Humans? · 2019-09-16T07:38:27.891Z · score: 3 (1 votes) · LW · GW

Obviously, there’s other stuff to do to establish a stable unipolar world order

I was asking about this part. I'm not convinced HSIFAUH allows you to do this in a safe way (e.g., without triggering a war that you can't necessarily win).

Given your lead time from having more computing power than the reckless team, one has to analyze how many doubling periods you have time for.

Another complication here is that the people trying to build ~AIXI can probably build an economically useful ~AIXI using less compute than you need for ~HSIFAUH (for jobs that don't need to model humans), and start doing their own doublings.

But in terms of extinction threat to real-world humans, this starts to look more like the problem maintaining a power structure over a vast number of humans and less like typical AI alignment difficulties; historically, the former seems to be a solvable problem.

I don't think we've seen a solution that's very robust though. Plus, having to maintain such a power structure starts to become a human safety problem for the real humans (i.e., potentially causes their values to become corrupted).

Comment by wei_dai on Reframing the evolutionary benefit of sex · 2019-09-16T06:46:49.640Z · score: 5 (2 votes) · LW · GW

either evolution is surprisingly forward-looking

Interestingly, this relates to our discussion about episodic vs non-episodic learning algorithms. In this case it seems clear that evolution is not episodic and assuming very large population sizes ought to maximize long-run inclusive fitness. So the puzzle here is that if it takes 300 generations for sex to break even, then if a mutation caused some member of a sexual species to start reproducing asexually, the sexual population would crash to 0 before it could recover.

My idea for solving this (which I just thought of now so take it with a grain of salt) is, because a sexual species can maintain a genome against a much higher mutation rate than an asexual species can (see past discussion), an asexual species needs to have a much lower mutation rate (i.e., much more machinery to prevent/repair mutations) to survive. When an asexual population arises from a sexual species, it doesn't have the extra protective machinery and therefore quickly succumbs to accumulation of harmful mutations, perhaps before the sexual population can go extinct.

Or if it does drive the sexual population extinct first before itself going extinct, if most species are sexual then a phenotypically nearby species can just come occupy the now vacated niche.

Comment by wei_dai on Realism and Rationality · 2019-09-16T05:15:40.272Z · score: 33 (12 votes) · LW · GW

When discussing normative questions, many members of the rationalist community identify as anti-realists.

I wish when people did this kind of thing (i.e., respond to other people's ideas, arguments, or positions) they would give some links or quotes, so I can judge whether whatever they're responding to is being correctly understood and represented. In this case, I feel like there aren't actually that many people who identify as normative anti-realists (i.e., deny that any kind of normative facts exist). More often I see people who are realist about rationality, but anti-realist, subjectivist, or relativist about morality. (See my Six Plausible Meta-Ethical Alternatives for a quick intro to these distinctions.)

Your footnote 1 suggests that maybe you think these distinctions don't really exist (or something like that) and therefore we should just consider realism vs anti-realism, where realism means that all types of normative facts exist and anti-realism means that all types of normative facts don't exist. If so, I think this needs to be explicitly spelled out and defended before you start assuming it.

Comment by wei_dai on Feature Wish List for LessWrong · 2019-09-15T02:34:28.136Z · score: 3 (1 votes) · LW · GW

though not that many users use the LessWrong PM system, so I am kind of hesitant to optimize that part of the site super much.

I'm actually thinking more about comments than PMs. There are comments (either addressed to me or just general comments) that I want to make sure to eventually read/digest/respond to, and right now there's no good way to keep track of that.

Is there anything else in this thread that you think we should prioritize?

Not sure what you mean by "this thread", but I still want to see the backlinks feature, and this.

Comment by wei_dai on If you had to pick one thing you've read that changed the course of your life, what would it be? · 2019-09-15T01:44:06.703Z · score: 12 (2 votes) · LW · GW

This post contains my answer.

Comment by wei_dai on Jimrandomh's Shortform · 2019-09-15T01:12:22.846Z · score: 3 (1 votes) · LW · GW

Can you give some specific examples of me having security mindset, and why they count as having security mindset? I'm actually not entirely sure what it is or that I have it, and would be hard pressed to come up with such examples myself. (I'm pretty sure I have what Eliezer calls "ordinary paranoia" at least, but am confused/skeptical about "deep security".)

Comment by wei_dai on Jimrandomh's Shortform · 2019-09-15T01:09:19.661Z · score: 5 (2 votes) · LW · GW

Combining hash functions is actually trickier than it looks, and some people are doing research in this area and deploying solutions. See and It does seem that if cryptography people had more of a security mindset (that are not being defeated) then there would be more research and deployment of this already.

Comment by wei_dai on Feature Wish List for LessWrong · 2019-09-15T00:16:41.763Z · score: 13 (2 votes) · LW · GW

Maybe this could be done by something else besides the core LW team, but I'd like to have an Android app for LW and EA Forum, that would give me periodic notifications for my inbox, posts/authors/threads I subscribe to, karma changes, maybe new high-karma posts and comments, so I don't have to constantly refresh many different pages on LW/GW/EAF to keep up with what's going on. Having to do that is really a pain when you're trying to use forum participation as a research strategy .

I've been thinking about writing this app myself, but thought I'd ask first to see if anyone else wants to do it.

Comment by wei_dai on A Critique of Functional Decision Theory · 2019-09-15T00:05:03.490Z · score: 4 (2 votes) · LW · GW

I don't see anything wrong with what you're saying, but if you did that you'd end up not being an indexically selfish person anymore. You'd be selfish in a different, perhaps alien or counterintuitive way. So you might be reluctant to make that kind of commitment until you've thought about it for a much longer time, and UDT isn't compatible with your values in the meantime. Also, without futuristic self-modification technologies, you are probably not able to make such a commitment truly binding even if you wanted to and you tried.

Comment by wei_dai on Feature Wish List for LessWrong · 2019-09-14T23:53:31.666Z · score: 3 (1 votes) · LW · GW

I suggest a "I already replied" indicator for messages in my inbox. Also, flags I can set on certain comments/messages to indicate that I should reply to them with high priority. (These could basically work the same way as in a typical email client.) This could be integrated with the bookmark feature that I suggested earlier.

Also, is it just me or does development happen rather slowly on LW? I've done some web development myself and it seems like on a codebase that I'm familiar with, it would take a few weeks at most to implement some of the feature suggestions that have been sitting here for many months.

Comment by wei_dai on Reframing the evolutionary benefit of sex · 2019-09-14T18:32:33.680Z · score: 16 (6 votes) · LW · GW

From the perspective of an organism trying to propagate its genes, sex is like a trade: I’ll put half of your DNA in my offspring if you put half of my DNA in yours. I still pass one copy of my genes onto the next generation per unit of investment in children, so it’s a fair deal. And it doesn’t impact the average fitness of my kids very much, since on average my partner’s genes will be about as good as mine.

Wait, you seem to be assuming that both parents invest equally in the offspring, but in most (vast majority?) of species, one sex invests more than the other. In some species the male makes virtually no investment at all, and what you say here clearly do not apply to those species.

Comment by wei_dai on A Critique of Functional Decision Theory · 2019-09-14T17:37:52.795Z · score: 7 (2 votes) · LW · GW

That's true, but they could say, "Well, given that no binding commitment was in fact made, and given my indexically selfish values, it's rational for me to choose Right." And I'm not sure how to reply to that, unless we can show that such indexically selfish values are wrong somehow.

Comment by wei_dai on Utility uncertainty vs. expected information gain · 2019-09-14T10:56:39.074Z · score: 4 (2 votes) · LW · GW

So if anyone has a safety idea to which utility uncertainty feels central

These two posts looked at some possibilities for using utility uncertainty but they didn't seem that promising and I don't know if anyone is still going in these directions:

Comment by wei_dai on Just Imitate Humans? · 2019-09-14T09:23:11.901Z · score: 3 (1 votes) · LW · GW

Sorry, I think you misunderstood my question about combining human imitations with more general oracles/predictors. What I meant is that you could use general oracles/predictors to build models of the world, which the human imitators could then query or use to test out potential actions. This perhaps lets you overcome the problem of human imitators having worse world models than ~AIXI and narrows the capability gap between them.

Comment by wei_dai on Just Imitate Humans? · 2019-09-14T09:12:27.370Z · score: 3 (1 votes) · LW · GW

I’m pretty sure N = 7 billion is enough.

Why? What are those 7 billion HSIFAUH doing?

In another comment you said "If I’m understanding correctly, the concern is that the imitator learns how humans plan before learning what humans want, so then it plans like a human toward the achievement of some inhuman goal. I don’t think this causes an existential catastrophe." But if there are 7 billion HSIFAUH which are collectively capable of taking over the world, how is not a potential existential catastrophe if they have inhuman values?

Or maybe the right way to look at it is whether N = 10 could finance a rapidly exponentially growing N.

How? And why would it grow fast enough to get to a large enough N before someone deploys ~AIXI?

It should be possible to weaken the online version and get some of this speedup.

What do you have in mind here?

I don’t know how to do this. But it’s the same stuff the reckless team is doing to make standard RL powerful.

You do have to solve some safety problems that the reckless team doesn't though, don't you? What do you think the main safety problems are?

Comment by wei_dai on Eli's shortform feed · 2019-09-14T08:23:41.752Z · score: 5 (2 votes) · LW · GW

I furthermore, I think that should be a universal policy on LessWrong, though maybe this is just an idiosyncratic neurosis of mine.

If it's not just you, it's at least pretty rare. I've seen the mods "helpfully" edit posts several times (without asking first) and this is the first time I've seen anyone complain about it.

Comment by wei_dai on A Critique of Functional Decision Theory · 2019-09-14T08:00:41.911Z · score: 33 (11 votes) · LW · GW

It seems important to acknowledge that there's a version of the Bomb argument that actually works, at least if we want to apply UDT to humans as opposed to AIs, and this may be part of what's driving Will's intuitions. (I'll use "UDT" here because that's what I'm more familiar with, but presumably everything transfers to FDT.)

First there's an ambiguity in Bomb as written, namely what does my simulation see? Does it see a bomb in Left, or no bomb? Suppose the setup is that the simulation sees no bomb in Left. In that case since obviously I should take Left when there's no bomb in it (and that's what my simulation would do), if I am seeing a bomb in Left it must mean I'm in the 1 in a trillion trillion situation where the predictor made a mistake, therefore I should (intuitively) take Right. UDT also says I should take Right so there's no problem here.

Now suppose the simulation is set up to see a bomb in Left. In that case, when I see a bomb in Left, I don't know if I'm a simulation or a real person. If I was selfish in an indexical way, I would think something like "If I'm a simulation then it doesn't matter what I choose. The simulation will end as soon as I make a choice so my choice is inconsequential. But if I'm a real person, choosing Left will cause me to be burned. So I should choose Right." The thing is, UDT is incompatible with this kind of selfish values, because UDT takes a utility function that is defined over possible histories of the world and not possible centered histories of the world (i.e., histories with an additional pointer that says this is "me"). UDT essentially forces an agent to be altruistic to its copies, and therefore is unable to give the intuitively correct answer in this case.

If we're doing decision theory for humans, then the incompatibility with this kind of selfish values would be a problem because humans plausibly do have this kind of selfish values as part of our complex values and whatever decision theory we use perhaps should be able to handle it. However if we're building an AI, it doesn't seem to make sense to let it have selfish values (i.e., have a utility function over centered histories as opposed to uncentered histories), so UDT seems fine (at least as far as this issue is concerned) for thinking about how AIs should ideally make decisions.

Comment by wei_dai on AI Safety "Success Stories" · 2019-09-13T17:45:50.719Z · score: 4 (2 votes) · LW · GW

Even though alignment may take more years/decades to solve in this scenario, it’s a much safer environment to do so.

It seems safer, but I'm not sure about "much safer". You now have an extremely powerful AI that takes human commands, lots of people and governments would want to get their hands on it, and geopolitics is highly destabilized due to your unilateral actions. What are your next steps to ensure continued safety?

Although, these hypotheticals are unlikely (their purpose is pedagogical). It’s likely due to my ignorance, but I am unaware of any pivotal acts attached to anyone’s research agenda.

I think the examples in that Arbital post are actually intended to be realistic examples (i.e., something that MIRI or at least Eliezer would consider doing if they managed to build a safe and powerful task AGI). If you have reason to think otherwise, please explain.

Comment by wei_dai on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-13T08:15:22.360Z · score: 6 (3 votes) · LW · GW

What we want the training process to produce is a mesa-optimizer that tries to minimize the actual distance between its output and the training label (i.e., the actual loss).

Hmm, actually this still doesn't fully address TurnTrout’s (Alex Turner’s) concern, because this mesa-optimizer could try to minimize the actual distance between its output and the training label by changing the training label (what that means depends on how the training label is defined within its utility function). To do that it would have to break out of the box that it's in, which may not be possible, but this is still a system that is “looking for ways to hurt you.” It seems that what we really want is a mesa-optimizer that tries to minimize the actual loss while pretending that it has no causal influence on the training label (even if it actually does because there's a way to break out of its box).

This seems like a harder inner alignment problem than I thought, because we have to make the training process converge upon a rather unnatural kind of agent. Is this still a feasible inner alignment problem to solve, and if not is there another way to get around this problem?

Comment by wei_dai on [AN #63] How architecture search, meta learning, and environment design could lead to general intelligence · 2019-09-13T03:37:29.158Z · score: 4 (2 votes) · LW · GW

As the paper acknowledges, this introduces several risks, and so it calls for deep engagement with AI safety researchers (but sadly it does not propose ideas on how to mitigate the risks).

As far as I can tell, IA-GA doesn't fit into any of the current AI safety success stories, and it seems hard to imagine what kind of success story it might fit into. I'm curious if anyone is more optimistic about this.

Comment by wei_dai on AI Safety "Success Stories" · 2019-09-13T03:33:38.773Z · score: 6 (3 votes) · LW · GW

Thanks for the references. I think I should also credit you with being the first to use "success story" the way I'm using it here, in connection with AI safety, which gave me the idea to write this post.

It’s not the same as your Interim Quality-of-Life Improver, but it’s got similar aspects.

The main difference seems to be that you don't explicitly mention strong global coordination to stop unaligned AI from arising. Is that something you also had in mind? (I seem to recall someone talking about that in connection with this kind of scenario.)

It’s also related to the concept of a “Great Deliberation” where we stabilize the world and then figure out what we want to do. (I don’t have a reference for that though.)

There's also Will MacAskill and Toby Ord's "the Long Reflection" (which may be the same thing that you're thinking of), which as far as I know isn't written up in detail anywhere yet. However I'm told that both of their upcoming books will have some discussions of it.

Comment by wei_dai on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-13T02:48:36.644Z · score: 3 (1 votes) · LW · GW

I vaguely agree with this concern but would like a clearer understanding of it. Can you think of a specific example of how this problem can happen?

Comment by wei_dai on Concrete experiments in inner alignment · 2019-09-12T17:25:44.715Z · score: 3 (1 votes) · LW · GW

When I use the term “RL agent,” I always mean an agent trained via RL.

I think the problem with this usage is that "RL agent" originally meant something like "an agent designed to solve a RL problem" where "RL problem" is something like "a class of problems with the central example being MDP". I think it's just not a well-defined term at this point, and if you Google it, you get plenty of results that say things like "the goal of our RL agent is to maximize the expected cumulative reward", or "AIXI is a reinforcement learning agent". I guess this is fine for AI capabilities work but really confusing for AI safety work.

So, consider switching to "RL-trained agent" for greater clarity (unless someone has a better suggestion)? ETA: Maybe "reinforcement trained agent"?

Comment by wei_dai on Concrete experiments in inner alignment · 2019-09-12T11:04:51.288Z · score: 7 (4 votes) · LW · GW

Train an RL agent with access to its previous step reward as part of its observation.

This is making me notice a terminological ambiguity where sometimes "RL agent" refers to a model/policy trained by a reinforcement learning algorithm (such as REINFORCE) like you're doing here, and sometimes it refers to an agent that maximizes expected reward (given as an input), such as AIXI, like in Daniel Dewey's Learning What to Value, and a "RL agent" in the first sense is not necessarily a "RL agent" in the second sense.

To disambiguate, it seems a good idea to call the former kind of agent something like "RL-trained agent" and the second kind of agent "reward-maximizing agent" or "reward-maximizer" for short. Then we can say things like, "If a RL-trained agent is not given direct access to its step rewards during training, it seems less likely to become a reward-maximizer." Any thoughts on this suggestion? (I'll probably make a post about this later, but thought I'd run it by you and any others who sees this comment for a sanity check first.)

Comment by wei_dai on Utility ≠ Reward · 2019-09-12T09:46:05.734Z · score: 6 (3 votes) · LW · GW

A pithy way to put this point is to say that utility ≠ reward, if we want to call the objective a system is optimising its “utility”. (This is by way of metaphor – I don’t suggest that we must model RL agents as expected utility maximizers.)

I just realized that this seems like an unfortunate choice of terminology, because I've been thinking about "utility ≠ reward" in terms of utility-maximizing agent vs reward-maximizing agent, as described in this comment, whereas you're talking about the inner objective function ("utility") vs the outer objective function ("reward"). Given that your usage is merely metaphorical whereas mine seems more literal, it seems a good idea to change your terminology. Otherwise people will be confused by "utility ≠ reward" in the future and not know which distinction it's referring to, or worse infer a different meaning than intended. Does that make sense?

Comment by wei_dai on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-12T09:19:15.019Z · score: 14 (3 votes) · LW · GW

Here's my understanding and elaboration of your first paragraph, to make sure I understand it correctly and to explain it to others who might not:

What we want the training process to produce is a mesa-optimizer that tries to minimize the actual distance between its output and the training label (i.e., the actual loss). We don't want a mesa-optimizer that tries to minimize the output of the physical grading system (i.e., the computed loss). The latter kind of model will hack the grading system if given a chance, while the former won't. However these two models would behave identically in any training episode where reward hacking doesn't occur, so we can't distinguish between them without using some kind of inner alignment technique (which might for example look at how the models work on the inside rather than just how they behave).

If we can solve this inner alignment problem then it fully addresses TurnTrout's (Alex Turner's) concern, because the system would no longer be "looking for ways to hurt you."

Comment by wei_dai on Hackable Rewards as a Safety Valve? · 2019-09-12T07:12:44.616Z · score: 5 (2 votes) · LW · GW

I was mostly noting that I hadn’t thought of this, hadn’t seen it mentioned

There was some related discussion back in 2012 but of course you can be excused for not knowing about that. :) (The part about "AIXI would fail due to in­cor­rect de­ci­sion the­ory" is in part talking about reward-maximizing agent doing reward hacking.)

Comment by wei_dai on Is competition good? · 2019-09-12T06:20:07.794Z · score: 10 (4 votes) · LW · GW

I tried to introspect more on why I'm often reluctant to ask for explanations, and came up with these reasons. (But note some of these might just be rationalizations and not my real reasons.)

  1. I already spent quite some time trying to puzzle out the explanation, and asking is like admitting defeat.
  2. If there is a simple explanation that I reasonably could have figured out without asking, I look bad by asking.
  3. It's forcing me to publicly signal interest, and maybe I don't want to do that.
  4. Related to 3, it's forcing me to raise the status of the person I'm asking, by showing that I'm interested in what they're saying. (Relatedly, I worry this might cause people to withhold explanations more often than they should.)
  5. If my request is ignored or denied, I would feel bad, perhaps in part because it seems to lower my status.
  6. I feel annoyed that the commenter didn't value my time enough to preemptively include an explanation, and therefore don't want to interact further with them.
  7. My comment requesting an explanation is going to read by lots of people for whom it has no value, and I don't want to impose that cost on them, or make them subconsciously annoyed at me, etc.
  8. ETA: By the time the answer comes, the topic may have left my short term memory, plus I may not be that interested anymore.
Comment by wei_dai on Is competition good? · 2019-09-12T02:55:25.040Z · score: 5 (2 votes) · LW · GW

A background belief that has me less-than-fully-enthusiastically agreeing with you is that a stronger norm of “always include explanations and caveats like this” has a decent chance of causing people to not bother writing things at all (esp. if they’re on a busy day).

What about either:

  1. Give at least a short explanation unless you're really busy, or
  2. Use your best judgment of how much explanation to include, but keep in mind that if you include none at all, you might cause a bunch of people to waste time and feel frustrated trying to figure out what you mean or why you think what you think.

they couldn’t ask

Not so much that I couldn't ask (i.e., there's a rule or norm against asking in that situation) but rather that I didn't want to ask (i.e., the possibility and uncertainty of being ignored or denied makes cost-benefit seem to favor not asking). (I only did speak up because I thought it was an opportunity to affect more than this one instance of the situation.) What about an additional norm of, "if no explanation is included, at least say a few words about whether or not you'd be open to providing an explanation upon request"?

Comment by wei_dai on hereisonehand's Shortform · 2019-09-12T02:32:06.263Z · score: 7 (4 votes) · LW · GW

Is this in fact a part of the AI alignment problem, and if so is anyone trying to solve this facet of the problem and where might I go to read more about that?

Yes, it's part of some approaches to the AI alignment problem. It used to be considered more central to AI alignment until people started thinking it might be too hard, and started working on other ways of trying to solve AI alignment that perhaps don't require "finding an effective way to tell an AI what wellbeing is". See AI Safety "Success Stories" where "Sovereign Singleton" requires solving this and the others don't (at least not right away). See also Friendly AI and Coherent Extrapolated Volition.

Comment by wei_dai on G Gordon Worley III's Shortform · 2019-09-12T02:03:07.312Z · score: 10 (4 votes) · LW · GW

Things that are morally abhorrent are not necessarily moral errors. For example I can find wildlife suffering morally abhorrent but there's obviously no moral errors or any kind of errors being committed there. Given that the dictionary defines abhorrent as "inspiring disgust and loathing; repugnant" I think "I find X morally abhorrent" just means "my moral system considers X to be very wrong or to have very low value."

Comment by wei_dai on Is competition good? · 2019-09-11T22:32:39.541Z · score: 5 (2 votes) · LW · GW

Not sure if that changes the rest of your comment.

Yeah, the new quotes at least makes it clearer what they're disagreeing with.

exist prominent LW

This was unclear because you just said "people".

who would not agree with the “we can be reasonably confident that the second restaurant ends up canceling their AMF donations decreases value.”

Because of the way you quoted, I had no idea this was the disagreement. Hypotheses I generated included that they disagreed with the monopoly dynamics being described, or the right way to frame monopoly economics.

I understand it being frustrating to not get to understand or discuss the reasons why, but it seems important for it to be a socially acceptable move to say “hey, your blanket statement does not apply to me” without having to take time to explain why.

What about situations like this one, where the commenter just makes a mistake? (One could imagine an even more consequential mistake like saying or implying, for example through misquoting, the opposite of what one intended.) How does that get fixed if there's a norm that people can say something without explaining why (which would discourage others from asking for explanations)? (I'm not necessarily proposing a solution here, just flagging this as an issue.)

In this case my own answer of “am I up for being asked” is “you can certainly ask, and I may or may not get around to responding.”

I think this, if explicitly stated, is better than nothing.

Although I can say briefly that possible reasons here include ‘you might not think AMF is net positive, and you might think the general practice of donating to things like AMF is not a good strategy.’

Even a brief explanation like this would be super helpful.