wei_dai feed - LessWrong 2.0 Reader wei_dai’s posts and comments on the Effective Altruism Forum en-us Comment by Wei_Dai on Strategic implications of AIs' ability to coordinate at low cost, for example by merging https://www.lesswrong.com/posts/gYaKZeBbSL4y2RLP3/strategic-implications-of-ais-ability-to-coordinate-at-low#gWjWG45mxSEQJEf2f <p>One possible way for AIs to coordinate with each other is for two or more AIs to modify their individual utility functions into some compromise utility function, in a mutually verifiable way, or equivalently to jointly construct a successor AI with the same compromise utility function and then hand over control of resources to the successor AI. This simply isn't something that humans can do.</p> wei_dai gWjWG45mxSEQJEf2f 2019-04-25T07:21:27.778Z Comment by Wei_Dai on [Answer] Why wasn't science invented in China? https://www.lesswrong.com/posts/8SEvTvYFX2KDRZjti/answer-why-wasn-t-science-invented-in-china#jyGqafmJnR4wYwmw4 <blockquote> <p>Classical Chinese is a language extremely difficult to master. It literally take decades of effort to be able to write a decent piece. It is hard not because of complicated grammar or complex sentence structure. But because it focus on poetic expressions and scholarly idioms.</p> </blockquote> <p>Sounds like writing became mainly a way to signal one's intelligence and erudition, instead of a tool for efficient communications. But why didn't Western civilization fall into the same trap, or how did it manage to get out of it?</p> wei_dai jyGqafmJnR4wYwmw4 2019-04-25T07:03:09.053Z Strategic implications of AIs' ability to coordinate at low cost, for example by merging https://www.lesswrong.com/posts/gYaKZeBbSL4y2RLP3/strategic-implications-of-ais-ability-to-coordinate-at-low <p>It seems likely to me that AIs will be able to coordinate with each other much more easily (i.e., at lower cost and greater scale) than humans currently can, for example by merging into coherent unified agents by combining their utility functions. This has been discussed at least since <a href="https://www.lesswrong.com/posts/S4Jg3EAdMq57y587y/an-alternative-approach-to-ai-cooperation">2009</a>, but I'm not sure its implications have been widely recognized. In this post I talk about two such implications that occurred to me relatively recently.</p> <p>I was recently <a href="https://www.overcomingbias.com/2019/04/agency-failure-ai-apocalypse.html#comment-4433175599">reminded</a> of this quote from Robin Hanson's <a href="https://www.overcomingbias.com/2009/10/prefer-law-to-values.html">Prefer Law To Values</a>:</p> <blockquote> <p>The later era when robots are vastly more capable than people should be much like the case of choosing a nation in which to retire. In this case we don’t expect to have much in the way of skills to offer, so we mostly care that they are law-abiding enough to respect our property rights. If they use the same law to keep the peace among themselves as they use to keep the peace with us, we could have a long and prosperous future in whatever weird world they conjure. In such a vast rich universe our “retirement income” should buy a comfortable if not central place for humans to watch it all in wonder.</p> </blockquote> <p>Robin argued that this implies we should work to make it more likely that our current institutions like laws will survive into the AI era. But (aside from the problem that we're most likely still incurring astronomical waste even if many humans survive "in retirement"), assuming that AIs will have the ability to coordinate amongst themselves by doing something like merging their utility functions, there will be no reason to use laws (much less "the same laws") to keep peace among themselves. So the first implication is that to the extent that AIs are likely to have this ability, working in the direction Robin suggested would likely be futile.</p> <p>The second implication is that AI safety/alignment approaches that aim to preserve an AI's competitiveness must also preserve its ability to coordinate with other AIs, since that is likely an important part of its competitiveness. For example, making an AI corrigible in the sense of allowing a human to shut it (and its successors/subagents) down or change how it functions would seemingly make it impossible for this AI to merge with another AI that is not corrigible, or not corrigible in the same way. (I've mentioned this a number of times in previous comments, as a reason why I'm pessimistic about specific approaches, but I'm not sure if others have picked up on it, or agree with it, as a general concern, which partly motivates this post.)</p> <p>Questions: Do you agree AIs are likely to have the ability to coordinate with each other at low cost? What other implications does this have, especially for our strategies for reducing x-risk?</p> wei_dai gYaKZeBbSL4y2RLP3 2019-04-25T05:08:21.736Z Comment by Wei_Dai on Where to Draw the Boundaries? https://www.lesswrong.com/posts/esRZaPXSHgWzyB2NL/where-to-draw-the-boundaries#i9h9RhRmJrhgCBQ93 <p>Thanks, I think I have a better idea of what you're proposing now, but I'm still not sure I understand it correctly, or if it makes sense.</p> <blockquote> <p>mice and elephants form a cluster if you project into the subspace spanned by “color” and “relative ear size”, but using a word to point to a cluster in such a “thin”, impoverished subspace is a dishonest rhetorical move when your interlocutors are trying to use language to mostly talk about the many other features of animals which don’t covary much with color and relative-ear-size.</p> </blockquote> <p>But there are times when it's <em>not</em> a dishonest rhetorical move to do this, right? For example suppose an invasive predator species has moved into some new area, and I have an hypothesis that animals with grey skin and big ears might be the only ones in that area who can escape being hunted to extinction (because I think the predator has trouble seeing grey and big ears are useful for hearing the predator and only this combination of traits offers enough advantage for a prey species to survive). While I'm formulating this hypothesis, discussing how plausible it is, applying for funding, doing field research, etc., it seems useful to create a new term like "eargreyish" so I don't have to keep repeating "grey animals with relatively large ears".</p> <p>Since it doesn't seem to make sense to <em>never</em> use a word to point to a cluster in a "thin" subspace, what is your advice for when it's ok to do this or accept others doing this?</p> wei_dai i9h9RhRmJrhgCBQ93 2019-04-21T20:53:49.732Z Comment by Wei_Dai on Announcement: AI alignment prize round 4 winners https://www.lesswrong.com/posts/nDHbgjdddG5EN6ocg/announcement-ai-alignment-prize-round-4-winners#WHn5PF7YEExLgACEz <p>Whose time do you mean? The judges? Your own time? The participants' time?</p> wei_dai WHn5PF7YEExLgACEz 2019-04-19T22:30:58.962Z Comment by Wei_Dai on More realistic tales of doom https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/more-realistic-tales-of-doom#YSapPHJRr23AXtL6x <blockquote> <p>The key issue here is whether there will be coordination between a set of influence-seeking systems that can cause (and will benefit from) a catastrophe, even when other systems are opposing them.</p> </blockquote> <p>Do you not expect this threshold to be crossed sooner or later, assuming AI alignment remains unsolved? Also, it seems like the main alternative to this scenario is that the influence-seeking systems expect to eventually gain control of most of the universe anyway (even without a "correlated automation failure"), so they don't see a reason to "rock the boat" and try to dispossess humans of their remaining influence/power/resources, but this is almost as bad as the "correlated automation failure" scenario from an astronomical waste perspective. (I'm wondering if you're questioning whether things will turn out badly, or questioning whether things will turn out badly <em>this way</em>.)</p> wei_dai YSapPHJRr23AXtL6x 2019-04-17T21:08:07.976Z Comment by Wei_Dai on More realistic tales of doom https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/more-realistic-tales-of-doom#WJdXzffQK2WdG8fDs <p>(Upvoted because I think this deserves more clarification/discussion.)</p> <blockquote> <p>I'm not sure I understand this part. The influence-seeking systems which have the most influence also have the most to lose from a catastrophe. So they'll be incentivised to police each other and make catastrophe-avoidance mechanisms more robust.</p> </blockquote> <p>I'm not sure either, but I think the idea is that once influence-seeking systems gain a certain amount of influence, it may become faster or more certain for them to gain more influence by causing a catastrophe than to continue to work within existing rules and institutions. For example they may predict that unless they do that, humans will eventually coordinate to take back the influence that humans lost, or they may predict that during such a catastrophe they can probably expropriate a lot of resources currently owned by humans and gain much influence that way, or humans will voluntarily hand more power to them in order to try to use them to deal with the catastrophe.</p> <blockquote> <p>As an analogy: we may already be past the point where we could recover from a correlated "world leader failure": every world leader simultaneously launching a coup. But this doesn't make such a failure very likely, unless world leaders also have strong coordination and commitment mechanisms between themselves (which are binding even after the catastrophe).</p> </blockquote> <p>I think such a failure can happen without especially strong coordination and commitment mechanisms. Something like this happened during the Chinese <a href="https://en.wikipedia.org/wiki/Warlord_Era">Warlord Era</a>, when many military commanders became warlords during a correlated "military commander failure", and similar things probably happened many times throughout history. I think what's actually preventing a "world leader failure" today is that most world leaders, especially of the rich democratic countries, don't see any way to further their own values by launching coups in a correlated way. In other words, what would they do afterwards if they did launch such a coup, that would be better than just exercising the power that they already have?</p> wei_dai WJdXzffQK2WdG8fDs 2019-04-17T06:12:43.814Z Comment by Wei_Dai on Where to Draw the Boundaries? https://www.lesswrong.com/posts/esRZaPXSHgWzyB2NL/where-to-draw-the-boundaries#kEwyyW2m53DdEYWi8 <p>My interest in terminological debates is usually not to discover new ideas but to try to prevent confusion (when readers are likely to infer something wrong from a name, e.g., because of different previous usage or because a compound term is defined to mean something that's different from what one would reasonably infer from the combination of individual terms). But sometimes terminological debates can uncover hidden assumptions and lead to substantive debates about them. See <a href="https://www.lesswrong.com/posts/xCpuSfT5Lt6kkR3po/my-take-on-agent-foundations-formalizing-metaphilosophical#BK68tZ9gwK7SGgApk">here</a> for an example.</p> wei_dai kEwyyW2m53DdEYWi8 2019-04-16T15:54:53.757Z Comment by Wei_Dai on Where to Draw the Boundaries? https://www.lesswrong.com/posts/esRZaPXSHgWzyB2NL/where-to-draw-the-boundaries#SGTZmSkTB479Mzc49 <p>As someone who seems to care more about terminology than most (and as a result probably gets into more terminological debates on LW than anyone else (see <a href="https://www.lesswrong.com/posts/xCpuSfT5Lt6kkR3po/my-take-on-agent-foundations-formalizing-metaphilosophical#5r6kKBoEiX6hQC4ks">1</a> <a href="https://www.lesswrong.com/posts/pu3ddLSZjjmiiqQfh/another-take-on-agent-foundations-formalizing-zero-shot#TJtwakAfL7wwoaDKS">2</a> <a href="https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#4TaT6yR3FCiS34FZQ">3</a> <a href="https://www.lesswrong.com/posts/ZeE7EKHTFMBs8eMxn/clarifying-ai-alignment#RrvmhuqvxsGwG45Yv">4</a>)), I don't really understand what you're suggesting here. Do you think this advice is applicable to any of the above examples of naming / drawing boundaries? If so, what are its implications in those cases? If not, can you give a concrete example that might come up on LW or otherwise have some relevance to us?</p> wei_dai SGTZmSkTB479Mzc49 2019-04-14T21:53:02.771Z Comment by Wei_Dai on Best reasons for pessimism about impact of impact measures? https://www.lesswrong.com/posts/kCY9dYGLoThC3aG7w/best-reasons-for-pessimism-about-impact-of-impact-measures#o46TH9GgDTE8PahJ4 <p>I have an intuition that while impact measures as a way of avoiding negative side effects might work well in toy models, it will be hard or impossible to get them to work in the real world, because what counts as a negative side effect in the real world seems too complex to easily capture. It seems like AUP tries to get around this by aiming at a lower bar than "avoid negative side effects", namely "avoid catastrophic side effects", and aside from whether it actually succeeds at clearing this lower bar, it would mean that an AI that is only "safe" because of AUP can't be safely used for ordinary goals (e.g., invent a better widget, or make someone personally more successful in life) and instead we have to somehow restrict them to being used just for goals that relate to x-risk reduction, where it's worthwhile to risk incurring less-than-catastrophic negative side effects.</p> <p>As a side note, it seems generally the case that some approaches to AI safety/alignment aim at the higher bar of "safe for general use" and others aim at "safe enough to use for x-risk reduction", and this isn't always made clear, which can be a source of confusion for both AI safety/alignment researchers and others such as strategists and policy makers.</p> wei_dai o46TH9GgDTE8PahJ4 2019-04-10T20:04:27.556Z Comment by Wei_Dai on Best reasons for pessimism about impact of impact measures? https://www.lesswrong.com/posts/kCY9dYGLoThC3aG7w/best-reasons-for-pessimism-about-impact-of-impact-measures#5irFDF9ddjPe92YuG <blockquote> <p>I’m interested in learning about the intuitions, experience, and facts which inform this pessimism. As such, I’m not interested in making any arguments to the contrary in this post; any pushback I provide in the comments will be with clarification in mind.</p> </blockquote> <p>I would prefer that you and/or others did push back, as I'm really curious which of the causes/reasons for pessimism actually stand up under such pushback. (See <a href="https://rationalconspiracy.com/2017/01/03/four-layers-of-intellectual-conversation/">Four Layers of Intellectual Conversation</a> and <a href="https://www.lesswrong.com/posts/wo6NsBtn3WJDCeWsx/ai-safety-via-debate">AI Safety via Debate</a>.) I do appreciate that you prioritize just knowing what the causes/reasons are in the first place and don't want to discourage people from sharing them, so I wonder if there's a way to get both of what we want.</p> wei_dai 5irFDF9ddjPe92YuG 2019-04-10T20:03:40.430Z Comment by Wei_Dai on Alignment Newsletter One Year Retrospective https://www.lesswrong.com/posts/3onCb5ph3ywLQZMX2/alignment-newsletter-one-year-retrospective#xF5eFfLcM5FeqKsjw <p>The main value to me is being updated on all the research that is going on in this field. If the newsletter went away and nothing else changes, I don't know how I would find all the new relevant papers and posts that come out.</p> <p>I think I've commented on your newsletters a few times, but haven't comment more because it seems like the number of people who would read and be interested in such a comment would be relatively small, compared to a comment on a more typical post. A lot of people who read your newsletters are doing so by email and won't even see my comment, and someone who does read them through LW/AF might not be interested in the particular paper (or your opinion of it) that I want to discuss. Plus, the fact that you avoid giving strong negative opinions (which BTW seems sensible to me for a newsletter format) makes it less likely that I feel an urgent need to correct something.</p> <p>One idea you can consider is to create individual link posts on AF for the most important papers/posts that you include in the newsletter (with your summaries and opinions) that haven't already been posted to AF, which would create focal points for discussing them. I think if I had a thought on some paper that is mentioned in your newsletter, I'd be more inclined to write a comment for it under its own link post as opposed to under your newsletter post. I would also be more inclined to comment on your summaries and opinions if there was a chance to correct something before it went out to your email subscribers. This could also be a way for you to solicit summaries from random readers.</p> wei_dai xF5eFfLcM5FeqKsjw 2019-04-10T09:54:04.753Z Comment by Wei_Dai on Defeating Goodhart and the "closest unblocked strategy" problem https://www.lesswrong.com/posts/PADPJ3xac5ogjEGwA/defeating-goodhart-and-the-closest-unblocked-strategy#Qzt9QQDNZEPHJ83sn <blockquote>The fuzziness will never get fully resolved. This approach is to deal with Goodhart-style problems without optimising leading to disaster; </blockquote><p>I&#x27;m saying this isn&#x27;t clear, because optimizing for a fuzzy utility function instead of the true utility function could lead to astronomical waste or be a form of x-risk, unless you also had a solution to corrigibility such that you could shut down the AI before it used up much of the resources of the universe trying to optimize for the fuzzy utility function. But then the corrigibility solution seems to be doing most of the work of making the AI safe. For example without a corrigibility solution it seems like the AI would not try to help you resolve your own uncertainty/fuzziness about values and would actually impede your own efforts to do so (because then your values would diverge from its values and you&#x27;d want to shut it down later or change its utility function).</p><blockquote>I&#x27;m working on other approaches that could allow the synthesis of the actual values. </blockquote><p>Ok, so I&#x27;m trying to figure out how these approaches fit together. Are they meant to both go into the same AI (if so how?), or is it more like, &quot;I&#x27;m not sure which of these approaches will work out so let&#x27;s research them simultaneously and then implement whichever one seems most promising later&quot;?</p> wei_dai Qzt9QQDNZEPHJ83sn 2019-04-09T16:23:20.571Z Comment by Wei_Dai on Defeating Goodhart and the "closest unblocked strategy" problem https://www.lesswrong.com/posts/PADPJ3xac5ogjEGwA/defeating-goodhart-and-the-closest-unblocked-strategy#f4LZPyAQHaSMAzyZL <blockquote> <p>What I term uncertainty might better be phrased as “known (or learnt) fuzziness of a concept or statement”. It differs from uncertainty in the Jessica sense in that knowing absolutely everything about the universe, about logic, and about human brains, doesn’t resolve it.</p> </blockquote> <p>In this approach, does the uncertainty/fuzziness <em>ever</em> get resolved (if so how?), or is the AI stuck with a "fuzzy" utility function forever? If the latter, why should we not expect that to incur an astronomically high opportunity cost (due to the AI wasting resources optimizing for values that we might have but actually don't) from the perspective of our real values?</p> <p>Or is this meant to be a temporary solution, i.e., at some point we shut this AI down and create a new one that <em>is</em> able to resolve the uncertainty/fuzziness?</p> wei_dai f4LZPyAQHaSMAzyZL 2019-04-09T01:08:33.438Z Comment by Wei_Dai on Impact Measure Desiderata https://www.lesswrong.com/posts/c2oM7qytRByv6ZFtz/impact-measure-desiderata#YWXbx3gYHLarNTZT6 <blockquote> <p>However, I am broadly suspicious of AUP agents choosing plans which involve almost maximally offensive components, even accounting for the fact that it could try to do so surreptitiously.</p> </blockquote> <p>I guess I don't have good intuitions of what an AUP agent would or wouldn't do. Can you share yours, like give some examples of real goals we might want to give to AUP agents, and what you think they would and wouldn't do to accomplish each of those goals, and why? (Maybe this could be written up as a post since it might be helpful for others to understand your intuitions about how AUP would work in a real-world setting.)</p> <blockquote> <p>I’m not sure whether this belongs in the desiderata, since we’re talking about whether temporary object level bad things could happen. I think it’s a bonus to think that there is less of a chance of that, but not the primary focus of the impact measure.</p> </blockquote> <p>Why not? I've usually seen people talk about "impact measures" as a way of avoiding side effects, especially negative side effects. It seems intuitive that "object level bad things" are negative side effects even if they are temporary, and ought to be a primary focus of impact measures. It seems like you've reframed "impact measures" in your mind to be a bit different from this naive intuitive picture, so perhaps you could explain that a bit more (or point me to such an explanation)?</p> wei_dai YWXbx3gYHLarNTZT6 2019-04-08T07:37:33.014Z Comment by Wei_Dai on Impact Measure Desiderata https://www.lesswrong.com/posts/c2oM7qytRByv6ZFtz/impact-measure-desiderata#LCSfchtH5YhK8ZcR8 <blockquote> <p>First, AUP seems to bound “how hard the agent tries” (in the physical world with its actions); the ambitions of such an agent seem rather restrained.</p> </blockquote> <p>But creating extreme suffering might not actually involve doing much in the physical world (compared to "normal" actions the AI would have to take to achieve the goals that we gave it). What if, depending on the goals we give the AI, doing this kind of extortion is actually the lowest impact way to achieve some goal?</p> <blockquote> <p>If I understand the extortion scenario correctly, it would have to be extorting us, so it couldn’t keep it secret, so it would be penalized and it wouldn’t do it.</p> </blockquote> <p>Maybe it could extort a different group of humans, and as part of the extortion force them to keep it secret from people who could turn it off? Or extort us and as part of the extortion force us to not turn it off (until we were going to turn it off anyway)?</p> <p>Also, since we're discussing this under the "Impact Measure Desiderata" post, do the existing desiderata cover this scenario? If not, what new desideratum do we need to add to the list?</p> wei_dai LCSfchtH5YhK8ZcR8 2019-04-05T03:38:33.961Z Comment by Wei_Dai on Impact Measure Desiderata https://www.lesswrong.com/posts/c2oM7qytRByv6ZFtz/impact-measure-desiderata#DifA6mSLr6to6YFwv <blockquote> <p>Mindcrime would indeed be very bad, and a unique kind of catastrophe not meant to be covered by my claim above.</p> </blockquote> <p>Aside from mindcrime, I'm also concerned about AI deliberately causing extreme suffering as part of some sort of bargaining/extortion scheme. Is that something that impact measures can mitigate?</p> <blockquote> <p>However, I’m skeptical that that goal is actually a component of our terminal preferences. What is doing the causing – are you thinking “never have an AI cause an instance of that”? Why would that be part of our terminal preferences?</p> </blockquote> <p>An AI designer or humanity as a whole might want to avoid personal or collective responsibility for causing extreme suffering, which plausibly is part of our terminal preferences.</p> <blockquote> <p>If you mean “never have this happen”, we’ve already lost.</p> </blockquote> <p>Additionally, a superintelligent AI can probably cause much more extreme forms of suffering than anything that has occurred in the history of our universe so far, so even if the goal is defined as "never have this happen" I think we could lose more than we already have.</p> wei_dai DifA6mSLr6to6YFwv 2019-04-05T01:22:57.672Z Comment by Wei_Dai on Impact Measure Desiderata https://www.lesswrong.com/posts/c2oM7qytRByv6ZFtz/impact-measure-desiderata#w9a5kLcpZs78KyaPr <blockquote> <p>Of course, you could say “what if being beaten even once is a catastrophe, such that it destroys our ability to be undefeated forever”, but it seems like our goals are simply not of this form.</p> </blockquote> <p>We might have a goal like "never cause an instance of extreme suffering, including in computer simulations" which seems pretty similar to "never let an AI defeat humans in Go".</p> wei_dai w9a5kLcpZs78KyaPr 2019-04-04T23:40:53.291Z Comment by Wei_Dai on Defeating Goodhart and the "closest unblocked strategy" problem https://www.lesswrong.com/posts/PADPJ3xac5ogjEGwA/defeating-goodhart-and-the-closest-unblocked-strategy#ygKihYAiHcHQart9S <blockquote> <p>This is not a design for corrigible agents (if anything, it’s more a design for low impact agents). The aim of this approach is not to have an AI that puts together the best U, but one that doesn’t go maximising a narrow V, and has wide enough uncertainty to include a decent U among the possible utility functions, and that doesn’t behave too badly.</p> </blockquote> <p>Ok, understood, but I think this approach might run into similar problems as the attempts to formalize value uncertainty in Jessica's post. Have you read it to see if one of those ways to formalize value uncertainty would work for your purposes, and if not, what would you do instead?</p> wei_dai ygKihYAiHcHQart9S 2019-04-04T20:27:43.160Z Comment by Wei_Dai on Defeating Goodhart and the "closest unblocked strategy" problem https://www.lesswrong.com/posts/PADPJ3xac5ogjEGwA/defeating-goodhart-and-the-closest-unblocked-strategy#NaAM573DkZzfYNNhK <p>This seems interesting but I don't really understand what you're proposing.</p> <blockquote> <p>1, Give the AI W as our current best estimate for U.</p> </blockquote> <p>Is W a single utility function?</p> <blockquote> <p>2, Encode our known uncertainties about how well W relates to U.</p> </blockquote> <p>What is the type signature of the encoded data (let's call it D) here? A probability distribution for U-W, or for U? Or something else?</p> <blockquote> <p>3, Have the AI deduce, from our subsequent behaviour, how well we have encoded our uncertainties, and change these as needed.</p> </blockquote> <p>How does the AI actually do this? Does it use some sort of meta-prior, separate from D? Suppose we were overconfident in step 2, e.g., let's say we neglected to include some uncertainty in D (there is a certain kind of computation that is highly negatively valuable, but in W we specified it as having value 0, and in D we didn't include any uncertainty about it so the AI thinks that this kind of computation has value 0 with probability 1), how would the AI "deduce" that we were wrong? (Or give an example with a different form of overconfidence if more appropriate.)</p> <blockquote> <p>4, Repeat 2-3 for different types of uncertainties.</p> </blockquote> <p>Do you literally mean that 2-3 should be done separately for each kind of uncertainty, or just that we should try to include all possible types of uncertainties into D in step 2?</p> <p>Also, Jessica Taylor's <a href="https://www.lesswrong.com/posts/5bd75cc58225bf0670375041/a-first-look-at-the-hard-problem-of-corrigibility">A first look at the hard problem of corrigibility</a> went over a few different ways that an AI could formalize the fact that humans are uncertain about their utility functions, and concluded that none of them would solve the problem of corrigibility. Are you proposing a different way of formalizing it that's not on her list, or do you get around the issue by trying to solve a different problem?</p> wei_dai NaAM573DkZzfYNNhK 2019-04-03T20:39:43.274Z Comment by Wei_Dai on What would you need to be motivated to answer "hard" LW questions? https://www.lesswrong.com/posts/zEMzFGhRt4jZwyJqt/what-would-you-need-to-be-motivated-to-answer-hard-lw#w2nhh4yzjrTCt7et4 <p>I feel like I should provide some data as someone who participated in a number of past bounties.</p> <ol> <li>For one small bounty &lt;$100, it was a chance to show off my research (i.e., Googling and paper skimming) skills, plus it was a chance to learn something that I was somewhat interested in but didn't know a lot about.</li> <li>For one of the AI alignment related bounties (Paul's "Prize for probable problems" for IDA), it was a combination of the bounty giver signaling interest, plus it serving as coordination for a number of people to all talk about IDA at around the same time and me wanting to join that discussion while it was a hot topic.</li> <li>For another of the AI alignment related bounties (Paul's "AI Alignment Prize"), it was a chance to draw attention to some ideas that I already had or was going to write about anyway.</li> <li>For both of the AI alignment related bounties, when a friend or acquaintance asks me about my "work", I can now talk about these prize that I recently won, which sounds a lot cooler than "oh, I participate on this online discussion forum". :)</li> </ol> wei_dai w2nhh4yzjrTCt7et4 2019-03-30T17:53:29.265Z Comment by Wei_Dai on Please use real names, especially for Alignment Forum? https://www.lesswrong.com/posts/GEHg5T9tNbJYTdZwb/please-use-real-names-especially-for-alignment-forum#ymqRcuQpoBZwPAn6K <p>It's just so that if I read a LW/AF post/comment from someone, I can more easily recall oh, this is the same person I met at event X, or this is the same person who wrote paper Y (and vice versa). If someone consistently uses their online name for physical meetings and authoring papers, that would be fine with me. And if someone wants to keep their online and physical identities completely apart, that would be understandable to me too. But I'm not sure if people have good reasons to give themselves an online name that's different from their "real" name, and make others keep a mapping between the two.</p> wei_dai ymqRcuQpoBZwPAn6K 2019-03-29T16:50:43.118Z Comment by Wei_Dai on Please use real names, especially for Alignment Forum? https://www.lesswrong.com/posts/GEHg5T9tNbJYTdZwb/please-use-real-names-especially-for-alignment-forum#A7nWszWqTb8pkau52 <blockquote> <p>I am a bit hesitant to do the parenthesis thing, just because it would make usernames quite big, which I think will cause some problems with some upcoming redesigns we have for the frontpage.</p> </blockquote> <p>Given this, maybe it would still be a good idea to officially encourage people to use their real names as their user names (or something that's very easy to associate with their real name like a shortened form of it)? Because unless the real name is displayed everywhere, I still have to keep a mapping in my brain between their username and their real name, which seems like a pointless cognitive burden to impose on someone.</p> <p>Is there's some important benefit to letting people (who don't want to keep their real names completely private) choose a different display name, that I'm missing?</p> wei_dai A7nWszWqTb8pkau52 2019-03-29T07:34:24.579Z Comment by Wei_Dai on The Main Sources of AI Risk? https://www.lesswrong.com/posts/WXvt8bxYnwBYpy9oT/the-main-sources-of-ai-risk#9wnxtPmuZNM7QmhAW <blockquote> <p>Failure to learn how to deal with alignment in the many-humans, many-AIs case even if single-human, single-AI alignment is solved (which I think Andrew Critch has talked about).</p> </blockquote> <p>Good point, I'll add this to the list.</p> <blockquote> <p>For example, AIs negotiating on behalf of humans take the stance described in <a href="https://arxiv.org/abs/1711.00363">https://arxiv.org/abs/1711.00363</a> of agreeing to split control of the future according to which human’s priors are most accurate (on potentially irrelevant issues) if this isn’t what humans actually want.</p> </blockquote> <p>Thanks, I hadn't noticed that paper until now. Under "Related Works" it cites Social Choice Theory but doesn't actually mention any recent research from that field. Here is one paper that criticizes the Pareto principle that Critch's paper is based on, in the context of preference aggregation of people with different priors: <a href="https://people.hec.edu/mongin/wp-content/uploads/sites/36/2018/08/LSE1.MonginSpurious97.pdf">Spurious Unanimity and the Pareto Principle</a></p> wei_dai 9wnxtPmuZNM7QmhAW 2019-03-29T07:23:30.372Z Comment by Wei_Dai on Please use real names, especially for Alignment Forum? https://www.lesswrong.com/posts/GEHg5T9tNbJYTdZwb/please-use-real-names-especially-for-alignment-forum#6v55dWshyyzz9ppZk <blockquote> <p>If you to go <a href="http://www.alignmentforum.org">www.alignmentforum.org</a> you will see that a lot more users have their full-name displayed than on LW.</p> </blockquote> <p>Oh, I didn't know that was a feature, but it would be pretty hard to take advantage of it for me. I tend to use GW and it takes two clicks to go from a post on GW to the same post on AF (via LW), and there doesn't seem to be a way to directly navigate from a comment on LW to the same comment on AF.</p> <p>Making real names available to see on hover would help a lot, but might not work on mobile. Maybe you could put the real names in parenthesis after the user name, or make that an option that people can enable? And expose it to GW via your API (if it isn't already) so they can implement this too?</p> wei_dai 6v55dWshyyzz9ppZk 2019-03-29T04:26:46.477Z Please use real names, especially for Alignment Forum? https://www.lesswrong.com/posts/GEHg5T9tNbJYTdZwb/please-use-real-names-especially-for-alignment-forum <p>As the number of AI alignment researchers increases over time, it's getting hard for me to keep track of everyone's names. (I'm probably worse than average in this regard.) It seems the fact that some people don't use their real names as their LW/AF usernames makes it harder than it needs to be. So I wonder if we could officially encourage people to use their real firstname and lastname as their username, especially if they regularly participate on AF, unless they're deliberately trying to keep their physical identities secret? (Alternatively, at least put their real firstname and lastname in their user profile/description?)</p> wei_dai GEHg5T9tNbJYTdZwb 2019-03-29T02:54:20.812Z Comment by Wei_Dai on The Main Sources of AI Risk? https://www.lesswrong.com/posts/WXvt8bxYnwBYpy9oT/the-main-sources-of-ai-risk#TH5Xj9xPzfRDDy59s <p>I'm not sure if I meant to include this when I wrote 3, but it does seem like a good idea to break it out into its own item. How would you suggest phrasing it? "Wireheading" or something more general or more descriptive?</p> wei_dai TH5Xj9xPzfRDDy59s 2019-03-29T01:03:57.968Z Comment by Wei_Dai on Some Thoughts on Metaphilosophy https://www.lesswrong.com/posts/EByDsY9S3EDhhfFzC/some-thoughts-on-metaphilosophy#jJ9zg53475uCThPNk <blockquote> <p>I guess it feels like I don’t know how we could know that we’re in the position that we’ve “solved” meta-philosophy.</p> </blockquote> <p>What I imagine is reaching a level of understanding of what we’re really doing (or what we should be doing) when we “do philosophy”, on par with our current understanding of what “doing math” or “doing science” consist of, or ideally a better level of of understanding than that. (See <a href="https://www.lesswrong.com/posts/fC248GwrWLT4Dkjf6/open-problems-related-to-solomonoff-induction#Apparent_Unformalizability_of__Actual__Induction">Apparent Unformalizability of “Actual” Induction</a> for one issue with our current understanding of “doing science”.)</p> <blockquote> <p>I also don’t think we know how to specify a ground truth reasoning process that we could try to protect and run forever which we could be completely confident would come up with the right outcome (where something like HCH is a good candidate but potentially with bugs/subtleties that need to be worked out).</p> </blockquote> <p>Here I’m imagining something like putting a group of the best AI researchers, philosophers, etc. in some safe and productive environment (which includes figuring out the right rules of social interactions), where they can choose to delegate further to other reasoning processes, but don’t face any time pressure to do so. Obviously I don’t know how to specify this in terms of having all the details worked out, but that does not seem like a hugely difficult problem to solve, so I wonder what do you mean/imply by “don’t think we know how”?</p> <blockquote> <p>It feels like the thing we could do is build a set of better and better models of philosophy and check their results against held-out human reasoning and against each other.</p> </blockquote> <p>If that’s all we do, it seems like it would be pretty easy to miss some error in the models, because we didn’t know that we should test for it. For example there could be entire classes of philosophical problems that the models will fail on, which we won’t know because we won’t have realized yet that those classes of problems even exist.</p> <blockquote> <p>Do you think this would lead to “good outcomes”? Do you think some version of this approach could be satisfactory for solving the problems in Two Neglected Problems in Human-AI Safety?</p> </blockquote> <p>It could, but it seems much riskier than either of the approaches I described above.</p> <blockquote> <p>Do you think there’s a different kind of thing that we would need to do to “solve metaphilosophy”? Or do you think that working on “solving metaphilosophy” roughly caches out as “work on coming up with better and better models of philosophy in the model I’ve described here”?</p> </blockquote> <p>Hopefully I answered these sufficiently above. Let me know if there’s anything I can clear up further.</p> wei_dai jJ9zg53475uCThPNk 2019-03-28T10:59:34.999Z Comment by Wei_Dai on More realistic tales of doom https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/more-realistic-tales-of-doom#8GvQAAbbTwMY5od6w <p>I'm not sure I understand the distinction you're drawing between risk factors that compound the risks that you're describing vs. different problems not related to intent alignment per se. It seems to me like "AI-powered economies have much higher economies of scale because AIs don’t suffer from the kind of coordination costs that humans have (e.g., they can merge their utility functions and become clones of each other)" is a separate problem from solving intent alignment, whereas "AI-powered memetic warfare makes all humans effectively insane" is kind of an extreme case of "machine learning will increase our ability to 'get what we can measure'" which seems to be the opposite of how you classify them.</p> <p>What do you think are the implications of something belonging to one category versus another (i.e., is there something we should do differently depending on which of these categories a risk factor / problem belongs to)?</p> <blockquote> <p>I think the more general point is: if you think AI progress is likely to drive many of the biggest upcoming changes in the world, then there will be lots of risks associated with AI. Here I’m just trying to clarify what happens if we fail to solve intent alignment.</p> </blockquote> <p>Ah, when I read "I think this is probably not what failure will look like" I interpreted that to mean "failure to prevent AI risk", and then I missed the clarification "these are the most important problems if we fail to solve intent alignment" that came later in the post, in part because of a <a href="https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/more-realistic-tales-of-doom#9KEyJ2pWm6kck3iLD">bug in GW</a> that caused the post to be incorrectly formatted.</p> <p>Aside from that, I'm worried about telling a vivid story about one particular AI risk, unless you really hammer the point that it's just one risk out of many, otherwise it seems too easy for the reader to get that story stuck in their mind and come to think that this is the main or only thing they have to worry about as far as AI is concerned.</p> wei_dai 8GvQAAbbTwMY5od6w 2019-03-28T06:44:12.849Z The Main Sources of AI Risk? https://www.lesswrong.com/posts/WXvt8bxYnwBYpy9oT/the-main-sources-of-ai-risk <p> </p><p>There are so many causes or sources of AI risk that it&#x27;s getting hard to keep them all in mind. I propose we keep a list of the main sources (that we know about), such that we can say that if none of these things happen, then we&#x27;ve mostly eliminated AI risk (as an existential risk) at least as far as we can determine. Here&#x27;s a list that I spent a couple of hours enumerating and writing down. Did I miss anything important?</p><ol><li>Insufficient time/resources for AI safety (for example caused by intelligence explosion or AI race)</li><li>Insufficient global coordination, leading to the above</li><li>Misspecified or incorrectly learned goals/values</li><li>Inner optimizers</li><li>ML differentially accelerating easy to measure goals</li><li>Paul&#x27;s &quot;influence-seeking behavior&quot; (a combination of 3 and 4 above?)</li><li>AI generally accelerating intellectual progress in a wrong direction (e.g., accelerating unsafe/risky technologies more than knowledge/wisdom about how to safely use those technologies)</li><li>Metaethical error</li><li>Metaphilosophical error</li><li>Other kinds of philosophical errors in AI design (e.g., giving AI a wrong prior or decision theory)</li><li>Other design/coding errors (e.g., accidentally putting a minus sign in front of utility function, supposedly corrigible AI not actually being corrigible)</li><li>Doing acausal reasoning in a wrong way (e.g., failing to make good acausal trades, being acausally extorted, failing to acausally influence others who can be so influenced)</li><li>Human-controlled AIs ending up with wrong values due to insufficient &quot;metaphilosophical paternalism&quot;</li><li>Human-controlled AIs causing ethical disasters (e.g., large scale suffering that can&#x27;t be &quot;balanced out&quot; later) prior to reaching moral/philosophical maturity</li><li>Intentional corruption of human values</li><li>Unintentional corruption of human values</li><li>Mind crime (disvalue unintentionally incurred through morally relevant simulations in AIs&#x27; minds)</li><li>Premature value lock-in (i.e., freezing one&#x27;s current conception of what&#x27;s good into a utility function)</li><li>Extortion between AIs leading to vast disvalue</li><li>Distributional shifts causing apparently safe/aligned AIs to stop being safe/aligned</li><li>Value drift and other kinds of error as AIs self-modify, or AIs failing to solve value alignment for more advanced AIs</li><li>Treacherous turn / loss of property rights due to insufficient competitiveness of humans &amp; human-aligned AIs</li><li>Gradual loss of influence due to insufficient competitiveness of humans &amp; human-aligned AIs</li><li>Utility maximizers / goal-directed AIs having an economic and/or military competitive advantage due to relative ease of cooperation/coordination, defense against value corruption and other forms of manipulation and attack, leading to one or more of the above</li><li>In general, the most competitive type of AI being too hard to align or to safely use</li><li>Computational resources being too cheap, leading to one or more of the above</li></ol><p>(With this post I mean to (among other things) re-emphasize the disjunctive nature of AI risk, but this list isn&#x27;t fully disjunctive (i.e., some of the items are subcategories or causes of others), and I mostly gave a source of AI risk its own number in the list if it seemed important to make that source more salient. Maybe once we have a list of everything that is important, it would make sense to create a graph out of it.)</p> wei_dai WXvt8bxYnwBYpy9oT 2019-03-21T18:28:33.068Z Comment by Wei_Dai on What's wrong with these analogies for understanding Informed Oversight and IDA? https://www.lesswrong.com/posts/LigbvLH9yKR5Zhd6y/what-s-wrong-with-these-analogies-for-understanding-informed#pqxEwuKjRNrZRA8Ju <blockquote> <p>In that case, you can still try to be a straightforward Bayesian about it, and say “our intuition supports the general claim that process P outputs true statements;” you can then apply that regularity to trust P on some new claim even if it’s not the kind of claim you could verify, as long as “P outputs true statements” had a higher prior than “P outputs true statements just in the cases I can check.”</p> </blockquote> <p>If that's what you do, it seems “P outputs true statements just in the cases I can check.” could have a posterior that's almost 50%, which doesn't seem safe, especially in an iterated scheme where you have to depend on such probabilities many times? Do you not need to reduce the posterior probability to a negligible level instead?</p> <blockquote> <p>See the second and third examples in the post introducing ascription universality.</p> </blockquote> <p>Can you quote these examples? The word "example" appears 27 times in that post and looking at the literal second and third examples, they don't seem very relevant to what you've been saying here so I wonder if you're referring to some other examples.</p> <blockquote> <p>There is definitely a lot of fuzziness here and it seems like one of the most important places to tighten up the definition / one of the big research questions for whether ascription universality is possible.</p> </blockquote> <p>What I'm inferring from this (as far as a direct answer to my question) is that an overseer trying to do Informed Oversight on some ML model doesn't need to reverse engineer the model enough to fully understand what it's doing, only enough to make sure it's not doing something malign, which might be a lot easier, but this isn't quite reflected in the formal definition yet or isn't a clear implication of it yet. Does that seem right?</p> wei_dai pqxEwuKjRNrZRA8Ju 2019-03-20T19:36:22.263Z What's wrong with these analogies for understanding Informed Oversight and IDA? https://www.lesswrong.com/posts/LigbvLH9yKR5Zhd6y/what-s-wrong-with-these-analogies-for-understanding-informed <p>In <a href="https://www.lesswrong.com/posts/4qY9zEHLa2su4PkQ4/can-hch-epistemically-dominate-ramanujan">Can HCH epistemically dominate Ramanujan?</a> Alex Zhu wrote:</p> <blockquote> <p>If HCH is ascription universal, then it should be able to epistemically dominate an AI theorem-prover that reasons similarly to how Ramanujan reasoned. But I don’t currently have any intuitions as to why explicit verbal breakdowns of reasoning should be able to replicate the intuitions that generated Ramanujan’s results (or any style of reasoning employed by any mathematician since Ramanujan, for that matter).</p> </blockquote> <p>And I <a href="https://www.lesswrong.com/posts/4qY9zEHLa2su4PkQ4/can-hch-epistemically-dominate-ramanujan#9esxLBJSNtmiiwvvG">answered</a>:</p> <blockquote> <p>My guess is that HCH has to reverse engineer the theorem prover, figure out how/why it works, and then reproduce the same kind of reasoning.</p> </blockquote> <p>And then I followed up my own comment with:</p> <blockquote> <p>It occurs to me that if the overseer understands everything that the ML model (that it’s training) is doing, and the training is via some kind of local optimization algorithm like gradient descent, the overseer is essentially manually programming the ML model by gradually nudging it from some initial (e.g., random) point in configuration space.</p> </blockquote> <p>No one answered my comments with either a confirmation or denial, as to whether these guesses of how to understand Universality / Informed Oversight and IDA are correct. I'm surfacing this question as a top-level post because if "Informed Oversight = reverse engineering" and "IDA = programming by nudging" are good analogies for understanding Informed Oversight and IDA, it seems to have pretty significant implications.</p> <p>In particular it seems to imply that there's not much hope for IDA to be competitive with ML-in-general, because if IDA is analogous to a highly constrained method of "manual" programming, that seems unlikely to be competitive with less constrained methods of "manual" programming (i.e., AIs designing and programming more advanced AIs in more general ways, similar to how humans do most programming today), which itself is presumably not competitive with general (unconstrained-by-safety) ML (otherwise ML would not be the competitive benchmark).</p> <p>If these are not good ways to understand IO and IDA, can someone please point out why?</p> wei_dai LigbvLH9yKR5Zhd6y 2019-03-20T09:11:33.613Z Comment by Wei_Dai on A theory of human values https://www.lesswrong.com/posts/qezBTig6p6p5xtL6G/a-theory-of-human-values#GkCJiLPuSaZfymn7H <blockquote> <p>There is the issue of avoiding ignorant-yet-confident meta-preferences, which I’m working on writing up right now (partially thanks to you very comment here, thanks!)</p> </blockquote> <p>I look forward to reading that. In the meantime can you address my parenthetical point in the grand-parent comment: "correctly extracting William MacAskill’s meta-preferences seems equivalent to learning metaphilosophy from William"? If it's not clear, what I mean is that suppose Will wants to figure out his values by doing philosophy (which I think he <a href="https://80000hours.org/podcast/episodes/will-macaskill-moral-philosophy/">actually does</a>), does that mean that under you scheme the AI needs to learn how to do philosophy? If so, how do you plan to get around the problems with applying ML to metaphilosophy that I described in <a href="https://www.lesswrong.com/posts/EByDsY9S3EDhhfFzC/some-thoughts-on-metaphilosophy">Some Thoughts on Metaphilosophy</a>?</p> wei_dai GkCJiLPuSaZfymn7H 2019-03-18T08:13:47.331Z Comment by Wei_Dai on More realistic tales of doom https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/more-realistic-tales-of-doom#CB8ieALcHfSSuAYYJ <p>I think AI risk is disjunctive enough that it's not clear most of the probability mass can be captured by a single scenario/story, even as broad as this one tries to be. Here are some additional scenarios that don't fit into this story or aren't made very salient by it.</p> <ol> <li>AI-powered memetic warfare makes all humans effectively insane.</li> <li>Humans break off into various groups to colonize the universe with the help of their AIs. Due to insufficient "metaphilosophical paternalism", they each construct their own version of utopia which is either directly bad (i.e., some of the "utopias" are objectively terrible or subjectively terrible according to my values), or bad because of <a href="https://www.lesswrong.com/posts/Qz6w4GYZpgeDp6ATB/beyond-astronomical-waste">opportunity costs</a>.</li> <li>AI-powered economies have much higher economies of scale because AIs don't suffer from the kind of coordination costs that humans have (e.g., they can merge their utility functions and become clones of each other). Some countries may try to prevent AI-managed companies from merging for ideological or safety reasons, but others (in order to gain a competitive advantage on the world stage) will basically allow their whole economy to be controlled by one AI, which eventually achieves a decisive advantage over the rest of humanity and does a treacherous turn.</li> <li>The same incentive for AIs to merge might also create an incentive for value lock-in, in order to facilitate the merging. (AIs that don't have utility functions might have a harder time coordinating with each other.) Other incentives for premature value lock-in might include defense against value manipulation/corruption/drift. So AIs end up embodying locked-in versions of human values which are terrible in light of our true/actual values.</li> <li>I think the original "stereotyped image of AI catastrophe" is still quite plausible, if for example there is a large amount of hardware overhang before the last piece of puzzle for building AGI falls into place.</li> </ol> wei_dai CB8ieALcHfSSuAYYJ 2019-03-18T07:59:17.054Z Comment by Wei_Dai on More realistic tales of doom https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/more-realistic-tales-of-doom#Kdob9WSJ6sHdmzd8z <blockquote> <p>Sounds like a new framing of the “daemon” idea.</p> </blockquote> <p>That's my impression as well. If it's correct, seems like it would be a good idea to mention that explicitly in the post, so people can link up the new concept with their old concept.</p> wei_dai Kdob9WSJ6sHdmzd8z 2019-03-18T06:09:56.265Z Comment by Wei_Dai on Comparison of decision theories (with a focus on logical-counterfactual decision theories) https://www.lesswrong.com/posts/QPhY8Nb7gtT5wvoPH/comparison-of-decision-theories-with-a-focus-on-logical#s8usYremypuKJBv7A <p>See "Example 1: Counterfactual Mugging" in <a href="https://www.lesswrong.com/posts/de3xjFaACCAk6imzv/towards-a-new-decision-theory">Towards a New Decision Theory</a>.</p> wei_dai s8usYremypuKJBv7A 2019-03-18T05:57:49.579Z Comment by Wei_Dai on Comparison of decision theories (with a focus on logical-counterfactual decision theories) https://www.lesswrong.com/posts/QPhY8Nb7gtT5wvoPH/comparison-of-decision-theories-with-a-focus-on-logical#GsxuGskpjXNJaoqoQ <p>I think it's needed just to define what it means to condition on an action, i.e., if an agent conditions on "I make this decision" in order to compute its expected utility, what does that mean formally? You could make "I" a primitive element in the agent's ontology, but I think that runs into all kinds of problems. My solution was to make it a logical statement of the form "source code X outputs action/policy Y", and then to condition on it you need a logically uncertain distribution.</p> wei_dai GsxuGskpjXNJaoqoQ 2019-03-18T00:19:12.261Z Comment by Wei_Dai on More realistic tales of doom https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/more-realistic-tales-of-doom#9KEyJ2pWm6kck3iLD <p>There's a bunch of bullet points below Part 1 and Part 2. Are these intended to be parallel with them on the same level, or instances/subcategories of them?</p> <p>Oh, this is only on GW. On LW it looks very different. Presumably the LW version is the intended version.</p> wei_dai 9KEyJ2pWm6kck3iLD 2019-03-18T00:01:59.293Z Comment by Wei_Dai on Comparison of decision theories (with a focus on logical-counterfactual decision theories) https://www.lesswrong.com/posts/QPhY8Nb7gtT5wvoPH/comparison-of-decision-theories-with-a-focus-on-logical#Q4xDaQ79RGED2sLvX <p>Chris asked me via PM, "I’m curious, have you written any posts about why you hold that position?"</p> <p>I don't think I have, but I'll give the reasons here:</p> <ol> <li>"evidential-style conditioning on a logically uncertain distribution" seems simpler / more elegant to me.</li> <li>I'm not aware of a compelling argument for "causal-graph-style counterpossible reasoning". There are definitely some unresolved problems with evidential-style UDT and I do endorse people looking into causal-style FDT as an alternative but I'm not convinced the solutions actually lie in that direction. (<a href="https://sideways-view.com/2018/09/30/edt-vs-cdt-2-conditioning-on-the-impossible/">https://sideways-view.com/2018/09/30/edt-vs-cdt-2-conditioning-on-the-impossible/</a> and links therein are relevant here.)</li> <li>Part of it is just historical, in that UDT was originally specified as "evidential-style conditioning on a logically uncertain distribution" and if I added my name as a co-author to a paper that focuses on causal-style decision theory, people would naturally wonder if something made me change my mind.</li> </ol> wei_dai Q4xDaQ79RGED2sLvX 2019-03-17T21:19:47.785Z Comment by Wei_Dai on Privacy https://www.lesswrong.com/posts/v3Nnsm5HgvEBBDpEZ/privacy#a5JzzdxxYuRMHhnha <blockquote> <p>OK, looking at the argument, I think it makes sense that signalling equilibria can potentially be Pareto-worse than non-signalling equilibria, as they can have more of a “market for lemons” problem.</p> </blockquote> <p>Not sure what the connection to “market for lemons” is. Can you explain more (if it seems important)?</p> <blockquote> <p>(I think “no one gets education, everyone gets paid average productivity” is still a Nash equilibrium)</p> </blockquote> <p>I agree that is still a Nash equilibrium and I think even a Perfect Bayesian Equilibrium, but there may be a stronger formal equilibrium concept that rules it out? (It's been a while since I studied all those equilibrium refinements so I can't tell you which off the top of my head.)</p> <p>I think under Perfect Bayesian Equilibrium, off-the-play-path nodes formally happen with probability 0 and the players are allowed to update in an arbitrary way on those nodes, including not update at all. But intuitively if someone does deviate from the proposed equilibrium strategy and get some education, it seems implausible that employers don't update towards them being type H and therefore offer them a higher salary.</p> wei_dai a5JzzdxxYuRMHhnha 2019-03-17T06:55:59.036Z Comment by Wei_Dai on Privacy https://www.lesswrong.com/posts/v3Nnsm5HgvEBBDpEZ/privacy#PSGHw9gCzCMakFgM6 <p>It looks like the code that turns a URL into a link made the colon into part of the link. I removed it so the link should work now. The argument should be in the PDF. Basically you just solve the game assuming the ability to signal and compare that to the game where signaling isn't possible, and see that the signaling equilibrium makes everyone worse off (in that particular game).</p> wei_dai PSGHw9gCzCMakFgM6 2019-03-17T05:43:29.038Z Comment by Wei_Dai on Privacy https://www.lesswrong.com/posts/v3Nnsm5HgvEBBDpEZ/privacy#T6t9EjxeRaczQgGK7 <blockquote> <p>We need a realm shielded from signaling and judgment.</p> </blockquote> <p>To support this, there are results from economics / game theory showing that signaling equilibria can be worse than non-signaling equilibria (in the sense of Pareto inefficiency). Quoting one example from <a href="http://faculty.econ.ucdavis.edu/faculty/bonanno/teaching/200C/Signaling.pdf">http://faculty.econ.ucdavis.edu/faculty/bonanno/teaching/200C/Signaling.pdf</a></p> <blockquote> <p>So the benchmark is represented by the situation where no signaling takes place and employers -- not being able to distinguish between more productive and less productive applicants and not having any elements on which to base a guess -- offer the same wage to every applicant, equal to the average productivity. Call this the non-signaling equilibrium. In a signaling equilibrium (where employers’ beliefs are confirmed, since less productive people do not invest in education, while the more productive do) everybody may be worse off than in the non-signaling equilibrium. This occurs if the wage offered to the non-educated is lower than the average productivity (= wage offered to everybody in the non-signaling equilibrium) and that offered to the educated people is higher, but becomes lower (than the average productivity) once the costs of acquiring education are subtracted. The possible Pareto inefficiency of signaling equilibria is a strong result and a worrying one: it means that society is wasting resources in the production of education. However, it is not per se enough to conclude that education (i.e. the signaling activity) should be eliminated. The result is not that, in general, elimination of the signaling activity leads to a Pareto improvement: Spence simply pointed out that this is a possibility.</p> </blockquote> <p>So in theory it seems quite possible that privacy is a sort of coordination mechanism for avoiding bad signaling equilibria. Whether or not it actually is, I'm not sure. That seems to require empirical investigation and I'm not aware of such research.</p> wei_dai T6t9EjxeRaczQgGK7 2019-03-17T05:10:30.128Z Comment by Wei_Dai on Question: MIRI Corrigbility Agenda https://www.lesswrong.com/posts/BScxwSun3K2MgpoNz/question-miri-corrigbility-agenda#eeyJ3Hy8rkis6cL5M <p>Is Jessica Taylor's <a href="https://www.greaterwrong.com/posts/5bd75cc58225bf0670375041/a-first-look-at-the-hard-problem-of-corrigibility">A first look at the hard problem of corrigibility</a> still a good reference or is it outdated?</p> wei_dai eeyJ3Hy8rkis6cL5M 2019-03-15T19:00:21.925Z Comment by Wei_Dai on A theory of human values https://www.lesswrong.com/posts/qezBTig6p6p5xtL6G/a-theory-of-human-values#BMB8mDA7R5niWN3vF <p>I think in terms of economics, vNM expected utility is closest to how we tend to think about utility/preferences. The problem with vNM (from our perspective) is that it assumes a coherent agent (i.e., an agent that satisfies the vNM axioms) but humans aren&#x27;t coherent, in part because we don&#x27;t know what our values are or should be. (&quot;Humans don&#x27;t have utility functions&quot; is a common refrain around here.) From academia in general, the approach that comes closest to how we tend to think about values is reflective equilibrium, although other meta-ethical views are not unrepresented around here. </p><p>For utility comparisons between people, I think a lot of thinking here have been based on or inspired by game theory, e.g., bargaining games.</p><p>Of course there is a lot of disagreement and uncertainty between and within individuals on LW, so specific posts may well be based on different foundations or are just informal explorations that aren&#x27;t based on any theoretical foundations.</p><p>In this post, Stuart seems to be trying to construct an extrapolated/synthesized (vNM or vNM-like) utility function out of a single human&#x27;s incomplete and inconsistent preferences and meta-preferences, which I don&#x27;t think has much of a literature in economics?</p> wei_dai BMB8mDA7R5niWN3vF 2019-03-15T04:00:36.648Z Comment by Wei_Dai on Speculations on Duo Standard https://www.lesswrong.com/posts/kZ7g7ikfzcxRdF5eG/speculations-on-duo-standard#hQSSqRkxi96je6cEx <p>Hi Zvi, may I suggest that you tag your Magic the Gathering posts with [MtG] or something similar in the title? Since you blog about both MtG topics and other topics, I imagine a lot of people on LW clicked on this post wondering what it's about, and then immediately went back out after seeing that it's a post about MtG. (I actually had to Google "Duo Standard" to figure that out because the post doesn't mention MtG or Magic in the first few paragraphs.)</p> <p>Also, am I correct in assuming that these MtG posts are just about MtG, and are not meant to illustrate more general principles or something like that?</p> wei_dai hQSSqRkxi96je6cEx 2019-03-15T02:54:47.791Z Comment by Wei_Dai on A theory of human values https://www.lesswrong.com/posts/qezBTig6p6p5xtL6G/a-theory-of-human-values#6KPydkXN6hBDwhbRa <blockquote> <p>Probably you were thinking of something like teaching AIs metaphilosophy in order to perhaps improve the procedure? This would be the main alternative I see, and it does feel more robust. I am wondering though whether we’ll know by that point whether we’ve found the right way to do metaphilosophy</p> </blockquote> <p>I think there's some (small) hope that by the time we need it, we can hit upon a solution to metaphilosophy that will just be clearly right to most (philosophically sophisticated) people, like how math and science were probably once methodologically quite confusing but now everyone mostly agrees on how math and science should be done. Failing that, we probably need some sort of global coordination to prevent competitive pressures leading to value lock-in (like the kind that would follow from Stuart's scheme). In other words, if there wasn't a race to build AGI, then there wouldn't be a need to solve AGI safety, and there would be no need for schemes like Stuart's that would lock in our values before we solve metaphilosophy.</p> <blockquote> <p>it doesn’t feel obvious why something like Stuart’s anti-realism isn’t already close to there</p> </blockquote> <p>Stuart's scheme uses each human's own meta-preferences to determine their own (final) object-level preferences. I would less concerned if this was used on someone like William MacAskill (with the caveat that correctly extracting William MacAskill's meta-preferences seems equivalent to learning metaphilosophy from William) but a lot of humans have seemingly terrible meta-preferences or at least different meta-preferences which likely lead to different object-level preferences (so they can't all be right, assuming moral realism).</p> <p>To put it another way, my position is that if moral realism or relativism (positions 1-3 in <a href="https://www.lesswrong.com/posts/orhEa4wuRJHPmHFsR/six-plausible-meta-ethical-alternatives">this list</a>) is right, we need "<a href="https://www.greaterwrong.com/posts/5bd75cc58225bf06703752c6/my-current-take-on-the-paul-miri-disagreement-on-alignability-of-messy-ai/comment/5bd75cc58225bf06703752db">metaphilosophical paternalism</a>" to prevent a "terrible outcome", and that's not part of Stuart's scheme.</p> wei_dai 6KPydkXN6hBDwhbRa 2019-03-14T03:45:42.585Z Comment by Wei_Dai on How dangerous is it to ride a bicycle without a helmet? https://www.lesswrong.com/posts/3iFzaDwoah35ri4aD/how-dangerous-is-it-to-ride-a-bicycle-without-a-helmet#TsGKTyY3CzeuGyFPP <p>It feels to me like people in our community aren't being skeptical enough or pushing back enough on the idea of acausal coordination for humans. I'm kind of confused about this because it seems like a weirder idea and has less good arguments for it than for example the importance of AI risk which does get substantial skepticism and push back.</p> <blockquote> <p>In an old post I argued that for acausal coordination reasons it seems as if you should further multiply this value by the number of people in the reference class of those making the decision the same way (discounted by how little you care about strangers vs. yourself).</p> </blockquote> <p>But if "the same way" includes not only the same kind of explicit cost/benefit analysis but also "further multiply this value by the number of people in the reference class of those making the decision the same way", the number of people in this reference class must be tiny because nobody is doing this for deciding whether to wear bike helmets.</p> <p>Suppose two people did "further multiply this value by the number of people in the reference class of those making the decision the same way", but their decision making processes are slightly different, e.g., they use different heuristics to do things like finding sources for the numbers that go into the cost/benefit analysis, I don't know how to figure out whether they are still in the same reference class, or how to generalize beyond "same reference class" when the agents are humans as opposed to AIs (and even with the latter we don't have a complete mathematical theory).</p> <blockquote> <p>people talk about this argument mostly in the context of voting</p> </blockquote> <p>I'm skeptical about this too. I'm not actually aware of a good argument for acausal coordination in the context of voting. A search on LW yields only <a href="https://www.lesswrong.com/posts/uG3ri4y3siWyC52bD/a-rationalist-argument-for-voting#drhSHWrAaRGaTdkbv">this short comment</a> from Eliezer.</p> wei_dai TsGKTyY3CzeuGyFPP 2019-03-13T20:22:42.724Z Comment by Wei_Dai on A theory of human values https://www.lesswrong.com/posts/qezBTig6p6p5xtL6G/a-theory-of-human-values#7F8LZWBbjYF5n3bvH <p>This seems to assume a fairly specific (i.e., anti-realist) metaethics. I'm <a href="https://www.lesswrong.com/posts/orhEa4wuRJHPmHFsR/six-plausible-meta-ethical-alternatives">quite uncertain about metaethics</a> and I'm worried that if moral realism is true (and say for example that total hedonic utilitarianism is the true moral theory), and what you propose here causes the true moral theory to be able to control only a small fraction of the resources of our universe, that would constitute a terrible outcome. Given my state of knowledge, I'd prefer not to make any plans that imply commitment to a specific metaethical theory, like you seem to be doing here.</p> <p>What's your response to people with other metaethics or who are very uncertain about metaethics?</p> <blockquote> <p>However, for actual humans, the first scenario seems to loom much larger.</p> </blockquote> <p>I don't think this is true for me, or maybe I'm misunderstanding what you mean by the two scenarios.</p> wei_dai 7F8LZWBbjYF5n3bvH 2019-03-13T19:48:00.339Z Comment by Wei_Dai on Asymptotically Benign AGI https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#qnWEEGapssmw7pBcz <blockquote> <p>If the assumption is true, we could demand that A use their words, and counter us being mind-hacked by poking holes in what B is saying rather than demanding we stop listening to B. And if A is able to convince us that B was mind-hacking, even after some more back and forth, B will be punished for that.</p> </blockquote> <p>Oh, I see, I didn't understand "it is harder to mystify a judge than it is to pierce through someone else mystifying a judge" correctly. So this assumption basically rules out a large class of possible vulnerabilities in the judge, right? For example, if the judge had the equivalent of a buffer overflow bug in a network stack, the scheme would fail. In that case, A would not be able to "pierce through" B's attack and stop it with its words if the judge keeps listening to B (and B was actually attacking).</p> <p>I don't think the "AI safety via debate" paper actually makes arguments for this assumption (at least I couldn't find where it does). Do you have reasons to think it's true, or ideas for how to verify that it's true, short of putting a human in a BoMAI?</p> wei_dai qnWEEGapssmw7pBcz 2019-03-13T07:39:48.301Z Comment by Wei_Dai on Asymptotically Benign AGI https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#ZRzriwp8Wjpmt4oqC <blockquote> <p>*but A could concoct a story … counterarguments from B .. mind hacked by B, right?</p> </blockquote> <p>Yeah, I mixed up the A's and B's at the end. It's fixed now. Thanks for pointing it out.</p> <blockquote> <p>I think the main contention of their paper is that it is harder to mystify a judge than it is to pierce through someone else mystifying a judge, so this shouldn’t be a problem.</p> </blockquote> <p>I don't understand how the former implies the latter. Assuming the former is true (and it seems like a big assumption), why can't what I suggested still happen?</p> <blockquote> <p>That said, here’s one possibility: if A ever suggests that you don’t read more output from B, don’t read anything more from B, then flip coins to give A a 25% chance of victory.</p> </blockquote> <p>But what about the case where B <em>is</em> actually trying to mind hack the judge? If you always give A a 25% chance of victory for suggesting or implying that you shouldn't read more output from B, then mind hacking becomes a (mostly) winning strategy, since a player gets a 75% chance of victory from mind hacking even if the other side successfully convinces the judge that they're trying to mind hack the judge. The equilibrium might then consist of a race to see who can mind hack the judge first, or (if one side has &gt;75% chance of winning such a race due to first-mover or second-mover advantage) one side trying to mind hack the judge, getting blocked by the other side, and still getting 75% victory.</p> wei_dai ZRzriwp8Wjpmt4oqC 2019-03-13T05:04:06.416Z Comment by Wei_Dai on Asymptotically Benign AGI https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#kKtYNoTNai7kJQWgF <p>With a debate-like setup, if one side (A) is about to lose a debate, it seems to have a high incentive to claim that the other side (B) trying to do a mind hack and that if the judge keeps paying attention to what B says (i.e., read any further output from B), they will soon be taken over. What is the judge supposed to do in this case? They could ask A to explain how B's previous outputs constitute part of an attempt to mind hack, but A could concoct a story mixed with its own attempt to mind hack, and the judge can't ask for any counter-arguments from B without risking being mind hacked by B.</p> <p>(I realize this is a problem in “AI Safety via debate” as well, but I'm asking you since you're here and Geoffrey Irving isn't. :)</p> wei_dai kKtYNoTNai7kJQWgF 2019-03-13T03:21:17.421Z Comment by Wei_Dai on AI Safety via Debate https://www.lesswrong.com/posts/wo6NsBtn3WJDCeWsx/ai-safety-via-debate#3nguy6qBkLTdYeZEv <p>Geoffrey Irving has done <a href="https://futureoflife.org/2019/03/06/ai-alignment-through-debate-with-geoffrey-irving/">an interview</a> with the AI Alignment Podcast, where he talked about a bunch of things related to DEBATE including some thoughts that are not mentioned in either the blog post or the paper.</p> wei_dai 3nguy6qBkLTdYeZEv 2019-03-13T02:43:41.342Z Comment by Wei_Dai on Asymptotically Benign AGI https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#25XyWN4qjmaZJGwvh <blockquote> <p>so for two world-models that are exactly equally accurate, we need to make sure the malign one is penalized for being slower, enough to outweigh the inconvenient possible outcome in which it has shorter description length</p> </blockquote> <p>Yeah, I understand this part, but I'm not sure why, since the benign one can be extremely complex, the malign one can't have enough of a K-complexity advantage to overcome its slowness penalty. And since (with low β) we're going through many more different world models as the number of episodes increases, that also gives malign world models more chances to "win"? It seems hard to make any trustworthy conclusions based on the kind of informal reasoning we've been doing and we need to figure out the actual math somehow.</p> wei_dai 25XyWN4qjmaZJGwvh 2019-03-13T02:09:18.867Z Comment by Wei_Dai on Asymptotically Benign AGI https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#bD9SPZ86umujHndQY <blockquote> <p>Just as you said: it outputs Bernoulli(1/2) bits for a long time. It’s not dangerous.</p> </blockquote> <p>I just read the math more carefully, and it looks like no matter how small β is, as long as β is positive, as BoMAI receives more and more input, it will eventually converge to the most accurate world model possible. This is because the computation penalty is applied to the per-episode computation bound and doesn't increase with each episode, whereas the accuracy advantage gets accumulated across episodes.</p> <p>Assuming that the most accurate world model is an exponential-time quantum simulation, that's what BoMAI will converge to (no matter how small β is), right? And in the meantime it will go through some arbitrarily complex (up to some very large bound) but faster than exponential classical approximations of quantum physics that are increasingly accurate, as the number of episodes increase? If so, I'm no longer convinced that BoMAI is benign as long as β is small enough, because the qualitative behavior of BoMAI seems the same no matter what β is, i.e., it gets smarter over time as its world model gets more accurate, and I'm not sure why the reason BoMAI might not be benign at high β couldn't also apply at low β (if we run it for a long enough time).</p> <p>(If you're going to discuss all this in your "longer reply", I'm fine with waiting for it.)</p> wei_dai bD9SPZ86umujHndQY 2019-03-12T18:37:54.957Z Three ways that "Sufficiently optimized agents appear coherent" can be false https://www.lesswrong.com/posts/4K52SS7fm9mp5rMdX/three-ways-that-sufficiently-optimized-agents-appear <p>There has been a couple of recent posts suggesting that Eliezer Yudkowsky's <a href="https://arbital.com/p/optimized_agent_appears_coherent/">Sufficiently optimized agents appear coherent</a> thesis does not seem useful because it's vacuously true: one obvious way to formalize "coherent" implies that all agents can be considered coherent. In a <a href="https://www.lesswrong.com/posts/vphFJzK3mWA4PJKAg/coherent-behaviour-in-the-real-world-is-an-incoherent#F2YB5aJgDdK9ZGspw">previous comment</a>, I suggested that we can formalize "coherent" in a different way to dodge this criticism. I believe there's reason to think that Eliezer never intended "Sufficiently optimized agents appear coherent" to have an airtight argument and be universally true. (The Arbital post contains a number of caveats, including "If there is a particular kind of optimization pressure that seems sufficient to produce a cognitively highly advanced agent, but which also seems sure to overlook some particular form of incoherence, then this would present a loophole in the overall argument and yield a route by which an advanced agent with that particular incoherence might be produced".) In this post, I suggest that considering the ways in which it could be false can be a useful way to frame some recent ideas in AI safety. (Note that this isn't intended to be an exhaustive list.)</p> <h1>Distributional shift</h1> <p>Even a very powerful optimization process cannot train or test an agent in every possible environment and for every possible scenario (by this I mean some sequence of inputs) that it might face, and some optimization processes may not care about many possible environments/scenarios. Given this, we can expect that if an agent faces a new environment/scenario that's very different from what is was optimized for, it may fail to behave coherently.</p> <p>(Jessica Taylor made a related point in <a href="https://www.greaterwrong.com/posts/5bd75cc58225bf06703751eb/modeling-the-capabilities-of-advanced-ai-systems-as-episodic-reinforcement-learning#section-6">Modeling the capabilities of advanced AI systems as episodic reinforcement learning</a>: "When the test episode is similar to training episodes (e.g. in an online learning context), we should expect trained policies to act like a rational agent maximizing its expected score in this test episode; otherwise, the policy that acts as a rational agent would get a higher expected test score than this one, and would therefore receive the highest training score.")</p> <p>A caveat to this caveat is that if an agent is optimized for a broad enough range of environments/scenarios, it could become an explicit EU maximizer, and keep doing EU maximization even after facing a distributional shift. (In this case it may be highly unpredictable what the agent's utility function looks like outside the range that it was optimized for. Humans can be considered a good example of this.)</p> <h1>Optimize for low compute</h1> <p>Eric Drexler <a href="https://www.fhi.ox.ac.uk/reframing/">suggested</a> that one way to keep AIs safe is to optimize them to use few computing resources. If computing resources are expensive, it will often be less costly to accept incoherent behavior than to expend computing resources to reduce such incoherence. (Eliezer noted that such incoherence would only be removed "given the option of eliminating it at a reasonable computational cost".)</p> <p>A caveat to this is that the true economic costs for compute will continue to fall, eventually to very low levels, so this depends on people assigning artificially high costs to computing resources (which Eric suggests that they do). However assigning an optimization cost for compute that is equal to its economic cost would often produce a more competitive AI, and safety concerns may not be sufficient incentive for an AI designer (if they are mostly selfish) to choose otherwise (because the benefits of producing a more competitive AI are more easily <a href="https://en.wikipedia.org/wiki/Externality">internalized</a> than the costs/risks). One can imagine that in a world where computing costs are very low in an economic sense, but everyone is treating compute as having high cost for the sake of safety, the first person to <em>not</em> do this would gain a huge competitive advantage.</p> <h1>The optimizing process wants the agent to remain incoherent</h1> <p>The optimizing process may itself be incoherent and not know how to become coherent or produce an agent that is coherent in an acceptable or safe way. A number of ideas fall into this category, including Peter Eckersley's recent <a href="https://arxiv.org/abs/1901.00064">Impossibility and Uncertainty Theorems in AI Value Alignment (or why your AGI should not have a utility function)</a>, which suggests that we should create AIs that handle moral uncertainty by randomly assigning a subagent (representing some moral theory) to each decision, with the argument that this is similar to how humans handle moral uncertainty. This can clearly be seen as an instance where the optimizing process (i.e., AI programmers) opts for the agent to remain incoherent because it does not know an acceptable/safe way to remove the incoherence.</p> <p>A caveat here is that the agent may itself decide to become coherent anyway, and not necessarily in a way that the original optimizing process would endorse. For example, under Peter's proposal, one subagent may take an opportunity to modify the overall AI to become coherent in a way that it prefers, or multiple subagents may decide to cooperate and merge together into a more coherent agent. Another caveat is that incoherence is economically costly especially in a competitive multi-polar scenario, and if such costs are high enough the optimizing process may be forced to create a coherent agent even if it would prefer not to (in the absence of such costs).</p> wei_dai 4K52SS7fm9mp5rMdX 2019-03-05T21:52:35.462Z Why didn't Agoric Computing become popular? https://www.lesswrong.com/posts/5XzQQHwYtSgrATzMC/why-didn-t-agoric-computing-become-popular <p>I remember being quite excited when I first read about Agoric Computing. From the <a href="https://e-drexler.com/d/09/00/AgoricsPapers/agoricpapers.html">authors' website</a>:</p> <blockquote> <p>Like all systems involving goals, resources, and actions, computation can be viewed in economic terms. This paper examines markets as a model for computation and proposes a framework--agoric systems--for applying the power of market mechanisms to the software domain. It then explores the consequences of this model and outlines initial market strategies.</p> </blockquote> <p>Until today when Robin Hanson's <a href="https://www.overcomingbias.com/2019/02/how-lumpy-ai-services.html">blog post</a> reminded me, I had forgotten that one of the authors of Agoric Computing is Eric Drexler, who also authored <a href="https://www.fhi.ox.ac.uk/wp-content/uploads/Reframing_Superintelligence_FHI-TR-2019-1.1-1.pdf">Comprehensive AI Services as General Intelligence</a>, which has stirred a lot of recent discussions in the AI safety community. (One reason for my excitement was that I was going through a market-maximalist phase, due to influences from Vernor Vinge's anarcho-captalism, Tim May's crypto-anarchy, as well as a teacher who was a libertarian and a big fan of the <a href="https://en.wikipedia.org/wiki/Austrian_School">Austrian school of economics</a>.)</p> <p>Here's a <a href="http://e-drexler.com/d/09/00/AgoricsPapers/agoricpapers/aos/aos.1.html">concrete way</a> that Agoric Computing might work:</p> <blockquote> <p>For concreteness, let us briefly consider one possible form of market-based system. In this system, machine resources-storage space, processor time, and so forth-have owners, and the owners charge other objects for use of these resources. Objects, in turn, pass these costs on to the objects they serve, or to an object representing the external user; they may add royalty charges, and thus earn a profit. The ultimate user thus pays for all the costs directly or indirectly incurred. If the ultimate user also owns the machine resources (and any objects charging royalties), then currency simply circulates inside the system, incurring computational overhead and (one hopes) providing information that helps coordinate computational activities.</p> </blockquote> <p>When later it appeared as if Agoric Computing wasn't going to take over the world, I tried to figure out why, and eventually settled upon the answer that markets often don't align incentives correctly for maximum computing efficiency. For example, consider an object whose purpose is to hold onto some valuable data in the form of a lookup table and perform lookup services. For efficiency you might have only one copy of this object in a system, but that makes it a monopolist, so if the object is profit maximizing (e.g., running some algorithm that automatically adjusts prices so as to maximize profits) then it would end up charging an inefficiently high price. Objects that might use its services are incentivized to try to do without the data, or to maintain an internal cache of past data retrieved, even if that's bad for efficiency.</p> <p>Suppose this system somehow came into existence anyway. A programmer would likely notice that it would be better if the lookup table and its callers were merged into one economic agent which would eliminate the inefficiencies described above, but then that agent would itself still be a monopolist (unless you inefficiently maintained multiple copies of it) so then they'd want to merge that agent with <em>its</em> callers, and so on.</p> <p>My curiosity stopped at that point and I went on to other interests, but now I wonder if that is actually a correct understanding of why Agoric Computing didn't become popular. Does anyone have any insights to offer on this topic?</p> wei_dai 5XzQQHwYtSgrATzMC 2019-02-16T06:19:56.121Z Some disjunctive reasons for urgency on AI risk https://www.lesswrong.com/posts/8oSCw3z2dZgWjanqB/some-disjunctive-reasons-for-urgency-on-ai-risk <p>(This has been sitting in my drafts folder since August 2017. Robin Hanson's recent <a href="https://www.overcomingbias.com/2019/02/how-lumpy-ai-services.html">How Lumpy AI Services?</a> made me think of it again. I'm not sure why I didn't post it back then. I may have wanted to add more reasons, details and/or citations, but at this point it seems better to just post it as is. Apologies to those who may have come up with some of these arguments earlier.)</p> <p>Robin Hanson recently <a href="http://www.overcomingbias.com/2017/08/foom-justifies-ai-risk-efforts-now.html">wrote</a>, "Recently AI risk has become something of an industry, with far more going on than I can keep track of. Many call working on it one of the most effectively altruistic things one can possibly do. But I’ve searched a bit and as far as I can tell that foom scenario is still the main reason for society to be concerned about AI risk now." (By "foom scenario" he means a local intelligence explosion where a single AI takes over the world.) In response, I list the following additional reasons to work urgently on AI alignment.</p> <ol> <li> <p>Property rights are likely to not hold up in the face of large capability differentials between humans and AIs, so even if the intelligence explosion is likely global as opposed to local, that doesn't much reduce the urgency of working on AI alignment.</p> </li> <li> <p>Making sure an AI has aligned values and strong controls against value drift is an extra constraint on the AI design process. This constraint appears likely to be very costly at both design and run time, so if the first human level AIs deployed aren't value aligned, it seems very difficult for aligned AIs to catch up and become competitive.</p> </li> <li> <p>AIs' control of the economy will grow over time. This may happen slowly in their time frame but quickly in ours, leaving little time to solve value alignment problems before human values are left with a very small share of the universe, even if property rights hold up.</p> </li> <li> <p>Once we have human-level AIs and it's really obvious that value alignment is difficult, superintelligent AIs may not be far behind. Superintelligent AIs can probably find ways to bend people's beliefs and values to their benefit (e.g., create highly effective forms of propaganda, cults, philosophical arguments, and the like). Without an equally capable, value-aligned AI to protect me, even if my property rights are technically secure, I don't know how I would secure my mind.</p> </li> </ol> wei_dai 8oSCw3z2dZgWjanqB 2019-02-15T20:43:17.340Z Some Thoughts on Metaphilosophy https://www.lesswrong.com/posts/EByDsY9S3EDhhfFzC/some-thoughts-on-metaphilosophy <p>A powerful AI (or human-AI civilization) guided by wrong philosophical ideas would likely cause astronomical (or <a href="https://www.lesswrong.com/posts/Qz6w4GYZpgeDp6ATB/beyond-astronomical-waste">beyond astronomical</a>) waste. Solving metaphilosophy is one way in which we can hope to avoid this kind of disaster. For my previous thoughts on this topic and further motivation see <a href="https://www.lesswrong.com/posts/MAhueZtNz5SnDPhsy/metaphilosophical-mysteries">Metaphilosophical Mysteries</a>, <a href="">The Argument from Philosophical Difficulty</a>, <a href="https://www.lesswrong.com/posts/vbtvgNXkufFRSrx4j/three-ai-safety-related-ideas">Three AI Safety Related Ideas</a>, and <a href="https://www.lesswrong.com/posts/HTgakSs6JpnogD6c2/two-neglected-problems-in-human-ai-safety">Two Neglected Problems in Human-AI Safety</a>.</p> <h1>Some interrelated ways of looking at philosophy</h1> <h2>Philosophy as answering confusing questions</h2> <p>This was my starting point for thinking about what philosophy is: it's what we do when we try to answer confusing questions, or questions that we don't have any other established methodology for answering. Why do we find some questions confusing, or lack methods for answering them? This leads to my next thought.</p> <h2>Philosophy as ability to generalize / handle distributional shifts</h2> <p>ML systems tend to have a lot of trouble dealing with distributional shifts. (It seems to be a root cause of many AI as well as human safety problems.) But humans seem to have some way of (sometimes) noticing out-of-distribution inputs, and can feel confused instead of just confidently using their existing training to respond to it. This is perhaps most obvious in unfamiliar ethical situations like <a href="https://www.lesswrong.com/posts/3wYTFWY3LKQCnAptN/torture-vs-dust-specks">Torture vs Dust Specks</a> or trying to determine whether our moral circle should include things like insects and RL algorithms. Unlike ML algorithms that extrapolate in an essentially random way when given out-of-distribution inputs, humans can potentially generalize in a principled or correct way, by using philosophical reasoning.</p> <h2>Philosophy as slow but general purpose problem solving</h2> <p>Philosophy may even be a fully general purpose problem solving technique. At least we don't seem to have reason to think that it's not. The problem is that it's painfully slow and resource intensive. Individual humans acting alone seem to have little chance of achieving justifiably high confidence in many philosophical problems even if they devote their entire lives to those problems. Humanity has been collectively trying to solve some philosophical problems for hundreds or even thousands of years, without arriving at final solutions. The slowness of philosophy explains why distributional shifts remain a safety problem for humans, even though we seemingly have a general way of handling them.</p> <h2>Philosophy as meta problem solving</h2> <p>Given that philosophy is extremely slow, it makes sense to use it to solve meta problems (i.e., finding faster ways to handle some class of problems) instead of object level problems. This is exactly what happened historically. Instead of using philosophy to solve individual scientific problems (natural philosophy) we use it to solve science as a methodological problem (philosophy of science). Instead of using philosophy to solve individual math problems, we use it to solve logic and philosophy of math. Instead of using philosophy to solve individual decision problems, we use it to solve decision theory. Instead of using philosophy to solve individual philosophical problems, we can try to use it to solve metaphilosophy.</p> <h2>Philosophy as "high computational complexity class"</h2> <p>If philosophy can solve any problem within a very large class, then it must have a "computational complexity class" that's as high as any given problem within that class. Computational complexity can be measured in various ways, such as time and space complexity (on various actual machines or models of computation), whether and how high a problem is in the polynomial hierarchy, etc. "Computational complexity" of human problems can also be measured in various ways, such as how long it would take to solve a given problem using a specific human, group of humans, or model of human organizations or civilization, and whether and how many rounds of <a href="https://www.lesswrong.com/posts/wo6NsBtn3WJDCeWsx/ai-safety-via-debate">DEBATE</a> would be sufficient to solve that problem either theoretically (given infinite computing power) or in practice.</p> <p>The point here is that no matter how we measure complexity, it seems likely that philosophy would have a "high computational complexity class" according to that measure.</p> <h2>Philosophy as interminable debate</h2> <p>The visible aspects of philosophy (as traditionally done) seem to resemble an endless (both in clock time and in the number of rounds) game of debate, where people propose new ideas, arguments, counterarguments, counter-counterarguments, and so on, and at the same time to try judge proposed solutions based on these ideas and arguments. People sometimes complain about the interminable nature of philosophical discussions, but that now seems understandable if philosophy is a "high computational complexity" method of general purpose problem solving.</p> <p>In a sense, philosophy is the opposite of math: whereas in math any debate can be settled by producing a proof (hence analogous to the complexity class NP) (in practice maybe a couple more rounds is needed of people finding or fixing flaws in the proof), potentially no fixed number of rounds of debate (or DEBATE) is enough to settle all philosophical problems.</p> <h2>Philosophy as Jürgen Schmidhuber's <a href="http://people.idsia.ch/~juergen/toes.pdf">General TM</a></h2> <p>Unlike traditional Turing Machines, a General TM or GTM may edit their previous outputs, and can be considered to solve a problem even if it never terminates, as long as it stops editing its output after a finite number of edits and the final output is the correct solution. So if a GTM solves a certain problem, you know that it will eventually converge to the right solution, but you have no idea when, or if what's on its output tape at any given moment is the right solution. This seems a lot of like philosophy, where people can keep changing their minds (or adjust their credences) based on an endless stream of new ideas, arguments, counterarguments, and so on, and you never really know when you've arrived at a correct answer.</p> <h1>What to do until we solve metaphilosophy?</h1> <h2>Protect the trajectory?</h2> <p>What would you do if you had a GTM that could solve a bunch of really important problems, and that was the only method you had of solving them? You'd try to reverse-engineer it and make a bunch of copies. But if you couldn't do that, then you'd want to put layers and layers of protection around it. Applied to philosophy, this line of thought seems to lead to the familiar ideas of using global coordination (or a decisive strategic advantage) to stop technological progress, or having AIs derive their terminal goals from simulated humans who live in a safe virtual environment.</p> <h2>Replicate the trajectory with ML?</h2> <p>Another idea is to try to build a good enough approximation of the GTM by training ML on its observable behavior (including whatever work tapes you have read access to). But there are two problems with this: 1. This is really hard or impossible to do if the GTM has internal state that you can't observe. And 2. If you haven't already reverse engineered the GTM, there's no good way to know that you've built a good enough approximation, i.e., to know that the ML model won't end up converging to answers that are different from the GTM.</p> <h3>A three part model of philosophical reasoning</h3> <p>It may be easier to understand the difficulty of capturing philosophical reasoning with ML by considering a more concrete model. I suggest we can divide it into three parts as follows: A. Propose new ideas/arguments/counterarguments/etc. according to some (implicit) distribution. B. Evaluate existing ideas/arguments/counterarguments/etc. C. Based on past ideas/arguments/counterarguments/etc., update some hidden state that changes how one does A and B. It's tempting to think that building an approximation of B using ML perhaps isn't too difficult, and then we can just search for the "best" ideas/arguments/counterarguments/etc. using standard optimization algorithms (maybe with some safety precautions like trying to avoid adversarial examples for the learned model). There's some chance this could work out well, but without having a deeper understanding of metaphilosophy, I don't see how we can be confident that throwing out A and C won't lead to disaster, especially in the long run. But A and C seem very hard or impossible for ML to capture (A due to paucity of training data, and C due to the unobservable state).</p> <p>Is there a way around this difficulty? What else can we do in the absence of a full <a href="https://www.lesswrong.com/posts/vrnhfGuYTww3fKhAM/three-approaches-to-friendliness">white-box</a> solution to metaphilosophy?</p> wei_dai EByDsY9S3EDhhfFzC 2019-02-10T00:28:29.482Z The Argument from Philosophical Difficulty https://www.lesswrong.com/posts/w6d7XBCegc96kz4n3/the-argument-from-philosophical-difficulty <p>(I'm reposting <a href="https://www.lesswrong.com/posts/JbcWQCxKWn3y49bNB/disentangling-arguments-for-the-importance-of-ai-safety#daD7JREPtx2WDe2Wf">this comment</a> as a top-level post, for ease of future reference. The <a href="https://www.lesswrong.com/posts/JbcWQCxKWn3y49bNB/disentangling-arguments-for-the-importance-of-ai-safety">context</a> here is a discussion about the different lines of arguments for the importance of AI safety.)</p> <p>Here's another argument that I've been pushing since the <a href="http://www.sl4.org/archive/0711/17101.html">early days</a> (apparently not very successfully since it didn't make it to this list :) which might be called "argument from philosophical difficulty". It appears that achieving a good long term future requires getting a lot of philosophical questions right that are hard for us to answer. Given this, <a href="https://www.lesswrong.com/posts/vrnhfGuYTww3fKhAM/three-approaches-to-friendliness">initially</a> I thought there are only three ways for AI to go right in this regard (assuming everything else goes well with the AI):</p> <ol> <li> <p>We solve all the important philosophical problems ahead of time and program the solutions into the AI.</p> </li> <li> <p>We solve metaphilosophy (i.e., understand philosophical reasoning as well as we understand mathematical reasoning) and program that into the AI so it can solve philosophical problems on its own.</p> </li> <li> <p>We program the AI to learn philosophical reasoning from humans or use human simulations to solve philosophical problems.</p> </li> </ol> <p>Since then people have come up with a couple more scenarios (which did make me <em>slightly</em> more optimistic about this problem):</p> <ol start="4"> <li> <p>We all coordinate to stop technological progress some time after AI but before space colonization, and have a period of long reflection where humans, maybe with help from AIs, spend thousands or millions of years to solve philosophical problems.</p> </li> <li> <p>We program AIs to be corrigible to their users, some users care about getting philosophy correct so the AIs help keep them safe and get their "fair share" of the universe until philosophical problems are solved eventually, enough users care about this so that we end up with a mostly good future, and lack of philosophical knowledge doesn't cause disaster in the meantime. (My writings on "human safety problems" were in part a response to this suggestion, outlining how hard it would be to keep humans "safe" in this scenario.)</p> </li> </ol> <p>The overall argument is that, given human safety problems, realistic competitive pressures, difficulties with coordination, etc., it seems hard to end up in any of these scenarios and not have something go wrong along the way. Maybe another way to put this is, given philosophical difficulties, the target we'd have to hit with AI is even smaller than it might otherwise appear.</p> wei_dai w6d7XBCegc96kz4n3 2019-02-10T00:28:07.472Z Why is so much discussion happening in private Google Docs? https://www.lesswrong.com/posts/hnvPCZ4Cx35miHkw3/why-is-so-much-discussion-happening-in-private-google-docs <p>I've noticed that when I've been invited to view and comment on AI-safety related draft articles (in Google Docs), they tend to quickly attract a lot of extensive discussion, including from people who almost never participate on public forums like LessWrong or AI Alignment Forum. The number of comments is often an order of magnitude higher than a typical post on the Alignment Forum. (Some of these are just pointing out typos and the like, but there's still a lot of substantial discussions.) This seems kind of wasteful because many of the comments do not end up being reflected in the final document so the ideas and arguments in them never end up being seen by the public (e.g., because the author disagrees with them, or doesn't want to include them due to length). So I guess I have a number of related questions:</p> <ol> <li> <p>What is it about these Google Docs that makes people so willing to participate in discussing them?</p> </li> <li> <p>Would the same level of discussion happen if the same draft authors were to present their drafts for discussion in public?</p> </li> <li> <p>Is there a way to attract this kind of discussion/participants to public posts in general (i.e., not necessarily drafts)?</p> </li> <li> <p>Is there some other way to prevent those ideas/arguments from "going to waste"?</p> </li> <li> <p>I just remembered that LessWrong has a sharable drafts feature. (Where I think the initially private comments can be later made public?) Is anyone using this? If not, why?</p> </li> </ol> <p>Personally I much prefer to comment in public places, due to not wanting my comments to be "wasted", so I'm having trouble understanding the psychology of people who seem to prefer the opposite.</p> wei_dai hnvPCZ4Cx35miHkw3 2019-01-12T02:19:19.332Z Two More Decision Theory Problems for Humans https://www.lesswrong.com/posts/6RjL996E8Dsz3vHPk/two-more-decision-theory-problems-for-humans <p>(This post has been sitting in my drafts folder for 6 years. Not sure why I didn't make it public, but here it is now after some editing.)</p> <p>There are two problems closely related to the <a href="/posts/KLaJjNdENsHhKhG5m/ontological-crisis-in-humans">Ontological Crisis in Humans</a>. I'll call them the "Partial Utility Function Problem" and the "Decision Theory Upgrade Problem".</p> <p><strong>Partial Utility Function Problem</strong></p> <p>As I mentioned in a <a href="/lw/fyb/ontological_crisis_in_humans/">previous post</a>, the only apparent utility function we have seems to be defined over an ontology very different from the fundamental ontology of the universe. But even on it's native domain, the utility function seems only partially defined. In other words, it will throw an error (i.e., say "I don't know") on some possible states of the heuristical model. For example, this happens for me when the number of people gets sufficiently large, like 3^^^3 in Eliezer's Torture vs Dust Specks scenario. When we try to compute the expected utility of some action, how should we deal with these "I don't know" values that come up?</p> <p>(Note that I'm presenting a simplified version of the real problem we face, where in addition to "I don't know", our utility function could also return essentially random extrapolated values outside of the region where it gives sensible outputs.)</p> <p><strong>Decision Theory Upgrade Problem</strong></p> <p>In the Decision Theory Upgrade Problem, an agent decides that their current decision theory is inadequate in some way, and needs to be upgraded. (Note that the Ontological Crisis could be considered an instance of this more general problem.) The question is whether and how to transfer their values over to the new decision theory.</p> <p>For example a human might be be running a mix of several decision theories: reinforcement learning, heuristical model-based consequentialism, identity-based decision making (where you adopt one or more social roles, like "environmentalist" or "academic" as part of your identity and then make decisions based on pattern matching what that role would do in any given situation), as well as virtual ethics and deontology. If you are tempted to drop one or more of these in favor of a more "advanced" or "rational" decision theory, such as UDT, you have to figure out how to transfer the values embodied in the old decision theory, which may not even be represented as any kind of utility function, over to the new.</p> <p>Another instance of this problem can be seen in someone just wanting to be a bit more consequentialist. Maybe UDT is too strange and impractical, but our native model-based consequentialism at least seems closer to being rational than the other decision procedures we have. In this case we tend to assume that the consequentialist module already has our real values and we don't need to "port" values from the other decision procedures that we're deprecating. But I'm not entirely sure this is safe, since the step going from (for example) identity-based decision making to heuristical model-based consequentialism doesn't seem <em>that</em> different from the step between heuristical model-based consequentialism and something like UDT.</p> wei_dai 6RjL996E8Dsz3vHPk 2019-01-04T09:00:33.436Z Two Neglected Problems in Human-AI Safety https://www.lesswrong.com/posts/HTgakSs6JpnogD6c2/two-neglected-problems-in-human-ai-safety <p>In this post I describe a couple of human-AI safety problems in more detail. These helped motivate my proposed <a href="https://www.lesswrong.com/posts/vbtvgNXkufFRSrx4j/three-ai-safety-related-ideas#2__A_hybrid_approach_to_the_human_AI_safety_problem">hybrid approach</a>, and I think need to be addressed by other AI safety approaches that currently do not take them into account.</p> <p><strong>1. How to prevent "aligned" AIs from unintentionally corrupting human values?</strong></p> <p>We know that ML systems tend to have problems with adversarial examples and distributional shifts in general. There seems to be no reason not to expect that human value functions have similar problems, which even "aligned" AIs could trigger unless they are somehow designed not to. For example, such AIs could give humans so much power so quickly or put them in such novel situations that their moral development can't keep up, and their value systems no longer apply or give essentially random answers. AIs could give us new options that are irresistible to some parts of our motivational systems, like more powerful versions of video game and social media addiction. In the course of trying to figure out what we most want or like, they could in effect be searching for adversarial examples on our value functions. At our own request or in a sincere attempt to help us, they could generate philosophical or moral arguments that are wrong but extremely persuasive.</p> <p>(Some of these issues, like the invention of new addictions and new technologies in general, would happen even without AI, but I think AIs would likely, by default, strongly exacerbate the problem by differentially accelerating such technologies faster than progress in understanding how to safely handle them.)</p> <p><strong>2. How to defend against intentional attempts by AIs to corrupt human values?</strong></p> <p>It looks like we may be headed towards a world of multiple AIs, some of which are either unaligned, or aligned to other owners or users. In such a world there's a strong incentive to use one's own AIs to manipulate other people's values in a direction that benefits oneself (even if the resulting loss to others are greater than gains to oneself).</p> <p>There is an apparent asymmetry between attack and defense in this arena, because manipulating a human is a straightforward optimization problem with an objective that is easy to test/measure (just check if the target has accepted the values you're trying to instill, or has started doing things that are more beneficial to you), and hence relatively easy for AIs to learn how to do, but teaching or programming an AI to help defend against such manipulation seems much harder, because it's unclear how to distinguish between manipulation and useful information or discussion. (One way to defend against such manipulation would be to cut off all outside contact, including from other humans because we don't know whether they are just being used as other AIs' mouthpieces, but that would be highly detrimental to one's own moral development.)</p> <p>There's also an asymmetric between AIs with simple utility functions (either unaligned or aligned to users who think they have simple values) and AIs aligned to users who have high value complexity and moral uncertainty. The former seem to be at a substantial advantage in a contest to manipulate others' values and protect one's own.</p> wei_dai HTgakSs6JpnogD6c2 2018-12-16T22:13:29.196Z Three AI Safety Related Ideas https://www.lesswrong.com/posts/vbtvgNXkufFRSrx4j/three-ai-safety-related-ideas <p>(I have a health problem that is acting up and making it hard to type for long periods of time, so I'm condensing three posts into one.)</p> <p><strong>1. AI design as opportunity and obligation to address human safety problems</strong></p> <p>Many AI safety problems are likely to have counterparts in humans. AI designers and safety researchers shouldn't start by assuming that humans are safe (and then try to inductively prove that increasingly powerful AI systems are safe when developed/trained by and added to a team of humans) or try to solve AI safety problems without considering whether their designs or safety approaches exacerbate human safety problems relative to other designs / safety approaches. At the same time, the development of AI may be a huge opportunity to address human safety problems, for example by transferring power from probably unsafe humans to de novo AIs that are designed from the ground up to be safe, or by assisting humans' built-in safety mechanisms (such as moral and philosophical reflection).</p> <p><strong>2. A hybrid approach to the human-AI safety problem</strong></p> <p>Idealized humans can be safer than actual humans. An example of idealized human is a human whole-brain emulation that is placed in a familiar, safe, and supportive virtual environment (along with other humans for socialization), so that they are not subject to problematic "distributional shifts" nor vulnerable to manipulation from other powerful agents in the physical world. One way to take advantage of this is to design an AI that is ultimately controlled by a group of idealized humans (for example, has a terminal goal that refers to the reflective equilibrium of the idealized humans), but this seems impractical due to computational constraints. An idea to get around this is to give the AI an advice or hint, that it can serve that terminal goal by learning from actual humans as an instrumental goal. This learning can include imitation learning, value learning, or other kinds of learning. Then, even if the actual humans become corrupted, the AI has a chance of becoming powerful enough to discard its dependence on actual humans and recompute its instrumental goals directly from its terminal goal. (Thanks to Vladimir Nesov for giving me a <a href="https://www.lesswrong.com/posts/DfcywmqRSkBaCB6Ma/intuitions-about-goal-directed-behavior#5Gx787nr6ynpYBfZH">hint</a> that led to this idea.)</p> <p><strong>3. Several approached to AI alignment will <a href="https://wiki.lesswrong.com/wiki/Differential_intellectual_progress">differentially accelerate</a> intellectual progress that are <a href="https://arxiv.org/abs/1811.07871">analogous</a> to solving problems that are low in the polynomial hierarchy.</strong></p> <p>This is bad if the "good" kind of intellectual progress (such as philosophical progress) is disproportionally high in the hierarchy or outside PH entirely, or if we just don't know how to formulate such progress as problems low in PH. I think this issue needs to be on the radar of more AI safety researchers.</p> <p>(A reader might ask, "differentially accelerate relative to what?" An "aligned" AI could accelerate progress in a bad direction relative to a world with no AI, but still in a good direction relative to a world with only unaligned AI. I'm referring to the former here.)</p> wei_dai vbtvgNXkufFRSrx4j 2018-12-13T21:32:25.415Z Counterintuitive Comparative Advantage https://www.lesswrong.com/posts/DN6q3SmgTrJgnRYWb/counterintuitive-comparative-advantage <p>This has been sitting in my drafts folder since 2011. Decided to post it today given the <a href="https://www.lesswrong.com/posts/qS2dnpeQyppetgXqv/double-dipping-in-dunning-kruger">recent post about Dunning—Kruger</a> and related discussions.</p> <p>The standard rationalist answer when someone asks for career advice is "find your comparative advantage." I don't have any really good suggestions about how to make this easier, but it seems like a good topic to bring up for discussion.</p> <p>If 15 years ago (when I was still in college and my initial career choice hadn't been finalized yet), someone told me that perhaps I ought to consider a career in philosophy, I would have laughed. "You must be joking. <em>Obviously</em>, I'll be really bad at doing philosophy," I would have answered. I thought of myself as a natural born programmer, and that's the career direction I ended up choosing.</p> <p>As it turns out, I am a pretty good programmer, and a terrible philosopher, but it also happens to be the case that just about everyone else is even <em>worse</em> at doing philosophy, and getting some philosophical questions right might be <em>really</em> important.</p> <p>The usual (instinctive) way for someone to choose a career is probably to pick a field that they think they will be particularly good at, using a single standard of goodness across all of the candidate fields. For example, the implicit reasoning behind my own career choice could be something like "Given a typical programming problem, I can solve it in a few hours with high probability. Whereas, given a typical philosophical problem, I can at best solve it after many years with low probability."</p> <p>On the other hand, comparative advantage says that in addition to your own abilities, you should also consider how good other people are (or will be) at various fields, and how valuable the outputs of those fields are (will be). Unless you're only interested in maximizing income, and the fields you're considering are likely to remain stable over your lifetime (in which case you can just compare current salaries, although apparently many people don't even do that), this can be pretty tricky.</p> <p>(There doesn't appear to be any previous OB/LW posts on comparative advantage. The closest I could find is Eliezer's <a href="/lw/65/money_the_unit_of_caring/">Money: The Unit of Caring</a>. Most discussions elsewhere seem to focus on simple static examples where finding comparative advantage is relatively trivial.)</p> <p>Today (in 2018) there's an <a href="https://80000hours.org/articles/comparative-advantage/">80,000 Hour article</a> about comparative advantage but that is more about how to find one's comparative advantage in a community of people who share a cause, like in EA, rather in the wider economy.</p> <p>I would also add (in 2018) that besides everyone else lacking skill or talent at something, an even bigger source of comparative advantage is being one of the first people to realize that a problem is a problem, or to realize an important new variant or subproblem of an existing problem. In that case, everyone else is really bad at solving that problem just because they have no idea the problem even exists.</p> wei_dai DN6q3SmgTrJgnRYWb 2018-11-28T20:33:30.023Z A general model of safety-oriented AI development https://www.lesswrong.com/posts/idb5Ppp9zghcichJ5/a-general-model-of-safety-oriented-ai-development <p>This may be trivial or obvious for a lot of people, but it doesn't seem like anyone has bothered to write it down (or I haven't looked hard enough). It started out as a generalization of Paul Christiano's <a href="https://ai-alignment.com/iterated-distillation-and-amplification-157debfd1616">IDA</a>, but also covers things like safe recursive self-improvement.</p> <pre><code>Start with a team of one or more humans (researchers, programmers, trainers, and/or overseers), with access to zero or more AIs (initially as assistants). The human/AI team in each round develops a new AI and adds it to the team, and repeats this until maturity in AI technology is achieved. Safety/alignment is ensured by having some set of safety/alignment properties on the team that is inductively maintained by the development process. </code></pre><p>The reason I started thinking in this direction is that Paul's approach seemed very hard to knock down, because any time a flaw or difficulty is pointed out or someone expresses skepticism on some technique that it uses or the overall safety invariant, there's always a list of other techniques or invariants that could be substituted in for that part (<a href="https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely/comment/8L28cEW6bmY3EHARS">sometimes</a> in my own brain as I tried to criticize some part of it). Eventually I realized this shouldn't be surprising because IDA is an instance of this more general model of safety-oriented AI development, so there are bound to be many points near it in the space of possible safety-oriented AI development practices. (Again, this may already be obvious to others including Paul, and in their minds IDA is perhaps already a cluster of possible development practices consisting of the most promising safety techniques and invariants, rather than a single point.)</p> <p>If this model turns out not to have been written down before, perhaps it should be assigned a name, like Iterated Safety-Invariant AI-Assisted AI Development, or something pithier?</p> wei_dai idb5Ppp9zghcichJ5 2018-06-11T21:00:02.670Z Beyond Astronomical Waste https://www.lesswrong.com/posts/Qz6w4GYZpgeDp6ATB/beyond-astronomical-waste <p>Faced with the astronomical amount of unclaimed and unused resources in our universe, one's first reaction is probably wonderment and anticipation, but a second reaction may be disappointment that our universe isn't even larger or contains even more resources (such as the ability to support 3^^^3 human lifetimes or perhaps to perform an infinite amount of computation). In a <a href="/posts/BNbxueXEcm6dCkDuk/is-the-potential-astronomical-waste-in-our-universe-too">previous post</a> I suggested that the potential amount of <a href="//wiki.lesswrong.com/wiki/Astronomical_waste">astronomical waste</a> in our universe seems small enough that a total utilitarian (or the total utilitarianism part of someone’s moral uncertainty) might reason that since one should have made a deal to trade away power/resources/influence in this universe for power/resources/influence in universes with much larger amounts of available resources, it would be rational to behave as if this deal was actually made. But for various reasons a total utilitarian may not buy that argument, in which case another line of thought is to look for things to care about beyond the potential astronomical waste in our universe, in other words to explore possible sources of expected value that may be much greater than what can be gained by just creating worthwhile lives in this universe.</p> <p>One example of this is the possibility of escaping, or being deliberately uplifted from, a simulation that we're in, into a much bigger or richer base universe. Or more generally, the possibility of controlling, <a href="//wiki.lesswrong.com/wiki/Updateless_decision_theory">through our decisions</a>, the outcomes of universes with much greater computational resources than the one we're apparently in. It seems likely that under an assumption such as <a href="//en.wikipedia.org/wiki/Mathematical_universe_hypothesis">Tegmark's Mathematical Universe Hypothesis</a>, there are many simulations of our universe running all over the multiverse, including in universes that are much richer than ours in computational resources. If such simulations exist, it also seems likely that we can leave some of them, for example through one of these mechanisms:</p> <ol> <li>Exploiting a flaw in the software or hardware of the computer that is running our simulation (including "natural simulations" where a very large universe happens to contain a simulation of ours without anyone intending this).</li> <li>Exploiting a flaw in the psychology of agents running the simulation.</li> <li>Altruism (or other moral/axiological considerations) on the part of the simulators.</li> <li><a href="//wiki.lesswrong.com/wiki/Acausal_trade">Acausal trade</a>.</li> <li>Other instrumental reasons for the simulators to let out simulated beings, such as wanting someone to talk to or play with. (Paul Christiano's recent <a href="/posts/3kN79EuT27trGexsq/when-is-unaligned-ai-morally-valuable">When is unaligned AI morally valuable?</a> contains an example of this, however the idea there only lets us escape to another universe similar to this one.)</li> </ol> <p>(Being run as a simulation in another universe isn't necessarily the only way to control what happens in that universe. Another possibility is if universes with halting oracles exist (which is implied by Tegmark's MUH since they exist as mathematical structures in the <a href="https://en.wikipedia.org/wiki/Arithmetical_hierarchy">arithmetical hierarchy</a>), some of their oracle queries may be questions whose answers can be controlled by our decisions, in which case we can control what happens in those universes without being simulated by them (in the sense of being run step by step in a computer). Another example is that superintelligent beings may be able to reason about what our decisions are without having to run a step by step simulation of us, even without access to a halting oracle.)</p> <p>The general idea here is for a superintelligence descending from us to (after determining that this is an advisable course of action) use some fraction of the resources of this universe to reason about or search (computationally) for much bigger/richer universes that are running us as simulations or can otherwise be controlled by us, and then determine what we need to do to maximize the expected value of the consequences of our actions on the base universes, perhaps through one or more of the above listed mechanisms.</p> <h3>Practical Implications</h3> <p>Realizing this kind of <a href="/posts/JjY8Yq9YdEAHc7Lkb/existential-risk-and-existential-hope-definitions">existential hope</a> seems to require a higher level of philosophical sophistication than just preventing astronomical waste in our own universe. Compared to that problem, here we have more questions of a philosophical nature, for which no empirical feedback seems possible. It seems very easy to make a mistake somewhere along the chain of reasoning and waste a more-than-astronomical amount of potential value, for example by failing to realize the possibility of affecting bigger universes through our actions, incorrectly calculating the expected value of such a strategy, failing to solve the distributional/ontological shift problem of how to value strange and unfamiliar processes or entities in other universes, failing to figure out the correct or optimal way to escape into or otherwise influence larger universes, etc.</p> <p>The total utilitarian in me is thus very concerned about trying to preserve and improve the collective philosophical competence of our civilization, such that when it becomes possible to pursue strategies like ones listed above, we'll be able to make the right decisions. The best opportunity to do this that I can foresee is the advent of advanced AI, which is another reason I want to push for AIs that are not just value aligned with us, but also have philosophical competence that scales with their other intellectual abilities, so they can <a href="https://agentfoundations.org/item?id=1150">help correct</a> the philosophical errors of their human users (instead of merely deferring to them), thereby greatly improving our collective philosophical competence.</p> <h3>Anticipated Questions</h3> <p><em>How is this idea related to Nick Bostrom's <a href="https://www.simulation-argument.com/">Simulation Argument</a>?</em> Nick's argument focuses on the possibility of post-humans (presumably living in a universe similar to ours but just at a later date) simulating us as their ancestors. It does not seem to consider that we may be running as simulations in much larger/richer universes, or that this may be a source of great potential value.</p> <p><em>Isn't this a form of <a href="//wiki.lesswrong.com/wiki/Pascal's_mugging">Pascal's Mugging</a>?</em> I'm not sure. It could be that when we figure out how to solve Pascal's Mugging it will become clear that we shouldn't try to leave our simulation for reasons similar to why we shouldn't pay the mugger. However the analogy doesn't seem so tight that I think this is highly likely. Also, note that the argument here isn't that we should do the equivalent of "pay the mugger" but rather that we should try to bring ourselves into a position where we can definitively figure out what the right thing to do is.</p> wei_dai Qz6w4GYZpgeDp6ATB 2018-06-07T21:04:44.630Z Can corrigibility be learned safely? https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely <p>EDIT: Please note that the way I use the word &quot;corrigibility&quot; in this post isn&#x27;t quite how Paul uses it. See <a href="https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely#jo2cwbB3WK7KyGjpy">this thread</a> for clarification.</p><p>This is mostly a reply to Paul Christiano&#x27;s <a href="http://ai-alignment.com/universality-and-security-amplification-551b314a3bab">Universality and security amplification</a> and assumes familiarity with that post as well as Paul&#x27;s AI alignment approach in general. See also <a href="http://www.lesswrong.com/posts/ZyyMPXY27TTxKsR5X/problems-with-amplification-distillation">my previous comment</a> for my understanding of what corrigibility means here and the motivation for wanting to do AI alignment through corrigibility learning instead of value learning.</p><p>Consider the <a href="http://medium.com/@weidai/to-put-it-another-way-a-human-translator-has-learned-a-lot-of-valuable-information-much-of-it-48457f95b9bf">translation example</a> again as an analogy about corrigibility. Paul&#x27;s alignment approach depends on humans having a notion of &quot;corrigibility&quot; (roughly &quot;being helpful to the user and keeping the user in control&quot;) which is preserved by the amplification scheme. Like the information that a human uses to do translation, the details of this notion may also be stored as connection weights in the deep layers of a large neural network, so that the only way to access them is to provide inputs to the human of a form that the network was trained on. (In the case of translation, this would be sentences and associated context, while in the case of corrigibility this would be questions/tasks of a human understandable nature and context about the user&#x27;s background and current situation.) This seems plausible because in order for a human&#x27;s notion of corrigibility to make a difference, the human has to apply it while thinking about the meaning of a request or question and &quot;translating&quot; it into a series of smaller tasks.</p><p>In the language translation example, if the task of translating a sentence is broken down into smaller pieces, the system could no longer access the full knowledge the Overseer has about translation. By analogy, if the task of breaking down tasks in a corrigible way is itself broken down into smaller pieces (either for security or because the input task and associated context is so complex that a human couldn&#x27;t comprehend it in the time allotted), then the system might no longer be able to access the full knowledge the Overseer has about &quot;corrigibility&quot;.</p><p>In addition to &quot;corrigibility&quot; (trying to be helpful), breaking down a task also involves &quot;understanding&quot; (figuring out what the intended meaning of the request is) and &quot;competence&quot; (how to do what one is trying to do). By the same analogy, humans are likely to have introspectively inaccessible knowledge about both understanding and competence, which they can&#x27;t fully apply if they are not able to consider a task as a whole.</p><p>Paul is aware of this problem, at least with regard to competence, and his <a href="https://ai-alignment.com/universality-and-security-amplification-551b314a3bab">proposed solution</a> is:</p><blockquote>I propose to go on breaking tasks down anyway. This means that we will lose certain abilities as we apply amplification. [...] Effectively, this proposal replaces our original human overseer with an impoverished overseer, who is only able to respond to the billion most common queries.</blockquote><p></p><p>How bad is this, with regard to understanding and corrigibility? Is an impoverished overseer who only learned a part of what a human knows about understanding and corrigibility still understanding/corrigible enough? I think the answer is probably no.</p><p>With regard to understanding, natural language is famously ambiguous. The fact that a sentence is ambiguous (has multiple possible meanings depending on context) is itself often far from apparent to someone with a shallow understanding of the language. (See <a href="http://www.greaterwrong.com/posts/Mhaikukvt6N4YtwHF/dragon-army-retrospective#comment-Pj6b4SDDdf3YrWp9i">here</a> for a recent example on LW.) So the overseer will end up being overly literal, and misinterpreting the meaning of natural language inputs without realizing it.</p><p>With regard to corrigibility, if I try to think about what I&#x27;m doing when I&#x27;m trying to be corrigible, it seems to boil down to something like this: build a model of the user based on all available information and my prior about humans, use that model to help improve my understanding of the meaning of the request, then find a course of action that best balances between satisfying the request as given, upholding (my understanding of) the user&#x27;s morals and values, and most importantly keeping the user in control. Much of this seems to depend on information (prior about humans), procedure (how to build a model of the user), and judgment (how to balance between various considerations) that are far from introspectively accessible.</p><p>So if we try to learn understanding and corrigibility &quot;safely&quot; (i.e., in small chunks), we end up with an <a href="https://www.greaterwrong.com/posts/ySSEz5CmSEo6MbokQ/reframing-misaligned-agi-s-well-intentioned-non-neurotypical">overly literal overseer</a> that lacks common sense understanding of language and independent judgment of what the user&#x27;s wants, needs, and shoulds are and how to balance between them. However, if we amplify the overseer enough, eventually the AI will have the option of learning understanding and corrigibility from external sources rather than relying on its poor &quot;native&quot; abilities. As Paul explains with regard to translation:</p><blockquote>This is potentially OK, as long as we learn a good policy for leveraging the information in the environment (including human expertise). This can then be distilled into a state maintained by the agent, which can be as expressive as whatever state the agent might have learned. Leveraging external facts requires making a tradeoff between the benefits and risks, so we haven’t eliminated the problem, but we’ve potentially isolated it from the problem of training our agent.</blockquote><p></p><p>So instead of directly trying to break down a task, the AI would first learn to understand natural language and what &quot;being helpful&quot; and &quot;keeping the user in control&quot; involve from external sources (possibly including texts, audio/video, and queries to humans), distill that into some compressed state, then use that knowledge to break down the task in a more corrigible way. But first, since the lower-level (less amplified) agents are contributing little besides the ability to execute literal-minded tasks that don&#x27;t require independent judgment, it&#x27;s unclear what advantages there are to doing this as an Amplified agent as opposed to using ML directly to learn these things. And second, trying to learn understanding and corrigibility from external humans has the same problem as trying to learn from the human Overseer: if you try to learn in large chunks, you risk corrupting the external human and then learning corrupted versions of understanding and corrigibility, but if you try to learn in small chunks, you won&#x27;t get all the information that you need.</p><p>The conclusion here seems to be that corrigibility can&#x27;t be learned safely, at least not in a way that&#x27;s clear to me.</p> wei_dai o22kP33tumooBtia3 2018-04-01T23:07:46.625Z Multiplicity of "enlightenment" states and contemplative practices https://www.lesswrong.com/posts/yxvp9LErWao5kJ3bC/multiplicity-of-enlightenment-states-and-contemplative <p>It seems that there are multiple different mental states that people have historically called &quot;enlightenment&quot;, as well as many different types of contemplative practices with different underlying cognitive mechanisms. I link to and quote from a couple of papers showing this. Given the apparent multiplicity of &quot;enlightenment&quot; states and contemplative practices, I&#x27;d like to request that future discussions on these topics include more detailed references or descriptions as to which states and practices are being talking about.</p><p><a href="https://www.frontiersin.org/articles/10.3389/fpsyg.2013.00870/full">Can enlightenment be traced to specific neural correlates, cognition, or behavior?</a></p><blockquote>The term ’’enlightenment” is an extraordinarily imprecise construct. Using the term enlightenment or even the term more native to Buddhist traditions, “awakening” (<em>bodhi</em>), as if it referred to a single outcome either privileges one conception over others or else assumes that there is some commonality among the traditional goals of diverse contemplative traditions. There are deep disagreements over the nature of the goal between and even within various Buddhist schools. Scientific investigations cannot assume that there is any commonality among the transformative changes referred to as “kensho,” “stream entry,” “realizing the nature of mind,” and so on, that various Buddhist traditions take as various stages of awakening. Empirical investigations of these constructs can only proceed with reference to the specific psychological and behavioral outcomes described in the native discourse of a specific tradition (see <a href="https://www.frontiersin.org/articles/10.3389/fpsyg.2013.00870/full#B21">Lutz et al., 2007</a>). [...]</blockquote><blockquote>Given the differences between various competing conceptions of awakening, one scientific approach to tracing enlightenment would be to use the tools of social psychology to investigate which states and traits are valued in a particular community. For instance, recent work in moral psychology suggests how value judgments of people and practices as either enlightened or unenlightened could be traced to affective reactions of admiration and disgust (<a href="https://www.frontiersin.org/articles/10.3389/fpsyg.2013.00870/full#B26">Rozin et al., 1999</a>; <a href="https://www.frontiersin.org/articles/10.3389/fpsyg.2013.00870/full#B29">Schnall et al., 2008</a>; <a href="https://www.frontiersin.org/articles/10.3389/fpsyg.2013.00870/full#B2">Brandt and Reyna, 2011</a>; <a href="https://www.frontiersin.org/articles/10.3389/fpsyg.2013.00870/full#B30">Schnall and Roper, 2011</a>). Some of the most virulent disagreements over what counts as genuine awakening occur between closely related practice traditions, such as the debates between various Theravāda Buddhist traditions in Burma over which states are to count as realizations of <em>nibbāna</em> and which are instead to be counted (merely) as states of deep concentration. Surveying these debates, <a href="https://www.frontiersin.org/articles/10.3389/fpsyg.2013.00870/full#B31">Sharf (1995)</a> concludes “there is no public consensus as to the application of terms that supposedly refer to discreet experiential states within the <em>vipassanā</em> movement” (<a href="https://www.frontiersin.org/articles/10.3389/fpsyg.2013.00870/full#B31">Sharf, 1995</a>, p. 265).</blockquote><p><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4595910/">Reconstructing and deconstructing the self: Cognitive mechanisms in meditation practice</a></p><blockquote>While mindfulness (see Glossary), compassion, and other forms of meditation are increasingly being studied as interventions to alleviate suffering and promote well-being [<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4595910/#R3">3</a>–<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4595910/#R10">10</a>], it is not yet clear how different styles of meditation affect specific cognitive processes, nor how alterations in these processes might impact levels of well-being. Here, we address this question from the perspective of psychology and cognitive neuroscience to better understand how changes in well-being are mediated by alterations in distinct cognitive processes and in the structure and functioning of corresponding brain networks. [...]</blockquote><blockquote>In this article we expand our original framework to accommodate a broader range of traditional and contemporary meditation practices, grouping them into attentional, constructive, and deconstructive families. According to this model, the primary cognitive mechanisms in these three families are (1) attention regulation and meta-awareness, (2) perspective taking and reappraisal, and (3) self-inquiry, respectively. To illustrate the role of these processes in different forms of meditation, we discuss how experiential fusion, maladaptive self-schema, and cognitive reification are differentially targeted by these processes in the context of Buddhist meditation, integrating the perspectives of other contemplative, philosophical, and clinical perspectives when relevant.</blockquote><p>Below is a table from this paper showing how it classifies various traditional and modern contemplative practices. (Click <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4595910/table/T1/">here</a> to see a more readable version.)</p><span><figure><img src="https://i.imgur.com/310WreY.png" class="draft-image " style="" /></figure></span><p></p> wei_dai yxvp9LErWao5kJ3bC 2018-03-12T08:15:48.709Z Online discussion is better than pre-publication peer review https://www.lesswrong.com/posts/a3aGosA987cZ4aRAB/online-discussion-is-better-than-pre-publication-peer-review <p>Related: <a href="https://rationalconspiracy.com/2012/06/20/why-academic-papers-are-a-terrible-discussion-forum/">Why Academic Papers Are A Terrible Discussion Forum</a>, <a href="https://www.facebook.com/yudkowsky/posts/10154888183439228">Four Layers of Intellectual Conversation</a></p> <p>During a <a href="/lw/pdd/the_doomsday_argument_in_anthropic_decision_theory/dwsa">recent</a> <a href="https://www.lesserwrong.com/posts/xQ9tMMk3RArodLtDq/intellectual-progress-inside-and-outside-academia">discussion</a> about (in part) academic peer review, some people defended peer review as necessary in academia, despite its flaws, for time management. Without it, they said, researchers would be overwhelmed by "cranks and incompetents and time-card-punchers" and "semi-serious people post ideas that have already been addressed or refuted in papers already". I replied that on online discussion forums, "it doesn't take a lot of effort to detect cranks and previously addressed ideas". I was prompted by Michael Arc and Stuart Armstrong to elaborate. Here's what I wrote in response:</p> <p>My experience is with systems like LW. If an article is in my own specialty then I can judge it easily and make comments if it&rsquo;s interesting, otherwise I look at its votes and other people&rsquo;s comments to figure out whether it&rsquo;s something I should pay more attention to. One advantage over peer review is that each specialist can see all the unfiltered work in their own field, and it only takes one person from all the specialists in a field to recognize that a work may be promising, then comment on it and draw others&rsquo; attentions. Another advantage is that nobody can make ill-considered comments without suffering personal consequences since everything is public. This seem like an obvious improvement over standard pre-publication peer review, for the purpose of filtering out bad work and focusing attention on promising work, and in practice works reasonably well on LW.</p> <p>Apparently some people in academia have come to similar conclusions about how peer review is currently done and are trying to reform it in <a href="http://pubs.acs.org/doi/full/10.1021/acs.chemmater.5b01917">various ways</a>, including switching to <a href="http://blog.scienceopen.com/2016/02/pre-or-post-publication-peer-review/">post-publication peer review</a> (which seems very similar to what we do on forums like LW). However it's troubling (in a "civilizational inadequacy" sense) that academia is moving so slowly in that direction, despite the necessary enabling technology having been invented a decade or more ago.</p> wei_dai a3aGosA987cZ4aRAB 2017-09-05T13:25:15.331Z Examples of Superintelligence Risk (by Jeff Kaufman) https://www.lesswrong.com/posts/yj3ehygYJiqqkr8MD/examples-of-superintelligence-risk-by-jeff-kaufman wei_dai yj3ehygYJiqqkr8MD 2017-07-15T16:03:58.336Z Combining Prediction Technologies to Help Moderate Discussions https://www.lesswrong.com/posts/f9i65H986t8mrfs2M/combining-prediction-technologies-to-help-moderate <p>I came across a <a href="https://blog.ethereum.org/2015/11/24/applications-of-security-deposits-and-prediction-markets-you-might-not-have-thought-about/">2015 blog post by Vitalik Buterin</a> that contains some ideas similar to Paul Christiano's recent <a href="/lw/o7p/crowdsourcing_moderation_without_sacrificing/">Crowdsourcing moderation without sacrificing quality</a>. The basic idea in both is that it would be nice to have a panel of trusted moderators carefully pore over every comment and decide on its quality, but since that is too expensive, we can instead use some tools to predict moderator decisions, and have the trusted moderators look at only a small subset of comments in order to calibrate the prediction tools. In Paul's proposal the prediction tool is machine learning (mainly using individual votes as <a href="https://en.wikipedia.org/wiki/Feature_%28machine_learning%29">features</a>), and in Vitalik's proposal it's <a href="https://wiki.lesswrong.com/wiki/Prediction_market">prediction markets</a> where people bet on what the moderators would decide if they were to review each comment.</p> <p>It seems worth thinking about how to combine the two proposals to get the best of both worlds. One fairly obvious idea is to let people both vote on comments as an expression of their own opinions, and also place bets about moderator decisions, and use ML to set baseline odds, which would reduce how much the forum would have to pay out to incentivize accurate prediction markets. The hoped for outcome is that the ML algorithm would make correct decisions most of the time, but people can bet against it when they see it making mistakes, and moderators would review comments that have the greatest disagreements between ML and people or between different bettors in general. Another part of Vitalik's proposal is that each commenter has to make an initial bet that moderators would decide that their comment is good. The article notes that such a bet can also be viewed as a refundable deposit. Such forced bets / refundable deposits would <a href="/lw/o7p/crowdsourcing_moderation_without_sacrificing/dit1">help solve</a> a security problem with Paul's ML-based proposal.</p> <p>Are there better ways to combine these prediction tools to help with forum moderation? Are there other prediction tools that can be used instead or in addition to these?</p> <div><br /></div> wei_dai f9i65H986t8mrfs2M 2016-12-08T00:19:35.854Z [link] Baidu cheats in an AI contest in order to gain a 0.24% advantage https://www.lesswrong.com/posts/SYaYvGAM7JpDTpzp8/link-baidu-cheats-in-an-ai-contest-in-order-to-gain-a-0-24 <p>Some of you may already have seen this story, since it's several days old, but MIT Technology Review seems to have the best explanation of what happened: <a href="http://www.technologyreview.com/view/538111/why-and-how-baidu-cheated-an-artificial-intelligence-test/">Why and How Baidu Cheated an Artificial Intelligence Test</a></p> <blockquote> <p>Such is the success of deep learning on this particular test that even a small advantage could make a difference. Baidu had reported it achieved an error rate of only 4.58 percent, beating the previous best of 4.82 percent, <a href="http://arxiv.org/abs/1502.03167" target="_blank">reported by Google</a> in March. In fact, some experts have noted that the small margins of victory in the race to get better on this particular test make it increasingly meaningless. That Baidu and others continue to trumpet their results all the same - and may even be willing to break the rules - suggest that being the best at machine learning matters to them very much indeed.</p> </blockquote> <p>(In case you didn't know, Baidu is the largest search engine in China, with a market cap of $72B, compared to Google's $370B.)</p> <p>The problem I see here is that the mainstream AI / machine learning community measures progress mainly by this kind of contest. Researchers are incentivized to use whatever method they can find or invent to gain a few tenths of a percent in some contest, which allows them to claim progress at an AI task and publish a paper. Even as the AI safety / control / Friendliness field gets more attention and funding, it seems easy to foresee a future where mainstream AI researchers continue to ignore such work because it does not contribute to the tenths of a percent that they are seeking but instead can only hinder their efforts. What can be done to change this?</p> wei_dai SYaYvGAM7JpDTpzp8 2015-06-06T06:39:44.990Z Is the potential astronomical waste in our universe too small to care about? https://www.lesswrong.com/posts/BNbxueXEcm6dCkDuk/is-the-potential-astronomical-waste-in-our-universe-too <p>In the not too distant past, people <a href="http://www.aleph.se/Trans/Global/Omega/dyson.txt">thought</a> that our universe might be capable of supporting an unlimited amount of computation. Today our <a href="http://arxiv.org/pdf/astro-ph/0404510.pdf">best guess</a> at the cosmology of our universe is that it stops being able to support any kind of life or deliberate computation after a finite amount of time, during which only a finite amount of computation can be done (on the order of something like 10^120 operations).</p> <p>Consider two hypothetical people, Tom, a total utilitarian with a near zero discount rate, and Eve, an egoist with a relatively high discount rate, a few years ago when they thought there was .5 probability the universe could support doing at least 3^^^3 ops and .5 probability the universe could only support 10^120 ops. (These numbers are obviously made up for convenience and illustration.) It would have been mutually beneficial for these two people to make a deal: if it turns out that the universe can only support 10^120 ops, then Tom will give everything he owns to Eve, which happens to be $1 million, but if it turns out the universe can support 3^^^3 ops, then Eve will give $100,000 to Tom. (This may seem like a lopsided deal, but Tom is happy to take it since the potential utility of a universe that can do 3^^^3 ops is so great for him that he really wants any additional resources he can get in order to help increase the probability of a positive Singularity in that universe.)</p> <p>You and I are not total utilitarians or egoists, but instead are people with <a href="http://wiki.lesswrong.com/wiki/Moral_uncertainty">moral uncertainty</a>. Nick Bostrom and Toby Ord <a href="http://www.overcomingbias.com/2009/01/moral-uncertainty-towards-a-solution.html">proposed </a>the Parliamentary Model for dealing with moral uncertainty, which works as follows:</p> <blockquote> <p>Suppose that you have a set of mutually exclusive moral theories, and that you assign each of these some probability.&nbsp; Now imagine that each of these theories gets to send some number of delegates to The Parliament.&nbsp; The number of delegates each theory gets to send is proportional to the probability of the theory.&nbsp; Then the delegates bargain with one another for support on various issues; and the Parliament reaches a decision by the delegates voting.&nbsp; What you should do is act according to the decisions of this imaginary Parliament.</p> </blockquote> <p>It occurred to me recently that in such a Parliament, the delegates would makes deals similar to the one between Tom and Eve above, where they would trade their votes/support in one kind of universe for votes/support in another kind of universe. If I had a Moral Parliament active back when I thought there was a good chance the universe could support unlimited computation, all the delegates that really care about <a href="http://wiki.lesswrong.com/wiki/Astronomical_waste">astronomical waste</a> would have traded away their votes in the kind of universe where we actually seem to live for votes in universes with a lot more potential astronomical waste. So today my Moral Parliament would be effectively controlled by delegates that care little about astronomical waste.</p> <div>I actually still seem to care about astronomical waste (even if I pretend that I was <em>certain</em> that the universe could only do at most 10^120 operations). (Either my Moral Parliament wasn't active back then, or my delegates weren't smart enough to make the appropriate deals.) Should I nevertheless follow UDT-like reasoning and conclude that I should act as if they had made such deals, and therefore I should stop caring about the relatively small amount of astronomical waste that could occur in our universe? If the answer to this question is "no", what about the future going forward, given that there is still uncertainty about cosmology and the nature of physical computation. Should the delegates to my Moral Parliament be making these kinds of deals from now on?<br /></div> wei_dai BNbxueXEcm6dCkDuk 2014-10-21T08:44:12.897Z What is the difference between rationality and intelligence? https://www.lesswrong.com/posts/o8YF6L9sRv4k9zbfb/what-is-the-difference-between-rationality-and-intelligence <p>Or to ask the question another way, is there such a thing as a theory of bounded rationality, and if so, is it the same thing as a theory of general intelligence?</p> <p>The LW Wiki defines general intelligence as "ability to efficiently achieve goals in a wide range of domains", while instrumental rationality is defined as "the art of choosing and implementing actions that steer the future toward outcomes ranked higher in one's preferences". These definitions seem to suggest that rationality and intelligence are fundamentally the same concept.</p> <p>However, rationality and AI have separate research communities. This seems to be mainly for historical reasons, because people studying rationality started with theories of unbounded rationality (i.e., with logical omniscience or access to unlimited computing resources), whereas AI researchers started off trying to achieve modest goals in narrow domains with very limited computing resources. However rationality researchers are trying to find theories of bounded rationality, while people working on AI are trying to achieve more general goals with access to greater amounts of computing power, so the distinction may disappear if the two sides end up meeting in the middle.</p> <p>We also distinguish between rationality and intelligence when talking about humans. I understand the former as the ability of someone to overcome various biases, which seems to consist of a set of skills that can be learned, while the latter is a kind of mental firepower measured by IQ tests. This seems to suggest another possibility. Maybe (as Robin Hanson recently argued on his blog) there is no such thing as a simple theory of how to optimally achieve arbitrary goals using limited computing power. In this view, general intelligence requires cooperation between many specialized modules containing domain specific knowledge, so "rationality" would just be one module amongst many, which tries to find and correct systematic deviations from ideal (unbounded) rationality caused by the other modules.</p> <p>I was more confused when I started writing this post, but now I seem to have largely answered my own question (modulo the uncertainty about the nature of intelligence mentioned above). However I'm still interested to know how others would answer it. Do we have the same understanding of what "rationality" and "intelligence" mean, and know what distinction someone is trying to draw when they use one of these words instead of the other?</p> <p><strong>ETA: </strong>To clarify, I'm asking about the difference between general intelligence and rationality as theoretical concepts that apply to all agents. Human rationality vs intelligence may give us a clue to that answer, but isn't the main thing that I'm interested here.</p> wei_dai o8YF6L9sRv4k9zbfb 2014-08-13T11:19:53.062Z Six Plausible Meta-Ethical Alternatives https://www.lesswrong.com/posts/orhEa4wuRJHPmHFsR/six-plausible-meta-ethical-alternatives <p>In this post, I list six metaethical possibilities that I think are plausible, along with some arguments or plausible stories about how/why they might be true, where that's not obvious. A lot of people seem fairly certain in their metaethical views, but I'm not and I want to convey my uncertainty as well as some of the reasons for it.</p> <ol> <li>Most intelligent beings in the multiverse share similar preferences. This came about because there are facts about what preferences one should have, just like there exist facts about what decision theory one should use or what prior one should have, and species that manage to build intergalactic civilizations (or the equivalent in other universes) tend to discover all of these facts. There are occasional paperclip maximizers that arise, but they are a relatively minor presence or tend to be taken over by more sophisticated minds.</li> <li>Facts about what everyone should value exist, and most intelligent beings have a part of their mind that can discover moral facts and find them motivating, but those parts don't have full control over their actions. These beings eventually build or become rational agents with values that represent compromises between different parts of their minds, so most intelligent beings end up having shared moral values along with idiosyncratic values.</li> <li>There aren't facts about what everyone should value, but there are facts about how to translate non-preferences (e.g., emotions, drives, fuzzy moral intuitions, circular preferences, non-consequentialist values, etc.) into preferences. These facts may include, for example, what is the right way to deal with <a href="/lw/fyb/ontological_crisis_in_humans/">ontological crises</a>. The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to <em>become</em> rational, that would seem like a strange state of affairs.</li> <li>None of the above facts exist, so the only way to become or build a rational agent is to just think about what preferences you want your future self or your agent to hold, until you make up your mind in some way that depends on your psychology. But at least this process of reflection is convergent at the individual level so each person can reasonably call the preferences that they endorse after reaching reflective equilibrium their morality or real values. </li> <li>None of the above facts exist, and reflecting on what one wants turns out to be a divergent process (e.g., it's highly sensitive to initial conditions, like whether or not you drank a cup of coffee before you started, or to the order in which you happen to encounter philosophical arguments). There are still facts about rationality, so at least agents that are already rational can call their utility functions (or the equivalent of utility functions in whatever decision theory ends up being the right one) their real values.</li> <li>There aren't any normative facts at all, including facts about what is rational. For example, it turns out there is no one decision theory that does better than every other decision theory in every situation, and there is no obvious or widely-agreed-upon way to determine which one "wins" overall.</li> </ol> <p>(Note that for the purposes of this post, I'm concentrating on morality in the axiological sense (what one should value) rather than in the sense of cooperation and compromise. So alternative 1, for example, is not intended to include the possibility that most intelligent beings end up merging their preferences through some kind of grand <a href="http://wiki.lesswrong.com/wiki/Acausal_trade">acausal bargain</a>.)</p> <p>It may be useful to classify these possibilities using labels from academic philosophy. Here's my attempt: 1. realist + internalist 2. realist + externalist 3. <a href="http://plato.stanford.edu/entries/moral-anti-realism/moral-subjectivism-versus-relativism.html">relativist</a> 4. <a href="http://plato.stanford.edu/entries/moral-anti-realism/#Sub">subjectivist</a> 5. moral anti-realist 6. normative anti-realist. (A lot of debates in metaethics concern the meaning of ordinary moral language, for example whether they refer to facts or merely express attitudes. I mostly ignore such debates in the above list, because it's not clear what implications they have for the questions that I care about.)</p> <p>One question LWers may have is, where does Eliezer's metathics fall into this schema? Eliezer says that there <em>are</em> moral facts about what values every intelligence in the multiverse should have, but only humans are likely to discover these facts and be motivated by them. To me, Eliezer's use of language is counterintuitive, and since it seems plausible that there are facts about what everyone should value (or how each person should translate their non-preferences into preferences) that most intelligent beings can discover and be at least somewhat motivated by, I'm reserving the phrase "moral facts" for these. In my language, I think 3 or maybe 4 is probably closest to Eliezer's position.</p> <ol> </ol> wei_dai orhEa4wuRJHPmHFsR 2014-08-06T00:04:14.485Z Look for the Next Tech Gold Rush? https://www.lesswrong.com/posts/Jter3YhFBZFYo8vtq/look-for-the-next-tech-gold-rush <p>In early 2000, I registered my personal domain name weidai.com, along with a couple others, because I was worried that the small (sole-proprietor) ISP I was using would go out of business one day and break all the links on the web to the articles and software that I had published on my "home page" under its domain. Several years ago I started getting offers, asking me to sell the domain, and now they're coming in almost every day. A couple of days ago I saw the first six figure offer ($100,000).</p> <p>In early 2009, someone named Satoshi Nakamoto emailed me personally with an announcement that he had published version 0.1 of Bitcoin. I didn't pay much attention at the time (I was more interested in Less Wrong than Cypherpunks at that point), but then in early 2011 I saw a LW article about Bitcoin, which prompted me to start mining it. I wrote at the time, "thanks to the discussion you started, I bought a Radeon 5870 and started mining myself, since it looks likely that I can at least break even on the cost of the card." That approximately $200 investment (plus maybe another $100 in electricity) is also worth around six figures today.</p> <p>Clearly, technological advances can sometimes create gold rush-like situations (i.e., first-come-first-serve opportunities to make truly extraordinary returns with minimal effort or qualifications). And it's possible to stumble into them without even trying. Which makes me think, maybe we <em>should</em> be trying? I mean, if only I had been <em>looking</em> for possible gold rushes, I could have registered a hundred domain names optimized for potential future value, rather than the few that I happened to personally need. Or I could have started mining Bitcoins a couple of years earlier and be a thousand times richer.</p> <p>I wish I was already an experienced gold rush spotter, so I could explain how best to do it, but as indicated above, I participated in the ones that I did more or less by luck. Perhaps the first step is just to keep one's eyes open, and to keep in mind that tech-related gold rushes do happen from time to time and they are not impossibly difficult to find. What other ideas do people have? Are there other past examples of tech gold rushes besides the two that I mentioned? What might be some promising fields to look for them in the future?</p> wei_dai Jter3YhFBZFYo8vtq 2014-07-19T10:08:53.127Z Outside View(s) and MIRI's FAI Endgame https://www.lesswrong.com/posts/WdibyPFqYkCkLGxCq/outside-view-s-and-miri-s-fai-endgame <p>On the subject of how an FAI team can avoid accidentally creating a UFAI, Carl Shulman <a href="/lw/8c3/qa_with_new_executive_director_of_singularity/596l">wrote</a>:</p> <blockquote> <p>If we condition on having all other variables optimized, I'd expect a team to adopt very high standards of proof, and recognize limits to its own capabilities, biases, etc. One of the primary purposes of organizing a small FAI team is to create a team that can actually stop and abandon a line of research/design (Eliezer calls this "halt, melt, and catch fire") that cannot be shown to be safe (given limited human ability, incentives and bias).</p> </blockquote> <p>In the history of philosophy, there have been many steps in the right direction, but virtually no significant problems have been fully solved, such that philosophers can agree that some proposed idea can be the last words on a given subject. An FAI design involves making many explicit or implicit philosophical assumptions, many of which may then become fixed forever as governing principles for a new reality. They'll end up being last words on their subjects, whether we like it or not. Given the history of philosophy and applying the outside view, how can an FAI team possibly reach "very high standards of proof" regarding the safety of a design? But if we can foresee that they <em>can't</em>, then what is the point of aiming for that predictable outcome now?</p> <p>Until recently I haven't paid a lot of attention to the discussions here about inside view vs outside view, because the discussions have tended to focus on the applicability of these views to the problem of predicting intelligence explosion. It seemed obvious to me that outside views can't possibly rule out intelligence explosion scenarios, and even a small probability of a future intelligence explosion would justify a much higher than current level of investment in preparing for that possibility. But given that the inside vs outside view debate may also be relevant to the "FAI Endgame", I read up on Eliezer and Luke's most recent writings on the subject... and found them to be unobjectionable. Here's <a href="/lw/vz/the_weak_inside_view/">Eliezer</a>:</p> <blockquote> <p>On problems that are drawn from a barrel of causally similar problems, where human optimism runs rampant and unforeseen troubles are common, the <a href="/lw/jg/planning_fallacy/">Outside View beats the Inside View</a>.&nbsp;</p> </blockquote> <p>Does anyone want to argue that Eliezer's criteria for using the outside view are wrong, or don't apply here?</p> <p>And <a href="/lw/hzu/model_combination_and_adjustment/">Luke</a>:</p> <blockquote> <p>One obvious solution is to use <em>multiple</em> reference classes, and weight them by how relevant you think they are to the phenomenon you're trying to predict.</p> <p>[...]</p> <p>Once you've combined a handful of models to arrive at a qualitative or quantitative judgment, you should still be able to "adjust" the judgment in some cases using an inside view.</p> </blockquote> <p>These ideas seem harder to apply, so I'll ask for readers' help. What reference classes should we use here, in addition to past attempts to solve philosophical problems? What inside view adjustments could a future FAI team make, such that they might justifiably overcome (the most obvious-to-me) outside view's conclusion that they're very unlikely to be in the possession of complete and fully correct solutions to a diverse range of philosophical problems?</p> wei_dai WdibyPFqYkCkLGxCq 2013-08-28T23:27:23.372Z Three Approaches to "Friendliness" https://www.lesswrong.com/posts/vrnhfGuYTww3fKhAM/three-approaches-to-friendliness <p>I put "Friendliness" in quotes in the title, because I think what we <a href="/lw/hll/to_reduce_astronomical_waste_take_your_time_then/9csk">really want</a>, and what MIRI seems to be working towards, is closer to "optimality": create an AI that minimizes the expected amount of astronomical waste. In what follows I will continue to use "Friendly AI" to denote such an AI since that's the established convention.</p> <p>I've often stated my objections MIRI's plan to build an FAI directly (instead of after human intelligence has been substantially enhanced). But it's not because, as some have suggested while criticizing MIRI's FAI work, that we can't foresee what problems need to be solved. I think it's because we <em>can</em> largely foresee what kinds of problems need to be solved to build an FAI, but they all look superhumanly difficult, either due to their inherent difficulty, or the lack of opportunity for "trial and error", or both.</p> <p>When people say they don't know what problems need to be solved, they may be mostly talking about "AI safety" rather than "Friendly AI". If you think in terms of "AI safety" (i.e., making sure some particular AI doesn't cause a disaster) then that does looks like a problem that depends on what kind of AI people will build. "Friendly AI" on the other hand is really a very different problem, where we're trying to figure out what kind of AI to build in order to minimize astronomical waste. I suspect this may explain the apparent disagreement, but I'm not sure. I'm hoping that explaining my own position more clearly will help figure out whether there is a real disagreement, and what's causing it.</p> <p>The basic issue I see is that there is a large number of serious philosophical problems facing an AI that is meant to take over the universe in order to minimize astronomical waste. The AI needs a full solution to moral philosophy to know which configurations of particles/fields (or perhaps which dynamical processes) are most valuable and which are not. Moral philosophy in turn seems to have dependencies on the philosophy of mind, consciousness, metaphysics, aesthetics, and other areas. The FAI also needs solutions to many problems in decision theory, epistemology, and the philosophy of mathematics, in order to not be stuck with making wrong or suboptimal decisions for eternity. These essentially cover all the major areas of philosophy.</p> <p>For an FAI builder, there are three ways to deal with the presence of these open philosophical problems, as far as I can see. (There may be other ways for the future to turns out well without the AI builders making any special effort, for example if being philosophical is just a natural attractor for any superintelligence, but I don't see any way to be confident of this ahead of time.) I'll name them for convenient reference, but keep in mind that an actual design may use a mixture of approaches.</p> <ol> <li><strong>Normative AI</strong> - Solve all of the philosophical problems ahead of time, and code the solutions into the AI.</li> <li><strong>Black-Box Metaphilosophical AI</strong> - Program the AI to use the minds of one or more human philosophers as a black box to help it solve philosophical problems, without the AI builders understanding what "doing philosophy" actually is.</li> <li><strong>White-Box Metaphilosophical AI</strong> - Understand the nature of philosophy well enough to specify "doing philosophy" as an algorithm and code it into the AI.</li> </ol> <p>The problem with <strong>Normative </strong><strong>AI</strong>, besides the obvious inherent difficulty (as evidenced by the slow progress of human philosophers after decades, sometimes centuries of work), is that it requires us to anticipate all of the philosophical problems the AI might encounter in the future, from now until the end of the universe. We can certainly foresee some of these, like the problems associated with agents being copyable, or the AI radically changing its ontology of the world, but what might we be missing?</p> <p><strong>Black-Box Metaphilosophical AI</strong> is also risky, because it's hard to test/debug something that you don't understand. Besides that general concern, designs in this category (such as Paul Christiano's <a href="/lw/c0k/formalizing_value_extrapolation/">take on indirect normativity</a>) seem to require that the AI achieve superhuman levels of optimizing power <em>before</em> being able to solve its philosophical problems, which seems to mean that a) there's no way to test them in a safe manner, and b) it's unclear why such an AI won't cause disaster in the time period before it achieves philosophical competence.</p> <p><strong>White-Box Metaphilosophical AI</strong> may be the most promising approach. There is no strong empirical evidence that solving <a href="/lw/2id/metaphilosophical_mysteries/">metaphilosophy</a> is superhumanly difficult, simply because not many people have attempted to solve it. But I don't think that a reasonable prior combined with what evidence we do have (i.e., absence of visible progress or clear hints as to how to proceed) gives much hope for optimism either.</p> <p>To recap, I think we can largely already see what kinds of problems must be solved in order to build a superintelligent AI that will minimize astronomical waste while colonizing the universe, and it looks like they probably can't be solved correctly with high confidence until humans become significantly smarter than we are now. I think I understand why some people disagree with me (e.g., Eliezer thinks these problems just aren't <em>that</em> hard, relative to his abilities), but I'm not sure why some others say that we don't yet know what the problems will be.</p> wei_dai vrnhfGuYTww3fKhAM 2013-07-17T07:46:07.504Z Normativity and Meta-Philosophy https://www.lesswrong.com/posts/LPRuP6vdDBTeGnXmC/normativity-and-meta-philosophy <p>I find Eliezer's&nbsp;explanation&nbsp;of what "should" means to be <a href="/lw/1fz/a_less_wrong_singularity_article/19pv">unsatisfactory</a>, and here's an attempt to do better. Consider the following usages of the word:</p> <ol> <li>You should stop building piles of X pebbles because X = Y*Z.</li> <li>We should kill that police informer and dump his body in the river.</li> <li>You should one-box in Newcomb's problem.</li> </ol> <p>All of these seem to be sensible sentences, depending on the speaker and intended audience. #1, for example, seems a reasonable translation of what a <a href="/lw/sy/sorting_pebbles_into_correct_heaps/">pebblesorter</a> would say after discovering that X = Y*Z. Some might argue for "pebblesorter::should" instead of plain "should", but it's hard to deny that we need "should" in some form to fill the blank there for a translation, and I think few people besides Eliezer would object to plain "should".</p> <p>Normativity, or the idea that there's something in common about how "should" and similar words are used in different contexts, is an active area in academic philosophy. I won't try to <a href="http://analysis.oxfordjournals.org/content/70/2/331.full.pdf?keytype=ref&amp;ijkey=Q50DmwsULURmO5j">survey</a> the current theories, but my current thinking is that "should" usually means "better according to some shared, motivating standard or procedure of evaluation", but occasionally it can also be used to <em>instill </em>such a standard or procedure of evaluation in someone (such as a child) who is open to being instilled by the speaker/writer.</p> <p>It seems to me that different people (including different humans) can have different motivating standards and procedures of evaluation, and apparent disagreements about "should' sentences can arise from having different standards/procedures or from disagreement about whether something is better according to a shared standard/procedure. In most areas my personal procedure of evaluation is something that might be called "doing philosophy" but many people apparently do not share this. For example a religious extremist may have been taught by their parents, teachers, or peers to follow some rigid moral code given in their holy books, and not be open to any philosophical arguments that I can offer.</p> <p>Of course this isn't a fully satisfactory theory of normativity since I don't know <a href="/lw/2id/metaphilosophical_mysteries/">what "philosophy" really is</a> (and I'm not even sure it really is a thing). But it does help explain how "should" in morality might relate to "should" in other areas such as decision theory, does not require assuming that all humans ultimately share the same morality, and avoids the need for linguistic contortions such as "pebblesorter::should".</p> wei_dai LPRuP6vdDBTeGnXmC 2013-04-23T20:35:16.319Z Outline of Possible Sources of Values https://www.lesswrong.com/posts/uFEu2Y7efZ8CzCD5F/outline-of-possible-sources-of-values <p>I don't know what my values are. I don't even know how to find out what my values are. But do I know something about how I (or an <a href="http://wiki.lesswrong.com/wiki/Friendly_artificial_intelligence">FAI</a>) <em>may</em> be able to find out what my values are? Perhaps... and I've organized my answer to this question in the form of an "Outline of Possible Sources of Values". I hope it also serves as a summary of the major open problems in this area.</p> <ol> <li>External<ol> <li>god(s)</li> <li>other humans</li> <li>other agents</li> </ol></li> <li>Behavioral<ol> <li>actual (historical/observed) behavior</li> <li>counterfactual (simulated/predicted) behavior</li> </ol></li> <li>Subconscious Cognition<ol> <li>model-based decision making<ol> <li>ontology</li> <li>heuristics for extrapolating/updating model</li> <li>(partial) <a href="/lw/9jh/the_humans_hidden_utility_function_maybe/">utility function</a></li> </ol></li> <li>model-free decision making<ol> <li>identity based (adopt a social role like "environmentalist" or "academic" and emulate an appropriate role model, actual or idealized)</li> <li>habits</li> <li>reinforcement based</li> </ol></li> </ol></li> <li>Conscious Cognition<ol> <li>decision making using explicit verbal and/or quantitative reasoning<ol> <li>consequentialist (similar to model-based above, but using <a href="/lw/2yp/making_your_explicit_reasoning_trustworthy/">explicit reasoning</a>)</li> <li>deontological</li> <li>virtue ethical</li> <li>identity based</li> </ol></li> <li>reasoning about terminal goals/values/preferences/moral principles<ol> <li>responses (changes in state) to moral arguments (possibly context dependent)</li> <li>distributions of autonomously generated moral arguments (possibly context dependent)</li> <li>logical structure (if any) of moral reasoning</li> </ol></li> <li>object-level intuitions/judgments<ol> <li>about what one should do in particular ethical situations</li> <li>about the desirabilities of particular outcomes</li> <li>about moral principles</li> </ol></li> <li>meta-level&nbsp;intuitions/judgments<ol> <li>about the nature of morality</li> <li>about the <a href="/lw/6us/whats_wrong_with_simplicity_of_value/">complexity of values</a></li> <li>about what the valid sources of values are</li> <li>about what constitutes correct moral reasoning</li> <li>about how to explicitly/formally/effectively represent values (utility function, multiple utility functions, deontological rules, or something else) (if utility function(s), for what decision theory and ontology?)</li> <li>about how to&nbsp;extract/translate/combine sources of values into a representation of values<ol> <li>how to solve <a href="http://wiki.lesswrong.com/wiki/Ontological_crisis">ontological crisis</a></li> <li>how to deal with native utility function or revealed preferences being partial</li> <li>how to translate non-consequentialist sources of values into utility function(s)</li> <li>how to deal with moral principles being vague and incomplete</li> <li>how to deal with conflicts between different sources of values</li> <li>how to deal with <a href="http://wiki.lesswrong.com/wiki/Moral_uncertainty">lack of certainty</a> in one's intuitions/judgments</li> </ol></li> <li>whose intuition/judgment ought to be applied? (may be different for each of the above)<ol> <li>the subject's (at what point in time? current intuitions, eventual judgments, or something in between?)</li> <li>the FAI designers'</li> <li>the FAI's own philosophical conclusions</li> </ol></li> </ol></li> </ol></li> </ol> <p>Using this outline, we can obtain a concise understanding of what many metaethical theories and FAI proposals are claiming/suggesting and how they differ from each other. For example, Nyan_Sandwich's "<a href="/lw/g7y/morality_is_awesome/">morality is awesome</a>" thesis can be interpreted as the claim that the most important source of values is our intuitions about the desirability (awesomeness) of&nbsp;particular outcomes.</p> <p>As another example, Aaron Swartz argued <a href="http://www.aaronsw.com/weblog/ethicsfor">against "reflective equilibrium"</a> by which he meant the claim that the valid sources of values are our object-level moral intuitions, and that correct moral reasoning consists of working back and forth between these intuitions until they reach coherence. His own position was that intuitions about moral principles are the only valid source of values and we should discount our intuitions about particular ethical situations.</p> <p>A final example is Paul Christiano's "<a href="/lw/c0k/formalizing_value_extrapolation/">Indirect Normativity</a>"&nbsp;proposal&nbsp;(n.b., "Indirect Normativity" was originally coined by Nick Bostrom to refer to an entire class of designs where the AI's values are defined "indirectly") for FAI, where an important source of values is the&nbsp;distribution&nbsp;of&nbsp;moral arguments the subject is likely to generate in a particular simulated environment and their responses to those arguments. Also, just about every meta-level question is left for the (simulated) subject to answer, except for the decision theory and ontology of the utility function that their values must finally be encoded in, which is fixed by the FAI designer.</p> <p>I think the outline includes most of the ideas brought up in past LW discussions, or in moral philosophies that I'm familiar with. Please let me know if I left out anything important.</p> wei_dai uFEu2Y7efZ8CzCD5F 2013-01-18T00:14:49.866Z How to signal curiosity? https://www.lesswrong.com/posts/qKjwd4zR9PvB9Fxfw/how-to-signal-curiosity <p>At LessWrong we encourage people to be <a href="/lw/4ku/use_curiosity/">curious</a>. Curiosity causes people to ask questions, but sometimes those questions get misinterpreted as social challenges or rhetorical techniques, or maybe just regular questions that you don't have a "<a href="/lw/jz/the_meditation_on_curiosity/">burning itch</a>" to know the answers for (and hence maybe not particularly worth answering). I sometimes preface a question by "I'm curious," but of course anyone could say that so it's not a very effective way to distinguish oneself as being genuinely curious. Another thing I sometimes do is to try to answer the question myself and present one or more answers as my "guesses" and ask if one of them is correct, since someone who is genuinely curious is more likely put in such effort. But unfortunately sometimes that backfires when the person you're directing the question at interprets the guesses as a way to make them look bad, because for example you failed to hypothesize the actual answer and include it as one of the guesses, and all your guesses make them look worse than the actual answer.</p> <p>I've noticed examples of&nbsp;this happening to others on LW (or at least possibly happening, since I can't be sure whether someone else really&nbsp;is&nbsp;curious) as well as to myself, and can only imagine that the problem is even worse elsewhere, where people may not give each other as much benefit of doubt as we do around here. So my question is, what can curious people do, to signal their genuine curiosity when asking questions? Has anyone thought about this question already, or perhaps can recognize some strategies they already employ and make them explicit for the rest of us?</p> <p><strong>ETA:</strong>&nbsp;Perhaps I should say a bit more about the kind of situation I have in mind. Often I'll see a statement from someone that either contradicts my existing beliefs about something or is on a topic that I'm pretty ignorant about, and it doesn't come with an argument or evidence to back it up. I'd think "I don't want to just take their word since they might be wrong, but there also seems a good chance that they know something that I don't in which case I'd really like to know what it is, so let's ask why they're saying what they're saying." And unfortunately this sometimes gets interpreted as "I'm pretty sure you're wrong, and I'm going to&nbsp;embarrass&nbsp;you by asking a question that I don't think you can answer."</p> <p><strong>ETA2:</strong>&nbsp;The reason I use "signal" in the title is that people who <em>do</em> just want to embarrass the other person would want to have plausible deniability. If it was clear that's their intention and it turns out that the other person has a perfectly good answer, then they'll be the one embarrassed instead. So ideally the curious person should send a signal that can't be faked by someone who just wants to pretend to be curious.</p> wei_dai qKjwd4zR9PvB9Fxfw 2013-01-11T22:47:23.698Z Morality Isn't Logical https://www.lesswrong.com/posts/QvYKSFmsBX3QhgQvF/morality-isn-t-logical <p>What do I mean by "morality isn't logical"? I mean in the same sense that mathematics is logical but literary criticism isn't: the "reasoning" we use to think about morality doesn't resemble logical reasoning. All systems of logic, that I'm aware of, have a concept of proof and a method of verifying with high degree of certainty whether an argument constitutes a proof. As long as the logic is consistent (and we have good reason to think that many of them are), once we verify a proof we can accept its conclusion without worrying that there may be another proof that makes the opposite conclusion. With morality though, we have no such method, and people all the time make moral arguments that can be reversed or called into question by other moral arguments. (Edit: For an example of this, see <a href="/lw/n3/circular_altruism/">these</a> <a href="/lw/1r9/shut_up_and_divide/">posts</a>.)</p> <p>Without being a system of logic, moral philosophical reasoning likely (or at least plausibly) doesn't have any of the nice properties that a well-constructed system of logic would have, for example, consistency, validity, soundness, or even the more basic property that considering arguments in a different order, or in a different mood, won't cause a person to accept an entirely different set of conclusions. For all we know, somebody trying to reason about a <a href="/lw/tc/unnatural_categories/">moral concept</a> like "fairness" may just be taking a random walk as they move from one conclusion to another based on moral arguments they encounter or think up.</p> <p>In a <a href="/lw/fv3/by_which_it_may_be_judged/">recent post</a>, Eliezer said "morality is <em>logic</em>", by which he seems to mean... well, I'm still not exactly sure what, but one interpretation is that a person's cognition about morality can be described as an algorithm, and that <em>algorithm</em> can be studied using logical reasoning. (Which of course is true, but in that sense both math and literary criticism as well as every other subject of human study would be logic.) In any case, I don't think Eliezer is explicitly claiming that an algorithm-for-thinking-about-morality constitutes an algorithm-for-doing-logic, but I worry that the characterization of&nbsp;"morality is&nbsp;logic"&nbsp;may cause some connotations of "logic" to be inappropriately&nbsp;<a href="/lw/ny/sneaking_in_connotations/">sneaked</a> into "morality". For example Eliezer seems to (at least <a href="/lw/sm/the_meaning_of_right/to4">at one point</a>) assume that considering moral arguments in a different order <em>won't</em> cause a human to accept an entirely different set of conclusions, and maybe this is why. To fight this potential sneaking of connotations, I suggest that when you see the phrase "morality is logic", remind yourself that morality isn't logical.</p> <p>&nbsp;</p> wei_dai QvYKSFmsBX3QhgQvF 2012-12-26T23:08:09.419Z Beware Selective Nihilism https://www.lesswrong.com/posts/uXxoLPKAdunq6Lm3s/beware-selective-nihilism <p>In a <a href="/lw/fyb/ontological_crisis_in_humans/">previous post</a>, I argued that nihilism is often short changed around here. However I'm far from certain that it is correct, and in the mean time I think we should be careful not to discard our values one at a time by engaging in "selective nihilism" when faced with an ontological crisis, without even realizing that's what's happening. Karl recently reminded me of the post <a href="/lw/qx/timeless_identity/">Timeless Identity</a> by Eliezer Yudkowsky, which I noticed seems to be an instance of this.</p> <p>As I mentioned in the previous post, our values seem to be defined in terms of a world model where people exist as ontologically primitive entities ruled heuristically by (mostly intuitive understandings of) physics and psychology. In this kind of decision system, both identity-as-physical-continuity and identity-as-psychological-continuity make perfect sense as possible values, and it seems humans do "natively" have both values. A typical human being is both&nbsp;reluctant&nbsp;to step into a teleporter that works by destructive scanning, and unwilling to let their physical structure be continuously modified into a&nbsp;psychologically very different being.&nbsp;</p> <p>If faced with the knowledge that&nbsp;physical continuity doesn't exist in the real world at the level of fundamental physics, one might conclude that it's crazy to continue to value it, and this is what Eliezer's post argued. But if we apply this reasoning in a non-selective fashion, wouldn't we also conclude that we should stop valuing things like "pain" and "happiness" which also do not seem to exist at the level of&nbsp;fundamental physics?</p> <p>In our current environment, there is widespread agreement&nbsp;among&nbsp;humans as to which macroscopic objects at time t+1 are physical&nbsp;continuations&nbsp;of which macroscopic objects existing at time t. We may not fully understand what exactly it is we're doing when judging such physical continuity, and the agreement tends to break down when we start talking about more exotic situations, and if/when we do fully understand our criteria for judging physical continuity it's unlikely to have a simple&nbsp;definition&nbsp;in terms of fundamental physics, but all of this is true&nbsp;for "pain" and "happiness" as well.</p> <p>I suggest we keep all of our (potential/apparent) values intact until we have a better handle on how we're supposed to deal with ontological crises in general. If we convince ourselves that we should discard some value, and that turns out to be wrong, the error may be unrecoverable once we've lived with it long enough.</p> wei_dai uXxoLPKAdunq6Lm3s 2012-12-20T18:53:05.496Z Ontological Crisis in Humans https://www.lesswrong.com/posts/KLaJjNdENsHhKhG5m/ontological-crisis-in-humans <p>Imagine a robot that was designed to find and collect spare change around its owner's house. It had a world model where macroscopic everyday objects are ontologically primitive and ruled by high-school-like physics and (for humans and their pets)&nbsp;rudimentary&nbsp;psychology and animal behavior. Its goals were expressed as a utility function over this world model, which was sufficient for its designed purpose. All went well until one day, a prankster decided to "upgrade" the robot's world model to be based on modern particle physics. This unfortunately caused the robot's utility function to instantly throw a <a href="http://stackoverflow.com/questions/641064/what-is-a-domain-error">domain error</a> exception (since its inputs are no longer the expected list of macroscopic objects and associated properties like shape and color), thus crashing the controlling AI.</p> <p>According to Peter de Blanc, who used the phrase "<a href="http://wiki.lesswrong.com/wiki/Ontological_crisis">ontological crisis</a>" to describe this kind of problem,</p> <blockquote> <p>Human beings also confront ontological crises. We should find out what cognitive algorithms humans use to solve the same problems described in this paper. If we wish to build agents that maximize human values, this may be aided by knowing how humans re-interpret their values in new ontologies.</p> </blockquote> <p>I recently realized that a couple of problems that I've been thinking over&nbsp;(the&nbsp;<a href="/lw/8gk/where_do_selfish_values_come_from">nature of selfishness</a>&nbsp;and the&nbsp;<a href="/lw/4qg/a_thought_experiment_on_pain_as_a_moral_disvalue/">nature of pain/pleasure/suffering/happiness</a>)&nbsp;can be considered instances of ontological crises in humans (although I'm not so sure we necessarily have the cognitive algorithms to solve them).&nbsp;I started thinking in this direction after writing <a href="/lw/b7w/decision_theories_a_semiformal_analysis_part_iii/6hef">this comment</a>:</p> <blockquote> <p>This formulation or variant of TDT requires that before a decision problem is handed to it, the world is divided into the agent itself (X), other agents (Y), and "dumb matter" (G). I think this is misguided, since the world doesn't really divide cleanly into these 3 parts.</p> </blockquote> <p>What struck me is that even though the world doesn't divide cleanly into these 3 parts, <em>our models</em>&nbsp;of the world actually do. In the world models that we humans use on a day to day basis, and over which our utility functions seem to be defined (<a href="/lw/9jh/the_humans_hidden_utility_function_maybe/">to the extent</a> that we can be said to have utility functions at all), we do take the Self, Other People, and various Dumb Matter to be ontologically primitive entities. Our world models, like the coin collecting robot's, consist of these macroscopic objects ruled by a hodgepodge of heuristics and prediction algorithms, rather than&nbsp;microscopic particles governed&nbsp;by a coherent set of laws of physics.</p> <p>For example, the amount of pain someone is experiencing doesn't seem to exist in the real world as an XML tag attached to some "person entity", but that's pretty much how our models of the world work, and perhaps more importantly, that's what our utility functions expect their inputs to look like (as opposed to, say, a list of particles and their positions and velocities). Similarly, a human can be selfish just by treating the object labeled "SELF" in its world model differently from other objects, whereas an AI with a world model consisting of microscopic particles would need to somehow inherit or learn a detailed description of itself in order to&nbsp;be selfish.</p> <p>To fully confront the ontological crisis that we face, we would have to upgrade our world model to be based on actual physics, and simultaneously translate our utility functions so that their domain is the set of possible states of the new model. We currently have little idea how to accomplish this, and instead what we do in practice is, as far as I can tell, keep our ontologies intact and utility functions unchanged, but just add some new&nbsp;heuristics that in certain limited circumstances call out to new physics formulas to better update/extrapolate our models. This is actually rather clever, because it lets us make use of updated understandings of physics without ever having to, for instance, decide exactly what patterns of particle movements constitute pain or pleasure, or&nbsp;what&nbsp;patterns constitute oneself. Nevertheless, this approach hardly seems capable of being extended to work in a future where many people may have&nbsp;nontraditional&nbsp;mind architectures, or have a zillion copies of themselves running on all kinds of strange substrates, or be merged into amorphous group minds with no clear boundaries between individuals.</p> <p>By the way, I think nihilism often gets short&nbsp;changed <a href="/lw/sc/existential_angst_factory/">around</a> <a href="/lw/5i7/on_being_okay_with_the_truth/">here</a>. Given that we do not actually have at hand a solution to ontological crises in general or to the specific crisis that we face, what's wrong with saying that the solution set may just be null? Given that evolution doesn't constitute a particularly benevolent and farsighted designer, perhaps we may not be able to do much better than that poor spare-change collecting robot? If Eliezer is <a href="/r/discussion/lw/827/ai_ontology_crises_an_informal_typology/7pqr">worried</a> that actual AIs facing actual ontological crises could do worse&nbsp;than just crash, should we be very sanguine that for humans everything must "add up to moral normality"?</p> <p>To expand a bit more on this possibility, many people have an aversion against moral arbitrariness, so we need at a minimum a utility translation scheme that's principled enough to pass that filter. But our existing world models are a hodgepodge put together by evolution so there may not be any such sufficiently&nbsp;principled&nbsp;scheme, which (if other approaches to solving moral philosophy also don't pan out) would leave us with legitimate&nbsp;feelings of "existential angst" and nihilism. One could perhaps still argue that any <em>current</em> such feelings are premature, but maybe some people have stronger intuitions than others that these problems are unsolvable?</p> <p>Do we have any examples of humans successfully navigating an ontological crisis? The LessWrong Wiki <a href="http://wiki.lesswrong.com/wiki/Ontological_crisis">mentions</a> loss of faith in God:</p> <blockquote> <p>In the human context, a clear example of an ontological crisis is a believer&rsquo;s loss of faith in God. Their motivations and goals, coming from a very specific view of life suddenly become obsolete and maybe even nonsense in the face of this new configuration. The person will then experience a deep crisis and go through the psychological task of reconstructing its set of preferences according the new world view.</p> </blockquote> <p>But I don't think loss of faith in God actually constitutes an ontological crisis, or if it does, certainly not a very severe one. An ontology consisting of Gods, Self, Other People, and Dumb Matter just isn't very different from one consisting of Self, Other People, and Dumb Matter (the latter could just be considered a special case of the former with quantity of Gods being 0), especially when you compare either&nbsp;ontology&nbsp;to one made of microscopic particles or even <a href="http://en.wikipedia.org/wiki/Loop_quantum_gravity">less</a>&nbsp;<a href="http://en.wikipedia.org/wiki/String_theory">familiar</a>&nbsp;<a href="http://en.wikipedia.org/wiki/Ultimate_ensemble">entities</a>.</p> <p>But to end on a more positive note, realizing that seemingly unrelated problems are actually instances of a more general problem gives some hope that by "going meta" we can find a solution to all of these problems at once. Maybe we can solve many ethical problems simultaneously by discovering some generic algorithm that can be used by an agent to transition from any ontology to another?&nbsp;</p> <p>(Note that I'm not saying this <em>is</em>&nbsp;the right way to understand one's real preferences/morality, but just drawing attention to it as a possible alternative to other more "object level" or "purely&nbsp;philosophical" approaches. See also <a href="/lw/6ha/the_blueminimizing_robot/4gi2">this previous discussion</a>, which I recalled after writing most of the above.)</p> wei_dai KLaJjNdENsHhKhG5m 2012-12-18T17:32:39.150Z Reasons for someone to "ignore" you https://www.lesswrong.com/posts/WmbHx5F3aiEWKD29Y/reasons-for-someone-to-ignore-you <p>I often feel guilty for <a href="/lw/1gg/agree_retort_or_ignore_a_post_from_the_future/">ignoring</a> other people's comments or questions, and frustrated when other people seem to be ignoring me. If I can't indicate to someone exactly why I'm not answering, or can't receive such an indication myself, I can at least help my future selves and&nbsp;others&nbsp;obtain a better probability distribution over such reasons. To that end, I'm listing all of the reasons I can think of for someone to not respond to a comment/question, to save the effort of regenerating these hypotheses from scratch each time and prevent the possibility of failing to <em>consider</em>&nbsp;the actual reason. Note that these are not meant to be mutually exclusive.</p> <ul> <li>They haven't checked their inbox yet.</li> <li>They got too many responses in their inbox and didn't pay enough attention to yours.</li> <li>They are temporarily too busy to respond.</li> <li>They were planning to respond but then forgot to. </li> <li>They don't understand the comment yet and are still trying.</li> <li>They've stopped trying to understand the comment and don't expect further discussion to resolve the confusion. </li> <li>They think it's obvious that they agree.</li> <li>They think it's obvious that they disagree.</li> <li>They disagree and are planning to write up the reasons later.</li> <li>They don't know whether to agree or disagree and are still thinking about it.</li> <li>They think all useful information has been exchanged and it's not worth another comment just to indicate final agreement/disagreement.</li> <li>They think you just want to express your opinion and don't care what they think.</li> <li>They are&nbsp;tired of the discussion and don't want to think about it any more.</li> <li>The comment shows a level of intelligence and/or rationality and/or knowledge that makes it not worthwhile for them to engage you. </li> <li>They already addressed your question or point before but you missed it or didn't get it.</li> <li>They don't know how to answer your question and are too embarrassed&nbsp;to admit it.</li> <li>They interpreted your question as being addressed to the public rather than to them personally.</li> <li>They think most people already know the answer (or don't care to know) and don't want to bother answering just for you or a few other people.</li> <li>They think you are mainly signaling/status-seeking instead of truth-seeking.</li> <li>They are mainly&nbsp;signaling/status-seeking&nbsp;(perhaps subconsciously)&nbsp;and think not responding is optimal for that.</li> <li>They can't see how to respond honestly without causing or prolonging a personal&nbsp;enmity.</li> <li>They consider you a troll or potential troll and don't want to reinforce you with attention.</li> <li>They have an emotional aversion against talking to you. </li> <li>They have some other instrumental reason for not responding.</li> <li>Suggested by&nbsp;shminux: You're on a list of LWers they never reply to, because a number of prior conversations with you were invariably futile for one or more of the reasons described above, and their estimate of any future conversation going any better is very low. </li> <li>Suggested by wedrifid:&nbsp;Technical difficulties. They first read your comment via a mobile device, composed (mentally) a reply that would take too long to type on that medium and two days later they either forget to type it out via keyboard, no longer care about the subject or think that a late reply would be inappropriate given developments in the conversation.</li> <li>Suggested by wedrifid:&nbsp;Previous comments by them in the thread had been downvoted or otherwise opposed and they choose to accede to the implied wishes of the community rather than try to fight it or defy it.</li> <li>Suggested by cata: Not answering promptly caused them to feel guilty, which caused more delay and more guilt, so they never respond to hide their shame.</li> <li>Suggested by wedrifid: They think your comment missed the point of the context and so doesn't make sense but it is not important enough to embarrass you by explaining or challenging.</li> <li>Suggested by Morendil: Your post/comment didn't contain a single question mark, so there's no call to answer.</li> <li>Suggesetd by sixes_and_sevens:&nbsp;They think the discussion is going off topic. <li>Suggested by Airedale:&nbsp;They're purposefully trying to disengage early rather than getting into a fight about who has the "last word" on the subject, e.g., on some level they may want to respond or even to "win" the exchange, but they're purposefully telling themselves to step away from the computer.</li> </li> </ul> <p>If I missed any reasons (that happen often enough to be worth including in this list), please give them in the comments. See also <a href="/lw/5/issues_bugs_and_requested_features/11mr">this related comment</a>.</p> wei_dai WmbHx5F3aiEWKD29Y 2012-10-08T19:50:36.426Z "Hide comments in downvoted threads" is now active https://www.lesswrong.com/posts/XrLvxW7Wjo6b9rRCT/hide-comments-in-downvoted-threads-is-now-active <p>I just found out that <a href="http://code.google.com/p/lesswrong/issues/detail?id=345">a new website feature</a>&nbsp;was implemented 2 days ago. If a comment is voted to -4 or below, it and all replies and downstream comments from it will be hidden from Recent Comments, and further replies in that subthread will incur 5 karma points penalty. The hiding, but not karma penalty, applies retroactively to comments in that subthread posted before the -4 vote.</p> <p>This seems to be worth a discussion post since most people are probably still voting things to below -3 without knowing the new&nbsp;consequences&nbsp;of doing so.</p> wei_dai XrLvxW7Wjo6b9rRCT 2012-10-05T07:23:56.318Z Under-acknowledged Value Differences https://www.lesswrong.com/posts/4XS5LQA6RadkMqdgt/under-acknowledged-value-differences <p>I've been reading a lot of the recent LW discussions on politics and gender, and noticed that people rarely bring up or explicitly acknowledge that different people affected by some political or gender issue have different values/preferences, and therefore solving the problem involves a strong element of bargaining and is not just a matter of straightforward optimization. Instead, we tend to talk as if there is some way to solve the problem that's best for everyone, and that rational discussion will bring us closer to finding that one best solution.</p> <p>For example, when discussing gender-related problems, one solution may be generally better for men, while another solution may be&nbsp;generally&nbsp;better for women. If people are selfish, then they will each prefer the solution that's individually best for them, even if they can agree on all of the facts. (It's <a href="/lw/8gk/where_do_selfish_values_come_from/">unclear</a> whether people <em>should</em>&nbsp;be selfish, but it seems best to assume that most are, for practical purposes.)</p> <p>Unfortunately, <a href="/lw/f6/epistemic_vs_instrumental_rationality_case_of_the/">in bargaining situations</a>, epistemic rationality is not necessarily instrumentally rational. In general, convincing others of a falsehood can be useful for moving the negotiated outcome closer to one's own preferences and away from others', and this may be done more easily if one honestly believes the falsehood. (One of these falsehoods may be, for example, "My preferred solution is best for everyone.") Given these (subconsciously&nbsp;or&nbsp;evolutionarily&nbsp;processed) incentives, it seems reasonable to think that the more solving a problem resembles bargaining, the more likely we are to be&nbsp;epistemicaly irrationality when thinking and talking about it.</p> <p>If we do not acknowledge and keep in mind that we are in a bargaining situation, then we are less likely to detect such failures of&nbsp;epistemic rationality, especially in ourselves.&nbsp;We're also less likely to see that there's an element of Prisoner's&nbsp;Dilemma&nbsp;in participating in such debates: your effort to convince people to adopt your preferred solution is costly (in time and in your and LW's overall sanity level) but may achieve little because someone else is making an opposite argument. Both of you may be better off if neither engaged in the debate.</p> wei_dai 4XS5LQA6RadkMqdgt 2012-09-12T22:02:19.263Z Kelly Criteria and Two Envelopes https://www.lesswrong.com/posts/ZBq4H9tdkm5MgC8hL/kelly-criteria-and-two-envelopes <p>(This post is motivated by <a href="/lw/dy9/solving_the_two_envelopes_problem/">recent</a> <a href="/lw/e26/who_wants_to_start_an_important_startup/77e3">discussions</a> here of the two titular topics.)</p> <p>Suppose someone hands you two envelopes and gives you some information that allows you to conclude either:</p> <ol> <li>The expected ratio of amount of money in the red envelope to the amount in the blue is &gt;1, or</li> <li>With probability close to 1 (say 0.999) the amount of money in the red envelope is greater than the amount in the blue.</li> </ol> <div>In either case, is the conclusion sufficient to imply that one should choose the red envelope over the blue? Obviously not, right? (Well, at least #2 should be obvious, and #1 was recently <a href="/lw/dy9/solving_the_two_envelopes_problem/75dh">pointed out</a> by VincentYu.) In any case I will also give some simple counter-examples here:</div> <div><ol> <li>Suppose red envelope has $5 and blue envelope has even chance of $1 and $100. E(R/B) = .5(5/1)+.5(5/100) = 2.525 but one would want to choose the blue envelope assuming utility linear in money.</li> <li>Red envelope has $100, blue envelope has $99 with probability 0.999 and $1 million with probability 0.001.&nbsp;</li> </ol></div> <p>Notice that it's not sufficient to establish both conclusions at once either (my second example above actually satisfies both).</p> <p>A common argument for the Kelly Criteria being "optimal" (see page 10 of <a href="http://www.bf.uzh.ch/publikationen/pdf/publ_1967.pdf">this review paper</a> recommended by Robin Hanson) is to mathematically establish conclusions 1 and 2, with Kelly Criteria in place of the red envelope and "any other strategy" in place of the blue envelope. However it turns out that "optimal" is not supposed to be normative, as the paper later explains:</p> <blockquote> <p> <p>In essence&nbsp;the critique is that you should maximize your utility function rather than to base&nbsp;your investment decision on some other criterion. This is certainly correct, but fails&nbsp;to appreciate that Kelly's results are not necessarily normative but rather descriptive.</p> </p> </blockquote> <p>So the upshot here is that unless your utility function is actually log in money and not, say, linear (or even <a href="/lw/12v/fair_division_of_blackhole_negentropy_an/">superlinear</a>) in the amount of resources under your control, you may not want to adopt the Kelly Criteria even when the other commonly mentioned assumptions are&nbsp;satisfied.</p> wei_dai ZBq4H9tdkm5MgC8hL 2012-08-16T21:57:41.809Z Cynical explanations of FAI critics (including myself) https://www.lesswrong.com/posts/Cqo5uKA6r8jickEhA/cynical-explanations-of-fai-critics-including-myself <p><strong>Related Posts: </strong><a href="/lw/dy8/a_cynical_explanation_for_why_rationalists_worry/">A cynical explanation for why rationalists worry about FAI</a>,&nbsp;<a href="/lw/bl2/a_belief_propagation_graph/">A belief propagation graph</a></p> <p>Lately I've been pondering the fact that while there are many critics of SIAI and its plan to form a team to build FAI, few of us seem to agree on what SIAI or we should do instead. Here are some of the alternative suggestions offered so far:</p> <ul> <li>work on computer security</li> <li>work to improve laws and institutions</li> <li>work on mind uploading</li> <li>work on intelligence amplification</li> <li>work on non-autonomous AI (e.g., Oracle AI, "Tool AI", automated formal reasoning systems, etc.)</li> <li>work on academically "mainstream" AGI approaches or trust that those researchers know what they are doing</li> <li>stop worrying about the Singularity and work on more mundane goals</li> </ul> <div>Given that ideal reasoners are not supposed to disagree, it seems likely that most if not all of these alternative suggestions can also be explained by their proponents being less than rational. Looking at myself and <a href="/lw/6mi/some_thoughts_on_singularity_strategies/">my suggestion</a> to work on IA or uploading, I've noticed that I have a tendency to be initially over-optimistic about some technology and then become gradually more&nbsp;pessimistic&nbsp;as I learn more details about it, so that I end up being more optimistic about technologies that I'm less familiar with than the ones that I've studied in detail. (Another example of this is me being initially enamoured with Cypherpunk ideas and then giving up on them after inventing some key pieces of the necessary technology and seeing in more detail how it would actually have to work.)</div> <div>I'll skip giving explanations for other critics to avoid offending them, but it shouldn't be too hard for the reader to come up with their own explanations. It seems that I can't trust any of the FAI critics, including myself, nor do I think Eliezer and company are much better at reasoning or intuiting their way to a correct conclusion about how we should face the apparent threat and opportunity that is the Singularity. What <em>useful</em>&nbsp;implications can I draw from this? I don't know, but it seems like it can't hurt to pose the question to LessWrong.&nbsp;</div> <p>&nbsp;</p> wei_dai Cqo5uKA6r8jickEhA 2012-08-13T21:19:06.671Z Work on Security Instead of Friendliness? https://www.lesswrong.com/posts/m8FjhuELdg7iv6boW/work-on-security-instead-of-friendliness <blockquote> <p>So I submit the only useful questions we can ask are not about AGI, "goals", and other such anthropomorphic, infeasible, irrelevant, and/or hopelessly vague ideas. We can only usefully ask computer security questions. For example some researchers I know believe we can achieve <a rel="nofollow" href="http://www.hpl.hp.com/techreports/2004/HPL-2004-221.html?jumpid=reg_R1002_USEN">virus-safe computing</a>. If we can achieve security against malware as strong as we can achieve for symmetric key cryptography, then it doesn't matter how smart the software is or what goals it has: if one-way functions exist no computational entity, classical or quantum, can crack symmetric key crypto based on said functions. And if NP-hard public key crypto exists, similarly for public key crypto. These and other security issues, and in particular the security of property rights, are the only real issues here and the rest is BS.</p> </blockquote> <p>-- <a href="http://unenumerated.blogspot.com/2011/01/singularity.html#7781997206773677029">Nick Szabo</a></p> <p>Nick Szabo and I have very similar backrounds and interests. We both majored in computer science at the University of Washington. We're both very interested in economics and security. We came up with similar ideas about digital money. So why don't I advocate working on security problems while ignoring AGI, goals and Friendliness?</p> <p>In fact, I once did think that working on security was the best way to push the future towards a positive Singularity and away from a negative one. I started working on my <a href="http://www.cryptopp.com/">Crypto++ Library</a> shortly after reading Vernor Vinge's A Fire Upon the Deep. I believe it was the first general purpose open source cryptography library, and it's still one of the most popular. (Studying cryptography led me to become involved in the Cypherpunks community with its emphasis on privacy and freedom from government intrusion, but a major reason for me to become interested in cryptography in the first place was a desire to help increase security against future entities similar to the Blight described in Vinge's novel.)</p> <p>I've since changed my mind, for two reasons.</p> <p><strong>1. The economics of security seems very unfavorable to the defense, in every field <em>except</em> cryptography.</strong></p> <p>Studying cryptography gave me hope that improving security could make a difference. But in every other security field, both physical and virtual, little progress is apparent, certainly not enough that humans might hope to defend their property rights against smarter intelligences. Achieving "security against malware as strong as we can achieve for symmetric key cryptography" seems quite hopeless in particular. Nick links above to a 2004 technical report titled "Polaris: Virus Safe Computing for Windows XP", which is strange considering that it's now 2012 and malware have little trouble with the latest operating systems and their defenses. Also striking to me has been the fact that even dedicated security software like OpenSSH and OpenSSL have had design and coding flaws that introduced security holes to the systems that run them.</p> <p>One way to think about Friendly AI is that it's an offensive approach to the problem of security (i.e., take over the world), instead of a defensive one.</p> <p><strong>2. Solving the problem of security at a sufficient level of generality requires understanding goals, and is essentially equivalent to solving Friendliness.</strong></p> <p>What does it mean to have "secure property rights", anyway? If I build an impregnable fortress around me, but an Unfriendly AI causes me to give up my goals in favor of its own by crafting a philosophical argument that is extremely convincing to me but wrong (or more generally, subverts my motivational system in some way), have I retained my "property rights"? What if it does the same to one of my robot servants, so that it subtly starts serving the UFAI's interests while thinking it's still serving mine? How does one define whether a human or an AI has been "subverted" or is "secure", without reference to its "goals"? It became apparent to me that fully solving security is not very different from solving Friendliness.</p> <p>I would be very interested to know what Nick (and others taking a similar position) thinks after reading the above, or if they've already had similar thoughts but still came to their current conclusions.</p> wei_dai m8FjhuELdg7iv6boW 2012-07-21T18:28:44.692Z Open Problems Related to Solomonoff Induction https://www.lesswrong.com/posts/fC248GwrWLT4Dkjf6/open-problems-related-to-solomonoff-induction <p>Solomonoff Induction seems clearly "on the right track", but there are a number of problems with it that I've been&nbsp;puzzling&nbsp;over for several years and have not made much progress on. I think I've talked about all of them in various comments in the past, but never collected them in one place.</p> <h4>Apparent Unformalizability of &ldquo;Actual&rdquo; Induction</h4> <h5><a href="https://groups.google.com/group/everything-list/browse_thread/thread/c7442c13ff1396ec">Argument</a> via Tarski&rsquo;s <a href="http://en.wikipedia.org/wiki/Tarski's_undefinability_theorem">Indefinability of Truth</a></h5> <blockquote> <p>Informally, the theorem states that arithmetical truth cannot be defined in arithmetic. The theorem applies more generally to any sufficiently strong formal system, showing that truth in the standard model of the system cannot be defined within the system.</p> </blockquote> <p>Suppose we define a generalized version of Solomonoff Induction based on some second-order logic. The truth predicate for this logic can&rsquo;t be defined within the logic and therefore a device that can decide the truth value of arbitrary statements in this logical has no finite description within this logic. If an alien claimed to have such a device, this generalized Solomonoff induction would assign the hypothesis that they're telling the truth zero probability, whereas we would assign it some small but positive probability.</p> <h5><a href="https://groups.google.com/group/one-logic/browse_thread/thread/b499a90ef9e5fd84">Argument</a> via Berry&rsquo;s Paradox</h5> <p>Consider an arbitrary probability distribution P, and the smallest integer (or the lexicographically least object) x such that P(x) &lt; 1/3^^^3 (in Knuth's up-arrow notation). Since x has a short description, a universal distribution shouldn't assign it such a low probability, but P does, so P can't be a universal distribution.</p> <h4>Is Solomonoff Induction &ldquo;good enough&rdquo;?</h4> <p>Given the above, is Solomonoff Induction nevertheless &ldquo;<a href="/lw/4iy/does_solomonoff_always_win/">good enough</a>&rdquo; for practical purposes? In other words, would an AI programmed to approximate Solomonoff Induction do as well as any other possible agent we might build, even though it wouldn&rsquo;t have what we&rsquo;d consider correct beliefs?</p> <h4>Is complexity objective?</h4> <p>Solomonoff Induction is supposed to be a formalization of Occam&rsquo;s Razor, and it&rsquo;s confusing that the formalization has a free parameter in the form of a universal Turing machine that is used to define the notion of complexity. What&rsquo;s the significance of the fact that we can&rsquo;t seem to define a parameterless concept of complexity? That complexity is subjective?</p> <h4>Is Solomonoff an ideal or an approximation?</h4> <p>Is it the case that the universal prior&nbsp;(or some suitable generalization of it that somehow overcomes the above "unformalizability problems")&nbsp;is the &ldquo;true&rdquo; prior and that Solomonoff Induction represents idealized reasoning, or does Solomonoff just &ldquo;work well enough&rdquo; (in some sense) at approximating any rational agent?</p> <h4>How can we apply Solomonoff when our inputs are not symbol strings?</h4> <p>Solomonoff Induction is defined over symbol strings (for example bit strings) but our perceptions are made of &ldquo;qualia&rdquo; instead of symbols. How is Solomonoff Induction supposed to work for us?</p> <h4>What does Solomonoff Induction actually say?</h4> <p>What does Solomonoff Induction actually say about, for example, whether we live in a creatorless universe that runs on physics? Or the Simulation Argument?</p> wei_dai fC248GwrWLT4Dkjf6 2012-06-06T00:26:10.035Z List of Problems That Motivated UDT https://www.lesswrong.com/posts/4kvaocbkDDS2AMoPG/list-of-problems-that-motivated-udt <p>I noticed that recently I wrote several comments of the form "UDT can be seen as a step towards solving X" and thought it might be a good idea to list in one place all of the problems that helped motivate <a href="/lw/15m/towards_a_new_decision_theory/">UDT1</a>&nbsp;(not including problems that came up subsequent to that post).&nbsp;</p> <ul> <li><a href="http://extropians.weidai.com/extropians.3Q97/4116.html">decision making for minds that can copy themselves</a></li> <li><a href="http://www.weidai.com/everything.html">Doomsday Argument</a></li> <li>Sleeping Beauty</li> <li><a href="http://wiki.lesswrong.com/wiki/Absent-Minded_driver">Absent-Minded Driver</a></li> <li><a href="/lw/175/torture_vs_dust_vs_the_presumptuous_philosopher/">Presumptuous Philosopher</a></li> <li><a href="http://groups.google.com/group/everything-list/browse_thread/thread/8c25168e232a7efd/">anthropic reasoning for non-sentient AIs</a></li> <li><a href="/lw/1lq/less_wrong_qa_with_eliezer_yudkowsky_video_answers/1f0b">Simulation Argument</a></li> <li><a href="/lw/102/indexical_uncertainty_and_the_axiom_of/">indexical uncertainty</a> in general</li> <li><a href="http://www.mail-archive.com/everything-list@googlegroups.com/msg03620.html">wireheading/<span style="font-family: Arial, Helvetica, sans-serif; line-height: 19px; text-align: justify;">Cartesianism</span></a>&nbsp;(<a href="/lw/cej/general_purpose_intelligence_arguing_the/6mm0">how</a> to formulate something like AIXI that cares about an external world instead of just its sensory inputs)</li> <li><a href="http://extropians.weidai.com/extropians/0302/2444.html">How</a> to make decisions if all possible worlds exist? (a la Tegmark or Schmidhuber, or just in the MWI)</li> <li>Quantum Immortality/Suicide</li> <li>Logical Uncertainty (<a href="http://www.sl4.org/archive/0509/12317.html">how</a> to formulate something like Godel machine that can make reasonable decisions involving P=NP)</li> <li><a href="https://groups.google.com/group/everything-list/browse_thread/thread/c7442c13ff1396ec">uncertainty about hypercomputation</a> (how to avoid assuming we must be living in a computable universe)</li> <li><a href="/lw/1iy/what_are_probabilities_anyway/">What are probabilities?</a></li> <li><a href="/lw/aq9/decision_theories_a_less_wrong_primer/606k">What are decisions</a> and what kind of consequences should be considered when making decisions?</li> <li>Newcomb's Problem</li> <li><a href="/lw/az7/video_paul_christianos_impromptu_tutorial_on_aixi/6cv7">Smoking Lesion</a></li> <li>Prisoner's&nbsp;Dilemma</li> <li><a href="http://wiki.lesswrong.com/wiki/Counterfactual_mugging">Counterfactual Mugging</a></li> <li><a href="http://extropians.weidai.com/extropians/0302/2567.html">FAI</a></li> </ul> <div><br /></div> <div><ol> </ol></div> wei_dai 4kvaocbkDDS2AMoPG 2012-06-06T00:26:00.625Z How can we ensure that a Friendly AI team will be sane enough? https://www.lesswrong.com/posts/WCYK7B28SZ7uJxftD/how-can-we-ensure-that-a-friendly-ai-team-will-be-sane <p>One possible answer to the <a href="/lw/6mi/some_thoughts_on_singularity_strategies/">argument</a> "attempting to build FAI based on Eliezer's ideas seems infeasible and increases the risk of UFAI without helping much to increase the probability of a good outcome, and therefore we should try to achieve a positive Singularity by other means" is that it's too early to decide this. Even if our best current estimate is that trying to build such an FAI increases risk, there is still a reasonable chance that this estimate will turn out to be wrong after further investigation. Therefore, the counter-argument goes, we ought to mount a serious investigation into the feasibility and safety of Eliezer's design (as well as other possible FAI approaches), before deciding to either move forward or give up.</p> <p>(I've been given to understand that this is a standard belief within SI, except possibly for Eliezer, which makes me wonder why nobody gave this&nbsp;counter-argument in response to my post linked above. ETA: Carl Shulman did subsequently give me a version of this argument <a href="/r/discussion/lw/8c3/qa_with_new_executive_director_of_singularity/596l">here</a>.)</p> <p>This answer makes sense to me, except for the concern that even seriously investigating the feasibility of FAI is risky, if the team doing so isn't fully rational. For example they may be overconfident about their abilities and thereby overestimate the feasibility and safety, or commit sunken cost fallacy once they have developed lots of FAI-relevant theory in the attempt to study feasibility, or become too attached to their status and identity as FAI researchers, or some team members may disagree with a consensus of "give up" and leave to form their own AGI teams and take the dangerous knowledge developed with them.</p> <p>So the question comes down to, how rational is such an FAI feasibility team likely to be, and is that enough for the benefits to exceed the costs? I don't have a lot of good ideas about how to answer this, but the question seems really important to bring up. I'm hoping this post this will trigger SI people to tell us their thoughts, and maybe other LWers have ideas they can share.</p> wei_dai WCYK7B28SZ7uJxftD 2012-05-16T21:24:58.681Z Neuroimaging as alternative/supplement to cryonics? https://www.lesswrong.com/posts/iKrmihLBHTRmuWHuZ/neuroimaging-as-alternative-supplement-to-cryonics <p>Paul Christiano recently <a href="/r/discussion/lw/c0k/formalizing_value_extrapolation/">suggested</a> that we can use neuroimaging to form a complete mathematical characterization of a human brain, which a sufficiently powerful superintelligence would be able to reconstruct into a working mind, and the neuroimaging part is already possible today, or close to being possible.</p> <blockquote> <p>In fact, this project may be possible using existing resources. The complexity of the human brain is not as unapproachable as it may at first appear: though it may contain 10<sup>14</sup> synapses, each described by many parameters, it can be specified much more compactly. A newborn&rsquo;s brain can be specified by about 10<sup>9</sup> bits of genetic information, together with a recipe for a physical simulation of development. The human brain appears to form new long-term memories at a rate of 1-2 bits per second, suggesting that it may be possible to specify an adult brain using 10<sup>9</sup> additional bits of experiential information. This suggests that it may require only about 10<sup>10</sup> bits of information to specify a human brain, which is at the limits of what can be reasonably collected by existing technology for functional neuroimaging.</p> </blockquote> <p>Paul was using this idea as part of an FAI design proposal, but I'm highlighting it here since it seems to have independent value as an alternative or supplement to cryonics. That is, instead of (or in addition to) trying to get your body to be frozen and then preserved in liquid nitrogen after you die, you periodically take&nbsp;neuroimaging scans of your brain and save them to multiple backup locations (10<sup>10</sup>&nbsp;bits is only about 1 gigabyte), in the hope that a friendly AI or posthuman will eventually use the scans to reconstruct your mind.</p> <p>Are there any neuroimaging experts around who can tell us how feasible this really is, and how much such a scan might cost, now or in the near future?</p> <p>ETA: Given the presence of thermal noise and the fact that a set of neuroimaging data may contain redundant or irrelevant information, 10<sup>10</sup>&nbsp;bits ought to be regarded as just a rough lower bound on how much data needs to be collected and stored. Thanks to commenters who pointed this out.&nbsp;</p> wei_dai iKrmihLBHTRmuWHuZ 2012-05-12T23:26:28.429Z Strong intutions. Weak arguments. What to do? https://www.lesswrong.com/posts/GyDHM7fG47mkmFD4x/strong-intutions-weak-arguments-what-to-do <p style="margin: 0px 0px 1em; color: #000000; font-family: Arial,Helvetica,sans-serif; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 19px; orphans: 2; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; background-color: #ffffff; font-size: small;">I thought Ben Goertzel made an interesting point at the end of his <a href="/lw/c7h/muehlhausergoertzel_dialogue_part_2/">dialog</a> with Luke Muehlhauser, about how the strengths of both sides' arguments do not match up with the strengths of their intuitions:</p> <blockquote> <p style="margin: 0px 0px 1em; color: #000000; font-family: Arial,Helvetica,sans-serif; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 19px; orphans: 2; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; background-color: #ffffff; font-size: small;">One thing I'm repeatedly struck by in discussions on these matters with you and other SIAI folks, is the way the strings of reason are pulled by the puppet-master of intuition. With so many of these topics on which we disagree -- for example: the Scary Idea, the importance of optimization for intelligence, the existence of strongly convergent goals for intelligences -- you and the other core SIAI folks share a certain set of intuitions, which seem quite strongly held. Then you formulate rational arguments in favor of these intuitions -- but the conclusions that result from these rational arguments are very weak. For instance, the Scary Idea intuition corresponds to a rational argument that "superhuman AGI might plausibly kill everyone." The intuition about strongly convergent goals for intelligences, corresponds to a rational argument about goals that are convergent for a "wide range" of intelligences. Etc.</p> <p style="margin: 0px 0px 1em; color: #000000; font-family: Arial,Helvetica,sans-serif; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 19px; orphans: 2; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; background-color: #ffffff; font-size: small;">On my side, I have a strong intuition that OpenCog can be made into a human-level general intelligence, and that if this intelligence is raised properly it will turn out benevolent and help us launch a positive Singularity. However, I can't fully rationally substantiate this intuition either -- all I can really fully rationally argue for is something weaker like "It seems plausible that a fully implemented OpenCog system might display human-level or greater intelligence on feasible computational resources, and might turn out benevolent if raised properly." In my case just like yours, reason is far weaker than intuition.</p> </blockquote> <p style="margin: 0px 0px 1em; color: #000000; font-family: Arial,Helvetica,sans-serif; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 19px; orphans: 2; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; background-color: #ffffff; font-size: small;">What do we do about this <a href="http://wiki.lesswrong.com/wiki/Disagreement">disagreement</a> and other similar situations, both as bystanders (who may not have strong intuitions of their own) and as participants (who do)?</p> <p style="margin: 0px 0px 1em; color: #000000; font-family: Arial,Helvetica,sans-serif; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 19px; orphans: 2; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; background-color: #ffffff; font-size: small;">I guess what bystanders typically do (although not necessarily consciously) is evaluate how reliable each party's intuitions are likely to be, and then use that to form a probabilistic mixture of the two sides' positions.The information that go into such evaluations could include things like what cognitive processes likely came up with the intuitions, how many people hold each intuition and how accurate each individual's past intuitions were.</p> <p style="margin: 0px 0px 1em; color: #000000; font-family: Arial,Helvetica,sans-serif; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 19px; orphans: 2; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; background-color: #ffffff; font-size: small;">If this is the best we can do (at least in some situations), participants could help by providing more information that might be relevant to the reliability evaluations, and bystanders should pay more conscious attention to such information instead of focusing purely on each side's arguments. The participants could also pretend that they are just bystanders, for the purpose of making important decisions, and base their beliefs on "reliability-adjusted" intuitions instead of their raw intuitions.</p> <p style="margin: 0px 0px 1em; color: #000000; font-family: Arial,Helvetica,sans-serif; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 19px; orphans: 2; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; background-color: #ffffff; font-size: small;">Questions: Is this a good idea? Any other ideas about what to do when strong intuitions meet weak arguments?</p> <p style="margin: 0px 0px 1em; color: #000000; font-family: Arial,Helvetica,sans-serif; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 19px; orphans: 2; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; background-color: #ffffff; font-size: small;">Related Post: Kaj Sotala's <a href="/lw/19v/intuitive_differences_when_to_agree_to_disagree/">Intuitive differences: when to agree to disagree</a>, which is about a similar problem, but mainly from the participant's perspective instead of the bystander's.</p> wei_dai GyDHM7fG47mkmFD4x 2012-05-10T19:27:00.833Z How can we get more and better LW contrarians? https://www.lesswrong.com/posts/HBupaMix2NZLngQuJ/how-can-we-get-more-and-better-lw-contrarians <p>I'm worried that LW doesn't have enough good contrarians and skeptics, people who disagree with us or like to find fault in every idea they see, but do so in a way that is often right and can change our minds when they are. I fear that when contrarians/skeptics join us but aren't "good enough", we tend to drive them away instead of improving them.</p> <p>For example, I know a couple of people who occasionally had interesting ideas that were contrary to the local LW consensus, but were (or appeared to be) too confident in their ideas, both good and bad. Both people ended up being repeatedly downvoted and left our community a few months after they arrived. This must have happened more often than I have noticed (partly evidenced by the large number of comments/posts now marked as written by&nbsp;<strong>[deleted]</strong>, sometimes with whole threads written&nbsp;entirely&nbsp;by deleted accounts). I feel that this is a waste that we should try to prevent (or at least think about how we might). So here are some ideas:</p> <ul> <li>Try to "fix" them by telling them that they are overconfident and give them hints about how to get LW to take their ideas seriously.&nbsp;Unfortunately, from their perspective such advice must appear to come from someone who is themselves overconfident and wrong, so they're not likely to be very inclined to accept the advice.</li> <li>Create a&nbsp;separate section with different social norms, where people are not expected to maintain the "proper" level of confidence and niceness (on pain of being downvoted), and direct overconfident newcomers to it. Perhaps through no-holds-barred debate we can convince them that we're not as crazy and wrong as they thought, and<em>&nbsp;then</em>&nbsp;give them the above-mentioned advice and move them to the main sections.</li> <li>Give newcomers some sort of honeymoon period (marked by color-coding of their usernames or something like that), where we ignore their overconfidence and associated social transgressions (or just be extra nice and tolerant towards them), and take their ideas on their own merits. Maybe if they see us take their ideas seriously, that will cause them to reciprocate and take us more seriously when we point out that they may be wrong or overconfident.</li> </ul> <div>I guess these ideas sounded better in my head than written down, but maybe they'll inspire other people to think of better ones. And it might help a bit just to keep this issue in the back of one's mind and occasionally think strategically about how to improve the person you're arguing against, instead of only trying to win the particular argument at hand or downvoting them into leaving.</div> <div>P.S., after writing most of the above, I saw &nbsp;<a href="http://wallowinmaya.wordpress.com/2012/04/17/youre-calling-who-a-cult-leader/">this post</a>:</div> <blockquote> <div>OTOH, I don&rsquo;t think group think is a big problem. Criticism by folks like Will Newsome, Vladimir Slepnev and especially Wei Dai is often upvoted. (I upvote almost every comment of Dai or Newsome if I don&rsquo;t forget it. Dai makes always very good points and Newsome is often wrong but also hilariously funny or just brilliant and right.) Of course, folks like this Dymytry guy are often downvoted, but IMO with good reason.</div> </blockquote> <div>To be clear, I don't think "group think" is the problem. In other words, it's not that we're refusing to accept valid criticisms, but more like our group dynamics (and other factors) cause there to be fewer good contrarians in our community than is optimal. Of course&nbsp;what is optimal&nbsp;might be open to debate, but from my perspective, it can't be right that my own criticisms are valued so highly (especially since I've been moving closer to the SingInst "inner circle" and my critical tendencies have been decreasing). In the spirit of making oneself redundant, I'd feel much better if my occasional voice of dissent is just considered one amongst many.</div> wei_dai HBupaMix2NZLngQuJ 2012-04-18T22:01:12.772Z Reframing the Problem of AI Progress https://www.lesswrong.com/posts/nGP4soWSbFmzemM4i/reframing-the-problem-of-ai-progress <blockquote> <p>"Fascinating! You should definitely look into this. Fortunately, my own research has no chance of producing a super intelligent AGI, so I'll continue. Good luck son! The government should give you more money."</p> </blockquote> <p>Stuart Armstrong <a href="/lw/bfj/evidence_for_the_orthogonality_thesis/68hc">paraphrasing</a> a typical AI researcher</p> <p>I forgot to mention in my <a href="/r/discussion/lw/bnc/against_ai_risk/">last post</a>&nbsp;why "AI risk" might be a bad phrase even to denote the problem of UFAI.&nbsp;It brings to mind analogies like physics catastrophes or astronomical disasters, and lets AI researchers think that their work is ok as long as they have little chance of immediately destroying Earth. But the real problem we face is how to build or become a superintelligence that shares our values, and given that this seems very difficult, any progress that doesn't contribute to the solution but brings forward the date by which we <em>must </em>solve it (or be stuck with something very suboptimal even if it doesn't kill us), is bad.&nbsp;The word "risk" connotes a small chance of something bad suddenly happening, but slow steady progress towards&nbsp;losing&nbsp;the future is just as worrisome.</p> <p>The <a href="http://facingthesingularity.com/2011/not-built-to-think-about-ai/">usual way</a> of stating the problem also invites lots of debate that are largely beside the point (as far as determining how serious the problem is), like <a href="/lw/8j7/criticisms_of_intelligence_explosion/">whether intelligence explosion is possible</a>, or <a href="/lw/bfj/evidence_for_the_orthogonality_thesis/">whether a superintelligence can have arbitrary goals</a>, or <a href="/lw/bl2/a_belief_propagation_graph/">how sure we are that a non-Friendly superintelligence will destroy human civilization</a>. If someone wants to question the importance of facing this problem, they really&nbsp;instead&nbsp;need to argue that a superintelligence isn't possible&nbsp;(not even a <a href="/lw/b10/modest_superintelligences/">modest one</a>), or that the future will turn out to be close to the best possible just by everyone pushing forward their own research without any concern for the big picture, or perhaps that we really&nbsp;don't&nbsp;care very much about the far future and distant strangers and should pursue AI progress just for the immediate benefits.</p> <p>(This is an expanded version of a <a href="/lw/bnc/against_ai_risk/6b4p">previous comment</a>.)</p> wei_dai nGP4soWSbFmzemM4i 2012-04-12T19:31:04.829Z against "AI risk" https://www.lesswrong.com/posts/ctWGEQznumzyTRGFs/against-ai-risk <p>Why does SI/LW focus so much on <a href="http://wiki.lesswrong.com/wiki/FOOM">AI-FOOM</a> disaster, with apparently much less concern for things like</p> <ul> <li>bio/nano-tech disaster</li> <li>Malthusian upload scenario</li> <li>highly destructive war</li> <li>bad memes/philosophies spreading among humans or posthumans and overriding our values</li> <li>upload singleton ossifying into a suboptimal form compared to the kind of superintelligence that our universe could support</li> </ul> <p><a href="/lw/ajm/ai_risk_and_opportunity_a_strategic_analysis/5yvd">Why</a>, for example, is lukeprog's strategy sequence titled "AI Risk and Opportunity", instead of "The Singularity, Risks and Opportunities"? Doesn't it seem strange to assume that both the risks and opportunities must be AI related, before the analysis even begins? Given our current state of knowledge, I don't see how we can make such conclusions with any confidence even <em>after</em> a thorough analysis.</p> <p>SI/LW sometimes gives the <a href="/lw/atm/cult_impressions_of_less_wrongsi/">impression</a> of being a doomsday cult, and it would help if we didn't concentrate so much on a particular doomsday scenario. (Are there any doomsday cults that say "doom is probably coming, we're not sure how but here are some likely possibilities"?)</p> wei_dai ctWGEQznumzyTRGFs 2012-04-11T22:46:10.533Z Modest Superintelligences https://www.lesswrong.com/posts/KuBMKQnAsYBGP4rkZ/modest-superintelligences <p>I'm skeptical about trying to build FAI, but not about trying to influence the Singularity in a positive direction. Some people may be skeptical even of the latter because they don't think the possibility of an intelligence explosion is a very likely one. I suggest that even if intelligence explosion turns out to be impossible, we can still reach a positive Singularity by building what I'll call "modest superintelligences", that is, superintelligent entities, capable of taking over the universe and preventing existential risks and Malthusian outcomes, whose construction does not require fast recursive self-improvement or other questionable assumptions about the nature of intelligence. This helps to establish a lower bound on the benefits of an organization that aims to strategically influence the outcome of the Singularity.</p> <ul> <li>MSI-1: 10<sup>5</sup> biologically cloned humans of von Neumann-level intelligence, highly educated and indoctrinated from birth to work collaboratively towards some goal, such as building MSI-2 (or equivalent)</li> <li>MSI-2: 10<sup>10</sup> whole brain emulations of von Neumann, each running at ten times human speed, with WBE-enabled institutional controls that increase group coherence/rationality (or equivalent)</li> <li>MSI-3: 10<sup>20</sup> copies of von Neumann WBE, each running at a thousand times human speed, with more advanced (to be invented) institutional controls and collaboration tools (or equivalent)</li> </ul> <p>(To recall what the actual von Neumann, who we might call MSI-0, accomplished, open his <a href="http://en.wikipedia.org/wiki/John_von_Neumann">Wikipedia page</a> and scroll through the "known for" sidebar.)</p> <p>Building a MSI-1 seems to require a total cost on the order of $100 billion (assuming $10 million for each clone), which is comparable to the Apollo project, and about 0.25% of the annual Gross World Product. (For further comparison, note that Apple has a market capitalization of $561 billion, and annual profit of $25 billion.) In exchange for that cost, any nation that undertakes the project has a reasonable chance of obtaining an insurmountable lead in whatever technologies end up driving the Singularity, and with that a large measure of control over its outcome. If no better strategic options come along, lobbying a government to build MSI-1 and/or influencing its design and aims seems to be the least that a Singularitarian organization could do.</p> wei_dai KuBMKQnAsYBGP4rkZ 2012-03-22T00:29:03.184Z A Problem About Bargaining and Logical Uncertainty https://www.lesswrong.com/posts/oZwxY88NCCHffJuxM/a-problem-about-bargaining-and-logical-uncertainty <p>Suppose you wake up as a paperclip maximizer. Omega says "I calculated the millionth digit of pi, and it's odd. If it had been even, I would have made the universe capable of producing either 10<sup>20</sup> paperclips or 10<sup>10</sup> staples, and given control of it to a staples maximizer. But since it was odd, I made the universe capable of producing 10<sup>10</sup> paperclips or 10<sup>20</sup> staples, and gave you control." You double check Omega's pi computation and your internal calculator gives the same answer.</p> <p>Then a staples maximizer comes to you and says, "You should give me control of the universe, because before you knew the millionth digit of pi, you would have wanted to pre-commit to a deal where each of us would give the other control of the universe, since that gives you 1/2 probability of 10<sup>20</sup> paperclips instead of 1/2 probability of 10<sup>10</sup> paperclips."</p> <p>Is the staples maximizer right? If so, the general principle seems to be that we should act as if we had precommited to a deal we would have made in ignorance of logical facts we actually possess. But how far are we supposed to push this? What deal would you have made if you didn't know that the first digit of pi was odd, or if you didn't know that 1+1=2?</p> <p>On the other hand, suppose the staples maximizer is wrong. Does that mean you also shouldn't agree to exchange control of the universe before you knew the millionth digit of pi?</p> <p>To make this more relevant to real life, consider two humans negotiating over the goal system of an AI they're jointly building. They have a lot of ignorance about the relevant logical facts, like how smart/powerful the AI will turn out to be and how efficient it will be in implementing each of their goals. They could negotiate a solution now in the form of a weighted average of their&nbsp;utility&nbsp;functions, but the weights they choose now will likely turn out to be "wrong" in full view of the relevant logical facts (e.g., the actual shape of the utility-possibility frontier). Or they could program their utility functions into the AI separately, and let the AI determine the weights later using some formal <a href="/lw/2x8/lets_split_the_cake_lengthwise_upwise_and/">bargaining solution</a> when it has more knowledge about the relevant logical facts. Which is the right thing to do? Or should they follow the staples maximizer's reasoning and bargain under the pretense that they know even less than they actually do?</p> <p><strong>Other Related Posts:</strong> <a href="/lw/179/counterfactual_mugging_and_logical_uncertainty/">Counterfactual Mugging and Logical Uncertainty</a>, <a href="/lw/2xb/if_you_dont_know_the_name_of_the_game_just_tell/">If you don't know the name of the game, just tell me what I mean to you</a></p> wei_dai oZwxY88NCCHffJuxM 2012-03-21T21:03:17.051Z