capybaralet feed - LessWrong 2.0 Reader capybaralet’s posts and comments on the Effective Altruism Forum en-us Comment by capybaralet on Asymptotically Benign AGI https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#GwfgJhEn5yaGS3ENr <p>Yes, that&#x27;s basically what I mean. I think I&#x27;m trying to refer to the same issue that Paul mentioned here: <a href="https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#ZWtTvMdL8zS9kLpfu">https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#ZWtTvMdL8zS9kLpfu</a></p> capybaralet GwfgJhEn5yaGS3ENr 2019-04-01T03:42:25.538Z Comment by capybaralet on Asymptotically Benign AGI https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#7KP6kfs3fRokcmZ62 <p>I like that you emphasize and discuss the need for the AI to not believe that it can influence the outside world, and cleanly distinguish this from it actually being able to influence the outside world. I wonder if you can get any of the benefits here without needing the box to actually work (i.e. can you just get the agent to believe it does? and is that enough for some form/degree of benignity?)</p> capybaralet 7KP6kfs3fRokcmZ62 2019-04-01T03:18:13.818Z Comment by capybaralet on Asymptotically Benign AGI https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#SKtPP652mQYrtC7P5 <p>This doesn&#x27;t seem to address what I view as the heart of Joe&#x27;s comment. Quoting from the paper:</p><p>&quot;Now we note that µ* is the fastest world-model for on-policy prediction, and it does not simulate post-episode events until it has read access to the random action&quot;.</p><p>It seems like simulating *post-episode* events in particular would be useful for predicting the human&#x27;s responses, because they will be simulating post-episode events when they choose their actions. Intuitively, it seems like we *need* to simulate post-episode events to have any hope of guessing how the human will act. I guess the obvious response is that we can instead simulate the internal workings of the human in detail, and thus uncover their simulation of post-episode events (as a past event). That seems correct, but also a bit troubling (again, probably just for &quot;revealed preferences&quot; reasons, though).</p><p>Moreover, I think in practice we&#x27;ll want to use models that make good, but not perfect, predictions. That means that we trade-off accuracy with description length, and I think this makes modeling the outside world (instead of the human&#x27;s model of it) potentially more appealing, at least in some cases. </p> capybaralet SKtPP652mQYrtC7P5 2019-04-01T02:11:38.623Z Comment by capybaralet on Asymptotically Benign AGI https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#RYeewiFpjQkwvswMF <p>I&#x27;m calling this the &quot;no grue assumption&quot; <a href="(https://en.wikipedia.org/wiki/New_riddle_of_induction).">(https://en.wikipedia.org/wiki/New_riddle_of_induction).</a></p><p>My concern here is that this assumption might be False, even in a strong sense of &quot;There is no such U&quot;. </p><p>Have you proven the existence of such a U? Do you agree it might not exist? It strikes me as potentially running up against issues of NFL / self-reference.</p> capybaralet RYeewiFpjQkwvswMF 2019-04-01T01:55:31.155Z Comment by capybaralet on Asymptotically Benign AGI https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#jssTcepCmcP6PNPma <p>Also, it&#x27;s worth noting that this assumption (or rather, Lemma 3) also seems to preclude BoMAI optimizing anything *other* than revealed preferences (which others have noted seems problematic, although I think it&#x27;s definitely out of scope).</p> capybaralet jssTcepCmcP6PNPma 2019-04-01T01:51:13.044Z Comment by capybaralet on Asymptotically Benign AGI https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#2PsQaibrCEMr4DzLq <p>Still wrapping my head around the paper, but...</p><p>1) It seems too weak: In the motivating scenario of Figure 3, isn&#x27;t is the case that &quot;what the operator inputs&quot; and &quot;what&#x27;s in the memory register after 1 year&quot; are &quot;historically distributed identically&quot;?</p><p>2) It seems too strong: aren&#x27;t real-world features and/or world-models &quot;dense&quot;? Shouldn&#x27;t I be able to find features arbitrarily close to F*? If I can, doesn&#x27;t that break the assumption?</p><p>3) Also, I don&#x27;t understand what you mean by: &quot;it&#x27;s on policy behavior [is described as] simulating X&quot;. It seems like you (rather/also) want to say something like &quot;associating reward with X&quot;? </p> capybaralet 2PsQaibrCEMr4DzLq 2019-04-01T01:46:13.988Z Comment by capybaralet on Asymptotically Benign AGI https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#EpiKMMDwGB5wnA6qu <p>Just exposition-wise, I&#x27;d front-load pi^H and pi^* when you define pi^B, and also clarify then that pi^B considers human-exploration as part of it&#x27;s policy.</p> capybaralet EpiKMMDwGB5wnA6qu 2019-04-01T00:53:41.153Z Comment by capybaralet on Asymptotically Benign AGI https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#AA7DsdoBebPwrwWha <p>&quot; This result is independently interesting as one solution to the problem of safe exploration with limited oversight in nonergodic environments, which [Amodei et al., 2016] discus &quot;</p><p>^ This wasn&#x27;t super clear to me.... maybe it should just be moved somewhere else in the text?</p><p>I&#x27;m not sure what you&#x27;re saying is interesting here. I guess it&#x27;s the same thing I found interesting, which is that you can get sufficient (and safe-as-a-human) exploration using the human-does-the-exploration scheme you propose. Is that what you mean to refer to? </p><p></p> capybaralet AA7DsdoBebPwrwWha 2019-04-01T00:52:30.094Z Comment by capybaralet on Asymptotically Benign AGI https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#mCZ8dLjT8HisxHE56 <p>Maybe &quot;promotional of&quot; would be a good phrase for this.</p><p></p> capybaralet mCZ8dLjT8HisxHE56 2019-04-01T00:19:30.200Z Comment by capybaralet on Asymptotically Benign AGI https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#YQAty2HkgLRqzY3cT <p>ETA: NVM, what you said is more descriptive (I just looked in the appendix).</p><p>RE footnote 2: maybe you want to say &quot;monotonically increasing as a function of&quot; rather than &quot;proportional to&quot;. (It&#x27;s a shame there doesn&#x27;t seem to be a shorter way of saying the first one, which seems to be more often what people actually want to say...)</p><p></p><p></p> capybaralet YQAty2HkgLRqzY3cT 2019-04-01T00:16:01.463Z Comment by capybaralet on X-risks are a tragedies of the commons https://www.lesswrong.com/posts/P4sn9MNrFv6RR3moc/x-risks-are-a-tragedies-of-the-commons#ikQActXTFRE62t6dq <p>I&#x27;m not sure. I was trying to disagree with your top level comment :P</p> capybaralet ikQActXTFRE62t6dq 2019-02-14T04:56:34.573Z Comment by capybaralet on How much can value learning be disentangled? https://www.lesswrong.com/posts/Q7WiHdSSShkNsgDpa/how-much-can-value-learning-be-disentangled#9reWkDXMBoTmEGoS9 <p>FWICT, both of your points are actually responses to be point (3).</p><p>RE &quot;re: #2&quot;, see: https://en.wikipedia.org/wiki/Value_of_information#Characteristics</p><p>RE &quot;re: #3&quot;, my point was that it doesn&#x27;t seem like VoI is the correct way for one agent to think about informing ANOTHER agent. You could just look at the change in expected utility for the receiver after updating on some information, but I don&#x27;t like that way of defining it.</p> capybaralet 9reWkDXMBoTmEGoS9 2019-02-11T22:53:59.141Z Comment by capybaralet on X-risks are a tragedies of the commons https://www.lesswrong.com/posts/P4sn9MNrFv6RR3moc/x-risks-are-a-tragedies-of-the-commons#58TQguB2p2k4xMyyT <p>I think it is rivalrous.</p><p>Xrisk mitigation isn&#x27;t the resource; risky behavior is the resource. If you engage in more risky behavior, then I can&#x27;t engage in as much risky behavior without pushing us over into a socially unacceptable level of total risky behavior.</p><p></p> capybaralet 58TQguB2p2k4xMyyT 2019-02-11T22:43:34.891Z Comment by capybaralet on X-risks are a tragedies of the commons https://www.lesswrong.com/posts/P4sn9MNrFv6RR3moc/x-risks-are-a-tragedies-of-the-commons#kQ6o5D8a6J6myWz5L <p>If there is a cost to reducing Xrisk (which I think is a reasonable assumption), then there will be an incentive to defect, i.e. to underinvest in reducing Xrisk. There&#x27;s still *some* incentive to prevent Xrisk, but to some people everyone dying is not much worse than just them dying.</p> capybaralet kQ6o5D8a6J6myWz5L 2019-02-11T22:38:38.288Z Comment by capybaralet on The role of epistemic vs. aleatory uncertainty in quantifying AI-Xrisk https://www.lesswrong.com/posts/SwPQmJD2Davq3hQgn/the-role-of-epistemic-vs-aleatory-uncertainty-in-quantifying#Ba7BpytZqHPzoppMK <p>1) Yep, independence.</p><p>2) Seems right as well.</p><p>3) I think it&#x27;s important to consider &quot;risk per second&quot;, because </p><p> (i) I think many AI systems could eventually become dangerous, just not on reasonable time-scales.</p><p> (ii) I think we might want to run AI systems which have the potential to become dangerous for limited periods of time.</p><p> (iii) If most of the risk is far in the future, we can hope to become more prepared in the meanwhile</p><p></p> capybaralet Ba7BpytZqHPzoppMK 2019-02-11T22:24:45.850Z Comment by capybaralet on Predictors as Agents https://www.lesswrong.com/posts/G2EupNcdtigdyNhL2/predictors-as-agents#q6Jb6TXKiaN3dQu4z <p>Whether or not this happens depends on the learning algorithm. Let&#x27;s assume an IID setting. Then an algorithm that evaluates many random parameter settings and choses the one that gives the best performance would have this effect. But a gradient-based learning algorithm wouldn&#x27;t necessarily, since it only aims to improve its predictions locally (so what you say in the ETA is more accurate, **in this case**, I think).</p><p>Also, I just wanted to mention that Stuart Armstrong&#x27;s paper &quot;Good and safe uses of AI oracles&quot; discusses self-fulfilling prophecies as well; Stuart provides a way of training a predictor that won&#x27;t be victim to such effects (just don&#x27;t reveal its predictions when training). But then it also fails to account for the effect its predictions actually have, which can be a source of irreducible error... The example is a (future) stock-price predictor: making its predictions public makes them self-refuting to some extent, as they influence market actors decisions.</p> capybaralet q6Jb6TXKiaN3dQu4z 2019-02-10T22:39:31.662Z Comment by capybaralet on X-risks are a tragedies of the commons https://www.lesswrong.com/posts/P4sn9MNrFv6RR3moc/x-risks-are-a-tragedies-of-the-commons#dovEqPoAWJCu5KT29 <p>I dunno... I think describing them as a tragedy of the commons can help people understand why the problems are challenging and deserving of attention.</p> capybaralet dovEqPoAWJCu5KT29 2019-02-07T19:42:31.506Z X-risks are a tragedies of the commons https://www.lesswrong.com/posts/P4sn9MNrFv6RR3moc/x-risks-are-a-tragedies-of-the-commons <ul><li>Safety from Xrisk is a common good: We all benefit by making it less likely that we will all die. </li><li>In general, people are somewhat selfish, and value their own personal safety over that of another (uniform) randomly chosen person.</li><li>Thus individuals are not automatically properly incentivized to safeguard the common good of safety from Xrisk.</li></ul><p>I hope you all knew that already ;)</p><p></p> capybaralet P4sn9MNrFv6RR3moc 2019-02-07T02:48:25.825Z Comment by capybaralet on Thoughts on Ben Garfinkel's "How sure are we about this AI stuff?" https://www.lesswrong.com/posts/AE9yM7ZaPiZ662BF8/thoughts-on-ben-garfinkel-s-how-sure-are-we-about-this-ai#ZC2GLEgupnjkfjvzd <p>RE Sarah: Longer timelines don&#x27;t change the picture that much, in my mind. I don&#x27;t find this article to be addressing the core concerns. Can you recommend one that&#x27;s more focused on &quot;why AI-Xrisk isn&#x27;t the most important thing in the world&quot;?</p><p>RE Robin Hanson: I don&#x27;t really know much of what he thinks, but IIRC his &quot;urgency of AI depends on FOOM&quot; was not compelling.</p><p>What I&#x27;ve noticed is that critics are often working from very different starting points, e.g. being unwilling to estimate probabilities of future events, using common-sense rather than consequentialist ethics, etc. </p> capybaralet ZC2GLEgupnjkfjvzd 2019-02-07T02:10:29.396Z My use of the phrase "Super-Human Feedback" https://www.lesswrong.com/posts/naccwaCQEEBXK7hiJ/my-use-of-the-phrase-super-human-feedback <p>I&#x27;ve taken to calling Debate, Amplification, and Recursive Reward Modeling <strong>&quot;Super-human feedback&quot; (SHF)</strong> techniques. The point of this post is just to introduce that terminology and explain a bit why I like it and what I mean by it. </p><p>By calling something SHF I mean that it aims to outperform a single, unaided human H at the task of providing feedback about H&#x27;s intentions for training an AI system. I like thinking of it this way, because I think it makes it clear that these three approaches are naturally grouped together like this, and might inspire us to consider what else could fall into that category (a simple example is just using a team of humans).</p><p>I think this is very similar to &quot;scalable oversight&quot; (as discussed in Concrete Problems), but maybe different because:</p><p>1) It doesn&#x27;t imply that the approach must be scalable</p><p>2) It doesn&#x27;t require that feedback is expensive, i.e. it applies to things where human feedback is cheap, but we can do better than the cheap human feedback with SHF.</p><p></p><p></p><p></p> capybaralet naccwaCQEEBXK7hiJ 2019-02-06T19:11:11.734Z Thoughts on Ben Garfinkel's "How sure are we about this AI stuff?" https://www.lesswrong.com/posts/AE9yM7ZaPiZ662BF8/thoughts-on-ben-garfinkel-s-how-sure-are-we-about-this-ai <p>I liked <a href="https://www.youtube.com/watch?v=E8PGcoLDjVk&fbclid=IwAR12bx55ySVUShFMbMSuEMFi1MGe0JeoEGe_Jh1YUaWtrI_kP49pJQTyT40">this talk by Ben</a>.</p><p>I think it raises some very important points. OTTMH, I think the most important one is: <strong>We have no good critics. </strong>There is nobody I&#x27;m aware of who is seriously invested in knocking down AI-Xrisk arguments and qualified to do so. For many critics in machine learning (like Andrew Ng and Yann Lecun), the arguments seem obviously wrong or misguided, and so they do not think it&#x27;s worth their time to engage beyond stating that. </p><p>A related point which is also important is: <strong>We need to clarify and strengthen the case for AI-Xrisk. </strong>Personally, I think I have a very good internal map of the path arguments about AI-Xrisk can take, and the type of objections one encounters. It would be good to have this as some form of flow-chart. <strong>Let me know if you&#x27;re interested in helping make one.</strong></p><p>Regarding machine learning, I think he made some very good points about how the the way ML works doesn&#x27;t fit with the paperclip story. I think <strong>it&#x27;s worth exploring the disanalogies more and seeing how that affects various Xrisk arguments.</strong></p><p>As I reflect on what&#x27;s missing from the conversation, I always feel the need to make sure it hasn&#x27;t been covered in <em>Superintelligence</em>. When I read it several years ago, I found <em>Superintelligence </em>to be remarkably thorough. For example, I&#x27;d like to point out that FOOM isn&#x27;t necessary for a unilateral AI-takeover, since an AI could be progressing gradually in a box, and then break out of the box already superintelligent; I don&#x27;t remember if Bostrom discussed that.</p><p>The point about <strong>justification drift</strong> is quite apt. For instance, I think the case for MIRI&#x27;s veiwpoint increasingly relies on:</p><p>1) optimization daemons (aka &quot;inner optimizers&quot;)</p><p>2) adversarial examples (i.e. current ML systems seem to learn superficially similar but deeply flawed versions of our concepts)</p><p>TBC, I think these are quite good arguments, and I personally feel like I&#x27;ve come to appreciate them much more as well over the last several years. But I consider them far from conclusive, due to our current lack of knowledge/understanding. </p><p><strong>One thing I didn&#x27;t quite agree with in the talk: </strong>I think he makes a fairly general case against trying to impact the far future. I think the magnitude of impact and uncertainty we have about the direction of impact mostly cancel each other out, so even if we are highly uncertain about what effects our actions will have, it&#x27;s often still worth making guesses and using them to inform our decisions. He basically acknowledges this.</p><p></p> capybaralet AE9yM7ZaPiZ662BF8 2019-02-06T19:09:20.809Z Comment by capybaralet on How much can value learning be disentangled? https://www.lesswrong.com/posts/Q7WiHdSSShkNsgDpa/how-much-can-value-learning-be-disentangled#8gqtnwBNNzhTxC47C <p>IMO, VoI is also not a sufficient criteria for defining manipulation... I&#x27;ll list a few problems I have with it, OTTMH:</p><p>1) It seems to reduce it to &quot;providing misinformation, or providing information to another agent that is not maximally/sufficiently useful for them (in terms of their expected utility)&quot;. An example (due to Mati Roy) of why this doesn&#x27;t seem to match our intuition is: what if I tell someone something true and informative that serves (only) to make them sadder? That doesn&#x27;t really seem like manipulation (although you could make a case for it). </p><p>2) I don&#x27;t like the &quot;maximally/sufficiently&quot; part; maybe my intuition is misleading, but manipulation seems like a <em>qualitative</em> thing to me. Maybe we should just constrain VoI to be positive? </p><p>3) Actually, it seems weird to talk about VoI here; VoI is prospective and subjective... it treats an agent&#x27;s beliefs as real and asks how much value they should expect to get from samples or perfect knowledge, assuming these samples or the ground truth would be distributed according to their beliefs; this makes VoI strictly non-negative. But when we&#x27;re considering whether to inform an agent of something, we might recognize that certain information we&#x27;d provide would actually be net negative (see my top level comment for an example). Not sure what to make of that ATM...</p><p></p> capybaralet 8gqtnwBNNzhTxC47C 2019-01-31T19:58:48.431Z Comment by capybaralet on The role of epistemic vs. aleatory uncertainty in quantifying AI-Xrisk https://www.lesswrong.com/posts/SwPQmJD2Davq3hQgn/the-role-of-epistemic-vs-aleatory-uncertainty-in-quantifying#peK8JSWSdc4qgPeW2 <p>Agree, good point. I&#x27;d say it&#x27;s aleatoric risk is necessary to produce compounding, but not sufficient, but maybe I&#x27;m still looking at this the wrong way. </p><p></p><p></p> capybaralet peK8JSWSdc4qgPeW2 2019-01-31T19:46:50.359Z Comment by capybaralet on How much can value learning be disentangled? https://www.lesswrong.com/posts/Q7WiHdSSShkNsgDpa/how-much-can-value-learning-be-disentangled#7YeCFrhGc3vFJchcw <p>Haha no not at all ;)</p><p>I&#x27;m not actually trying to recruit people to work on that, just trying to make people aware of the idea of doing such projects. I&#x27;d suggest it to pretty much anyone who wants to work on AI-Xrisk without diving deep into math or ML. </p> capybaralet 7YeCFrhGc3vFJchcw 2019-01-31T19:41:05.045Z The role of epistemic vs. aleatory uncertainty in quantifying AI-Xrisk https://www.lesswrong.com/posts/SwPQmJD2Davq3hQgn/the-role-of-epistemic-vs-aleatory-uncertainty-in-quantifying <h1>(Optional) Background: what are epistemic/aleatory uncertainty?</h1><p>Epistemic uncertainty is uncertainty about <strong>which model of a phenomenon is correct.</strong> It can be reduced by learning more about how things work. An example is distinguishing between a fair coin and a coin that lands heads 75% of the time; these correspond to two different models of reality, and you may have uncertainty over which of these models is correct. </p><p>Aleatory uncertainty can be thought of as &quot;intrinsic&quot; randomness, and as such is <strong>irreducible. </strong>An example is the randomness in the outcome of a fair coin flip.</p><p>In the context of machine learning, aleatoric randomness can be thought of as irreducible <strong>under the modelling assumptions you&#x27;ve made.</strong><em> </em>It may be that there is no such thing as intrinsic randomness, and everything is deterministic, if you have the right model and enough information about the state of the world. But if you&#x27;re restricting yourself to a simple class of models, there will still be many things that appear random (i.e. unpredictable) to your model.</p><p>Here&#x27;s the paper that introduced me to these terms: https://arxiv.org/abs/1703.04977 </p><h1>Relevance for modelling AI-Xrisk </h1><p>I&#x27;ve previously claimed something like &quot;If running a single copy of a given AI system (let&#x27;s call it SketchyBot) for 1 month has a 5% chance of destroying the world, then running it for 5 years has a 1 - .95**60 ~= ~95% chance of destroying the world&quot;. A similar argument applied to running many copies of SketchyBot in parallel. But I&#x27;m suddenly surprised that nobody has called me out on this (that I recall), because this reasoning is valid <strong>only if this 5% risk is an expression of <em>only</em> aleatoric uncertainty.</strong> </p><p>In fact, this &quot;5% chance&quot; is best understood as combining epistemic and aleatory uncertainty (by integrating over all possible models, according to their subjective probability). </p><p>Significantly, epistemic uncertainty <strong>doesn&#x27;t have this compounding effect! </strong>For example, you could two models of how the world could work, one where we are lucky (L), and SketchyBot is completely safe, and another where we are unlucky (U), and running SketchyBot even for 1 day destroys the world (immediately). If you believe we have a 5% chance of being in world U and a 95% chance of being in world L, then you can roll the dice and run SketchyBot and not incur more than a 5% Xrisk. </p><p>Moreover, once you&#x27;ve actually run SketchyBot for 1 day, if you&#x27;re still alive, you can conclude that you were lucky (i.e. we live in world L), and SketchyBot is in fact completely safe. To be clear, I don&#x27;t think that absence of evidence of danger is <strong>strong</strong> evidence of safety in advanced AI systems (because of things like treacherous turns), but I&#x27;d say it&#x27;s a nontrivial amount of evidence. And it seems clear that <strong>I was overestimating Xrisk by naively compounding my subjective Xrisk estimates.</strong></p><p>Overall, I think the main takeaway is that <strong>there are plausible models in which we basically just get lucky, and fairly naive approaches at alignment just work. </strong>I don&#x27;t think we should bank on that, but I think it&#x27;s worth asking yourself where your uncertainty about Xrisk is coming from. Personally, I still put a lot of weight on models where the kind of advanced AI systems we&#x27;re likely to build are not dangerous by default, but carry some ~constant risk of becoming dangerous for every second they are turned on (e.g. by breaking out of a box, having critical insights about the world, instantiating inner optimizers, etc.). But I also put some weight on more FOOM-y things and at least a little bit of weight on us getting lucky.</p><h1></h1><p></p> capybaralet SwPQmJD2Davq3hQgn 2019-01-31T06:13:35.321Z Comment by capybaralet on How much can value learning be disentangled? https://www.lesswrong.com/posts/Q7WiHdSSShkNsgDpa/how-much-can-value-learning-be-disentangled#DFkrRvNiNZFHSR7Gg <p>So I want to emphasize that I&#x27;m only saying it&#x27;s *plausible* that *there exists* a specification of &quot;manipulation&quot;. This is my default position on all human concepts. I also think it&#x27;s plausible that there does not exist such a specification, or that the specification is too complex to grok, or that there end up being multiple conflicting notions we conflate under the heading of &quot;manipulation&quot;. See <a href="https://www.lesswrong.com/posts/j79pzvuYM8hC9Tfzc/conceptual-analysis-for-ai-alignment">this post</a> for more.</p><p>Overall, I understand and appreciate the issues you&#x27;re raising, but I think all this post does is show that naive attempts to specify &quot;manipulation&quot; fail; I think it&#x27;s quite difficult to argue compellingly that no such specification exists ;)</p><p>&quot;It seems that the only difference between manipulation and explanation is whether we end up with a better understanding of the situation at the end. And measuring understanding is very subtle.&quot;</p><p>^ Actually, I think &quot;ending up with a better understanding&quot; (in the sense I&#x27;m reading it)is probably not sufficient to rule out manipulation; what I mean is that I can do something which actually improves your model of the world, but leads you to follow a policy with worse expected returns. A simple example would be if you are doing Bayesian updating and your prior over returns for two bandit arms is P(r|a_1) = N(1,1), P(r|a_2) = N(2, 1), while the true returns are 1/2 and 2/3 (respectively). So your current estimates are optimistic, but they are ordered correctly, and so induce the optimal (greedy) policy. </p><p>Now if I give you a bunch of observations of a_2, I will be giving you true information, that will lead you to learn, correctly and with high confidence, that the expected reward for a_2 is ~2/3, improving your model of the world. But since you haven&#x27;t updated your estimate for a_1, you will now prefer a_1 to a_2 (if acting greedily), which is suboptimal. So overall I&#x27;ve informed you with true information, but disadvantaged you nonetheless. I&#x27;d argue that if I did this intentionally, it should count as a form of manipulation.</p> capybaralet DFkrRvNiNZFHSR7Gg 2019-01-31T04:58:48.703Z Comment by capybaralet on Imitation learning considered unsafe? https://www.lesswrong.com/posts/whRPLBZNQm3JD5Zv8/imitation-learning-considered-unsafe#B2eetE9FjHyYWSAs3 <p>I don&#x27;t think I&#x27;d put it that way (although I&#x27;m not saying it&#x27;s inaccurate). See my comments RE &quot;safety via myopia&quot; and &quot;inner optimizers&quot;.</p> capybaralet B2eetE9FjHyYWSAs3 2019-01-09T18:27:26.934Z Comment by capybaralet on Imitation learning considered unsafe? https://www.lesswrong.com/posts/whRPLBZNQm3JD5Zv8/imitation-learning-considered-unsafe#YhjLg2Q8PFByyc9s7 <p>Yes, maybe? Elaborating...</p><p>I&#x27;m not sure how well this fits into the category of &quot;inner optimizers&quot;; I&#x27;m still organizing my thoughts on that (aiming to finish doing so within the week...). I&#x27;m also not sure that people are thinking about inner optimizers in the right way.</p><p>Also, note that the thing being imitated doesn&#x27;t have to be a human.</p><p>OTTMH, I&#x27;d say:</p><ul><li>This seems more general in the sense that it isn&#x27;t some &quot;subprocess&quot; of the whole system that becomes a dangerous planning process.</li><li>This seems more specific in the sense that the boldest argument for inner optimizers is, I think, that they should appear in effectively any optimization problem when there&#x27;s enough optimization pressure.</li></ul><p></p><p></p> capybaralet YhjLg2Q8PFByyc9s7 2019-01-09T18:22:42.679Z Comment by capybaralet on Imitation learning considered unsafe? https://www.lesswrong.com/posts/whRPLBZNQm3JD5Zv8/imitation-learning-considered-unsafe#fp6SsYrxZdgjrsKc9 <p>See the clarifying note in the OP. I don&#x27;t think this is about imitating humans, per se.</p><p>The more general framing I&#x27;d use is WRT &quot;safety via myopia&quot; (something I&#x27;ve been working on in the past year). There is an intuition that supervised learning (e.g. via SGD as is common practice in current ML) is quite safe, because it doesn&#x27;t have any built-in incentive to influence the world (resulting in instrumental goals); it just seeks to yield good performance on the training data, learning in a myopic sense to improve it&#x27;s performance on the present input. I think this intuition has some validity, but also might lead to a false sense of confidence that such systems are safe, when in fact they may end up behaving as if they *do* seek to influence the world, depending on the task they are trained on (ETA: and other details of the learning algorithm, e.g. outer-loop optimization and model choice).</p><p></p> capybaralet fp6SsYrxZdgjrsKc9 2019-01-07T15:55:46.050Z Comment by capybaralet on Assuming we've solved X, could we do Y... https://www.lesswrong.com/posts/95i5B78uhqyB3d6Xc/assuming-we-ve-solved-x-could-we-do-y#SbLwB4ywmyY73v8pa <p>Aha, OK. So I either misunderstand or disagree with that.</p><p>I think SHF (at least most examples) have the human as &quot;CEO&quot; with AIs as &quot;advisers&quot;, and thus the human can chose to ignore all of the advice and make the decision unaided.</p> capybaralet SbLwB4ywmyY73v8pa 2019-01-07T15:39:48.509Z Comment by capybaralet on Imitation learning considered unsafe? https://www.lesswrong.com/posts/whRPLBZNQm3JD5Zv8/imitation-learning-considered-unsafe#y7WqX8irAeR8aFF2b <p>I think I disagree pretty broadly with the assumptions/framing of your comment, although not necessarily the specific claims.</p><p>1) I don&#x27;t think it&#x27;s realistic to imagine we have &quot;indistinguishable imitation&quot; with an idealized discriminator. It might be possible in the future, and it might be worth considering to make intellectual progress, but I&#x27;m not expecting it to happen on a deadline. So I&#x27;m talking about what I expect might be a practical problem if we actually try to build systems that imitate humans in the coming decades.</p><p>2) I wouldn&#x27;t say &quot;decision theory&quot;; I think that&#x27;s a bit of a red herring. What I&#x27;m talking about is the policy.</p><p>3) I&#x27;m not sure the link you are trying to make to the &quot;universal prior is malign&quot; ideas. But I&#x27;ll draw my own connection. I do think the core of the argument I&#x27;m making results from an intuitive idea of what a simplicity prior looks like, and its propensity to favor something more like a planning process over something more like a lookup table. </p> capybaralet y7WqX8irAeR8aFF2b 2019-01-07T15:31:57.847Z Imitation learning considered unsafe? https://www.lesswrong.com/posts/whRPLBZNQm3JD5Zv8/imitation-learning-considered-unsafe <p>This post states an observation which I think a number of people have had, but which hasn&#x27;t been written up (AFAIK). I find it one of the more troubling outstanding issues with a number of proposals for AI alignment.</p><p>1) Training a flexible model with a reasonable simplicity prior to imitate (e.g.) human decisions (e.g. via behavioral cloning) should presumably yield a good approximation of the process by which human judgments arise, which involves a planning process.</p><p>2) We shouldn&#x27;t expect to learn exactly the correct process, though.</p><p>3) Therefore imitation learning might produce an AI which implements an unaligned planning process, which seems likely to have instrumental goals, and be dangerous.</p><p><strong>Example:</strong> The human might be doing planning over a bounded horizon of time-steps, or with a bounded utility function, and the AI might infer a version of the planning process that doesn&#x27;t bound horizon or utility.</p><p><strong>Clarifying note:</strong> Imitating a human is just one example; the key feature of the human is that the process generating their decisions is (arguably) well-modeled as involving planning over a long horizon.</p><p><strong>Counter-argument(s):</strong></p><ul><li>The human may have privileged access to context informing their decision; without that context, the solution may look very different</li><li>Mistakes in imitating the human may be relatively harmless; the approximation may be good enough</li><li>We can restrict the model family with the specific intention of preventing planning-like solutions</li></ul><p><strong>Overall, I have a significant amount of uncertainty about the significance of this issue, and I would like to see more thought regarding it.</strong></p> capybaralet whRPLBZNQm3JD5Zv8 2019-01-06T15:48:36.078Z Comment by capybaralet on Assuming we've solved X, could we do Y... https://www.lesswrong.com/posts/95i5B78uhqyB3d6Xc/assuming-we-ve-solved-x-could-we-do-y#pK4bGGuwKcziGgumi <p>OK, so it sounds like your argument why SHF can&#x27;t do ALD is (a specific, technical version of) the same argument that I mentioned in my last response. Can you confirm?</p> capybaralet pK4bGGuwKcziGgumi 2019-01-06T15:13:06.375Z Comment by capybaralet on Conceptual Analysis for AI Alignment https://www.lesswrong.com/posts/j79pzvuYM8hC9Tfzc/conceptual-analysis-for-ai-alignment#to6q9pWX2MiM73cRc <p>I intended to make that clear in the &quot;<strong>Concretely, I imagine a project around this with the following stages (each yielding at least one publication)</strong>&quot; section. The TL;DR is: do a literature review of analytic philosophy research on (e.g.) honesty. </p> capybaralet to6q9pWX2MiM73cRc 2018-12-30T21:58:25.522Z Comment by capybaralet on Assuming we've solved X, could we do Y... https://www.lesswrong.com/posts/95i5B78uhqyB3d6Xc/assuming-we-ve-solved-x-could-we-do-y#mude3AH2aqnRZ6Zt3 <p>Yes, please try to clarify. In particular, I don&#x27;t understand your &quot;|&quot; notation (as in &quot;S|Output&quot;).</p><p>I realized that I was a bit confused in what I said earlier. I think it&#x27;s clear that (proposed) SHF schemes should be able to do at least as well as a human, given the same amount of time, because they have human &quot;on top&quot; (as &quot;CEO&quot;) who can merely ignore all the AI helpers(/underlings). </p><p>But now I can also see an argument for why SHF couldn&#x27;t do ALD, if it doesn&#x27;t have arbitrarily long to deliberate: there would need to be some parallelism/decomposition in SHF, and that might not work well/perfectly for all problems.</p> capybaralet mude3AH2aqnRZ6Zt3 2018-12-30T21:56:30.356Z Conceptual Analysis for AI Alignment https://www.lesswrong.com/posts/j79pzvuYM8hC9Tfzc/conceptual-analysis-for-ai-alignment <p>TL; DR - <a href="https://en.wikipedia.org/wiki/Philosophical_analysis">Conceptual Analysis</a> is highly relevant for AI alignment, and is also a way in which someone with less technical skills can contribute to alignment research. This suggests there should be at least one person working full-time on reviewing existing philosophy literature for relevant insights, and summarizing and synthesizing these results for the safety community.</p><p>There are certain &quot;primitive concepts&quot; that we are able to express in mathematics, and it is <em>relatively</em> straightforward to program AIs to deal with those things. Naively, alignment requires understanding *all* morally significant human concepts, which seems daunting. However, the &quot;<a href="https://ai-alignment.com/corrigibility-3039e668638">argument from corrigibility</a>&quot; suggests that there may be small sets of human concepts which, if properly understood, are sufficient for &quot;<a href="https://www.lesswrong.com/posts/FTpPC4umEiREZMMRu/disambiguating-alignment-and-related-notions">benignment</a>&quot;. We should seek to identify what these concepts are, and make a best-effort to perform thorough and reductive conceptual analyses on them. But we should also look at what has already been done! </p><h2>On the coherence of human concepts</h2><p>For human concepts which *haven&#x27;t* been formalized, it&#x27;s unclear whether there is a simple &quot;coherent core&quot; to the concept. Careful analysis may also reveal that there are several coherent concepts worth distinguishing, e.g. cardinal vs. ordinal numbers. If we find there is a coherent core, we can attempt to build algorithms around it. </p><p>If there isn&#x27;t a simple coherent core, there may be a more complex one, or it may be that the concept just isn&#x27;t coherent (i.e. that it&#x27;s the product of a confused way of thinking). Either way, in the near term we&#x27;d probably have to use machine learning if we wanted to include these concepts in our AI&#x27;s lexicon.</p><p>A serious attempt at conceptual analysis could help us decide whether we should attempt to learn or formalize a concept. </p><h2><strong>Concretely, I imagine a project around this with the following stages (each yielding at least one publication):</strong></h2><p>1) A &quot;brainstormy&quot; document which attempts to enumerate all the concepts that are relevant to safety and presents the arguments for their specific relevance and relation to other relevant concepts. This should also specifically indicate how a combination of concepts, if rigorously analyzed, could be along the line of the argument from corrigibility. Besides corrigibility, two examples that jump to mind are &quot;reduced impact&quot; (or &quot;<a href="https://arxiv.org/abs/1806.01186">side effects</a>&quot;), and <a href="https://arxiv.org/abs/1606.03490">interpretability</a>.</p><p>2) A deep dive into the relevant literature (I imagine mostly in analytic philosophy) on each of these concepts (or sets of concepts). These should summarize the state of research on these problems in the relevant fields, and potentially inspire safety researchers, or at least help them frame their work for these audiences and find potential collaborators within these fields. It *might* also do some &quot;legwork&quot; in terms of formalizing logically rigorous notions in terms of mathematics or machine learning.</p><p>3) Attempting to transferring insights or ideas from these fields into technical AI safety or machine learning papers, if applicable.</p><p></p><p></p><p></p><p></p> capybaralet j79pzvuYM8hC9Tfzc 2018-12-30T00:46:38.014Z Comment by capybaralet on Assuming we've solved X, could we do Y... https://www.lesswrong.com/posts/95i5B78uhqyB3d6Xc/assuming-we-ve-solved-x-could-we-do-y#5EMzeqoP9r37SwaAo <p>Regarding the question of how to do empirical work on this topic: I remember there being one thing which seemed potentially interesting, but I couldn&#x27;t find it in my notes (yet).</p><p>RE the rest of your comment: I guess you are taking issue with the complexity theory analogy; is that correct? An example hypothetical TDMP I used is &quot;arbitrarily long deliberation&quot; (ALD), i.e. a single human is allowed as long as they want to make the decision (I don&#x27;t think that&#x27;s a perfect &quot;target&quot; for alignment, but it seems like a reasonable starting point). I don&#x27;t see why ALD would (even potentially) &quot;do something that can&#x27;t be approximated by SHF-schemes&quot;, since those schemes still have the human making a decision. </p><p>&quot;Or was the discussion more about, assuming we have theoretical reasons to think that SHF-schemes can approximate TDMP, how to test it empirically?&quot; &lt;-- yes, IIUC.</p><p></p><p></p><p></p><p></p><p></p> capybaralet 5EMzeqoP9r37SwaAo 2018-12-27T04:42:09.568Z Comment by capybaralet on Survey: What's the most negative*plausible cryonics-works story that you know? https://www.lesswrong.com/posts/MzmMZmSzyeLPjLQjD/survey-what-s-the-most-negative-plausible-cryonics-works#NSSmfS7Fqm2WFwALv <p>I&#x27;d suggest separating these two scenarios, based on the way the comments are meant to work according to the OP.</p> capybaralet NSSmfS7Fqm2WFwALv 2018-12-19T22:42:54.714Z Comment by capybaralet on Assuming we've solved X, could we do Y... https://www.lesswrong.com/posts/95i5B78uhqyB3d6Xc/assuming-we-ve-solved-x-could-we-do-y#RTrjtCsz5YiZLadT9 <p>I actually don&#x27;t understand why you say they can&#x27;t be fully disentangled. </p><p>IIRC, it seemed to me during the discussion that your main objection was around whether (e.g.) &quot;arbitrarily long deliberation (ALD)&quot; was (or could be) fully specified in a way that accounts properly for things like deception, manipulation, etc. More concretely, I think you mentioned the possibility of an AI affecting the deliberation process in an undesirable way. </p><p>But I think it&#x27;s reasonable to assume (within the bounds of a discussion) that there is a non-terrible way (in principle) to specify things like &quot;manipulation&quot;. So do you disagree? Or is your objection something else entirely?</p> capybaralet RTrjtCsz5YiZLadT9 2018-12-17T04:43:41.963Z Comment by capybaralet on Assuming we've solved X, could we do Y... https://www.lesswrong.com/posts/95i5B78uhqyB3d6Xc/assuming-we-ve-solved-x-could-we-do-y#G3bmPDi9BXJA5qamX <p>Hey, David here!</p><p>Just writing to give some context... The point of this session was to discuss an issue I see with &quot;super-human feedback (SHF)&quot; schemes (e.g. debate, amplification, recursive reward modelling) that use helper AIs to inform human judgments. I guess there was more of an inferential gap going into the session than I expected, so for background: let&#x27;s consider the complexity theory viewpoint in feedback (as discussed in section 2.2 of &quot;<a href="https://arxiv.org/abs/1805.00899">AI safety via debate</a>&quot;). This implicitly assumes that we have access to a trusted (e.g. human) decision making process (TDMP), sweeping the issues that Stuart mentions under the rug. </p><p>Under this view, the goal of SHF is to efficiently emulate the TDMP, accelerating the decision-making. For example, we&#x27;d like an agent trained with SHF to be able to quickly (e.g. in a matter of seconds) make decisions that would take the TDMP billions of years to decide. But we don&#x27;t aim to change the decisions. </p><p>Now, the issue I mentioned is: there doesn&#x27;t seem to be any way to evaluate whether the SHF-trained agent is faithfully emulating the TDMP&#x27;s decisions on such problems. It seems like, naively, the best we can do is train on problems where the TDMP can make decisions quickly, so that we can use its decisions as ground truth; then we just hope that it generalizes appropriately to the decisions that take TDMP billions of years. And the point of the session was to see if people have ideas for how to do less naive experiments that would allow us to increase our confidence that a SHF-scheme would yield safe generalization to these more difficult decisions. </p><p>Imagine there are 2 copies of me, A and B. A makes a decision with some helper AIs, and independently, B makes a decision without their help. A and B make different decisions. Who do we trust? I&#x27;m more ready to trust B, since I&#x27;m worried about the helper AIs having an undesirable influence on A&#x27;s decision-making.</p><p>-------------------------------------------------------------------- </p><p>...So questions of how to define human preferences or values seem mostly orthogonal to this question, which is why I want to assume them away. However, our discussion did make me consider more that I was making an implicit assumption (and this seems hard to avoid), that there was some idealized decision-making process that is assumed to be &quot;what we want&quot;. I&#x27;m relatively comfortable with trusting idealized versions of &quot;behavioral cloning/imitation/supervised learning&quot; (P) or &quot;(myopic) reinforcement learning/preference learning&quot; (NP), compared with the SHF-schemes (PSPACE).</p><p>One insight I gleaned from our discussion is the usefulness of disentangling:</p><ul><li>an idealized process for *defining* &quot;what we want&quot; (HCH was mentioned as potentially a better model of this than &quot;a single human given as long as they want to think about the decision&quot; (which was what I proposed using, for the purposes of the discussion)).</li><li>a means of *approximating* that definition.</li></ul><p>From this perspective, the discussion topic was: how can we gain empirical evidence for/against this question: &quot;Assuming that the output of a human&#x27;s indefinite deliberation is a good definition of &#x27;what they want&#x27;, do SHF-schemes do a good/safe job of approximating that?&quot; </p><p></p><p></p> capybaralet G3bmPDi9BXJA5qamX 2018-12-12T19:20:36.102Z Comment by capybaralet on Disambiguating "alignment" and related notions https://www.lesswrong.com/posts/FTpPC4umEiREZMMRu/disambiguating-alignment-and-related-notions#TGuy8E7tQoqfJPZkY <p>So I discovered that Paul Christiano already made a very similar distinction to the holistic/parochial one here:</p><p>https://ai-alignment.com/ambitious-vs-narrow-value-learning-99bd0c59847e</p><p>ambitious ~ holistic</p><p>narrow ~ parochial</p><p>Someone also suggested simply using general/narrow instead of holistic/parochial.</p><p></p><p></p> capybaralet TGuy8E7tQoqfJPZkY 2018-11-26T06:55:58.050Z Comment by capybaralet on Notification update and PM fixes https://www.lesswrong.com/posts/HKbqFR8rz2BWuQFHL/notification-update-and-pm-fixes#Eqstr5FbGvCjM34Ry <p>Has it been rolled out yet? I would really like this feature.</p><p>RE spamming: certainly they can be disabled by default, and you can have an unsubscribe button at the bottom of every email?</p> capybaralet Eqstr5FbGvCjM34Ry 2018-08-15T16:01:45.520Z Comment by capybaralet on Safely and usefully spectating on AIs optimizing over toy worlds https://www.lesswrong.com/posts/ikN9qQEkrFuPtYd6Y/safely-and-usefully-spectating-on-ais-optimizing-over-toy#eNLujgQsKcSdvLopd <p>I view this as a capability control technique, highly analogous to running a supervised learning algorithm where a reinforcement learning algorithm is expected to perform better. Intuitively, it seems like there should be a spectrum of options between (e.g.) supervised learning and reinforcement learning that would allow one to make more fine-grained safety-performance trade-offs.</p><p>I&#x27;m very optimistic about this approach of doing &quot;capability control&quot; by making less agent-y AI systems. If done properly, I think it could allow us to build systems that have no instrumental incentives to create subagents (although we&#x27;d still need to worry about &quot;accidental&quot; creation of subagents and (e.g. evolutionary) optimization pressures for their creation).</p><p>I would like to see this fleshed out as much as possible. This idea is somewhat intuitive, but it&#x27;s hard to tell if it is coherent, or how to formalize it. </p><p>P.S. Is this the same as &quot;platonic goals&quot;? Could you include references to previous thought on the topic?</p><p></p><p></p> capybaralet eNLujgQsKcSdvLopd 2018-08-15T15:49:12.296Z Comment by capybaralet on Disambiguating "alignment" and related notions https://www.lesswrong.com/posts/FTpPC4umEiREZMMRu/disambiguating-alignment-and-related-notions#CgtjK8XRnmJwh4v6M <p>I realized it&#x27;s unclear to me what &quot;trying&quot; means here, and in your definition of intentional alignment. I get the sense that you mean something much weaker than MIRI does by &quot;(actually) trying&quot;, and/or that you think this is a lot easier to accomplish than they do. Can you help clarify?</p> capybaralet CgtjK8XRnmJwh4v6M 2018-06-10T14:31:55.309Z Comment by capybaralet on Disambiguating "alignment" and related notions https://www.lesswrong.com/posts/FTpPC4umEiREZMMRu/disambiguating-alignment-and-related-notions#ua4vTHXt6M8bLg5w8 <p>It seems like you are referring to daemons. </p><p>To the extent that daemons result from an AI actually doing a good job of optimizing the right reward function, I think we should just accept that as the best possible outcome.</p><p>To the extent that daemons result from an AI doing a bad job of optimizing the right reward function, that can be viewed as a problem with capabilities, not alignment. That doesn&#x27;t mean we should ignore such problems; it&#x27;s just out of scope.</p><blockquote>Indeed, most people at MIRI seem to think that most of the difficulty of alignment is getting from &quot;has X as explicit terminal goal&quot; to &quot;is actually trying to achieve X.&quot;</blockquote><p> That seems like the wrong way of phrasing it to me. I would put it like &quot;MIRI wants to figure out how to build properly &#x27;consequentialist&#x27; agents, a capability they view us as currently lacking&quot;. </p> capybaralet ua4vTHXt6M8bLg5w8 2018-06-10T14:26:06.488Z Comment by capybaralet on Disambiguating "alignment" and related notions https://www.lesswrong.com/posts/FTpPC4umEiREZMMRu/disambiguating-alignment-and-related-notions#RsaA8LvNJk4kSD4NS <p>Can you please explain the distinction more succinctly, and say how it is related?</p> capybaralet RsaA8LvNJk4kSD4NS 2018-06-10T14:14:01.206Z Comment by capybaralet on Disambiguating "alignment" and related notions https://www.lesswrong.com/posts/FTpPC4umEiREZMMRu/disambiguating-alignment-and-related-notions#KMAJveXHxuxGRqBYP <p>I don&#x27;t think I was very clear; let me try to explain. </p><p>I mean different things by &quot;intentions&quot; and &quot;terminal values&quot; (and I think you do too?)</p><p>By &quot;terminal values&quot; I&#x27;m thinking of something like a reward function. If we literally just program an AI to have a particular reward function, then we know that it&#x27;s terminal values are whatever that reward function expresses. </p><p>Whereas &quot;trying to do what H wants it to do&quot; I think encompasses a broader range of things, such as when R has uncertainty about the reward function, but &quot;wants to learn the right one&quot;, or really just any case where R could reasonably be described as &quot;trying to do what H wants it to do&quot;. </p><p>Talking about a &quot;black box system&quot; was probably a red herring.</p><p></p><p></p> capybaralet KMAJveXHxuxGRqBYP 2018-06-07T19:35:35.554Z Comment by capybaralet on Disambiguating "alignment" and related notions https://www.lesswrong.com/posts/FTpPC4umEiREZMMRu/disambiguating-alignment-and-related-notions#9o6xeNCgZiGb6aYN6 <p>Another way of putting it: A parochially aligned AI (for task T) needs to understand task T, but doesn&#x27;t need to have common sense &quot;background values&quot; like &quot;don&#x27;t kill anyone&quot;. </p><p>Narrow AIs might require parochial alignment techniques in order to learn to perform tasks that we don&#x27;t know how to write a good reward function for. And we might try to combine parochial alignment with capability control in order to get something like a genie without having to teach it background values. When/whether that would be a good idea is unclear ATM.</p><p></p> capybaralet 9o6xeNCgZiGb6aYN6 2018-06-07T18:47:43.057Z Comment by capybaralet on Disambiguating "alignment" and related notions https://www.lesswrong.com/posts/FTpPC4umEiREZMMRu/disambiguating-alignment-and-related-notions#S9G5ixgXPTfhjcrA5 <p>It doesn&#x27;t *necessarily*. But it sounds like what you&#x27;re thinking of here is some form of &quot;sufficient alignment&quot;.</p><p>The point is that you could give an AI a reward function that leads it to be a good personal assistant program, so long as it remains restricted to doing the sort of things we expect a personal assistant program to do, and isn&#x27;t doing things like manipulating the stock market when you ask it to invest some money for you (unless that&#x27;s what you expect from a personal assistant). If it knows it could do things like that, but doesn&#x27;t want to, then it&#x27;s more like something sufficiently aligned. If it doesn&#x27;t do such things because it doesn&#x27;t realize they are possibilities (yet), or because it hasn&#x27;t figured out a good way to use it&#x27;s actuators to have that kind of effect (yet), because you&#x27;ve done a good job boxing it, then it&#x27;s more like &quot;parochially aligned&quot;. </p><p></p><p></p> capybaralet S9G5ixgXPTfhjcrA5 2018-06-07T18:43:13.704Z Comment by capybaralet on Amplification Discussion Notes https://www.lesswrong.com/posts/LbJHizyfAsDYeETBq/amplification-discussion-notes#LEBKGekt5CaLFehp7 <p>This is one of my main cruxes. I have 2 main concerns about honest mistakes:</p><p>1) Compounding errors: IIUC, Paul thinks we can find a basin of attraction for alignment (or at least corrigibility...) so that an AI can help us correct it online to avoid compounding errors. This seems plausible, but I don&#x27;t see any strong reasons to believe it will happen or that we&#x27;ll be able to recognize whether it is or not.</p><p>2) The &quot;progeny alignment problem&quot; (PAP): An honest mistake could result in the creation an unaligned progeny. I think we should expect that to happen quickly if we don&#x27;t have a good reason to believe it won&#x27;t. You could argue that humans recognize this problem, so an AGI should as well (and if it&#x27;s aligned, it should handle the situation appropriately), but that begs the question of how we got an aligned AGI in the first place. There are basically 3 subconcerns here (call the AI we&#x27;re building &quot;R&quot;):</p><p>2a) R can make an unaligned progeny before it&#x27;s &quot;smart enough&quot; to realize it needs to exercise care to avoid doing so.</p><p>2b) R gets smart enough to realize that solving PAP (e.g. doing something like MIRI&#x27;s AF) is necessary in order to develop further capabilities safely, and that ends up being a huge roadblock that makes R uncompetitive with less safe approaches.</p><p>2c) If R has gamma &lt; 1, it could knowingly, rationally decide to build a progeny that is useful through R&#x27;s effective horizon, but will take over and optimize a different objective after that. </p><p>2b and 2c are *arguably* &quot;non-problems&quot; (although they&#x27;re at least worth taking into consideration). 2a seems like a more serious problem that needs to be addressed.</p> capybaralet LEBKGekt5CaLFehp7 2018-06-06T12:06:55.580Z Comment by capybaralet on Disambiguating "alignment" and related notions https://www.lesswrong.com/posts/FTpPC4umEiREZMMRu/disambiguating-alignment-and-related-notions#NkK5pgYtLMCXnriHg <p>This is not what I meant by &quot;the same values&quot;, but the comment points towards an interesting point.</p><p>When I say &quot;the same values&quot;, I mean the same utility function, as a function over the state of the world (and the states of &quot;R is having sex&quot; and &quot;H is having sex&quot; are different). </p><p>The interesting point is that states need to be inferred from observations, and it seems like there are some fundamentally hard issues around doing that in a satisfying way.</p> capybaralet NkK5pgYtLMCXnriHg 2018-06-05T19:59:10.666Z Comment by capybaralet on Funding for AI alignment research https://www.lesswrong.com/posts/DbPJGNS79qQfZcDm7/funding-for-ai-alignment-research#3yu4ckX8wBrNcYskz <p>So my original response was to the statement:</p><blockquote>Differential research that advances safety more than AI capability still advances AI capability.</blockquote><p>Which seems to suggest that advancing AI capability is sufficient reason to avoid technical safety that has non-trivial overlap with capabilities. I think that&#x27;s wrong.</p><p>RE the necessary and sufficient argument:</p><p>1) Necessary: it&#x27;s unclear that a technical solution to alignment would be sufficient, since our current social institutions are not designed for superintelligent actors, and we might not develop effective new ones quickly enough</p><p>2) Sufficient: I agree that never building AGI is a potential Xrisk (or close enough). I don&#x27;t think it&#x27;s entirely unrealistic &quot;to shoot for levels of coordination like &#x27;let&#x27;s just never build AGI&#x27;&quot;, although I agree it&#x27;s a long shot. Supposing we have that level of coordination, we could use &quot;never build AGI&quot; as a backup plan while we work to solve technical safety to our satisfaction, if that is in fact possible.</p><p></p><p></p> capybaralet 3yu4ckX8wBrNcYskz 2018-06-05T16:07:24.487Z Comment by capybaralet on Funding for AI alignment research https://www.lesswrong.com/posts/DbPJGNS79qQfZcDm7/funding-for-ai-alignment-research#PHrwvg7zNzjna2kjK <blockquote>Moving on from that I&#x27;m thinking that we might need a broad base of support from people (depending upon the scenario) so being able to explain how people could still have meaningful lives post AI is important for building that support. So I&#x27;ve been thinking about that.</blockquote><p></p><p>This sounds like it would be useful for getting people to support the development of AGI, rather than effective global regulation of AGI. What am I missing?</p> capybaralet PHrwvg7zNzjna2kjK 2018-06-05T16:01:44.240Z Disambiguating "alignment" and related notions https://www.lesswrong.com/posts/FTpPC4umEiREZMMRu/disambiguating-alignment-and-related-notions <p>I recently had an ongoing and undetected inferential gap with someone over our use of the term &quot;value aligned&quot;.</p><h2><strong>Holistic alignment vs. parochial alignment</strong></h2><p>I was talking about &quot;holistic alignment&quot;: <br/><em>Agent R is <strong>holistically aligned</strong> with agent H iff R and H have the same terminal values.<strong><br/></strong></em>This is the &quot;classic AI safety (CAIS)&quot; (as exemplified by Superintelligence) notion of alignment, and the CAIS view is roughly: &quot;a superintelligent AI (ASI) that is not holistically aligned is an Xrisk&quot;; this view is supported by <a href="https://en.wikipedia.org/wiki/Instrumental_convergence#Instrumental_convergence_thesis">the instrumental convergence thesis</a>.</p><p>My friend was talking about &quot;parochial alignment&quot;. I&#x27;m lacking a satisfyingly crisp definition of parochial alignment, but intuitively, it refers to how you&#x27;d want a &quot;<a href="https://arbital.com/p/task_agi/">genie</a>&quot; to behave:<br/><em>R is<strong> parochially aligned</strong> with agent H and task T iff R&#x27;s terminal values are to accomplish T in accordance to H&#x27;s preferences over the intended task domain.</em><br/>We both agreed that a parochially aligned ASI is not safe by default (it might <a href="https://wiki.lesswrong.com/wiki/Paperclip_maximizer">paperclip</a>), but that it might be possible to make one safe using various capability control mechanisms (for instance, anything that effectively restrict it to operating within the task domain). </p><h2><strong>Sufficient alignment</strong></h2><p>We might further consider a notion of &quot;sufficient alignment&quot;:<br/><em>R is <strong>sufficiently aligned </strong>with H iff optimizing R&#x27;s terminal values would not induce a nontrivial Xrisk (according to H&#x27;s definition of Xrisk).</em><br/>For example, an AI whose terminal values are &quot;maintain meaningful human control over the future&quot; is plausibly sufficiently aligned. A sufficiently aligned ASI is safe in the absence of capability control. It&#x27;s worth considering what might constitute sufficient alignment, short of holistic alignment. For instance, <a href="https://ai-alignment.com/corrigibility-3039e668638">Paul seems to argue that corrigible agents are sufficiently aligned</a>. As another example, we don&#x27;t know how bad of a <a href="https://en.wikipedia.org/wiki/AI_control_problem#The_problem_of_perverse_instantiation:_%22be_careful_what_you_wish_for%22">perverse instantiation</a> to expect from an ASI whose values are almost correct (e.g. within epsilon in L-infinity norm over possible futures).</p><h2><strong>Intentional alignment and &quot;benignment&quot;</strong></h2><p>Paul Christiano&#x27;s <a href="https://ai-alignment.com/clarifying-ai-alignment-cec47cd69dd6">version of alignment</a>, I&#x27;m calling &quot;intentional alignment&quot;: <br/><em>R is <strong>intentionally aligned </strong>with H if R is trying to do what H wants it to do.</em><br/>Although it feels intuitive, I&#x27;m not satisfied with the crispness of this definition, since we don&#x27;t have a good way of determining a black box system&#x27;s intentions. We can apply <a href="https://en.wikipedia.org/wiki/Intentional_stance">the intentional stance</a>, but that doesn&#x27;t provide a clear way of dealing with irrationality.</p><p>Paul also talks about <a href="https://ai-alignment.com/benign-ai-e4eb6ec6d68e">benign AI</a> which is about what an AI is optimized for (which is closely related to what it &quot;values&quot;). Inspired by this, I&#x27;ll define a complementary notion to Paul&#x27;s notion of alignment:<br/><em>R is <strong>benigned </strong>with H if R is not actively trying to do something that H doesn&#x27;t want it to do.</em></p><h2><strong>Take-aways</strong></h2><p>1) Be clear what notion of alignment you are using.</p><p>2) There might be sufficiently aligned ASIs that are not holistically aligned.</p><p>3) Try to come up with crisper definitions of parochial and intentional alignment.</p><p></p><p></p> capybaralet FTpPC4umEiREZMMRu 2018-06-05T15:35:15.091Z Comment by capybaralet on Funding for AI alignment research https://www.lesswrong.com/posts/DbPJGNS79qQfZcDm7/funding-for-ai-alignment-research#S2G5fnJ9EX7bFzFwr <p>Can you give some arguments for these views? </p><p>I think the best argument against institution-oriented work is that it might be harder to make a big impact. But more importantly, I think strong global coordination is necessary and sufficient, whereas technical safety is plausibly neither. </p><p>I also agree that one should consider tradeoffs, sometimes. But every time someone has raised this concern to me (I think it&#x27;s been 3x?) I think it&#x27;s been a clear cut case of &quot;why are you even worrying about that&quot;, which leads me to believe that there are a lot of people who are overconcerned about this. </p><p></p> capybaralet S2G5fnJ9EX7bFzFwr 2018-06-05T14:38:21.600Z Comment by capybaralet on When is unaligned AI morally valuable? https://www.lesswrong.com/posts/3kN79EuT27trGexsq/when-is-unaligned-ai-morally-valuable#Bex8icDTdJibtataM <blockquote>It seems like the preferences of the AI you build are way more important than its experience (not sure if that&#x27;s what you mean).</blockquote><p></p><p>This is because the AIs preferences are going to have a much larger downstream impact? </p><p>I&#x27;d agree, but caveat that there may be likely possible futures which don&#x27;t involve the creation of hyper-rational AIs with well-defined preferences, but rather artificial life with messy incomplete, inconsistent preferences but morally valuable experiences. More generally, the future of the light cone could be determined by societal/evolutionary factors rather than any particular agent or agent-y process.</p><p></p><p>I found your 2nd paragraph unclear...</p><blockquote>the goals happen to overlap enough</blockquote><p>Is this referring to the goals of having &quot;AIs that have good preferences&quot; and &quot;AIs that have lots of morally valuable experience&quot;?</p><p></p> capybaralet Bex8icDTdJibtataM 2018-06-05T14:33:53.494Z Comment by capybaralet on Funding for AI alignment research https://www.lesswrong.com/posts/DbPJGNS79qQfZcDm7/funding-for-ai-alignment-research#Ee6bccvcuvYdRSkCa <p>Are you funding constrained? Would you give out more money if you had more? </p> capybaralet Ee6bccvcuvYdRSkCa 2018-06-04T13:06:14.377Z Problems with learning values from observation https://www.lesswrong.com/posts/G8JEMZgDgFPss7Cm7/problems-with-learning-values-from-observation <p>I dunno if this has been discussed elsewhere (pointers welcome).<br /><br />Observational data doesn't allow one to distinguish correlation and causation.<br />This is a problem for an agent attempting to learn values without being allowed to make interventions.<br /><br />For example, suppose that happiness is just a linear function of how much Utopamine is in a person's brain.<br />If a person smiles only when their Utopamine concentration is above 3 ppm, then an value-learner which observes both someone's Utopamine levels and facial expression and tries to predict their reported happiness on the basis of these features will notice that smiling is correlated with higher levels of reported happiness and thus erroneously believe that it is partially responsible for the happiness.<br /><br />------------------<br />an IMPLICATION:<br />I have a picture of value learning where the AI learns via observation (since we don't want to give an unaligned AI access to actuators!).<br />But this makes it seem&nbsp;important to consider how to make an un unaligned AI safe-enough to perform value-learning relevant interventions.</p> capybaralet G8JEMZgDgFPss7Cm7 2016-09-21T00:40:49.102Z Risks from Approximate Value Learning https://www.lesswrong.com/posts/rLTv9Sx3A79ijoonQ/risks-from-approximate-value-learning <p>Solving the value learning problem is (IMO) the key technical challenge for AI safety.<br />How good or bad is an approximate solution?<br /><br />EDIT for clarity:<br />By "approximate value learning" I mean something which does a good (but suboptimal from the perspective of safety) job of learning values. &nbsp;So it may do a good enough job of learning values to behave well most of the time, and be useful for solving tasks, but it still has a non-trivial chance of developing dangerous instrumental goals, and is hence an Xrisk.<br /><br /><strong>Considerations:</strong><br /><br /><em>1. How would developing good approximate value learning algorithms effect AI research/deployment?<br /></em>It would enable more AI applications. &nbsp;For instance, many many robotics tasks such as "smooth grasping motion" are difficult to manually specify a utility function for. &nbsp;This could have positive or negative effects:<br /><br />Positive:<br />* It could encourage more mainstream AI researchers to work on value-learning.<br /><br />Negative:<br />* It could encourage more mainstream AI developers to use reinforcement learning to solve tasks for which "good-enough" utility functions can be learned.<br />Consider a value-learning algorithm which is "good-enough" to learn how to perform complicated, ill-specified tasks (e.g. <a href="https://www.youtube.com/watch?v=gy5g33S0Gzo">folding a towel</a>). &nbsp;But it's still not quite perfect, and so every second, there is a 1/100,000,000 chance that it decides to take over the world. A robot using this algorithm would likely pass a year-long series of safety tests and seem like a viable product, but would be expected to decide to take over the world in ~3 years.<br />Without good-enough value learning, these tasks might just not be solved, or might be solved with safer approaches involving more engineering and less performance, e.g. using a collection of supervised learning modules and hand-crafted interfaces/heuristics.</p> <p><em>2. What would a partially aligned AI do?&nbsp;<br /></em>An AI programmed with an approximately correct value function might fail&nbsp;<br />* dramatically (see, e.g.&nbsp;<a href="https://intelligence.org/files/CEV.pdf">Eliezer</a>, on AIs "tiling the solar system with tiny smiley faces.")<br />or<br />* relatively benignly (see, e.g.&nbsp;<a href="http://graphitepublications.com/the-beginning-of-the-end-or-the-end-of-beginning-what-happens-when-ai-takes-over/">my example of an AI that doesn't understand gustatory pleasure</a>)<br /><br />Perhaps a more significant example of benign partial-alignment would be an AI that has not learned all human values, but is corrigible and handles its uncertainty about its utility in a desirable way.</p> capybaralet rLTv9Sx3A79ijoonQ 2016-08-27T19:34:06.178Z Inefficient Games https://www.lesswrong.com/posts/7m6SCnR4nMWzAyL8d/inefficient-games <p>There are several well-known games in which the pareto optima and Nash equilibria are disjoint sets.<br />The most famous is probably the prisoner's dilemma. &nbsp;Races to the bottom or tragedies of the commons typically have this feature as well.<br /><br />I <a href="https://economics.stackexchange.com/questions/9514/is-there-a-term-for-a-game-whose-pareto-optimal-solutions-and-nash-equilibria-ar">proposed</a>&nbsp;calling these <em>inefficient games.</em>&nbsp; More generally, games where the sets of pareto optima and Nash equilibria are distinct (but not disjoint), such as a <a href="https://en.wikipedia.org/wiki/Stag_hunt">stag hunt</a> could be called <em>potentially inefficient games.</em></p> <p>It seems worthwhile to study <em>(potentially)&nbsp;inefficient games </em>as a class and see what can be discovered about them, but I don't know of any such work (pointers welcome!)<br /></p> capybaralet 7m6SCnR4nMWzAyL8d 2016-08-23T17:47:02.882Z Should we enable public binding precommitments? https://www.lesswrong.com/posts/5bJd553w4YZKSxy6g/should-we-enable-public-binding-precommitments <p>The ability to make arbitrary public binding precommitments seems like a powerful tool for solving coordination problems.<br /><br />We'd like to be able to commit to cooperating with anyone who will cooperate with us, as in the open-source prisoner's dilemma (although this simple case is still an open problem, AFAIK). &nbsp;But we should be able to do this piece-meal.<br /><br />It seems like we are moving in this direction, with things like Etherium that enable smart contracts. &nbsp;Technology should enable us to enforce more real-world precommitments, since we'll be able to more easily monitor and make public our private data.<br /><br />Optimistically, I think this could allow us to solve coordination issues robustly enough to have a very low probability of any individual actor making an unsafe AI. &nbsp;This would require a lot of people to make the right kind of precommitments.<br /><br />I'm guesing there are a lot of potential downsides and ways it could go wrong, which y'all might want to point out.<br /><br /></p> capybaralet 5bJd553w4YZKSxy6g 2016-07-31T19:47:05.588Z A Basic Problem of Ethics: Panpsychism? https://www.lesswrong.com/posts/gLzdsHQHthb7yWXF9/a-basic-problem-of-ethics-panpsychism <p><a href="http://en.wikipedia.org/wiki/Panpsychism">Panpsychism</a>&nbsp;seems like a plausible theory of consciousness. &nbsp;It raises extreme challenges for establishing reasonable ethical criteria.<br /><br />It seems to suggest that our ethics is very subjective: the "expanding circle" of Peter Singer would eventually (ideally) stretch to encompass all matter. &nbsp;But how are we to communicate with, e.g. rocks? &nbsp;Our ability to communicate with one another and our presumed ability to detect falsehood and empathize in a meaningful way allow us to ignore this challenge wrt other people.</p> <p>One way to argue that this is not such a problem is to suggest that humans are simply very limited in our capacity as ethical beings, and that we are fundamentally limited in our perceptions of ethical truth to only be able to draw conclusions with any meaningful degree of certainty about other humans or animals (or maybe even life-forms, if you are optimistic). &nbsp;<br /><br />But this is not very satisfying if we consider transhumanism. &nbsp;Are we to rely on AI to extrapolate our intuitions to the rest of matter? &nbsp;How do we know that our intuitions are correct (or do we even care? &nbsp;I do, personally...)? &nbsp;How can we tell if an AI is correctly extrapolating?<br /><br /><br /><br /><br /></p> capybaralet gLzdsHQHthb7yWXF9 2015-01-27T06:27:20.028Z A Somewhat Vague Proposal for Grounding Ethics in Physics https://www.lesswrong.com/posts/QotHGrSEaBPCoQQt5/a-somewhat-vague-proposal-for-grounding-ethics-in-physics <p>As <a href="http://arxiv.org/pdf/1409.0813.pdf">Tegmark argues</a>, the idea of "final goal" for AI is likely incoherent, at least if (as he states), "Quantum effects aside, a truly well-defined goal would specify how all particles in our Universe should be arranged at the end of time." &nbsp;<br /><br />But "life is a journey not a destination". &nbsp;So really, what we should be specifying is the entire <em>evolution</em>&nbsp;of the universe through its lifespan. &nbsp;So how can the universe "enjoy itself" as much as possible before the big crunch (or before and during the heat death)*.<br /><br />I hypothesize that experience is related to, if not a product of, change. &nbsp;I further propose (counter-intuitively, and with an eye towards "refinement" (to put it mildly))** that we treat experience as inherently positive and not try to distinguish between positive and negative experiences.<br /><br />Then it seems to me the (still rather intractable) question is: how does the rate of entropy's increase relate to the quantity of experience produced? &nbsp;Is it simply linear (in which case, it doesn't matter, ethically)? &nbsp;My intuition is that is it more like the fuel efficiency of a car, non-linear and with a sweet spot somewhere between a lengthy boredom and a flash of intensity.<br /><br /><br /><br />*I'm not super up on cosmology; are there other theories I ought to be considering?<br /><br />**One idea for refinement: successful "prediction" (undefined here) creates positive experiences; frustrated expectations negative ones.<br /><br /><br /></p> capybaralet QotHGrSEaBPCoQQt5 2015-01-27T05:45:52.991Z