william_s feed - LessWrong 2.0 Reader william_s’s posts and comments on the Effective Altruism Forum en-us Comment by William_S on The Main Sources of AI Risk? https://www.lesswrong.com/posts/WXvt8bxYnwBYpy9oT/the-main-sources-of-ai-risk#Lnp2QpWQmPyGk52dz <ul><li>AI systems end up controlled by a group of humans representing a small range of human values (ie. an ideological or religious group that imposes values on everyone else). While not caused only by AI design, it is possible that design decisions could impact the likelihood of this scenario (ie. at what point are values loaded into the system/how many people&#x27;s values are loaded into the system), and is relevant for overall strategy.</li></ul> william_s Lnp2QpWQmPyGk52dz 2019-03-22T18:55:56.945Z Comment by William_S on The Main Sources of AI Risk? https://www.lesswrong.com/posts/WXvt8bxYnwBYpy9oT/the-main-sources-of-ai-risk#yDJ95p8C3Gvg7zdpT <ul><li>Failure to learn how to deal with alignment in the many-humans, many-AIs case even if single-human, single-AI alignment is solved (which I think Andrew Critch has talked about). For example, AIs negotiating on behalf of humans take the stance described in <a href="https://arxiv.org/abs/1711.00363">https://arxiv.org/abs/1711.00363</a><a href="https://arxiv.org/abs/1711.00363"> </a><a href="https://arxiv.org/abs/1711.00363">o</a>f agreeing to split control of the future according to which human&#x27;s priors are most accurate (on potentially irrelevant issues) if this isn&#x27;t what humans actually want.</li></ul> william_s yDJ95p8C3Gvg7zdpT 2019-03-22T18:52:24.938Z Comment by William_S on Some Thoughts on Metaphilosophy https://www.lesswrong.com/posts/EByDsY9S3EDhhfFzC/some-thoughts-on-metaphilosophy#cvqf6QbWCMA6DZti7 <p>Maybe one AI philosophy service could look like: would ask you a bunch of other questions that are simpler than the problem of qualia, then show you what those answers imply about the problem of qualia if you use some method of reconciling those answers.</p> william_s cvqf6QbWCMA6DZti7 2019-03-08T18:58:55.169Z Comment by William_S on Some Thoughts on Metaphilosophy https://www.lesswrong.com/posts/EByDsY9S3EDhhfFzC/some-thoughts-on-metaphilosophy#eLTuDtXZmxevxBnPd <p>Re: Philosophy as interminable debate, another way to put the relationship between math and philosophy:</p><p>Philosophy as weakly verifiable argumentation</p><p>Math is solving problems by looking at the consequences of a small number of axiomatic reasoning steps. For something to be math, we have to be able to ultimately cash out any proof as a series of these reasoning steps. Once something is cashed out in this way, it takes a small constant amount of time to verify any reasoning step, so we can verify given polynomial time.</p><p>Philosophy is solving problems where we haven&#x27;t figured out a set of axiomatic reasoning steps. Any non-axiomatic reasoning step we propose could end up having arguments that we hadn&#x27;t thought of that would lead us to reject that step. And those arguments themselves might be undermined by other arguments, and so on. Each round of debate lets us add another level of counter-arguments. Philosophers can make progress when they have some good predictor of whether arguments are good or not, but they don&#x27;t have access to certain knowledge of arguments being good.</p><p>Another difference between mathematics and philosophy is that in mathematics we have a well defined set of objects and a well-defined problem we are asking about. Whereas in philosophy we are trying to ask questions about things that exist in the real world and/or we are asking questions that we haven&#x27;t crisply defined yet.</p><p>When we come up with a set of axioms and a description of a problem, we can move that problem from the realm of philosophy to the realm of mathematics. When we come up with some method we trust of verifying arguments (ie. replicating scientific experiments), we can move problems out of philosophy to other sciences.</p><p>It could be the case that philosophy grounds out in some reasonable set of axioms which we don&#x27;t have access to now for computational reasons - in which case it could all end up in the realm of mathematics. It could be the case that, for all practical purposes, we will never reach this state, so it will remain in the &quot;potentially unbounded DEBATE round case&quot;. I&#x27;m not sure what it would look like if it could never ground out - one model could be that we have a black box function that performs a probabilistic evaluation of argument strength given counter-arguments, and we go through some process to get the consequences of that, but it never looks like &quot;here is a set of axioms&quot;.</p> william_s eLTuDtXZmxevxBnPd 2019-03-08T18:53:49.848Z Comment by William_S on Some Thoughts on Metaphilosophy https://www.lesswrong.com/posts/EByDsY9S3EDhhfFzC/some-thoughts-on-metaphilosophy#B2DKZqD98gan7nXQK <p>I guess it feels like I don&#x27;t know how we could know that we&#x27;re in the position that we&#x27;ve &quot;solved&quot; meta-philosophy. It feels like the thing we could do is build a set of better and better models of philosophy and check their results against held-out human reasoning and against each other.</p><p>I also don&#x27;t think we know how to specify a ground truth reasoning process that we could try to protect and run forever which we could be completely confident would come up with the right outcome (where something like HCH is a good candidate but potentially with bugs/subtleties that need to be worked out).</p><p>I feel like I have some (not well justified and possibly motivated) optimism that this process yields something good fairly early on. We could gain confidence that we are in this world if we build a bunch of better and better models of meta-philosophy and observe at some point the models continue agreeing with each other as we improve them, and that they agree with various instantiations of protected human reasoning that we run. If we are in this world, the thing we need to do is just spend some time building a variety of these kinds of models and produce an action that looks good to most of them. (Where agreement is not &quot;comes up with the same answer&quot; but more like &quot;comes up with an answer that other models think is okay and not disastrous to accept&quot;).</p><p>Do you think this would lead to &quot;good outcomes&quot;? Do you think some version of this approach could be satisfactory for solving the problems in <a href="https://www.lesswrong.com/posts/HTgakSs6JpnogD6c2/two-neglected-problems-in-human-ai-safety">Two Neglected Problems in Human-AI Safety</a>?</p><p>Do you think there&#x27;s a different kind of thing that we would need to do to &quot;solve metaphilosophy&quot;? Or do you think that working on &quot;solving metaphilosophy&quot; roughly caches out as &quot;work on coming up with better and better models of philosophy in the model I&#x27;ve described here&quot;?</p> william_s B2DKZqD98gan7nXQK 2019-03-08T17:41:23.465Z Comment by William_S on Three AI Safety Related Ideas https://www.lesswrong.com/posts/vbtvgNXkufFRSrx4j/three-ai-safety-related-ideas#Koqa4intfDbf85BQY <p>A couple ways to implement a hybrid approach with existing AI safety tools:</p><p>Logical Induction: Specify some computationally expensive simulation of idealized humans. Run a logical inductor with the deductive process running the simulation and outputting what the humans say after time x in simulation, as well as statements about what non-idealized humans are saying in the real world. The inductor should be able to provide beliefs about what the idealized humans will say in the future informed by information from the non-idealized humans.</p><p>HCH/IDA: The HCH-humans demonstrate a reasoning process which aims to predict the output of a set of idealized humans using all available information (which can include running simulations of idealized humans or information from real humans). The way that the HCH tree using information about real humans involves looking carefully at their circumstances and asking things like &quot;how do the real human&#x27;s circumstances differ from the idealized human&quot; and &quot;is the information from the real human compromised in some way?&quot;</p> william_s Koqa4intfDbf85BQY 2019-03-08T17:30:25.571Z Comment by William_S on Can HCH epistemically dominate Ramanujan? https://www.lesswrong.com/posts/4qY9zEHLa2su4PkQ4/can-hch-epistemically-dominate-ramanujan#KkxSbsdWJtA8aHruZ <p>It seems like for Filtered-HCH, the application in the post you linked to, you might be able to do a weaker version where you label any computation that you can&#x27;t understand in kN steps as problematic, only accepting things you think you can efficiently understand. (But I don&#x27;t think Paul is arguing for this weaker version).</p> william_s KkxSbsdWJtA8aHruZ 2019-02-27T22:58:06.230Z Comment by William_S on Reinforcement Learning in the Iterated Amplification Framework https://www.lesswrong.com/posts/fq7Ehb2oWwXtZic8S/reinforcement-learning-in-the-iterated-amplification#gCPbsh3BFc26zdHGC <blockquote>RL is typically about sequential decision-making, and I wasn&#x27;t sure where the &quot;sequential&quot; part came in).</blockquote><p>I guess I&#x27;ve used the term &quot;reinforcement learning&quot; to refer to a broader class of problems including both one-shot bandit problems and sequential decision making problems. In this view The feature that makes RL different from supervised learning is not that we&#x27;re trying to figure out what how to act in an MDP/POMDP, but instead that we&#x27;re trying to optimize a function that we can&#x27;t take the derivative of (in the MDP case, it&#x27;s because the environment is non-differentiable, and in the approval learning case, it&#x27;s because the overseer is non-differentiable).</p> william_s gCPbsh3BFc26zdHGC 2019-02-18T21:09:09.756Z Comment by William_S on Some disjunctive reasons for urgency on AI risk https://www.lesswrong.com/posts/8oSCw3z2dZgWjanqB/some-disjunctive-reasons-for-urgency-on-ai-risk#bCm2FmZq8quQ7bDyP <p>Re: scenario 3, see <a href="https://en.wikipedia.org/wiki/The_Evitable_Conflict">The Evitable Conflict</a>, the last story in Isaac Asimov&#x27;s &quot;I, Robot&quot;:</p><blockquote>&quot;Stephen, how do we know what the ultimate good of Humanity will entail? We haven&#x27;t at our disposal the infinite factors that the Machine has at its! Perhaps, to give you a not unfamiliar example, our entire technical civilization has created more unhappiness and misery than it has removed. Perhaps an agrarian or pastoral civilization, with less culture and less people would be better. If so, the Machines must move in that direction, preferably without telling us, since in our ignorant prejudices we only know that what we are used to, is good – and we would then fight change. Or perhaps a complete urbanization, or a completely caste-ridden society, or complete anarchy, is the answer. We don&#x27;t know. Only the Machines know, and they are going there and taking us with them.&quot;</blockquote> william_s bCm2FmZq8quQ7bDyP 2019-02-15T21:59:17.046Z Comment by William_S on HCH is not just Mechanical Turk https://www.lesswrong.com/posts/4JuKoFguzuMrNn6Qr/hch-is-not-just-mechanical-turk#wqivTcZvP48kXvPsH <p>Yeah, to some extent. In the Lookup Table case, you need to have a (potentially quite expensive) way of resolving all mistakes. In the Overseer&#x27;s Manual case, you can also leverage humans to do some kind of more robust reasoning (for example, they can notice a typo in a question and still respond correctly, even if the Lookup Table would fail in this case). Though in low-bandwidth oversight, the space of things that participants could notice and correct is fairly limited.</p><p>Though I think this still differs from HRAD in that it seems like the output of HRAD would be a much smaller thing in terms of description length than the Lookup Table, and you can buy extra robustness by adding many more human-reasoned things into the Lookup Table (ie. automatically add versions of all questions with typos that don&#x27;t change the meaning of a question into the Lookup Table, add 1000 different sanity check questions to flag that things can go wrong).</p><p>So I think there are additional ways the system could correct mistaken reasoning relative to what I would think the output of HRAD would look like, but you do need to have processes that you think can correct any way that reasoning goes wrong. So the problem could be less difficult than HRAD, but still tricky to get right.</p> william_s wqivTcZvP48kXvPsH 2019-02-13T00:15:16.037Z Comment by William_S on The Argument from Philosophical Difficulty https://www.lesswrong.com/posts/w6d7XBCegc96kz4n3/the-argument-from-philosophical-difficulty#7uWrcCd6PAvDn72f8 <p>Thanks, this position makes more sense in light of Beyond Astronomical Waste (I guess I have some concept of &quot;a pretty good future&quot; that is fine with something like a bunch of human-descended beings living a happy lives that misses out on the sort of things mentioned in Beyond Astronomical Waste, and &quot;optimal future&quot; which includes those considerations). I buy this as an argument that &quot;we should put more effort into making philosophy work to make the outcome of AI better, because we risk losing large amounts of value&quot; rather than &quot;our efforts to get a pretty good future are doomed unless we make tons of progress on this&quot; or something like that.</p><p>&quot;Thousands of millions&quot; was a typo.</p> william_s 7uWrcCd6PAvDn72f8 2019-02-11T17:47:18.818Z Comment by William_S on Thoughts on reward engineering https://www.lesswrong.com/posts/NtX7LKhCXMW2vjWx6/thoughts-on-reward-engineering#4ZnrfC673yBKh2kA3 <blockquote>What is the motivation for using RL here?</blockquote><p>I see the motivation as given practical compute limits, it may be much easier to have the system find an action the overseer approves of instead of imitating the overseer directly. Using RL also allows you to use any advances that are made in RL by the machine learning community to try to remain competitive.</p> william_s 4ZnrfC673yBKh2kA3 2019-02-10T22:31:38.378Z Comment by William_S on Thoughts on reward engineering https://www.lesswrong.com/posts/NtX7LKhCXMW2vjWx6/thoughts-on-reward-engineering#cmJfe4rZWdEHsWZ6t <blockquote>Would this still be a problem if we were training the agent with SL instead of RL?</blockquote><p>Maybe this could happen with SL if SL does some kind of large search and finds a solution that looks good but is actually bad. The distilled agent would then learn to identify this action and reproduce it, which implies the agent learning some facts about the action to efficiently locate it with much less compute than the large search process. Knowing what the agent knows would allow the overseer to learn those facts, which might help in identifying this action as bad.</p> william_s cmJfe4rZWdEHsWZ6t 2019-02-10T22:28:42.330Z Comment by William_S on Reinforcement Learning in the Iterated Amplification Framework https://www.lesswrong.com/posts/fq7Ehb2oWwXtZic8S/reinforcement-learning-in-the-iterated-amplification#zTfj4zXHTAmZ9b9sn <blockquote>I don&#x27;t understand why we want to find this X* in the imitation learning case.</blockquote><p>Ah, with this example the intent was more like &quot;we can frame what the RL case is doing as finding X* , let&#x27;s show how we could accomplish the same thing in the imitation learning case (in the limit of unlimited compute)&quot;.</p><p>The reverse mapping (imitation to RL) just consists of applying reward 1 to M2&#x27;s demonstrated behaviour (which could be &quot;execute some safe search and return the results), and reward 0 to everything else.</p><blockquote>What is pM(X∗)?</blockquote><p> <span><style>.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} </style><span class="mjx-chtml"><span class="mjx-math" aria-label="p_M(X*)"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-msubsup"><span class="mjx-base"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.225em; padding-bottom: 0.446em;">p</span></span></span><span class="mjx-sub" style="font-size: 70.7%; vertical-align: -0.212em; padding-right: 0.071em;"><span class="mjx-mi" style=""><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em; padding-right: 0.081em;">M</span></span></span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">(</span></span><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em; padding-right: 0.024em;">X</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.151em; padding-bottom: 0.298em;">∗</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">)</span></span></span></span></span></span> is the probability of outputting <span><span class="mjx-chtml"><span class="mjx-math" aria-label="X*"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em; padding-right: 0.024em;">X</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.151em; padding-bottom: 0.298em;">∗</span></span></span></span></span></span> (where <span><span class="mjx-chtml"><span class="mjx-math" aria-label="p_M"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-msubsup"><span class="mjx-base"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.225em; padding-bottom: 0.446em;">p</span></span></span><span class="mjx-sub" style="font-size: 70.7%; vertical-align: -0.212em; padding-right: 0.071em;"><span class="mjx-mi" style=""><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em; padding-right: 0.081em;">M</span></span></span></span></span></span></span></span> is a stochastic policy)</p><blockquote>M2(&quot;How good is answer X to Y?&quot;)∗∇log(pM(X))</blockquote><p>This is the REINFORCE gradient estimator (which tries to increase the log probability of actions that were rated highly)</p> william_s zTfj4zXHTAmZ9b9sn 2019-02-10T22:09:27.681Z Comment by William_S on Announcement: AI alignment prize round 4 winners https://www.lesswrong.com/posts/nDHbgjdddG5EN6ocg/announcement-ai-alignment-prize-round-4-winners#w5CTff8hGAFk5Pf7o <p>I guess the question was more from the perspective of: if the cost was zero then it seems like it would worth running, so what part of the cost makes it not worth running (where I would think of cost as probably time to judge or availability of money to fund the contest).</p> william_s w5CTff8hGAFk5Pf7o 2019-02-10T19:06:00.792Z Comment by William_S on The Argument from Philosophical Difficulty https://www.lesswrong.com/posts/w6d7XBCegc96kz4n3/the-argument-from-philosophical-difficulty#qWHatXNamrRoCjSna <p>One important dimension to consider is how hard it is to solve philosophical problems well enough to have a pretty good future (which includes avoiding bad futures). It could be the case that this is not so hard, but fully resolving questions so we could produce an optimal future is very hard or impossible. It feels like this argument implicitly relies on assuming that &quot;solve philosophical problems well enough to have a pretty good future&quot; is hard (ie. takes thousands of millions of years in scenario 4) - can you provide further clarification on whether/why you think that is the case?</p> william_s qWHatXNamrRoCjSna 2019-02-10T19:02:57.074Z Comment by William_S on Announcement: AI alignment prize round 4 winners https://www.lesswrong.com/posts/nDHbgjdddG5EN6ocg/announcement-ai-alignment-prize-round-4-winners#v7e6BRGBDxWZGqEAZ <p>Slightly disappointed that this isn&#x27;t continuing (though I didn&#x27;t submit to the prize, I submitted to Paul Christiano&#x27;s call for possible problems with his approach which was similarly structured). Was hoping that once I got further into my PhD, I&#x27;d have some more things worth writing up, and the recognition/a bit of prize money would provide some extra motivation to get them out the door.</p><p>What do you feel like is the limiting resource that keeps continuing this from being useful to continue in it&#x27;s current form?</p> william_s v7e6BRGBDxWZGqEAZ 2019-02-09T17:43:58.790Z Comment by William_S on HCH is not just Mechanical Turk https://www.lesswrong.com/posts/4JuKoFguzuMrNn6Qr/hch-is-not-just-mechanical-turk#R335SY48K3HaXn4p7 <p>Yeah, this is a problem that needs to be addressed. It feels like in the Overseers Manual case you can counteract this by giving definitions/examples of how you want questions to be interpreted, and in the Lookup Table case this can be addr by coordination within the team creating the lookup table</p> william_s R335SY48K3HaXn4p7 2019-02-09T17:10:56.001Z Reinforcement Learning in the Iterated Amplification Framework https://www.lesswrong.com/posts/fq7Ehb2oWwXtZic8S/reinforcement-learning-in-the-iterated-amplification <p>When I think about <a href="https://blog.openai.com/amplifying-ai-training/">Iterated</a> <a href="https://ai-alignment.com/iterated-distillation-and-amplification-157debfd1616">Amplification</a> (IA), I usually think of a version that uses imitation learning for distillation. </p><p>This is the version discussed in the <a href="https://arxiv.org/abs/1811.07871">Scalable agent alignment via reward modeling: a research direction</a>, as &quot;Imitating expert reasoning&quot;, in contrast to the proposed approach of &quot;Recursive Reward Modelling&quot;. The approach works roughly as follows</p><p>1. Gather training data from experts on how to break problems into smaller pieces and combine the results </p><p>2. Train a model to imitate what the expert would do at every step</p><p>3. Amplification: Run a collaboration of a large number of copies of the learned model.</p><p>4. Distillation: Train a model to imitate what the collaboration did.</p><p>5. Repeat steps 3 and 4, increasing performance at every step</p><p>However, Paul has also talked about IA using reinforcement learning (RL) to maximize the approval of the amplified model. What does this approach (RL-IA) look like? How does it relate to Imitation-IA and Recursive Reward Modelling?</p><p><strong>Puzzling about RL-IA</strong></p><p>To get an agent that takes good actions in an Atari game, we use Imitation-IA to build a system that answers the question &quot;how good is it to take actions from this state&quot;, then train a reinforcement learner to &quot;output the best action to take from a given state&quot;. </p><p>But there it seems like the improvement stops there - it&#x27;s not clear how &quot;ability to output the best action to take from a given state&quot; could improve &quot;ability to evaluate how good actions are good from a state&quot; in any way that&#x27;s different from running a traditional reinforcement learning algorithm (which usually involves taking some policy/value estimate and gradually improving it).</p><p><strong>Clarifying what RL-IA does</strong></p><p>Claim: There is a fairly straightforward correspondence between how Imitation-IA and RL-IA perform a task (given no computational limits). RL-IA does not change the class of tasks that Imitation-IA can perform or perform them in a radically different way.</p><p>Suppose we have a current version of the model M1 that takes questions and produces a distribution over answers. Let M2 be an amplified version of that model (ie. produced by running a number of copies of M1). Let Y be some question, with domain of answers D. We want to find the answer X* that is the answer in D which maximizes the approval of amplified overseer, M2(&quot;How good is answer X to Y?&quot;). Y could be</p><ul><li>&quot;What action is best to take from this state in this atari game?&quot; where D is a small discrete set of possible actions</li><li>&quot;What answer of less than 100 characters should I give to this question?&quot; where D is a large discrete set of possible answers</li><li>&quot;What answer of unbounded length should I give to this question?&quot; where D is an infinite discrete set</li><li>&quot;What is probability that event E will happen tomorrow?&quot; where D is the continuous space of probabilities</li></ul><p>An update using imitation learning would have the form:</p><ul><li>X* = M1(Y)</li><li>For: number of samples</li><ul><li>Sample an answer X from D</li><li>Evaluate M2(&quot;How good is answer X to Y?&quot;)</li><li>If M2(&quot;How good is answer X to Y?&quot;) &gt; M2(&quot;How good is answer X* to Y?&quot;), then set X* = X</li></ul><li>Perform gradient descent to maximize the probability of outputting X*, using gradient <span><style>.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} </style><span class="mjx-chtml"><span class="mjx-math" aria-label="\nabla p_M(X*)"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">∇</span></span><span class="mjx-msubsup"><span class="mjx-base"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.225em; padding-bottom: 0.446em;">p</span></span></span><span class="mjx-sub" style="font-size: 70.7%; vertical-align: -0.212em; padding-right: 0.071em;"><span class="mjx-mi" style=""><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em; padding-right: 0.081em;">M</span></span></span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">(</span></span><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em; padding-right: 0.024em;">X</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.151em; padding-bottom: 0.298em;">∗</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">)</span></span></span></span></span></span> </li></ul><p>An update using the REINFORCE policy gradient estimator would have the form:</p><ul><li>sample X from a stochastic policy M1(Y)</li><li>Perform gradient descent using gradient<span><span class="mjx-chtml"><span class="mjx-math" aria-label="\text{M2(&quot;How good is answer X to Y?&quot;)}* \nabla\log(p_M(X))"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-mtext"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">M2("How good is answer X to Y?")</span></span><span class="mjx-mo MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.151em; padding-bottom: 0.298em;">∗</span></span><span class="mjx-mi MJXc-space2"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">∇</span></span><span class="mjx-mi MJXc-space1"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.519em;">log</span></span><span class="mjx-mo"><span class="mjx-char"></span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">(</span></span><span class="mjx-msubsup"><span class="mjx-base"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.225em; padding-bottom: 0.446em;">p</span></span></span><span class="mjx-sub" style="font-size: 70.7%; vertical-align: -0.212em; padding-right: 0.071em;"><span class="mjx-mi" style=""><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em; padding-right: 0.081em;">M</span></span></span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">(</span></span><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em; padding-right: 0.024em;">X</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">)</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">)</span></span></span></span></span></span></li></ul><p>If we have a perfect distillation algorithm, these both converge to <span><span class="mjx-chtml"><span class="mjx-math" aria-label="\mathrm{argmax}_X(M2(X))"><span class="mjx-mrow" aria-hidden="true"><span class="mjx-msubsup"><span class="mjx-base"><span class="mjx-texatom"><span class="mjx-mrow"><span class="mjx-mi"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.151em; padding-bottom: 0.372em;">a</span></span><span class="mjx-mi"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.151em; padding-bottom: 0.372em;">r</span></span><span class="mjx-mi"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.151em; padding-bottom: 0.519em;">g</span></span><span class="mjx-mi"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.151em; padding-bottom: 0.372em;">m</span></span><span class="mjx-mi"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.151em; padding-bottom: 0.372em;">a</span></span><span class="mjx-mi"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.151em; padding-bottom: 0.372em;">x</span></span></span></span></span><span class="mjx-sub" style="font-size: 70.7%; vertical-align: -0.377em; padding-right: 0.071em;"><span class="mjx-mi" style=""><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em; padding-right: 0.024em;">X</span></span></span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">(</span></span><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em; padding-right: 0.081em;">M</span></span><span class="mjx-mn"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.372em; padding-bottom: 0.372em;">2</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">(</span></span><span class="mjx-mi"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.446em; padding-bottom: 0.298em; padding-right: 0.024em;">X</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">)</span></span><span class="mjx-mo"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.446em; padding-bottom: 0.593em;">)</span></span></span></span></span></span> in the limit of infinite computation. </p><p><strong>Practical Differences</strong></p><p>Outside of this idealized situation, circumstances could make one or the other a better update to use.</p><p>The imitation update could converge more quickly if we have a good initialization for M(Y) from human data, as it bypasses the need to explore. It could also be less surprising, using only processes that the humans originally demonstrated.</p><p>The REINFORCE update could converge more quickly if the human initialization is suboptimal, or if it&#x27;s hard to exactly reproduce the human demonstration.</p><p>In general, it seems like the system could use an algorithm that combines reinforcement learning updates with imitation learning updates, ie. <a href="https://arxiv.org/abs/1704.03732">Deep Q Learning from Demonstrations</a>.</p><p><strong>Returning to the original puzzle</strong></p><p>I think the solution is not necessarily that &quot;ability to output good actions at this timestep&quot; translates into &quot;ability to evaluate which actions are good&quot;? Rather, I think that it is the case that the decomposition of &quot;evaluate which actions are good&quot; contains some questions which might perform a search over an answer space, and the answers to these questions are improved by reinforcement learning, and this improves the evaluation of atari actions. This can produce a model which uses a mix of imitation learning and reinforcement learning.</p><p>For example:</p><p>&quot;What is a good action to take from state S?&quot; could be learned to maximize &quot;How good is it to take action A from this state S?&quot;</p><p>&quot;How good is it to take action A from this state S?&quot; could be learned by imitating an amplified reasoner that asks the subquestion &quot;What is the most useful information to provide about the consequences of action A from state S?&quot;</p><p>&quot;What is the most useful information to provide about the consequences of action A from state S?&quot; could be learned to maximize &quot;How useful is information I about the consequences of action A in state S?&quot;</p><p>A modified version of the question, &quot;How good is it to take action A from this state S, and include an explanation of your reasoning?&quot; could also be reinforcement learned to maximize &quot;How good is the explanation of how good it is to take action A in state S?&quot;</p><p><strong>Concluding Thoughts</strong></p><p>Indeed, I think we could see <em>every</em> question answerable by an IA system in the form of &quot;select the answer to question Y that the overseer approves most of&quot;, and use both demonstrations from the amplified reasoner and the amplified reasoner&#x27;s evaluation to improve the answer. This perspective allows the system to learn to decompose problems better than original humans. But it might also cause problems if we can make a series of updates that cause the learned answering system to behave very differently from the original human demonstrators. We might want to be careful about the degree to which an RL learned policy can differ from the original demonstration.</p><p>In terms of getting a system to be capable of doing some task, I&#x27;d be most optimistic about systems that could combine RL-IA and Imitation-IA depending on the situation. But I still think there&#x27;s usefulness in thinking about the pure Imitation-IA perspective to try and reason about the alignment properties of the system.</p><p>(Thanks to Andreas Stuhlmüller and Owain Evans for feedback on a draft of this post)</p> william_s fq7Ehb2oWwXtZic8S 2019-02-09T00:56:08.256Z HCH is not just Mechanical Turk https://www.lesswrong.com/posts/4JuKoFguzuMrNn6Qr/hch-is-not-just-mechanical-turk <p>HCH, introduced in<a href="https://ai-alignment.com/humans-consulting-hch-f893f6051455"> <u>Humans consulting HCH</u></a>, is a computational model in which a human answers questions using questions answered by another human, which can call other humans, which can call other humans, and so on. Each step in the process consists of a human taking in a question, optionally asking one or more subquestions to other humans, and returning an answer based on those subquestions. HCH can be used as a model for what Iterated Amplification would be able to do in the limit of infinite compute. HCH can also be used to decompose the question of &quot;is Iterated Amplification safe&quot; into “is HCH safe” and “If HCH is safe, will Iterated Amplification approximate the behaviour of HCH in a way that is also safe”. </p><p>I think there&#x27;s a way to interpret HCH in a way that leads to incorrect intuitions about why we would expect it to be safe. Here, I describe three models of how one could think HCH would work, and why we might expect them to be safe.</p><p><strong>Mechanical Turk:</strong> The human Bob, is hired on Mechanical Turk to act as a component of HCH. Bob takes in some reasonable length natural language question, formulates subquestions to ask other Turkers, and turns the responses from those Turkers into an answer to the original question. Bob only sees the question he is asked and thinks for a short period of time before asking subquestions or returning an answer. The question of &quot;is HCH corrigible&quot; is about &quot;how does the corrigibility of Bob translate into corrigibility of the overall system&quot;? To claim that HCH is safe in this scenario, we could point to Bob being well-intentioned, having human-like concepts and reasoning in a human-like way. Also, since Bob has to communicate in natural language to other humans, those communications could be monitored or reflected upon. We could claim that this leads the reasoning that produces the answer to stay within the space of reasoning that humans use, and so more likely to reflect our values and less likely to yield unexpected outcomes that misinterpret our values.</p><p><strong>Lookup Table:</strong> An AI safety research team lead by Alice writes down a set of 100 million possible queries that they claim capture all human reasoning. For each of these queries, they then write out the subquestions that would need to be written, along with simple computer code that combines the answers to the subquestions into an answer to the original question. This produces a large lookup table, and the &quot;human&quot; in HCH is just a call to this lookup table. The question of &quot;is HCH corrigible&quot; is about &quot;has Alice&#x27;s team successfully designed a set of rules that perform corrigible reasoning&quot;? To justify this, we point to Alice&#x27;s team having a large body of AI safety knowledge, proofs of properties of the system, demonstrations of the system working in practice, etc.</p><p><strong>Overseer&#x27;s Manual:</strong> An AI safety research team lead by Alice has written a manual on how to corrigibly answer questions by decomposing them into subquestions. This manual is handed to Bob, who was hired to decompose tasks. Bob carefully studies the manual and applies the rules in it when he is performing his task (and the quality of his work is monitored by the team). Alice&#x27;s team has carefully thought about how to decomposed tasks, and<a href="https://ought.org/"> <u>performed many experiments with people like Bob trying to decompose tasks</u></a>. So they understand the space of strategies and outputs that Bob will produce given the manual. The &quot;human&quot; in HCH is actually a human (Bob), but in effect Bob is acting as a compressed lookup table, and is only necessary because the lookup table is too large to write down. An analogy is that it would take too much space and time to write down a list of translations of all possible 10 word sentences from English to German, but it is possible to train humans who, given any 10 word English sentence can produce the German translation. The safety properties are caused by Alice&#x27;s team&#x27;s preparations, which include Alice&#x27;s team modelling how Bob would produce answers after reading the manual. To justify the safety of the system, we again point to Alice&#x27;s team having a large body of AI safety knowledge, proofs of properties of the system, demonstrations of the system working in practice etc.</p><p>I claim that the Mechanical Turk scenario is incomplete about why we might hope for an HCH system to be safe. Though it might be safer than a computation without human involvement, I would find it hard to trust that this system would continue to scale without running into problems, like handing over control deliberately or <u><a href="https://www.lesswrong.com/posts/SqcPWvvJJwwgZb6aH/prize-for-probable-problems#8dHK3dCwrjtHrqyae">accidentally</a></u> to some unsafe computational process. The Mechanical Turk scenario leaves out the process of design that Alice’s team takes part in the Lookup Table and Overseer’s Manual scenarios, which can include at least some consideration of AI safety issues (though how much of this is necessary is an open question). I think this design process, if done right, is the thing that could give the system the ability to avoid these problems as it scales. I think that we should keep these stronger Lookup Table and Overseer’s Manual scenarios in mind when considering whether HCH might be safe.</p><p>(Thanks to Andreas Stuhlmüller and Owain Evans for feedback on a draft of this post)</p> william_s 4JuKoFguzuMrNn6Qr 2019-02-09T00:46:25.729Z Comment by William_S on Can there be an indescribable hellworld? https://www.lesswrong.com/posts/rArsypGqq49bk4iRr/can-there-be-an-indescribable-hellworld#bZmTdyXqNTwac5N2z <p>Do you think you&#x27;d agree with a claim of this form applied to corrigibility of plans/policies/actions? </p><p>That is: If some plan/policy/action is uncorrigible, then A can provide some description of how the action is incorrigible.</p> william_s bZmTdyXqNTwac5N2z 2019-01-31T20:03:23.882Z Comment by William_S on Why we need a *theory* of human values https://www.lesswrong.com/posts/zvrZi95EHqJPxdgps/why-we-need-a-theory-of-human-values#mzWjjP4zghP4q3Coa <blockquote>The better we can solve the key questions (&quot;what are these &#x27;wiser&#x27; versions?&quot;, &quot;how is the whole setup designed?&quot;, &quot;what questions exactly is it trying to answer?&quot;), the better the wiser ourselves will be at their tasks.</blockquote><p>I feel like this statement suggests that we might not be doomed if we make a bunch of progress, but not full progress on these statements. I agree with that assessment, but it felt on reading the post like the post was making the claim &quot;Unless we fully specify a correct theory of human values, we are doomed&quot;. </p><p>I think that I&#x27;d view something like Paul&#x27;s indirect normativity approach as requiring that we do enough thinking in advance to get some critical set of considerations known by the participating humans, but once that&#x27;s in place we should be able to go from this core set to get the rest of the considerations. And it seems possible that we can do this without a fully-solved theory of human value (but any theoretical progress in advance we can make on defining human value is quite useful).</p> william_s mzWjjP4zghP4q3Coa 2018-12-29T00:01:46.830Z Comment by William_S on Three AI Safety Related Ideas https://www.lesswrong.com/posts/vbtvgNXkufFRSrx4j/three-ai-safety-related-ideas#EMQ2XpHsgKJMtRuGC <p>My interpretation of what you&#x27;re saying here is that the overseer in step #1 can do a lot of things to bake in having the AI interpret &quot;help the user get what they really want&quot; in ways that get the AI to try to eliminate human safety problems for the step #2 user (possibly entirely), but problems might still occur in the short term before the AI is able to think/act to remove those safety problems.</p><p>It seems to me that this implies that IDA essentially solves the AI alignment portion of points 1 and 2 in the original post (modulo things happening before the AI is in control).</p> william_s EMQ2XpHsgKJMtRuGC 2018-12-20T21:36:23.176Z Comment by William_S on A comment on the IDA-AlphaGoZero metaphor; capabilities versus alignment https://www.lesswrong.com/posts/yXFKh2jGysQNfX2NM/a-comment-on-the-ida-alphagozero-metaphor-capabilities#PsRpTTaTaDcDQ6tit <p>Correcting all problems in the subsequent amplification stage would be a nice property to have, but I think IDA can still work even if it corrects errors with multiple A/D steps in between (as long as all catastrophic errors are caught before deployment). For example, I could think of the agent initially using some rules for how to solve math problems where distillation introduces some mistake, but later in the IDA process the agent learns how to rederive those rules and realizes the mistake.</p> william_s PsRpTTaTaDcDQ6tit 2018-07-19T21:41:18.523Z Comment by William_S on A general model of safety-oriented AI development https://www.lesswrong.com/posts/idb5Ppp9zghcichJ5/a-general-model-of-safety-oriented-ai-development#4gAePd76MdoTfH4oD <p>Shorter name candidates:</p><p>Inductively Aligned AI Development</p><p>Inductively Aligned AI Assistants</p><p></p> william_s 4gAePd76MdoTfH4oD 2018-06-13T20:21:35.174Z Comment by William_S on A general model of safety-oriented AI development https://www.lesswrong.com/posts/idb5Ppp9zghcichJ5/a-general-model-of-safety-oriented-ai-development#fSPQj44Sp594qx7iw <p>It&#x27;s a nice property of this model that it prompts consideration of the interaction between humans and AIs at every step (to highlight things like risks of the humans having access to some set of AI systems for manipulation or moral hazard reasons).</p> william_s fSPQj44Sp594qx7iw 2018-06-13T20:20:03.086Z Comment by William_S on Poker example: (not) deducing someone's preferences https://www.lesswrong.com/posts/b3TLbcwfJK5rmcwXc/poker-example-not-deducing-someone-s-preferences#DHjzfHWq63HDRbAyC <p>In the higher dimensional belief/reward space, do you think that it would be possible to significantly narrow down the space of possibilities (so this argument is saying &quot;be bayesian with respect to reward/beliefs, picking policies that work over a distribution) or are you more pessimistic than that, thinking that the uncertainty would be so great in higher dimensional spaces that it would not be possible to pick a good policy?</p> william_s DHjzfHWq63HDRbAyC 2018-06-13T18:53:25.062Z Comment by William_S on Amplification Discussion Notes https://www.lesswrong.com/posts/LbJHizyfAsDYeETBq/amplification-discussion-notes#5cyk2GfAetSD9keAD <p>Open Question: Working with concepts that the human can’t understand</p><p>Question: when we need to assemble complex concepts by learning/interacting with the environment, rather than using H&#x27;s concepts directly, and when those concepts influence reasoning in subtle/abstract ways, how do we retain corrigibility/alignment?</p><p>Paul: I don&#x27;t have any general answer to this, seems like we should probably choose some example cases. I&#x27;m probably going to be advocating something like &quot;Search over a bunch of possible concepts and find one that does what you want / has the desired properties.&quot;</p><p>E.g. for elegant proofs, you want a heuristic that gives successful lines of inquiry higher scores. You can explore a bunch of concepts that do that, evaluate each one according to how well it discriminates good from bad lines of inquiry, and also evaluate other stuff like &quot;What would I infer from learning that a proof is &#x60;elegant&#x60; other than that it will work&quot; and make sure that you are OK with that.</p><p>Andreas: Suppose you don&#x27;t have the concepts of &quot;proof&quot; and &quot;inquiry&quot;, but learned them (or some more sophisticated analogs) using the sort of procedure you outlined below. I guess I&#x27;m trying to see in more detail that you can do a good job at &quot;making sure you&#x27;re OK with reasoning in ways X&quot; in cases where X is far removed from H&#x27;s concepts. (Unfortunately, it seems to be difficult to make progress on this by discussing particular examples, since examples are necessarily about concepts we know pretty well.)</p><p>This may be related to the more general question of what sorts of instructions you&#x27;d give H to ensure that if they follow the instructions, the overall process remains corrigible/aligned.</p> william_s 5cyk2GfAetSD9keAD 2018-06-01T19:04:19.114Z Comment by William_S on Amplification Discussion Notes https://www.lesswrong.com/posts/LbJHizyfAsDYeETBq/amplification-discussion-notes#QCvx5eMw2S2WZBjQ5 <p>Open Question: Severity of “Honest Mistakes”</p><p>In the discussion about creative problem solving,Paul said that he was concerned about problems arising when the solution generator was deliberately searching for a solution with harmful side effects. Other failures could occur where the solution generator finds a solution with harmful side effects without “deliberately searching” for it. The question is how bad these “honest mistakes” would end up being.</p><p>Paul: I also want to make the further claim that such failures are much less concerning than what-I&#x27;m-calling-alignment failures, which is a possible disagreement we could dig into (I think Wei Dai disagrees or is very unsure).</p> william_s QCvx5eMw2S2WZBjQ5 2018-06-01T19:04:01.100Z Amplification Discussion Notes https://www.lesswrong.com/posts/LbJHizyfAsDYeETBq/amplification-discussion-notes <p>Paul Christiano, Wei Dai, Andreas Stuhlmüller and I had an online chat discussion recently, <u><a href="https://dynalist.io/d/YGwuNf1YmxOmMMMg4f8t98Sq">the transcript of the discussion is available here</a></u>. (Disclaimer that it’s a nonstandard format and we weren&#x27;t optimizing for ease of understanding the transcript). This discussion was primarily focused on amplification of humans (not later amplification steps in IDA). Below are some highlights from the discussion, and I’ll include some questions that were raised that might merit further discussion in the comments.</p><h2>Highlights</h2><p><strong>Strategies for sampling from a human distribution of solutions:</strong></p><blockquote>Paul: For example you can use &quot;Use random human example,&quot; or &quot;find an analogy to another example you know and use it to generate an example,&quot; or whatever. </blockquote><blockquote>There is some subtlety there, where you want to train the model that sample from the real human distribution rather than from the empirical distribution of 10 proposals you happen to have collected so far. If samples are cheap that&#x27;s fine. Otherwise you may need to go further to &quot;Given that [X1, X2, ...] are successful designs, what is a procedure that can produce additional successful designs?&quot; or something like that. Not sure.</blockquote><p><strong>Dealing with unknown concepts</strong></p><blockquote>Andreas: Suppose you get a top-level command that contains words that H doesn&#x27;t understand (or just doesn&#x27;t look at), say something like &quot;Gyre a farbled bleg.&quot;. You have access to some data source that is in principle enough to learn the meanings of those words. What might the first few levels of questions + answers look like?</blockquote><blockquote>Paul: possible questions: &quot;What&#x27;s the meaning of the command&quot;, which goes to &quot;What&#x27;s the meaning of word X&quot; for the words X in the sentence, &quot;What idiomatic constructions are involved in this sentence?&quot;, &quot;What grammatical constructions are involved in the sentence&quot;</blockquote><blockquote>Answers to those questions are big trees representing meanings, e.g. a list of properties of &quot;gyre&quot; (what properties the subject and object typically have, under what conditions it is said to have occurred, why someone might want you to do it, tons of stuff most of which will be irrelevant for the query)</blockquote><blockquote>Which come from looking up definitions, proposing definitions and seeing how well they match with usage in the cases you can look at, etc.</blockquote><p><strong>Limits on what amplification can accomplish</strong></p><blockquote>Paul:In general, if ML can&#x27;t learn to do a task, then that&#x27;s fine with me. And if ML can learn to do a task but only using data source X, then we are going to have to integrate data source X into the amplification process in order for amplification to be able to solve it, there is no way to remove the dependence on arbitrary data sources. And there will exist data sources which pose alignment issues, independent of any alignment issues posed by the ML.</blockquote><p><strong>Alignment search for creative solutions</strong></p><p>Considering the task of generating a solution to a problem that requires creativity, it can be decomposed into:</p><p>Generate solutions</p><p>Evaluate those solutions</p><p>For solution generation, one idea is to shape the distribution of proposals so you are less likely to get malign answers (ie. sample from the distribution of answers a human would give, which would hopefully be more likely to be safe/easily evaluated compared to some arbitrary distribution). </p><p>I asked Paul if he thought that safe creative solution generation would require sampling from a less malign distribution, or whether he thought we could solve evaluation (“secure-X-evaluation”, as testing whether the solution fulfilled property X) well enough to use an arbitrary distribution/brute force search.</p><blockquote>Paul: I don&#x27;t see a good way to avoid solving secure X-evaluation anyway. It seems to me like we can generate solutions in ways that put much lower probability on malign answers, but it neither seems like we can totally eliminate that (I don&#x27;t think human creativity totally eliminates that either), nor that we will always have access to some more-aligned human generator </blockquote><blockquote>The best I&#x27;d probably say is that we can have a generation process that is not itself malign, not clear if that is helpful at all though.</blockquote><p>We then dived into how well we could solve secure X-evaluation. I was particularly interested in questions like how we could evaluate whether a design had potentially harmful side-effects.</p><blockquote>Paul: I think what we want is something like: if the designing process knows that X is bad, then the evaluator will also know it. If the designing process doesn&#x27;t know that X is bad, then that&#x27;s not malign.</blockquote><blockquote>[to be clear, for this discussion we only need security in the infinite limit; in practice the capability of both sides will be limited by the capability of the ML, so we&#x27;ll also need something to make sure the evaluating-ML does better than the generator-ML, but that seems like a separate issue.]</blockquote><blockquote>William: If you imagine slowly increasing the intelligence of the generator, then for any heuristic, it might first start picking solutions that fulfill that heuristic more often before actually understanding the heuristic, and it might take longer after that before the generator understands that the heuristic works because of a causal pathway that involves negative side effects. Is it the case that you&#x27;d say that this is an acceptable outcome/something that we can&#x27;t really get past?</blockquote><blockquote>Paul: If neither the evaluator nor generator knows about the negative side effect, it&#x27;s hard for the negative side effect to lead to higher evaluations. I agree this can happen sometimes (I wrote the implicit extortion post to give an example, there are certainly others), but they seem OK to accept as &quot;honest mistakes&quot; so far, i.e. none of them pose an existential risk.</blockquote><blockquote>in terms of &quot;what amplification is supposed to accomplish,&quot; if there is a problem that could just as well afflict a human who is trying their best to help me get what I want (and has the AI&#x27;s profile of abilities), then I&#x27;m basically considering that out of scope.</blockquote><blockquote>Whatever we could tell to a human, to help them avoid this kind of problem, we could also tell to an aligned AI, so the problem is factored into (a) help a human avoid the problem, (b) build aligned AI.</blockquote> william_s LbJHizyfAsDYeETBq 2018-06-01T19:03:35.294Z Comment by William_S on Challenges to Christiano’s capability amplification proposal https://www.lesswrong.com/posts/S7csET9CgBtpi7sCh/challenges-to-christiano-s-capability-amplification-proposal#aLvBXSQsaLWjRfZd5 <blockquote>I would solve X-and-only-X in two steps:</blockquote><blockquote>First, given an agent and an action which has been optimized for undesirable consequence Y, we&#x27;d like to be able to tell that the action has this undesirable side effect. I think we can do this by having a smarter agent act as an overseer, and giving the smarter agent suitable insight into the cognition of the weaker agent (e.g. by sharing weights between the weak agent and an explanation-generating agent). This is what I&#x27;m calling informed oversight.</blockquote><blockquote>Second, given an agent, identify situations in which it is especially likely to produce bad outcomes, or proofs that it won&#x27;t, or enough understanding of its internals that you can see why it won&#x27;t. This is discussed in “<u><a href="https://ai-alignment.com/techniques-for-optimizing-worst-case-performance-39eafec74b99">Techniques for Optimizing Worst-Case Performance</a></u>.”</blockquote><p>Paul, I&#x27;m curious whether you&#x27;d see as necessary for these techniques to work to have that the optimization target is pretty good/safe (but not perfect): ie some safety comes from the fact that the agents optimized for approval or imitation only have a limited class of Y&#x27;s that they might also end up being optimized for.</p> william_s aLvBXSQsaLWjRfZd5 2018-05-26T22:58:13.323Z Comment by William_S on Challenges to Christiano’s capability amplification proposal https://www.lesswrong.com/posts/S7csET9CgBtpi7sCh/challenges-to-christiano-s-capability-amplification-proposal#8tDzGKpDuz72oZxJs <blockquote>So I also don&#x27;t see how Paul expects the putative alignment of the little agents to pass through this mysterious aggregation form of understanding, into alignment of the system that understands Hessian-free optimization.</blockquote><p>My model of Paul&#x27;s approach sees the alignment of the subagents as just telling you that no subagent is trying to actively sabotage your system (ie. by optimizing to find the worst possible answer to give you), and that the alignment comes from having thought carefully about how the subagents are supposed to act in advance (in a way that could potentially be run just by using a lookup table).</p> william_s 8tDzGKpDuz72oZxJs 2018-05-26T22:54:32.134Z Comment by William_S on Resolving human values, completely and adequately https://www.lesswrong.com/posts/Y2LhX3925RodndwpC/resolving-human-values-completely-and-adequately#g6fhNfNeBKsRooYsx <p>Glad to see this work on possible structure for representing human values which can include disagreement between values and structured biases.</p><p>I had some half-formed ideas vaguely related to this, which I think map onto an alternative way to resolve self reference.</p><p>Rather than just having one level of values that can refer to other values on the same level (which potentially leads to a self-reference cycle), you could instead explicitly represent each level of value, with level 0 values referring to concrete reward functions, level 1 values endorsing or negatively endorsing level 0 values, and generally level n values only endorsing or negatively endorsing level n-1 values. This might mean that you have some kinds of values that end up being duplicated between multiple levels. For any n, there&#x27;s a unique solution to the level of endorsement for every concrete value. We can then consider the limit as n-&gt;infinity as the true level of endorsement. This allows for situations where the limit fails to converge (ie. it alternates between different values at odd and even levels), which seems like a way to handle self reference contradictions (possibly also the all-or-nothing problem if it results from a conflict between meta-levels). </p><p>I think this maps into the case where we don&#x27;t distinguish between value levels if we define an function that just adjusts the endorsement of each value by the values that directly to refer to it. Then iterating this function n times gives the equivalent of having an n-level meta-hierarchy.</p><p>I think there might be interesting work in mapping this strategy into some simple value problem, and then trying to perform bayesian value learning in that setting with some reasonable prior over values/value endorsements.</p> william_s g6fhNfNeBKsRooYsx 2018-05-16T18:23:05.370Z Comment by William_S on Can corrigibility be learned safely? https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely#8tioNjbyKpeF5wbuM <p>Ah, right. I guess I was balking at moving from exorbitant to exp(exorbitant). Maybe it&#x27;s better to think of this as reducing the size of fully worked initial overseer example problems that can be produced for training/increasing the number of amplification rounds that are needed.</p><p>So my argument is more an example of what a distilled overseer could learn as an efficient approximation.</p> william_s 8tioNjbyKpeF5wbuM 2018-04-24T18:18:04.818Z Comment by William_S on Can corrigibility be learned safely? https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely#CDCmzPbp9Lgz8vcdn <p>I guess what we&#x27;re trying to unpack is &quot;the mechanism that makes decisions from that database&quot;, and whether it can be efficient. If you store all experience with no editing, you&#x27;re left with the same problem of &quot;what decisions in the tree do you change based on experience (X,y) and how do you change them?&quot; And it seems like the most naive way of doing this is too inefficient (looking at all data at all decision nodes).</p><p>If you summarize experience (what I&#x27;m interpreting &quot;decide how to update some small sketch&quot; as meaning), then the question is can you produce that summary in an efficient way, and efficiently use it at runtime?</p><p>The gradient-descent style solution is meant to gesture how this could be done in a way analogous to how we use parametric models rather than non-parametric models in ML for decently sized datasets. In advance, bake in a set of parameters/information that could influence each decision, then do updates by having the overseer look for ways to change the decision parameters. But when we do this in an oversight model, we can make the decisions/parameters interpretable sentences.</p><p>Rough example of how I see the overall process going:</p><p>1. amplified overseer acts in an environment with a breakable vase, but the overseer does not know that the vase is breakable</p><p>2. First time it acts, some overseer node N looks into the database for any information related to the vase, and returns nothing, and proceeds in it decision</p><p>3. The vase breaks, and the negative feedback is given to the amplified overseer</p><p>4. Amplified overseer runs some process to figure out that it could have made a better decision if the database contained the knowledge that vases are breakable, and so updates the database. (The thing that Wei Dai are trying to unpack is, can this step be run in a reasonable amount of time, given that each node of the overseer has only a limited understanding of the original decision process)</p><p>5. Second time it acts, the amplified overseer process is the same, up until the node N that does the database lookup is run. This time, the lookup returns that the vase is breakable, so the overseer runs a different decision path and treats the vase with more care.</p> william_s CDCmzPbp9Lgz8vcdn 2018-04-24T16:42:01.486Z Comment by William_S on Can corrigibility be learned safely? https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely#hge2qutvMDEKgHvf9 <blockquote>What if the current node is responsible for the error instead of one of the subqueries, how do you figure that out?</blockquote><p>I think you&#x27;d need to form the decomposition in such a way that you could fix any problem through perturbing something in the <br/>world representation (an extreme version is you have the method for performing every operation contained in the world representation and looked up, so you can adjust it in the future).</p><blockquote> When you do backprop, you propagate the error signal through all the nodes, not just through a single path that is &quot;most responsible&quot; for the error, right? If you did this with meta-execution, wouldn&#x27;t it take an exponential amount of time?</blockquote><p>One step of this method, as in backprop, is the same time complexity as the forward pass (running meta-execution forward, which I wouldn&#x27;t call exponential complexity, as I think the relevant baseline is the number of nodes in the meta-execution forward tree). You only need to process each node once (when the backprop signal for it&#x27;s output is ready), and need to do a constant amount of work at each node (figure out all the ways to perturb the nodes input).</p><p>The catch is that, as with backprop, maybe you need to run multiple steps to get it to actually work.</p><blockquote>And what about nodes that are purely symbolic, where there are multiple ways the subnodes (or the current node) could have caused the error, so you couldn&#x27;t use the right answer for the current node to figure out what the right answer is from each subnode? (Can you in general structure the task tree to avoid this?)</blockquote><p>The default backprop answer to this is to shrug and adjust all of the inputs (which is what you get from taking the first order gradient). If this causes problems, then you can fix them in the next gradient step. That seems to work in practice for backprop in continuous models. Discrete models like this it might be a bit more difficult - if you start to try out different combinations to see if they work, that&#x27;s where you&#x27;d get exponential complexity. But we&#x27;d get to counter this by potentially having cases where, based on understanding the operation, we could intelligently avoid some branches - I think this could potentially wash out to linear complexity in the number of forward nodes if it all works well.</p><blockquote>I wonder if we&#x27;re on the right track at all, or if Paul has an entirely different idea about this.</blockquote><p>So do I :)</p> william_s hge2qutvMDEKgHvf9 2018-04-23T23:57:13.728Z Comment by William_S on Can corrigibility be learned safely? https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely#w6wPPkH8PwPRH9RWp <p>Huh, I hadn&#x27;t thought of this as trying to be a direct analogue of gradient descent, but now that I think about your comment that seems like an interesting way to approach it.</p><blockquote>A human debugging a translation software could look at the return value of some high-level function and ask &quot;is this return value sensible&quot; using their own linguistic intuition, and then if the answer is &quot;no&quot;, trace the execution of that function and ask the same question about each of the function it calls. This kind of debugging does not seem available to meta-execution trying to debug itself, so I just don&#x27;t see any way this kind of learning / error correction could work.</blockquote><p></p><p>I think instead of asking &quot;is this return value sensible&quot;, the debugging overseer process could start with some computation node where it knows what the return value should be (the final answer), and look at each of the subqueries of that node and ask for each subquery &quot;how can I modify the answer to make the query answer more correct&quot;, then recurse into the subquery. This seems pretty analogous to gradient descent, with the potential advantage that the overseer&#x27;s understanding of the function at each node could be better than naively taking the gradient (understanding the operation could yield something that takes into account higher-order terms in the operation).</p><p>I&#x27;m curious now whether you could run a more efficient version of gradient descent if you replace the gradient at each step with an overseer human who can harness some intuition to try to do better than the gradient.</p> william_s w6wPPkH8PwPRH9RWp 2018-04-23T18:33:51.958Z Comment by William_S on Can corrigibility be learned safely? https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely#5v7Q6iJptCEH7upeH <blockquote>What if the field of linguistics as a whole is wrong about some concept or technique, and as a result all of the humans are wrong about that? It doesn&#x27;t seem like using different random seeds would help, and there may not be another approach that can be taken that avoids that concept/technique.</blockquote><p>Yeah, I don&#x27;t think simple randomness would recover from this level of failure (only that it would help with some kinds of errors, where we can sample from a distribution that doesn&#x27;t make that error sometimes). I don&#x27;t know if anything could recover from this error in the middle of a computation without reinventing the entire field of linguistics from scratch, which might be too to ask. However, I think it could be possible to recover from this error if you get feedback about the final output being wrong.</p><p></p><blockquote>But in IDA, H is fixed and there&#x27;s no obvious way to figure out which parts of a large task decomposition tree was responsible for the badly translated sentence and therefore need to be changed for next time.</blockquote><p>I think that the IDA task decomposition tree could be created in such a way that you can reasonably trace back which part was responsible for the misunderstanding/that needs to be changed. The structure you&#x27;d need for this is that given a query, you can figure out which of it&#x27;s children would need to be corrected to get the correct result. So if you have a specific word to correct, you can find the subagent that generated that word, then look at it&#x27;s inputs, see which input is correct, trace where that came from, etc. This might need to be deliberately engineered into the task decomposition (in the same way that differently written programs accomplishing the same task could be easier or harder to debug).</p> william_s 5v7Q6iJptCEH7upeH 2018-04-23T15:55:45.522Z Understanding Iterated Distillation and Amplification: Claims and Oversight https://www.lesswrong.com/posts/yxzrKb2vFXRkwndQ4/understanding-iterated-distillation-and-amplification-claims <p>[Background: Intended for an audience that has some familiarity with Paul Christiano’s approach to AI Alignment. Understanding <u><a href="https://ai-alignment.com/iterated-distillation-and-amplification-157debfd1616">Iterated Distillation and Amplification</a></u> should provide sufficient background.]</p><p>[Disclaimer: When I talk about “what Paul claims”, I am only summarizing what I think he means through reading his blog and participating on discussions on his posts. I could be mistaken/misleading in these claims]</p><p>I’ve recently updated my mental model of how Paul Christiano’s approach to AI alignment works, based on recent blog posts and discussions around them (in which I found Wei Dai’s comments particularly useful). I think that the update that I made might be easy to miss if you haven’t read the right posts/comments, so I think it’s useful to lay it out here. I cover two parts: understanding the limits on what Paul’s approach claims to accomplish, and understanding the role of the overseer in Paul’s approach. These considerations are important to understand if you’re trying to evaluate how likely this approach is to work, or trying to make technical progress on it.</p><h2>What does Paul’s approach claim to accomplish?</h2><p>First, it’s important to understand that what “Paul’s approach to AI alignment” claims to accomplish if it were carried out. The term “approach to AI alignment” can sound like it means “recipe for building a superintelligence that safely solves all of your problems”, but this is not how Paul intends to use this term. Paul goes into this in more detail in <u><a href="https://ai-alignment.com/clarifying-ai-alignment-cec47cd69dd6">Clarifying “AI alignment”</a></u>.</p><p>A rough summary is that his approach will only build an agent that is as capable as some known unaligned machine learning algorithm. </p><p>He does not claim that the end result of his approach is an agent that:</p><ul><li>Can directly solve all problems which can be solved by a human</li><li>Will never take an unsafe catastrophic action</li><li>Will never take an action based on a misunderstanding your commands or your values</li><li>Could safely design successor agents or self-improve</li><li>Will have higher capability than an unaligned competitor</li></ul><p>It’s important to understand the limits of what Paul’s approach claims in order to understand what it would accomplish, and the strategic situation that would result.</p><h2>What is the Overseer?</h2><p><u><a href="https://ai-alignment.com/iterated-distillation-and-amplification-157debfd1616">Iterated Distillation and Amplification</a></u> (IDA) describes a procedure that tries to take an overseer and produce an agent that does what the overseer would want it to do, with a reasonable amount of training overhead. “what the overseer would want it to do” is defined by repeating the amplification procedure. The post refers to amplification as the overseer using a number of machine learned assistants to solve problems. We can bound what IDA could accomplish by thinking about what the overseer could do if it could delegate to a number of copies of itself to solve problems (for a human overseer, this corresponds to HCH). To understand what this approach can accomplish, it’s important to understand what the overseer is doing. I think there are two different models of the overseer that could be inferred from different parts of the discussion around Paul’s work, which I label high bandwidth oversight and low bandwidth oversight.</p><h2>High Bandwidth Overseer</h2><p>The impression that I get from many of Paul’s posts is that the overseer is:</p><p>A high bandwidth overseer is a human that takes in an input and has some amount of time (between 15 minutes and a day) to process it. The human can look at as much of the input as it can within the allowed time, and when it delegates a sub-task it can look at the results. The main requirement for a human acting as an overseer is that they are trying to be helpful. The overseer can take in arbitrary natural-language requests and attempt to fulfill them.</p><p>The main question that comes to mind considering this model of the overseer is the problem of competent task decomposition:</p><p>Can we break down tasks into reasonably sized pieces (ie. can you solve a math problem from a textbook in 15 minutes by delegating to a number of people who don’t understand the material and only have 15 minutes each to interact with the textbook)?</p><p>This high bandwidth oversight model can be easier to use when trying to get a rough idea of how oversight works in order to understand how IDA works as a whole.</p><h2>Low Bandwidth Overseer</h2><p>However, looking at <u><a href="https://ai-alignment.com/universality-and-security-amplification-551b314a3bab">Universality and Security Amplification</a></u> I found a different picture of what the overseer looks like:</p><p>A low bandwidth overseer is only allowed to process inputs from a finite input set. This includes both the original input query, and the results of subqueries. The overseer can pass information about the input or from a subquery to other copies of itself without looking at them.</p><p>To give a sense for the size of the input set, in <u><a href="https://ai-alignment.com/universality-and-security-amplification-551b314a3bab">Universality and Security Amplification</a></u>, Paul speculates that input sets of size 10⁸ would probably be safe and inputs sets of size 10³⁰ would probably be unsafe.</p><p>To give a sense of what this implies, the range of input size 10⁸ to 10³⁰ roughly corresponds to:</p><ul><li>6 to 24 pixels of arbitrary 4-bit greyscale images</li><li>10 to 38 words from typical english text</li><li>arbitrary phrases of 2 to 7 words from a vocabulary of 10000 words</li><li>arbitrary strings of 5 to 20 lowercase alphabetic characters</li></ul><p>A source of potential confusion is that Paul still talks about a human fulfilling the role of the low bandwidth overseer (there are reasons for this that I won’t cover here). But when the space of information the human overseer can consider is reduced to a finite set, we could simply evaluate what the human does on every element in that set and produce a lookup table that replaces the human. In other words, if you don’t think that some task could be accomplished by an amplified lookup table as overseer, then you shouldn’t think it could be accomplished with a low bandwidth overseer.</p><p>The reason for this system being introduced is wanting to avoid security issues as the system scales. The fear is that there would be an “attack” on the system: an input that could be shown to an overseer that would cause the overseer to become corrupted and try to sabotage the system. This could be some kind of misleading philosophical argument, some form of blackmail, a human adversarial example, etc. If an input like this exists, then as soon as the first agent is corrupted, it can try to spread the attack to other agents. The first agent could be corrupted either by chance, or through an attack being included in the input. Paul’s proposal in <u><a href="https://ai-alignment.com/universality-and-security-amplification-551b314a3bab">Universality and Security Amplification</a></u> is to avoid this by limiting the set of inputs that the overseer can process, and then use various techniques to verify that the overseer is not attackable on that set of inputs. I read this as implying that he thinks that a high bandwidth overseer will be sufficiently vulnerable to attack that after some number of rounds of amplification, the overseer would be likely to become misaligned.</p><h3>How does a low bandwidth overseer impact the difficult of IDA?</h3><p>Previously, I was mostly starting with the model of a high bandwidth overseer when thinking about IDA. I now think that Paul’s approach is intended to be using low bandwidth oversight, and that reasoning about the difficulty of the approach requires using this model.</p><p>Using a low bandwidth overseer more clearly requires solving a set of hard problems prior to using IDA at all:</p><ul><li>Figure out how to decompose tasks into low-bandwidth pieces. The task decomposition needs to be both corrigible and capable. </li><ul><li>Corrigibility requires understanding of AI safety concepts. For example, breaking down the task “What action does the user want me to take?” into the two subtasks “What are the user’s values?” and “What action is best according to these values”? is not corrigible. It produces an action optimised for some approximate model of the user’s values, which could be misaligned.</li><li>Capability requires having the ability to complete “enough” tasks. Paul acknowledges that the overseer will no longer be able to directly perform some tasks as well as a human (ie. they won’t be able to translate sentences as well if they only are able to look at small pieces of the original text). But in this case, he claims that it would be okay to understand how to safely use information provided by external resources (ie. human translators).</li></ul><li>Clarifying the currently imprecise definition of “corrigibility” in order to be able to reason about whether a given task decomposition is corrigible.</li><li>Sufficiently understand meta-philosophy to implement it explicitly using collaborations of “lookup table overseers”. Even if we have an overseer that is corrigible and capable of performing well understood and defined tasks, it’s not clear that if we ask it to start evaluating tricky philosophical questions it will come up with aligned answers. This roughly looks like figuring out an intractably large algorithm that will output the products of MIRI’s Highly Reliable Agent Design research (ie. produce a set of criteria for what makes a good decision theory, a search space of possible decision theories, a prior over that space).</li></ul><p>When considering the high bandwidth oversight regime, these problems seem more like they could be (at least partially) delegated to high bandwidth overseer at runtime. For example, one could think of preparing a summary of the relevant factors of a decision and giving it to a highly competent human overseer. The human overseer might be able to make good enough decisions without having an explicit model of decision theory (especially if odd edge cases don’t come up in practice). </p><h2>Evaluating IDA</h2><p>There are a number of lines of reasoning that one could use to try to get a sense of whether IDA could work, or how it should be prioritized relative to other approaches. Here, I summarize some of these considerations and indicate which versions of oversight they apply to, as I think this is important to be clear about. I don’t make claims about how these should be weighed, or claim to be exhaustive (there are additional considerations that would weigh on both forms of oversight, or other parts of IDA).</p><h3>Evaluating High Bandwidth Oversight</h3><p><strong>Positive Considerations</strong></p><ul><li>Reasons for general optimism about “learning to reason from humans”, such as those that Daniel Dewey discusses in <u><a href="http://effective-altruism.com/ea/1ca/my_current_thoughts_on_miris_highly_reliable/">his thoughts on MIRI&#x27;s &quot;highly reliable agent design&quot; work</a></u>. For example, the individual overseers are acting like humans. This would lead the system to consider problems in the way that humans would, incorporating common sense about human motivations, coming up with solutions that are easier for humans to understand and less likely to contain weird surprises, and operating on inputs that are understandable.</li><li>If it seems like high bandwidth task decomposition could solve difficult problems (which is the question that Ought is running experiments to investigate)</li><li>IDA with high bandwidth oversight seems like some version of it could be implemented today</li><li>High bandwidth oversight seems that it could be implemented without solving many of the hard philosophical/mathematical problems that are faced by HRAD (decision theory, induction, etc.), instead substituting them for difficult machine learning problems. </li></ul><p><strong>Negative Consideration: Not Scalable and Secure</strong></p><p>In high bandwidth oversight, if one copy of the overseer is corrupted, this corruption could easily spread. As more copies of the overseer are used to solve a problem, there is a higher chance of one copy becoming corrupted. Or an adversary could introduce a corrupting input into the system.</p><h3>Evaluating Low Bandwidth Oversight</h3><p><strong>Positive Considerations</strong></p><ul><li>Paul’s intuitions suggest that it is possible to work with low bandwidth oversight, and he has discussed some aspects of how various problems might be handled.</li><li>If it seems plausible that meta-philosophy might be easier than object level philosophy - ie. defining an algorithm that we can’t run directly but outputs the correct decision theory might be easier than specifying the correct decision theory directly.</li><li>Paul claims that this approach will scale with increasing capability for any distillation algorithm.</li></ul><p><strong>Negative Consideration: No longer “learning to reason from humans”</strong></p><p>I think that IDA with a low bandwidth overseer is not accurately described as “AI learns to reason from humans”, rather more “Humans figure out how to reason explicitly, then the AI learns from the explicit reasoning”. As Wei Dai has pointed out, amplified low bandwidth oversight will not actually end up reasoning like a human. Humans have implicit knowledge that helps them perform tasks when they see the whole task. But not all of this knowledge can be understood and break into smaller pieces. Low bandwidth oversight requires that the overseer not use any of this knowledge.</p><p>Now, it’s quite possible that performance still could be recovered by doing things like searching over a solution space, or by reasoning about when it is safe to use training data from insecure humans. But these solutions could look quite different from human reasoning. In discussion on Universality Amplification, Paul describes why he thinks that a low bandwidth overseer could still perform image classification, but the process looks very different from a human using their visual system to interpret the image:</p><p>“I’ve now played three rounds of the following game (inspired by Geoffrey Irving who has been thinking about debate): two debaters try to convince a judge about the contents of an image, e.g. by saying “It’s a cat because it has pointy ears.” To justify these claims, they make still simpler claims, like “The left ears is approximately separated from the background by two lines that meet at a 60 degree angle.” And so on. Ultimately if the debaters disagree about the contents of a single pixel then the judge is allowed to look at that pixel. This seems to give you a tree to reduce high-level claims about the image to low-level claims (which can be followed in reverse by amplification to classify the image). I believe the honest debater can quite easily win this game, and that this pretty strongly suggests that amplification will be able to classify the image.”</p><h2>Conclusion: Weighing Evidence for IDA</h2><p>The important takeaway is that considering IDA requires clarifying whether you are considering IDA with high or low bandwidth oversight. Then, only count considerations that actually apply to that approach. I think there’s a way to misunderstand the approach where you mostly think about high bandwidth oversight and count the feeling like it’s somewhat understandable, feels plausible to you that it could work and that it avoids some hard problems. But if you then also count Paul’s opinion that it could work, you may be overconfident - the approach that Paul claims is most likely to work is the low bandwidth oversight approach.</p><p>Additionally, I think it’s useful to consider both models as alternative tools for understanding oversight: for example, the problems in low bandwidth oversight might be less obvious but still important to consider in the high bandwidth oversight regime.</p><p>After understanding this, I am more nervous about whether Paul’s approach would work if implemented, due to the additional complications of working with low bandwidth oversight. I am somewhat optimistic that further work (such as fleshing out how particular problems could be address through low bandwidth oversight) will shed light on this issue, and either make it seem more likely to succeed or yield more understanding of why it won’t succeed. I’m also still optimistic about Paul’s approach yielding ideas or insights that could be useful for designing aligned AIs in different ways.</p><h2>Caveat: high bandwidth oversight could still be useful to work on</h2><p>High bandwidth oversight could still be useful to work on for the following reasons:</p><ul><li>If you think that other solutions could be found to the security problem in high bandwidth oversight. Paul claims that low bandwidth oversight is the most likely solution to security issues within the overseer, but he thinks it may be possible to make IDA with high bandwidth oversight secure using various techniques for <u><a href="https://ai-alignment.com/techniques-for-optimizing-worst-case-performance-39eafec74b99">optimizing worst-case performance</a></u> on the final distilled agent, even if the overseer is insecure. (see <u><a href="https://ai-alignment.com/two-guarantees-c4c03a6b434f">https://ai-alignment.com/two-guarantees-c4c03a6b434f</a></u>)</li><li>It could help make progress on low bandwidth oversight. If high bandwidth oversight fails, then so will low bandwidth oversight. If high bandwidth oversight succeeds, then we might be able to break down each of the subtasks into low bandwidth tasks, directly yielding a low bandwidth overseer). I think the <u><a href="https://ought.org/projects/factored-cognition">factored cognition experiments</a></u> planned by Ought plausibly fall into this category.</li><li>If you think it could be used as a medium-term alignment solution or a fallback plan if no other alignment approach is ready in time. This seems like it would only work if it is used for limited tasks and a limited amount of time, in order to extend the time window for preparing a truly scalable approach. In this scenario, it would be very useful to have techniques that could help us understand how far the approach could be scaled before failure.</li></ul> william_s yxzrKb2vFXRkwndQ4 2018-04-17T22:36:29.562Z Comment by William_S on Utility versus Reward function: partial equivalence https://www.lesswrong.com/posts/JewWDfLoxgFtJhNct/utility-versus-reward-function-partial-equivalence#fKAPHmjRbuYgQBChW <p>Ah, misunderstood that, thanks.</p> william_s fKAPHmjRbuYgQBChW 2018-04-16T17:03:02.145Z Comment by William_S on Utility versus Reward function: partial equivalence https://www.lesswrong.com/posts/JewWDfLoxgFtJhNct/utility-versus-reward-function-partial-equivalence#DL3EayMZDBs9QNJ42 <p>Say w2a is the world where the agent starts in w2 and w2b is the world that results after the agent moves from w1 to w2.</p><p>Without considering the agent&#x27;s memory part of the world, it seems like the problem is worse: the only way to distinguish between w2a and w2b is the agent&#x27;s memory of past events, so it seems that leaving the agent&#x27;s memory over the past out of the utility function requires U(w2a) = U(w2b)</p> william_s DL3EayMZDBs9QNJ42 2018-04-16T15:08:30.412Z Comment by William_S on Two clarifications about "Strategic Background" https://www.lesswrong.com/posts/hL9ennoEfJXMj7r2D/two-clarifications-about-strategic-background#qpRYqwQkSE9bBZEhg <p>Would you think that the following approach would fit within &quot;in addition to making alignment your top priority and working really hard to over-engineer your system for safety, <em>also</em> build the system to have the bare minimum of capabilities&quot; and possibly work, or would you think that it would be hopelessly doomed?</p><ul><li>Work hard on designing the system to be safe</li><li>But there&#x27;s some problem left over that you haven&#x27;t been able to fully solve, and think will manifest at a certain scale (level of intelligence/optimization power/capabilities)</li><li>Run the system, but limit scale to stay well within the range where you expect it to behave well</li></ul> william_s qpRYqwQkSE9bBZEhg 2018-04-15T17:01:39.498Z Comment by William_S on Utility versus Reward function: partial equivalence https://www.lesswrong.com/posts/JewWDfLoxgFtJhNct/utility-versus-reward-function-partial-equivalence#dAxkvohviWFNoZoXM <p>I&#x27;m trying to wrap my head around the case where there are two worlds, w1 and w2; w2 is better than w1, but moving from w1 to w2 is bad (ie. kill everyone and replacing them with different people who are happier, and we think this is bad).</p><p>I think for the equivalence to work in this case, the utility function U also needs to depend on your current state - if it&#x27;s the same for all states, then the agent would always prefer to move from w1 to w2 and erase it&#x27;s memory of the past when maximizing the utility function, wheras it would act correctly with the reward function.</p> william_s dAxkvohviWFNoZoXM 2018-04-13T16:02:55.058Z Comment by William_S on Can corrigibility be learned safely? https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely#dduayscqDFjzbrxFb <blockquote>how does IDA recover from an error on H&#x27;s part?</blockquote><p>Error recovery could be supported by having a parent agent running multiple versions of a query in parallel with different approaches (or different random seeds).</p><p></p><blockquote>And also, how does it improve itself using external feedback </blockquote><p>I think this could be implemented as: part of the input for a task is a set of information on background knowledge relevant to the task (ie. model of what the user wants, background information about translating the language). The agent can have a task &quot;Update [background knowledge] after receiving [feedback] after providing [output] for task [input]&quot;, which outputs a modified version of [background knowledge], based on the feedback.</p> william_s dduayscqDFjzbrxFb 2018-04-10T15:09:59.476Z Comment by William_S on Can corrigibility be learned safely? https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely#5KobFNgwauuM3nFWv <blockquote>The only way I know how to accomplish this is to have IDA emulate the deep learning translator at a very low level, with H acting as a &quot;human transistor&quot; or maybe a &quot;human neuron&quot;, and totally ignore what H knows about translation including the meanings of words.</blockquote><p>The human can understand the meaning of the word it sees, the human just can&#x27;t know the context (the words that it doesn&#x27;t see), and so can&#x27;t use their understanding of that context. </p><p>The could try to guess possible contexts for the word and leverage their understanding of those contexts (&quot;what are some examples of sentences where the word could be used ambiguously?&quot;), but they aren&#x27;t allowed to know if any of their guesses actually apply to the text they are currently working on (and so their answer is independent of the actual text they are currently working on).</p> william_s 5KobFNgwauuM3nFWv 2018-04-09T17:11:24.636Z Comment by William_S on Can corrigibility be learned safely? https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely#zenwbzDTKoBNhn92G <p>Okay, I agree that we&#x27;re on the same page. Amplify(X,n) is what I had in mind.</p> william_s zenwbzDTKoBNhn92G 2018-04-07T14:40:23.513Z Comment by William_S on Can corrigibility be learned safely? https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely#wrKz4FcY2zakkBDsc <p>Was thinking of things more in line with Paul&#x27;s version, not this finding ambiguity definition, where the goal is to avoid doing some kind of malign optimization during search (ie. untrained assistant thinks it&#x27;s a good idea to use the universal prior, then you show them <a href="https://ordinaryideas.wordpress.com/2016/11/30/what-does-the-universal-prior-actually-look-like/">What does the universal prior actually look like?</a>, and afterwards they know not to do that).</p> william_s wrKz4FcY2zakkBDsc 2018-04-05T20:29:22.676Z Comment by William_S on Can corrigibility be learned safely? https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely#4zArriiZx2A7yoxMq <blockquote>Can you give an example of natural language instruction (for humans operating on small inputs) that can&#x27;t be turned into a formal algorithm easily?</blockquote><p>Any set of natural language instructions for humans operating on small inputs can be turned into a lookup table by executing the human on all possible inputs (multiple times on each input, if you want to capture a stochastic policy).</p><p>The with the following &quot;Consider the sentence [s1] w [s2]&quot;, and have the agent launch queries of the form &quot;Consider the sentence [s1] w [s2], where we take w to have meaning m&quot;. Now, you could easily produce this behaviour algorithmically if you have a dictionary. But in a world without dictionaries, suitably preparing a human to answer this query takes much less effort than producing a dictionary.</p> william_s 4zArriiZx2A7yoxMq 2018-04-04T22:04:23.262Z Comment by William_S on Can corrigibility be learned safely? https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely#5R4EQZumh7YMKwnHF <blockquote>By &quot;corrigible&quot; here did you mean Paul&#x27;s definition which doesn&#x27;t include competence in modeling the user and detecting ambiguities, or what we thought &quot;corrigible&quot; meant (where it does include those things)?</blockquote><p>Thinking of &quot;corrigible&quot; as &quot;whatever Paul means when he says corrigible&quot;. The idea applies to any notion of corrigibility which allows for multiple actions and does not demand that the action returned be one that is the best possible for the user</p> william_s 5R4EQZumh7YMKwnHF 2018-04-04T21:52:57.420Z Comment by William_S on Can corrigibility be learned safely? https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely#58BiCCunnNptn2dhQ <p>What is the difference between &quot;core after a small number of amplification steps&quot; and &quot;core after a large number of amplification steps&quot; that isn&#x27;t captured in &quot;larger effective computing power&quot; or &quot;larger set of information about the world&quot;, and allows the highly amplified core to solve these problems?</p> william_s 58BiCCunnNptn2dhQ 2018-04-04T19:49:11.861Z Comment by William_S on Can corrigibility be learned safely? https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely#tNkv3Dcis7bjTF2Ka <blockquote>I don&#x27;t have optimism about finding a core which is already highly competent at these tasks.</blockquote><p>I&#x27;m a little confused about what this statement means. I thought that if you have an overseer that implements some reasoning core, and consider amplify(overseer) with infinite computation time and unlimited ability to query the world (ie. for background information on what humans seem to want, how they behave, etc.), then amplify(overseer) should be able to solve any problem that an agent produced by iterating IDA could solve.</p><p>Did you mean to say that </p><ul><li>&quot;already highly competent at these tasks&quot; means that the core should be able to solve these problems without querying the world at all, and this is not likely to be possible?</li><li>you don&#x27;t expect to find a core such that only one round of amplification of amplify(overseer) can solve practical tasks in any reasonable amount of time/number of queries?</li><li>There is some other way that the agent produced by IDA would be more competent than the original amplified overseer?</li></ul> william_s tNkv3Dcis7bjTF2Ka 2018-04-04T17:45:01.393Z Comment by William_S on Can corrigibility be learned safely? https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely#EZeCRbAWawu2fHnyr <blockquote>Among people I&#x27;ve had significant online discussions with, your writings on alignment tend to be the hardest to understand and easiest to misunderstand.</blockquote><p>Additionally, I think that there are ways to misunderstand the IDA approach that leave out significant parts of the complexity (ie. IDA based off of humans thinking for a day with unrestricted input, without doing the hard work of trying to understand corrigibility and meta-philosophy beforehand), but can seem to be plausible things to talk about in terms of &quot;solving the AI alignment problem&quot; if one hasn&#x27;t understood the more subtle problems that would occur. It&#x27;s then easy to miss the problems and feel optimistic about IDA working while underestimating the amount of hard philosophical work that needs to be done, or to incorrectly attack the approach for missing the problems completely.</p><p>(I think that these simpler versions of IDA might be worth thinking about as a plausible fallback plan if no other alignment approach is ready in time, but only if they are restricted in terms of accomplishing specific tasks to stabilise the world, restricted in how far the amplification is taking, replaced with something better as soon as possible, etc. I also think that working on simple versions of IDA can help make progress on issues that would be required for fully scalable IDA, ie. the experiments that Ought is running.).</p> william_s EZeCRbAWawu2fHnyr 2018-04-04T15:45:26.875Z Comment by William_S on Can corrigibility be learned safely? https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely#nR5X6pnuYPLPjGgSc <p>I would see the benefits of humans vs. algorithms being that giving a human a bunch of natural language instructions would be much easier (but harder to verify) than writing down a formal algorithm. Also, the training could just cover how to avoid taking incorrigible actions, and the Overseer could still use their judgement of how to perform competently within the space of corrigible outputs.</p> william_s nR5X6pnuYPLPjGgSc 2018-04-03T19:10:32.587Z Comment by William_S on Can corrigibility be learned safely? https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-corrigibility-be-learned-safely#XERa63K5H9cAawoeb <p>Trying to understand the boundary lines around incorrigibility, looking again at this example from Universality and Security Amplification</p><blockquote>For example, suppose meta-execution asks the subquestion “What does the user want?”, gets a representation of their values, and then asks the subquestion “What behavior is best according to those values?” I’ve then generated incorrigible behavior by accident, after taking innocuous steps.</blockquote><p>It sounds like from this that this only counts as incorrigible if the optimization in “What behavior is best according to those values?” is effectively optimizing for something that the user doesn&#x27;t want, but is not incorrigible if it is optimizing for something that the user doesn&#x27;t want in a way that the user can easily correct? (so incorrigibilty requires something more than just being malign)</p><p>One way to describe this is that the decomposition is incorrigible if the models of the user that are used in “What behavior is best according to those values?” are better than the models used in “What does the user want?” (as this could lead the AI to maximize an approximation V* of the user&#x27;s values V and realize that if the AI reveals to the user that they are maximizing V*, the user will try to correct what the AI is doing, which will perform worse on V*).</p><p>So acceptable situations are where both subqueries get the same user models, the first subquery gets a user better model than the second, or the situation where “What behavior is best according to those values?” is performing some form of mild optimization. Is that roughly correct?</p> william_s XERa63K5H9cAawoeb 2018-04-03T19:08:22.234Z Improbable Oversight, An Attempt at Informed Oversight https://www.lesswrong.com/posts/5bd75cc58225bf06703751d5/improbable-oversight-an-attempt-at-informed-oversight william_s 5bd75cc58225bf06703751d5 2017-05-24T17:43:53.000Z Informed Oversight through Generalizing Explanations https://www.lesswrong.com/posts/5bd75cc58225bf06703753ab/informed-oversight-through-generalizing-explanations william_s 5bd75cc58225bf06703753ab 2017-05-24T17:43:39.000Z Proposal for an Implementable Toy Model of Informed Oversight https://www.lesswrong.com/posts/5bd75cc58225bf06703753ac/proposal-for-an-implementable-toy-model-of-informed-oversight william_s 5bd75cc58225bf06703753ac 2017-05-24T17:43:13.000Z