Fuzzy Boundaries, Real Concepts 2018-05-07T03:39:33.033Z · score: 58 (16 votes)
Roleplaying As Yourself 2018-01-06T06:48:03.510Z · score: 85 (31 votes)
The Loudest Alarm Is Probably False 2018-01-02T16:38:05.748Z · score: 162 (63 votes)
Value Learning for Irrational Toy Models 2017-05-15T20:55:05.000Z · score: 0 (0 votes)
HCH as a measure of manipulation 2017-03-11T03:02:53.000Z · score: 1 (1 votes)
Censoring out-of-domain representations 2017-02-01T04:09:51.000Z · score: 2 (2 votes)
Vector-Valued Reinforcement Learning 2016-11-01T00:21:55.000Z · score: 2 (2 votes)
Cooperative Inverse Reinforcement Learning vs. Irrational Human Preferences 2016-06-18T00:55:10.000Z · score: 2 (2 votes)
Proof Length and Logical Counterfactuals Revisited 2016-02-10T18:56:38.000Z · score: 3 (3 votes)
Obstacle to modal optimality when you're being modalized 2015-08-29T20:41:59.000Z · score: 3 (3 votes)
A simple model of the Löbstacle 2015-06-11T16:23:22.000Z · score: 2 (2 votes)
Agent Simulates Predictor using Second-Level Oracles 2015-06-06T22:08:37.000Z · score: 2 (2 votes)
Agents that can predict their Newcomb predictor 2015-05-19T10:17:08.000Z · score: 1 (1 votes)
Modal Bargaining Agents 2015-04-16T22:19:03.000Z · score: 3 (3 votes)
[Clearing out my Drafts folder] Rationality and Decision Theory Curriculum Idea 2015-03-23T22:54:51.241Z · score: 6 (7 votes)
An Introduction to Löb's Theorem in MIRI Research 2015-03-23T22:22:26.908Z · score: 16 (17 votes)
Welcome, new contributors! 2015-03-23T21:53:20.000Z · score: 4 (4 votes)
A toy model of a corrigibility problem 2015-03-22T19:33:02.000Z · score: 4 (4 votes)
New forum for MIRI research: Intelligent Agent Foundations Forum 2015-03-20T00:35:07.071Z · score: 36 (37 votes)
Forum Digest: Updateless Decision Theory 2015-03-20T00:22:06.000Z · score: 5 (5 votes)
Meta- the goals of this forum 2015-03-10T20:16:47.000Z · score: 3 (3 votes)
Proposal: Modeling goal stability in machine learning 2015-03-03T01:31:36.000Z · score: 1 (1 votes)
An Introduction to Löb's Theorem in MIRI Research 2015-01-22T20:35:50.000Z · score: 2 (2 votes)
Robust Cooperation in the Prisoner's Dilemma 2013-06-07T08:30:25.557Z · score: 73 (71 votes)
Compromise: Send Meta Discussions to the Unofficial LessWrong Subreddit 2013-04-23T01:37:31.762Z · score: -2 (18 votes)
Welcome to Less Wrong! (5th thread, March 2013) 2013-04-01T16:19:17.933Z · score: 27 (28 votes)
Robin Hanson's Cryonics Hour 2013-03-29T17:20:23.897Z · score: 29 (34 votes)
Does My Vote Matter? 2012-11-05T01:23:52.009Z · score: 19 (37 votes)
Decision Theories, Part 3.75: Hang On, I Think This Works After All 2012-09-06T16:23:37.670Z · score: 23 (24 votes)
Decision Theories, Part 3.5: Halt, Melt and Catch Fire 2012-08-26T22:40:20.388Z · score: 31 (32 votes)
Posts I'd Like To Write (Includes Poll) 2012-05-26T21:25:31.019Z · score: 14 (15 votes)
Timeless physics breaks T-Rex's mind [LINK] 2012-04-23T19:16:07.064Z · score: 22 (29 votes)
Decision Theories: A Semi-Formal Analysis, Part III 2012-04-14T19:34:38.716Z · score: 23 (28 votes)
Decision Theories: A Semi-Formal Analysis, Part II 2012-04-06T18:59:35.787Z · score: 16 (19 votes)
Decision Theories: A Semi-Formal Analysis, Part I 2012-03-24T16:01:33.295Z · score: 23 (25 votes)
Suggestions for naming a class of decision theories 2012-03-17T17:22:54.160Z · score: 5 (8 votes)
Decision Theories: A Less Wrong Primer 2012-03-13T23:31:51.795Z · score: 72 (76 votes)
Baconmas: The holiday for the sciences 2012-01-05T18:51:10.606Z · score: 5 (5 votes)
Advice Request: Baconmas Website 2012-01-01T19:25:40.308Z · score: 11 (11 votes)
[LINK] "Prediction Audits" for Nate Silver, Dave Weigel 2011-12-30T21:07:50.916Z · score: 12 (13 votes)
Welcome to Less Wrong! (2012) 2011-12-26T22:57:21.157Z · score: 26 (27 votes)
Improving My Writing Style 2011-10-11T16:14:40.907Z · score: 6 (9 votes)
Decision Theory Paradox: Answer Key 2011-09-05T23:13:33.256Z · score: 6 (6 votes)
Consequentialism Need Not Be Nearsighted 2011-09-02T07:37:08.154Z · score: 55 (55 votes)
Decision Theory Paradox: PD with Three Implies Chaos? 2011-08-27T19:22:15.046Z · score: 19 (29 votes)
Why are certain trends so precisely exponential? 2011-08-06T17:38:42.140Z · score: 16 (17 votes)
Nature: Red, in Truth and Qualia 2011-05-29T23:50:28.495Z · score: 44 (38 votes)
A Study of Scarlet: The Conscious Mental Graph 2011-05-27T20:13:26.876Z · score: 29 (34 votes)
Seeing Red: Dissolving Mary's Room and Qualia 2011-05-26T17:47:55.751Z · score: 39 (43 votes)
Perspectivism and the Real World 2011-01-10T23:33:23.077Z · score: 1 (6 votes)


Comment by orthonormal on "Can We Survive Technology" by von Neumann · 2019-08-18T19:02:18.688Z · score: 53 (20 votes) · LW · GW

Quick approximate summary:

  • John von Neumann first says, essentially, that the Industrial Revolution has made the world smaller, and that in earlier centuries the problems caused by it were contained to nations, but that now they extend to the entire globe.
    • For a first example, he talks about cheap energy and industrial synthesis, though he predicts that nuclear fusion and transmutation would be much more practically feasible than they have turned out to be.
    • He briefly mentions expected major improvement in automation, communication, and transportation.
    • He then talks about anthropogenic climate change and the broad possibilities of geoengineering the climate.
    • All of these technologies can vastly improve human life, or destroy it.
  • He immediately rules out the "solution" of [preventing advances in technology] as both undesirable (because it blocks the positive uses) and impossible (because it would require total coordination and a total change in human values).
  • He next considers the possibility of permanently avoiding war through diplomacy etc, and does not think that the 1950s drive for world peace will last long; furthermore, such an initiative would need to adapt to ever-more-powerful technologies as fast as they are introduced.
  • He frames the upcoming decades as a dangerous but useful evolution, where we will either succeed or fail catastrophically, and doesn't sound especially optimistic. Our best hope is to innovate new political forms that are capable of handling major threats with patience, flexibility, and intelligence.
Comment by orthonormal on Keeping Beliefs Cruxy · 2019-07-29T23:18:54.604Z · score: 7 (4 votes) · LW · GW

Alas, double-cruxing is a two-person game; you can't make another person do it, you can only do it with someone who genuinely wants to. For everyone else, the best related trick I know is nonviolent communication.

Comment by orthonormal on The Real Rules Have No Exceptions · 2019-07-23T22:17:16.796Z · score: 19 (12 votes) · LW · GW

Meta: I approve of the practice of arguing against your own post in a comment.

Comment by orthonormal on Simple Rules of Law · 2019-07-16T04:33:44.801Z · score: 2 (1 votes) · LW · GW

How much less do you expect this to happen under the current system?

Comment by orthonormal on The AI Timelines Scam · 2019-07-11T04:59:45.435Z · score: 8 (6 votes) · LW · GW
both historically and now, criticism is often met with counterarguments based on "style" rather than engaging with the technical meat of the criticism

Is there any group of people who reliably don't do this? Is there any indication that AI researchers do this more often than others?

Comment by orthonormal on If physics is many-worlds, does ethics matter? · 2019-07-11T02:29:10.515Z · score: 13 (8 votes) · LW · GW

Eliezer's real answer to this question is discussed in Timeless Control. Basically, choice is still meaningful in many-worlds or any other physically deterministic universe. There are incredibly few Everett branches starting from here where tomorrow I go burn down an orphanage, and this is genuinely caused by the fact that I robustly do not want to do that sort of thing.

If you have altruistic motivation, then the Everett branches starting from here are in fact better (in expectation) than the branches starting from a similar universe with a version of you that has no altruistic motivation. By working to do good, you are in a meaningful sense causing the multiverse to contain a higher proportion of good worlds than it otherwise would.

It really does all add up to normality, even if it feels counterintuitive.

Comment by orthonormal on How I Ended Up Non-Ambitious · 2019-07-08T06:03:44.437Z · score: 4 (2 votes) · LW · GW

Well, this post aged interestingly for those of us who know the author (who ended up working for a high-profile EA organization for some time).

Comment by orthonormal on Causal Reality vs Social Reality · 2019-06-27T22:48:38.979Z · score: 2 (1 votes) · LW · GW


it is not the done the thing

it is not the done thing, perhaps?

Comment by orthonormal on Writing children's picture books · 2019-06-27T19:10:54.865Z · score: 14 (5 votes) · LW · GW

Maybe you can get the best of both worlds by imagining you're writing a children's book, but that your editor is in fact an expert on the subject and you don't want to embarrass yourself in front of them.

Comment by orthonormal on Quotes from Moral Mazes · 2019-06-04T00:55:40.036Z · score: 15 (5 votes) · LW · GW

And Robin Hanson was surprised that no big corporation wanted to implement a real prediction market?

Comment by orthonormal on Egoism In Disguise · 2019-06-04T00:37:18.486Z · score: 21 (5 votes) · LW · GW

This strongly resembles the argument given by Subhan in EY's post Is Morality Preference?, with a side order of Fake Selfishness. You might enjoy reading those posts along with others in their respective sequences. ("Is Morality Preference?" was part of the original metaethics sequence but didn't make the cut for Rationality: AI to Zombies.)

More to the point, the biggest mistake I see here is the one addressed in The Domain of Your Utility Function: yes, my moral preferences are a part of my map rather than the territory, but there's still a damn meaningful difference between egoism (preferences that point only to the part of my map labeled "my future experiences") and my actual moral preferences, which point to many other parts of the map as well.

Comment by orthonormal on Yes Requires the Possibility of No · 2019-05-21T03:02:07.816Z · score: 17 (11 votes) · LW · GW
I am struggling to understand the goal of the post.

The title was helpful to me in that regard. Each of these examples shows an agent who could run an honest process to get evidence on a question, but which prefers one answer so much that they try to stack the deck in that direction, and thereby loses the hoped-for benefits of that process.

Getting an honest Yes requires running the risk of getting a No instead.

Comment by orthonormal on Coherent decisions imply consistent utilities · 2019-05-14T00:01:56.252Z · score: 2 (1 votes) · LW · GW

Formatting request: can the footnote numbers be augmented with links that jump to the footnote text? (I presume this worked in Arbital but broke when it was moved here.)

Comment by orthonormal on Complex Behavior from Simple (Sub)Agents · 2019-05-12T06:06:31.711Z · score: 13 (4 votes) · LW · GW
I had a notion here that I could stochastically introduce a new goal that would minimize total suffering over an agent's life-history. I tried this, and the most stable solution turned out to be thus: introduce an overwhelmingly aversive goal that causes the agent to run far away from all of its other goals screaming.

did you mean: anhedonia

(No, seriously, your paragraph is an apt description of a long bout I had of depression-induced anhedonia; I felt so averse to every action that I ceased to feel wants, and I consistently marked my mood as neutral rather than negative despite being objectively more severely depressed than I was at other times when I put negative numbers in my mood tracker.)

Comment by orthonormal on Dishonest Update Reporting · 2019-05-05T23:41:37.310Z · score: 21 (7 votes) · LW · GW

The ideal thing is to judge Bob as if he were making the same prediction every day until he makes a new one, and log-score all of them when the event is revealed. (That is, if Bob says 75% on January 1st and 60% on February 1st, and then on March 1st the event is revealed to have happened, Bob's score equals 31*log(.25) + 28*log(.4). Then Bob's best strategy is to update his prediction to his actual current estimate as often as possible; past predictions are sunk costs.

The real-world version is remembering to dock people's bad predictions more, the longer they persisted in them. But of course this is hard.

538 did do this with their self-evaluation, which is a good way to try and establish a norm in the domain of model-driven reporting.

Comment by orthonormal on Pecking Order and Flight Leadership · 2019-05-04T14:57:15.433Z · score: 4 (2 votes) · LW · GW

Let's note differences of degree here. Political systems differ massively in how easily decisionmakers can claim large spoils for themselves, and these differences seem to correlate with how pro-social the decisions tend to be. In particular, the dollar amounts of graft being alleged for politicians in liberal democracies are usually small compared to what despots regularly claim without consequence. (Which is not to say that it would be wise to ignore corruption in liberal democracies!)

Comment by orthonormal on [Answer] Why wasn't science invented in China? · 2019-04-26T04:24:39.155Z · score: 30 (12 votes) · LW · GW

I'm not convinced that Europe had more intellectual freedom on average than China, but because of the patchwork of principalities, it certainly had more variation in intellectual freedom than did a China that was at any given time either mostly unified or mostly at war; and all that you need for an intellectual revolution is the existence of a bastion of intellectual freedom somewhere.

Comment by orthonormal on How does OpenAI's language model affect our AI timeline estimates? · 2019-02-15T22:29:03.024Z · score: 17 (6 votes) · LW · GW

It doesn't move much probability mass to the very near term (i.e. 1 year or less), because both this and AlphaStar aren't really doing consequentialist reasoning, they're just able to get a surprising performance with simpler tricks (the very Markovian nature of human writing, a good position evaluation function) given a whole lot of compute.

However, it does shift my probabilities forward in time, in the sense that one new weird trick to do deductive or consequentialist reasoning, plus a lot of compute, might get you there really quickly.

Comment by orthonormal on The Rocket Alignment Problem · 2018-10-05T18:56:26.687Z · score: 53 (21 votes) · LW · GW

I expect that some otherwise convinceable readers are not going to realize that in this fictional world, people haven't discovered Newton's physics or calculus, and those readers are therefore going to miss the analogy of "this is how MIRI would talk about the situation if they didn't already know the fundamental concepts but had reasons for searching in the right direction". (I'm not thinking of readers incapable of handling that counterfactual, but of readers who aren't great at inferring implicit background facts from a written dialogue. Such readers might get very confused at the unexpected turns of the dialogue and quit rather than figure out what they're baffled by.)

I'd suggest adding to the preamble something like "In a weird world where people had figured out workable aeronautics and basic rocket propulsion by trial and error, but hadn't figured out Newton's laws or calculus".

Comment by orthonormal on Fuzzy Boundaries, Real Concepts · 2018-05-07T20:51:32.092Z · score: 9 (2 votes) · LW · GW
I like your definition, though, and want to try to make a better one (and I acknowledge this is not the point of this post).

I think that's a perfectly valid thing to do in the comments here! However, I think your attempt,

My stab at a refinement of "consent" is "respect for another's choices", where "disrespect" is "deliberately(?) doing something to undermine"

is far too vague to be a useful concept.

In most realistic cases, I can give a definite answer to whether A touched B in a way B clearly did not want to be touched. In the case of my honesty definition, it does involve intent and so I can only infer statistically when someone else is being dishonest vs mistaken, but for myself I usually have an answer about whether saying X to person C would be honest or not.

I don't think I could do the same for your definition; "am I respecting their choices" is a tough query to bottom out in basic facts.

Comment by orthonormal on Local Validity as a Key to Sanity and Civilization · 2018-05-06T21:21:42.045Z · score: 15 (5 votes) · LW · GW

My comment was meant to explain what I understood Eliezer to be saying, because I think you had misinterpreted that. The OP is simply saying "don't give weight to arguments that are locally invalid, regardless of what else you like about them". Of course you need to use priors, heuristics, and intuitions in areas where you can't find an argument that carries you from beginning to end. But being able to think "oh, if I move there, then they can take my queen, and I don't see anything else good about that position, so let's not do that then" is a fair bit easier than proving your move optimal.

Comment by orthonormal on Local Validity as a Key to Sanity and Civilization · 2018-04-22T03:14:38.579Z · score: 24 (7 votes) · LW · GW
Relying purely on local validity won't get you very far in playing chess

The equivalent of local validity is just mechanically checking "okay, if I make this move, then they can make that move" for a bunch of cases. Which, first, is a major developmental milestone for kids learning chess. So we only think it "won't get you very far" because all the high-level human play explicitly or implicitly takes it for granted.

And secondly, it's pretty analogous to doing math; proving theorems is based on the ability to check the local validity of each step, but mathematicians aren't just brute-forcing their way to proofs. They have to develop higher-level heuristics, some of which are really hard to express in language, to suggest avenues, and then check local validity once they have a skeleton of some part of the argument. But if mathematicians stopped doing that annoying bit, well, then after a while you'll end up with another crisis of analysis when the brilliant intuitions are missing some tiny ingredient.

Local validity is an incredibly important part of any scientific discipline; the fact that it's not a part of most political discourse is merely a reflection that our society is at about the developmental level of a seven-year-old when it comes to political reasoning.

Comment by orthonormal on Non-Adversarial Goodhart and AI Risks · 2018-04-04T05:16:34.488Z · score: 4 (1 votes) · LW · GW

Broken link on the text "real killing of birds to reduce pests in China has never been tried".

Comment by orthonormal on The Costly Coordination Mechanism of Common Knowledge · 2018-04-04T04:52:11.400Z · score: 4 (1 votes) · LW · GW

Much of this material is covered very similarly in Melting Asphalt, especially the posts Ads Don't Work That Way and Doesn't Matter, Warm Fuzzies.

Comment by orthonormal on LessWrong Diaspora Jargon Survey · 2018-04-04T04:27:43.733Z · score: 14 (3 votes) · LW · GW

If you do future surveys of this sort, I'd like you to ask people for their probabilities rather than just their best guesses. If people are uncertain but decently calibrated, I'd argue there's not much of a problem; if people are confidently wrong, I'd argue there's a real problem.

Comment by orthonormal on The Meaning of Right · 2018-04-04T00:40:03.309Z · score: 10 (2 votes) · LW · GW

This comment got linked a decade later, and so I thought it's worth stating my own thoughts on the question:

We can consider a reference class of CEV-seeking procedures; one (massively-underspecified, but that's not the point) example is "emulate 1000 copies of Paul Christiano living together comfortably and immortally and discussing what the AI should do with the physical universe; once there's a large supermajority in favor of an enactable plan (which can include further such delegated decisions), the AI does that".

I agree that this is going to be chaotic, in the sense that even slightly different elements of this reference class might end up steering the AI to different basins of attraction.

I assert, however, that I'd consider it a pretty good outcome overall if the future of the world were determined by a genuinely random draw from this reference class, honestly instantiated. (Again with the massive underspecification, I know.)

CEV may be underdetermined and many-valued, but that doesn't mean paperclipping is as good an answer as any.

Re: no basins, it would be a bad situation indeed if the vast majority of the reference class never ended up outputting an action plan, instead deferring and delegating forever. I don't have cached thoughts about that.

Comment by orthonormal on April Fools: Announcing: Karma 2.0 · 2018-04-01T20:41:28.136Z · score: 32 (8 votes) · LW · GW

I for one welcome our new typographical overlords.

Comment by orthonormal on A model I use when making plans to reduce AI x-risk · 2018-03-31T22:47:45.460Z · score: 10 (2 votes) · LW · GW

That's a legit thing to be frustrated by, but I think you know the reason why AI safety researchers don't want "we don't see a way to get to a good outcome except for an aligned project to grab a decisive strategic advantage" to filter into public discourse: it pattern-matches too well to "trust us, you need to let us run the universe".

Comment by orthonormal on A model I use when making plans to reduce AI x-risk · 2018-03-31T19:51:25.352Z · score: 10 (2 votes) · LW · GW

To be clear, I am making the claim that, of the people who have made useful advances on Oracle AI safety research (Armstrong counts here; I don't think Yampolskiy does), all of them believe that the goal of having a safe Oracle AI is to achieve a decisive strategic advantage quickly and get to an aligned future. I recognize that this is a hard claim to evaluate (e.g. because this isn't a statement one could put in a Serious Academic Journal Article in the 2010s, it would have to be discussed on their blog or in private correspondence), but if anyone has a clear counterexample, I'd be interested in seeing it.

Comment by orthonormal on Circling · 2018-03-31T19:36:44.804Z · score: 9 (2 votes) · LW · GW

Yes, this. NVC should be treated with a similar sort of parameters to Crocker's Rules, which you can declare for yourself at any time, you can invite people to a conversation where it's known that everyone will be using them, but you cannot hold it against anyone if you invite them to declare Crocker's Rules and they refuse.

Comment by orthonormal on The abruptness of nuclear weapons · 2018-03-31T17:39:35.539Z · score: 10 (2 votes) · LW · GW

There's a lot of Actually Bad things an AI can do just by making electrons move.

Comment by orthonormal on A model I use when making plans to reduce AI x-risk · 2018-03-31T16:35:35.079Z · score: 12 (3 votes) · LW · GW

I'd be interested in a list of well-managed government science and engineering projects if one exists. The Manhattan Project and the Apollo Project both belong on that list (despite both having their flaws- leaks to the USSR from the former, and the Apollo 1 disaster from the latter); what are other examples?

Comment by orthonormal on A model I use when making plans to reduce AI x-risk · 2018-03-31T16:32:04.175Z · score: 2 (2 votes) · LW · GW

I'm pretty sure that, without exception, anyone who's made a useful contribution on Oracle AI recognizes that "let several organizations have an Oracle AI for a significant amount of time" is a world-ending failure, and that their work is instead progress on questions like "if you can have the only Oracle AI for six months, can you save the world rather than end it?"

Correct me if I'm wrong.

Comment by orthonormal on Roleplaying As Yourself · 2018-01-07T02:36:12.555Z · score: 4 (1 votes) · LW · GW


Comment by orthonormal on The Loudest Alarm Is Probably False · 2018-01-05T02:10:31.451Z · score: 3 (1 votes) · LW · GW

I agree there are broken alarms that are quiet (including those that are broken in the direction of failing to go off, which leads to a blind spot of obliviousness!), and that there are people stuck in situations where there is a correct loud alarm that happens most of the time.

I said that habits are easier to change than alarms, not that they're easy in an absolute sense.

Comment by orthonormal on The Loudest Alarm Is Probably False · 2018-01-04T02:40:48.216Z · score: 18 (6 votes) · LW · GW

It's because the non-broken alarms, which also start out loud, get quieter throughout your life as they calibrate themselves, and as one's habits fix the situations that make them correctly go off. So given a random initial distribution of loudness, eventually the alarm that's loudest on average will probably be a broken one.

Comment by orthonormal on My Predictions for 2018 (& a Template for Yours) · 2018-01-02T15:07:36.375Z · score: 8 (3 votes) · LW · GW

The formatting didn't come through when importing this post from your blog- especially the strikethroughs of failed predictions and the graphs!

Comment by orthonormal on [deleted post] 2017-05-25T22:13:07.522Z

In the spirit of Murphyjitsu, the most obvious failure mode that you didn't mention is that I expect you to burn out dramatically after a few weeks, from exhaustion or the psychological strain of trying to optimize the experiences of N people. The bootcamp phase is not analogous to anything I've heard of you doing sustainably for an extended period of time.

So, do you expect Dragon Army Barracks to work if Eli has to take over for you in Week Four?

Comment by orthonormal on Proposal for an Implementable Toy Model of Informed Oversight · 2017-04-14T20:36:59.000Z · score: 0 (0 votes) · LW · GW

I like this suggestion of a more feasible form of steganography for NNs to figure out! But I think you'd need further advances in transparency to get useful informed oversight capabilities from (transformed or not) copies of the predictive network.

Comment by orthonormal on HCH as a measure of manipulation · 2017-03-14T01:49:40.000Z · score: 1 (1 votes) · LW · GW

I should have said "reliably estimate HCH"; I'd also want quite a lot of precision in addition to calibration before I trust it.

Comment by orthonormal on HCH as a measure of manipulation · 2017-03-13T21:41:41.000Z · score: 0 (0 votes) · LW · GW

Re #2, I think this is an important objection to low-impact-via-regularization-penalty in general.

Comment by orthonormal on HCH as a measure of manipulation · 2017-03-13T21:29:23.000Z · score: 0 (0 votes) · LW · GW

Re #1, an obvious set of questions to include in are questions of approval for various aspects of the AI's policy. (In particular, if we want the AI to later calculate a human's HCH and ask it for guidance, then we would like to be sure that HCH's answer to that question is not manipulated.)

Comment by orthonormal on HCH as a measure of manipulation · 2017-03-13T21:27:33.000Z · score: 0 (0 votes) · LW · GW

There's the additional objection of "if you're doing this, why not just have the AI ask HCH what to do?"

Overall, I'm hoping that it could be easier for an AI to robustly conclude that a certain plan only changes a human's HCH via certain informational content, than for the AI to reliably calculate the human's HCH. But I don't have strong arguments for this intuition.

Comment by orthonormal on All the indifference designs · 2017-03-11T03:28:48.000Z · score: 0 (0 votes) · LW · GW

Question that I haven't seen addressed (and haven't worked out myself): which of these indifference methods are reflectively stable, in the sense that the AI would not push a button to remove them (or switch to a different indifference method)?

Comment by orthonormal on Modal Combat for games other than the prisoner's dilemma · 2017-03-06T23:16:21.000Z · score: 0 (0 votes) · LW · GW

This is a lot of good work! Modal combat is increasingly deprecated though (in my opinion), for reasons like the ones you noted in this post, compared to studying decision theory with logical inductors; and so I'm not sure this is worth developing further.

Comment by orthonormal on Censoring out-of-domain representations · 2017-02-11T01:42:06.000Z · score: 0 (0 votes) · LW · GW

Yup, this isn't robust to extremely capable systems; it's a quantitative shift in how promising it looks to the agent to learn about external affairs, not a qualitative one.

(In the example with the agent doing engineering in a sandbox that doesn't include humans or general computing devices, there could be a strong internal gradient to learn obvious details about the things immediately outside its sandbox, and a weaker gradient for learning more distant or subtle things before you know the nearby obvious ones.)

A whitelisting variant would be way more reliable than a blacklisting one, clearly.

Comment by orthonormal on The Pascal's Wager Fallacy Fallacy · 2017-01-10T23:49:12.184Z · score: 1 (1 votes) · LW · GW

How did this post get attributed to [deleted] instead of to Eliezer? I'm 99% sure this post was by him, and the comments seem to bear it out.

Comment by orthonormal on Suggested solution to The Naturalized Induction Problem · 2016-12-27T19:11:42.781Z · score: 2 (2 votes) · LW · GW

This sweeps some of the essential problems under the rug; if you formalize it a bit more, you'll see them.

It's not an artificial restriction, for instance, that a Solomonoff Induction oracle machine doesn't include things like itself in its own hypothesis class, since the question of "whether a given oracle machine matches the observed data" is a question that sometimes cannot be answered by an oracle machine of equivalent power. (There are bounded versions of this obstacle as well.)

Now, there are some ways around this problem (all of them, so far as I know, found by MIRI): modal agents, reflective oracle machines and logical inductors manage to reason about hypothesis classes that include objects like themselves. Outside of MIRI, people working on multiagent systems make do with agents that each assume the other is smaller/simpler/less meta than itself (so at least one of those agents is going to be wrong).

But this entire problem is hidden in your assertion that the agent, which is a Turing machine, "models the entire wrold, including the agent it self, as one unknown, output only Turing machine". The only way to find the other problems swept under the rug here is to formalize or otherwise unpack your proposal.

Comment by orthonormal on CFAR’s new focus, and AI Safety · 2016-12-03T05:58:39.623Z · score: 12 (12 votes) · LW · GW

If CFAR will be discontinuing/de-emphasizing rationality workshops for the general educated public, then I'd like to see someone else take up that mantle, and I'd hope that CFAR would make it easy for such a startup to build on what they've learned so far.

Comment by orthonormal on (Non-)Interruptibility of Sarsa(λ) and Q-Learning · 2016-11-28T22:24:58.000Z · score: 0 (0 votes) · LW · GW

Nice! One thing that might be useful for context: what's the theoretical correct amount of time that you would expect an algorithm to spend on the right vs. the left if the session gets interrupted each time it goes 1 unit to the right? (I feel like there should be a pretty straightforward way to calculate the heuristic version where the movement is just Brownian motion that gets interrupted early if it hits +1.)