What Would I Do? Self-prediction in Simple Algorithms 2020-07-20T04:27:25.490Z · score: 50 (13 votes)
Does Agent-like Behavior Imply Agent-like Architecture? 2019-08-23T02:01:09.651Z · score: 45 (21 votes)
Intentional Bucket Errors 2019-08-22T20:02:11.357Z · score: 61 (23 votes)
Risks from Learned Optimization: Conclusion and Related Work 2019-06-07T19:53:51.660Z · score: 65 (19 votes)
Deceptive Alignment 2019-06-05T20:16:28.651Z · score: 64 (18 votes)
The Inner Alignment Problem 2019-06-04T01:20:35.538Z · score: 73 (19 votes)
Conditions for Mesa-Optimization 2019-06-01T20:52:19.461Z · score: 59 (20 votes)
Risks from Learned Optimization: Introduction 2019-05-31T23:44:53.703Z · score: 130 (39 votes)
Yes Requires the Possibility of No 2019-05-17T22:39:32.879Z · score: 129 (57 votes)
Thoughts on Human Models 2019-02-21T09:10:43.943Z · score: 125 (36 votes)
Epistemic Tenure 2019-02-18T22:56:03.158Z · score: 68 (31 votes)
How the MtG Color Wheel Explains AI Safety 2019-02-15T23:42:59.637Z · score: 68 (31 votes)
How does Gradient Descent Interact with Goodhart? 2019-02-02T00:14:51.673Z · score: 71 (20 votes)
Formal Open Problem in Decision Theory 2018-11-29T03:25:46.134Z · score: 32 (17 votes)
The Ubiquitous Converse Lawvere Problem 2018-11-29T03:16:16.453Z · score: 20 (10 votes)
Hyperreal Brouwer 2018-11-29T03:15:23.650Z · score: 29 (11 votes)
Fixed Point Discussion 2018-11-24T20:53:39.545Z · score: 36 (8 votes)
Iteration Fixed Point Exercises 2018-11-22T00:35:09.885Z · score: 36 (11 votes)
Diagonalization Fixed Point Exercises 2018-11-18T00:31:19.683Z · score: 42 (12 votes)
Topological Fixed Point Exercises 2018-11-17T01:40:06.342Z · score: 71 (28 votes)
Fixed Point Exercises 2018-11-17T01:39:50.233Z · score: 52 (22 votes)
Embedded Agency (full-text version) 2018-11-15T19:49:29.455Z · score: 103 (42 votes)
Embedded Curiosities 2018-11-08T14:19:32.546Z · score: 86 (34 votes)
Subsystem Alignment 2018-11-06T16:16:45.656Z · score: 121 (39 votes)
Robust Delegation 2018-11-04T16:38:38.750Z · score: 120 (39 votes)
Embedded World-Models 2018-11-02T16:07:20.946Z · score: 91 (28 votes)
Decision Theory 2018-10-31T18:41:58.230Z · score: 101 (37 votes)
Embedded Agents 2018-10-29T19:53:02.064Z · score: 196 (86 votes)
(A -> B) -> A 2018-09-11T22:38:19.866Z · score: 46 (20 votes)
History of the Development of Logical Induction 2018-08-29T03:15:51.889Z · score: 94 (32 votes)
Bayesian Probability is for things that are Space-like Separated from You 2018-07-10T23:47:49.130Z · score: 79 (35 votes)
Optimization Amplifies 2018-06-27T01:51:18.283Z · score: 102 (38 votes)
Counterfactual Mugging Poker Game 2018-06-13T23:34:59.360Z · score: 79 (30 votes)
On the Chatham House Rule 2018-06-13T21:41:05.057Z · score: 67 (34 votes)
Prisoners' Dilemma with Costs to Modeling 2018-06-05T04:51:30.700Z · score: 173 (67 votes)
The Chromatic Number of the Plane is at Least 5 - Aubrey de Grey 2018-04-11T18:19:50.419Z · score: 127 (39 votes)
New Paper Expanding on the Goodhart Taxonomy 2018-03-14T09:01:59.735Z · score: 50 (12 votes)
Is there a Connection Between Greatness in Math and Philosophy? 2018-03-03T23:25:51.206Z · score: 40 (10 votes)
Robustness to Scale 2018-02-21T22:55:19.155Z · score: 177 (52 votes)
Don't Condition on no Catastrophes 2018-02-21T21:50:31.077Z · score: 82 (26 votes)
A Proper Scoring Rule for Confidence Intervals 2018-02-13T01:45:06.341Z · score: 108 (32 votes)
Knowledge is Freedom 2018-02-09T05:24:54.932Z · score: 63 (16 votes)
Sources of intuitions and data on AGI 2018-01-31T23:30:17.176Z · score: 158 (50 votes)
Goodhart Taxonomy 2017-12-30T17:19:47.000Z · score: 0 (0 votes)
The Three Levels of Goodhart's Curse 2017-12-30T16:41:25.000Z · score: 3 (3 votes)
Goodhart Taxonomy 2017-12-30T16:38:39.661Z · score: 207 (75 votes)
Logical Updatelessness as a Robust Delegation Problem 2017-11-30T04:23:24.000Z · score: 0 (0 votes)
Logical Updatelessness as a Robust Delegation Problem 2017-10-27T21:16:18.076Z · score: 48 (14 votes)
Conditioning on Conditionals 2017-08-17T01:15:08.000Z · score: 7 (1 votes)
Cooperative Oracles: Nonexploited Bargaining 2017-06-03T00:39:55.000Z · score: 4 (4 votes)


Comment by scott-garrabrant on What Would I Do? Self-prediction in Simple Algorithms · 2020-07-21T18:31:38.436Z · score: 6 (3 votes) · LW · GW

Somewhere in between? I have reliable intuition about what would happen that comes before being able to construct the proof, but can reliably be turned into the proof. All of the proofs that these agents do what I say they do can be found by asking:

Assume that the probability does not converge as I say it does. How can I use this to make money if I am allowed to see (continuously) the logical inductors beliefs, and bet against them?

For example in the first example, If the probability was greater that  infinity often, I could wait until the probability is greater than , then bet that the agent goes right. This bet will always pay out, and double my money, and I can do this forever.

Comment by scott-garrabrant on What Would I Do? Self-prediction in Simple Algorithms · 2020-07-21T18:24:04.986Z · score: 4 (2 votes) · LW · GW

Having  be  won't work.

Surprisingly, having  go to 0 at any quickly computably rate won't work. For example, if  you could imagine having a logical induction built out of a collection of traders where one trader has almost all the money and says that on days of the form , utility conditioned on going left is 0 (where  is a fast growing function). Then, you have a second trader that forces the probability on day  of the statement that the agent goes left to be slightly higher that . Finally, you have a third trader that forces the expected utility conditioned on right to be very slightly above 0 on days of the form .

The first trader never loses money, since the condition is never met. The second trader only loses a bounded amount of money, since it is forcing the probability of a sentence that will be false to be very small. The third trader similarly only loses a bounded amount of money. The exploration clause will never trigger, and the agent will never go left on any day of the form 

The issue here is that we need to not only explore infinitely often, we need to explore infinitely often on all simple subsets of days, if the probability goes to 0 slowly, you can just look at a subset of days that is sufficiently sparse. 

There are ways around this that allow us to make a logical induction agent that explores with destiny 0 (meaning that the limit as  goes to infinity of the proportion of days  that the agent explores is 0). This is done by explicitly exploring infinitely often on every quickly computable subset of days, while still having the probability of exploring go to 0.

Comment by scott-garrabrant on What Would I Do? Self-prediction in Simple Algorithms · 2020-07-20T23:53:39.853Z · score: 6 (3 votes) · LW · GW

It does not approach it from above or below. As  goes to infinity, the proportion of  for which =="Left" need not converge to 1/2, but it must have 1/2 as a limit point, so the proportion of  for which =="Left" is arbitrarily close to 1/2 infinitely often. Further, the same is true for any easy to compute subsequence of rounds.

So, unfortunately it might be that  goes left many many times in a row e.g. for all  between  and , but it will still be unpredictable, just not locally independent.

Comment by scott-garrabrant on What Would I Do? Self-prediction in Simple Algorithms · 2020-07-20T18:22:20.223Z · score: 2 (1 votes) · LW · GW

Not really.

Comment by scott-garrabrant on Sunday July 12 — talks by Scott Garrabrant, Alexflint, alexei, Stuart_Armstrong · 2020-07-08T08:09:34.010Z · score: 2 (1 votes) · LW · GW

Date is wrong. It says June 28. 

Comment by scott-garrabrant on [Site Meta] Feature Update: More Tags! (Experimental) · 2020-04-25T00:25:11.231Z · score: 29 (9 votes) · LW · GW

Here are some maybe useful tags. Interpret these as ideas, not requests.

Mechanism Design (I think I am imagining including systemization that aligns incentives within yourself in here, which maybe means you would want a more general name like "Aligning Incentives" but I think I prefer "Mechanism Design")

Fake Frameworks (When I first thought of this, I was thinking of people tagging their own posts. Maybe it is a little weird to have people tagging each other's posts as fake. )

Embedded Agency (Where I am imagining this as being largely for technical work) (In particular, I personally would get more use out one big embedded agency tag than a bunch of smaller tags, since I feel like all the most interesting stuff in embedded agency cuts across tags like "decision theory")

Something like the class including: Toward a New Technical Explanation of Technical Explanation, Embedded World Models, technical logical uncertainty work, things about dealing with the fact that Bayes is not a viable strategy for embedded agents. "Embedded World Models" "Resource Bounded Epistemics" "Embedded Epistemics" "Post-Bayesianism" I would hope the name here does not make people think it should only be for technical things.

Something like the class including: How I Lost 100 Pounds Using TDT, Humans Are Embedded Agents Too, Inner alignment in the brain, Sources of intuitions and data on AGI, things about applying AI alignment theory to human rationality and vice versa. Maybe more generally about applying results from one field to another field. "Interdisciplinary Analogies"?

Comment by scott-garrabrant on A method for fair bargaining over odds in 2 player bets! · 2020-04-24T20:18:33.933Z · score: 5 (3 votes) · LW · GW is another proposal that gives incentive compatible betting by having the bet be smaller than the maximum. (maybe its the same, haven't checked.)

Comment by scott-garrabrant on What's the upper bound of how long COVID is contagious? · 2020-03-22T03:26:34.201Z · score: 5 (3 votes) · LW · GW

You should (more strongly?) disambiguate between how long after being sick are you safe, or how long after being 100% isolated are you safe.

Comment by scott-garrabrant on Voting Phase of 2018 LW Review · 2020-01-09T22:13:55.234Z · score: 8 (4 votes) · LW · GW

Is it pro-social or anti-social to vote on posts I have skimmed but not read?

Comment by scott-garrabrant on Humans Are Embedded Agents Too · 2019-12-24T20:38:04.732Z · score: 21 (5 votes) · LW · GW

We actually avoided talking about AI in most of the cartoon, and tried to just imply it by having a picture of a robot.

The first time (I think) I presented the factoring in the embedded agency sequence was at a MIRI CFAR collaboration workshop, so parallels with humans was live in my thinking.

The first time we presented the cartoon in roughly its current form was at MSFP 2018, where we purposely did it on the first night before a CFAR workshop, so people could draw analogies that might help them transfer their curiosity in both directions.

Comment by scott-garrabrant on Honoring Petrov Day on LessWrong, in 2019 · 2019-09-27T00:09:54.038Z · score: 15 (7 votes) · LW · GW
Conspiracy theory: There are no launch codes. People who claim to have launch codes are lying. The real test is whether people will press the button at all. I have failed that test. I came up with this conspiracy theory ~250 milliseconds after pressing the button.

Oh no! Someone is wrong on the internet, and I have the ability to prove them wrong...

Comment by scott-garrabrant on Honoring Petrov Day on LessWrong, in 2019 · 2019-09-26T21:49:59.604Z · score: 23 (7 votes) · LW · GW

Did you consider the unilateralist curse before making this comment?

Do you consider it to be a bad idea if you condition the assumption that only one other person with launch access who sees this post in the time window choose to say it was a bad idea?

Comment by scott-garrabrant on Honoring Petrov Day on LessWrong, in 2019 · 2019-09-26T21:32:05.699Z · score: 42 (19 votes) · LW · GW
If any users do submit a set of launch codes, tomorrow I’ll publish their identifying details.

If we make it through this, here are some ideas to make it more realistic next year:

1) Anonymous codes.

2) Karma bounty for the first person to press the button.

1+2) Randomly and publicly give some people the same code as each other, and give a karma bounty to everyone who had the code that took down the site.

3) Anyone with button rights can share button rights with anyone, and a karma bounty for sharing with the most other people that only pays out if nobody presses the button.

Comment by scott-garrabrant on Why Subagents? · 2019-09-03T00:07:36.791Z · score: 12 (6 votes) · LW · GW

Not sure if you've seen it, but this paper by Critch and Russell might be relevant when you start thinking about uncertainty.

Comment by scott-garrabrant on AI Alignment Writing Day Roundup #1 · 2019-09-02T23:59:09.404Z · score: 8 (4 votes) · LW · GW

This is my favorite comment. Thank you.

Comment by scott-garrabrant on Does Agent-like Behavior Imply Agent-like Architecture? · 2019-08-23T17:45:25.579Z · score: 6 (3 votes) · LW · GW

I think I do want to make my agent-like architecture general enough to include evolution. However, there might be a spectrum of agent-like-ness such that you can't get much more than Sphex behavior with just evolution (without having a mesa-optimizer in there)

I think you can guarantee that, probabilistically, getting a specific outcome requires information about that outcome (no free lunch), which implies "search" on a "world model."

Yeah, but do you think you can make it feel more like a formal proof?

Comment by scott-garrabrant on Intentional Bucket Errors · 2019-08-23T17:25:35.138Z · score: 29 (7 votes) · LW · GW

I think there is a possible culture where people say a bunch of inside-view things, and run with speculations all the time, and another possible culture where people mostly only say literally true things that can be put into the listener's head directly. (I associate these cultures with the books R:A-Z and superintelligence respectively.) In the first culture, I don't feel the need to defend myself. However I feel like I am often also interacting with people from the second culture, and that makes me feel like I need a disclaimer before I think in public with speculation that conflates a bunch of concepts.

Comment by scott-garrabrant on Computational Model: Causal Diagrams with Symmetry · 2019-08-22T21:42:24.761Z · score: 19 (8 votes) · LW · GW

Were you summoned by this post accidentally using your true name?

Comment by scott-garrabrant on Steelmanning Divination · 2019-06-06T02:22:56.552Z · score: 7 (6 votes) · LW · GW

Nitpick: conservation of expected evidence does not seem to me like why you can’t do divination with a random number generator.

Comment by scott-garrabrant on Is there a difference between uncertainty over your utility function and uncertainty over outcomes? · 2019-03-18T23:23:03.184Z · score: 8 (6 votes) · LW · GW

You are just normalizing on the dollar. You could ask "how many chickens would I kill to save a human life" instead, and you would normalize on a chicken.

Comment by scott-garrabrant on Is there a difference between uncertainty over your utility function and uncertainty over outcomes? · 2019-03-18T20:23:55.447Z · score: 19 (8 votes) · LW · GW

Utility functions are invariant up to affine transformation. I don't need to say how much I value a human life or how much I value a chicken life to make decisions in weird trolly problems involving humans and chickens. I only need to know relative values. However, utility uncertainty messes this up. Say I have two hypotheses: one in which human and chicken lives have the same value, and one in which humans are a million times more valuable. I assign the two hypotheses equal weight.

I could normalize and say that in both cases a human is worth 1 util. Then, when I average across utility functions, humans are about twice as valuable as chickens. But if I normalize and say that in both cases a chicken is worth 1 util, then when I average, the human is worth about 500,000 times as much. (You can still treat it like other uncertainty, but you have to make this normalization choice.)

Comment by scott-garrabrant on Epistemic Tenure · 2019-02-19T21:27:56.142Z · score: 22 (7 votes) · LW · GW

I think it was wrong about the MtG post. I mostly think the negative effects of posting ideas (related to technical topics) that people think are bad is small enough to ignore, except in so far as it messes with my internal state. My system 2 thinks my system 1 is wrong about the external effects, but intends to cooperate with it anyway, because not cooperating with it could be internally bad.

As another example, months ago, you asked me to talk about how embedded agency fits in with the rest of AI safety, and I said something like that I didn't want to force myself to make any public arguments for or against the usefulness of agent foundations. This is because I think research prioritization is especially prone to rationalization, so it is important to me that any thoughts about research prioritization are not pressured by downstream effects on what I am allowed to work on. (It still can change what I decide to work on, but only through channels that are entirely internal.)

Comment by scott-garrabrant on Epistemic Tenure · 2019-02-19T20:12:21.109Z · score: 37 (10 votes) · LW · GW

So, I feel like I am concerned for everyone, including myself, but also including people who do not think that it would effect them. A large part of what concerns me is that the effects could be invisible.

For example, I think that I am not very effected by this, but I recently noticed a connection between how difficult it is to get to work on writing a blog post that I think it is good to write, and how much my system one expects some people to receive the post negatively. (This happened when writing the recent MtG post.) This is only anecdotal, but I think that posts that seems like bad PR caused akrasia, even when when controlling for how good I think the post is on net. The scary part is that there was a long time before I noticed this. If I believed that there was a credible way to detect when there are thoughts you can't have in the first place, I would be less worried.

I didn't have many data points, and the above connection might have been a coincidence, but the point I am trying to make is that I don't feel like I have good enough introspective access to rule out a large, invisible, effect. Maybe others do have enough introspective access, but I do not think that just not seeing the outer incentives pulling on you is enough to conclude that they are not there.

Comment by scott-garrabrant on Epistemic Tenure · 2019-02-19T00:33:21.721Z · score: 14 (6 votes) · LW · GW

I am not saying to falsely encourage him, I think I am mostly saying to continue giving him some attention/platform to get his ideas out in a way that would be heard. The real thing that I want is whatever will cause Bob to not end up back propagating from the group epistemics into his individual idea generation.

Comment by scott-garrabrant on Epistemic Tenure · 2019-02-18T22:58:09.971Z · score: 9 (5 votes) · LW · GW

I apologize for using the phrase "epistemic status" in a way that disagrees with the accepted technical term.

Comment by scott-garrabrant on How the MtG Color Wheel Explains AI Safety · 2019-02-17T02:07:14.332Z · score: 16 (6 votes) · LW · GW

I think informed oversight fits better with MtG white than it does with boxing. I agree that the three main examples are boxing like, and informed oversight is not, but it still feels white to me.

I do think that corrigibility done right is a thing that is in some sense less agentic. I think that things that have goals outside of them are less agentic than things that have their goals inside of them, but I think corrigibility is stronger than that. I want to say something like a corrigible agent not only has its goals partially on the outside (in the human), but also partially has its decision theory on the outside. Idk.

Comment by scott-garrabrant on How does Gradient Descent Interact with Goodhart? · 2019-02-04T03:44:54.558Z · score: 4 (2 votes) · LW · GW

Fixed, thanks.

Comment by scott-garrabrant on Announcement: AI alignment prize round 3 winners and next round · 2018-12-20T03:43:49.952Z · score: 12 (3 votes) · LW · GW

Abram and I submit Embedded Agency.

Comment by scott-garrabrant on Diagonalization Fixed Point Exercises · 2018-11-22T16:06:42.235Z · score: 6 (3 votes) · LW · GW

Yeah, it is just functions that take in two sentences and put both their Godel numbers into a fixed formula (with 2 inputs).

Comment by scott-garrabrant on Iteration Fixed Point Exercises · 2018-11-22T16:02:32.661Z · score: 6 (3 votes) · LW · GW

Thanks, I actually wanted to get rid of the earlier condition that for all , and I did that.

Comment by scott-garrabrant on Embedded Agents · 2018-11-07T19:45:34.084Z · score: 42 (11 votes) · LW · GW

This is not a complete answer, but it is part of my picture:

(It is the part of the picture that I can give while being only descriptive, and not prescriptive. For epistemic hygiene reasons, I want avoid discussions of how much of different approaches we need in contexts (like this one) that would make me feel like I was justifying my research in a way that people might interpret as an official statement from the agent foundations team lead.)

I think that Embedded Agency is basically a refactoring of Agent Foundations in a way that gives one central curiosity based goalpost, rather than making it look like a bunch of independent problems. It is mostly all the same problems, but it was previously packaged as "Here are a bunch of things we wish we understood about aligning AI," and in repackaged as "Here is a central mystery of the universe, and here are a bunch things we don't understand about it." It is not a coincidence that they are the same problems, since they were generated in the first place by people paying close to what mysteries of the universe related to AI we haven't solved yet.

I think of Agent Foundations research has having a different type signature than most other AI Alignment research, in a way that looks kind of like Agent Foundations:other AI alignment::science:engineering. I think of AF as more forward-chaining and other stuff as more backward-chaining. This may seem backwards if you think about AF as reasoning about superintelligent agents, and other research programs as thinking about modern ML systems, but I think it is true. We are trying to build up a mountain of understanding, until we collect enough that the problem seems easier. Others are trying to make direct plans on what we need to do, see what is wrong with those plans, and try to fix the problems. Some consequences of this is that AF work is more likely to be helpful given long timelines, partially because AF is trying to be the start of a long journey of figuring things out, but also because AF is more likely to be robust to huge shifts in the field.

I actually like to draw an analogy with this: (taken from this post by Evan Hubinger)

I was talking with Scott Garrabrant late one night recently and he gave me the following problem: how do you get a fixed number of DFA-based robots to traverse an arbitrary maze (if the robots can locally communicate with each other)? My approach to this problem was to come up with and then try to falsify various possible solutions. I started with a hypothesis, threw it against counterexamples, fixed it to resolve the counterexamples, and iterated. If I could find a hypothesis which I could prove was unfalsifiable, then I'd be done.
When Scott noticed I was using this approach, he remarked on how different it was than what he was used to when doing math. Scott's approach, instead, was to just start proving all of the things he could about the system until he managed to prove that he had a solution. Thus, while I was working backwards by coming up with possible solutions, Scott was working forwards by expanding the scope of what he knew until he found the solution.

(I don't think it quite communicates my approach correctly, but I don't know how to do better.)

A consequence of the type signature of Agent Foundations is that my answer to "What are the other major chunks of the larger problem?" is "That is what I am trying to figure out."

Comment by scott-garrabrant on Subsystem Alignment · 2018-11-07T18:52:53.598Z · score: 14 (9 votes) · LW · GW

So if we view an epistemic subsystem as an super intelligent agent who has control over the map and has the goal of make the map match the territory, one extreme failure mode is that it takes a hit to short term accuracy by slightly modifying the map in such a way as to trick the things looking at the map into giving the epistemic subsystem more control. Then, once it has more control, it can use it to manipulate the territory to make the territory more predictable. If your goal is to minimize surprise, you should destroy all the surprising things.

Note that we would not make an epistemic system this way, a more realistic model of the goal of an epistemic system we would build is "make the map match the territory better than any other map in a given class," or even "make the map match the territory better than any small modification to the map." But a large point of the section is that if you search strategies that "make the map match the territory better than any other map in a given class," at small scales, this is the same as "make the map match the territory." So you might find "make the map match the territory" optimizers, and then go wrong in the way above.

I think all this is pretty unrealistic, and I expect you are much more likely to go off in a random direction than something that looks like a specific subsystem the programmers put in gets too much power and optimizes stabile for what the programmers said. We would need to understand a lot more before we would even hit the failure mode of making a system where the epistemic subsystem was agenticly optimizing what it was supposed to be optimizing.

Comment by scott-garrabrant on Robust Delegation · 2018-11-05T19:00:14.227Z · score: 22 (6 votes) · LW · GW

Some last minute emphasis:

We kind of open with how agents have to grow and learn and be stable, but talk most of the time about this two agent problem, where there is an initial agent and a successor agent. When thinking about it as the succession problem, it seems like a bit of a stretch as a fundamental part of agency. The first two sections were about how agents have to make decisions and have models, and choosing a successor does not seem like as much of a fundamental part of agency. However, when you think it as an agent has to stably continue to optimize over time, it seems a lot more fundamental.

So, I want to emphasize that when we say there are multiple forms of the problem, like choosing successors or learning/growing over time, the view in which these are different at all is a dualistic view. To an embedded agent, the future self is not privileged, it is just another part of the environment, so there is no difference between making a successor and preserving your own goals.

It feels very different to humans. This is because it is much easier for us to change ourselves over time that it is to make a clone of ourselves and change the clone, but that difference is not fundamental.

Comment by scott-garrabrant on Decision Theory · 2018-11-02T16:53:31.594Z · score: 5 (3 votes) · LW · GW

But how do you avoid proving with certainty that p=1/2?

Since your proposal does not say what to do if you find inconsistent proofs that the linear function is two different things, I will assume that if it finds multiple different proofs, it defaults to 5 for the following.

Here is another example:

You are in a 5 and 10 problem. You have twin that is also in a 5 and 10 problem. You have exactly the same source code. There is a consistency checker, and if you and your twin do different things, you both get 0 utility.

You can prove that you and your twin do the same thing. Thus you can prove that the function is 5+5p. You can also prove that your twin takes 5 by Lob's theorem. (You can also prove that you take 5 by Lob's theorem, but you ignore that proof, since "there is always a chance") Thus, you can prove that the function is 5-5p. Your system doesn't know what to do with two functions, so it defaults to 5. (If it is provable that you both take 5, you both take 5, completing the proof by Lob.)

I am doing the same thing as before, but because I put it outside of the agent, it does not get flagged with the "there is always a chance" module. This is trying to illustrate that your proposal takes advantage of a separation between the agent and the environment that was snuck in, and could be done incorrectly.

Two possible fixes:

1) You could say that the agent, instead of taking 5 when finding inconsistency takes some action that exhibits the inconsistency (something that the two functions give different values). This is very similar to the chicken rule, and if you add something like this, you don't really need the rest of your system. If you take an agent that whenever it proves it does something, it does something else. This agent will prove (given enough time) that if it takes 5 it gets 5, and if it takes 10 it gets 10.

2) I had one proof system, and just ignored the proofs that I found that I did a thing. I could instead give the agent a special proof system that is incapable of proving what it does, but how do you do that? Chicken rule seems like the place to start.

One problem with the chicken rule is that it was developed in a system that was deductively closed, so you can't prove something that passes though a proof of P without proving P. If you violate this, by having a random theorem prover, you might have an system that fails to prove "I take 5" but proves "I take 5 and 1+1=2" and uses this to complete the Lob loop.

Comment by scott-garrabrant on Decision Theory · 2018-11-02T03:23:17.816Z · score: 6 (3 votes) · LW · GW

Sure. How do you do that?

Comment by scott-garrabrant on Decision Theory · 2018-11-01T23:53:25.250Z · score: 9 (5 votes) · LW · GW

My point was that I don't know where to assume the linearity is. Whenever I have private randomness, I have linearity over what I end up choosing with that randomness, but not linearity over what probability I choose. But I think this is non getting at the disagreement, so I pivot to:

In your model, what does it mean to prove that U is some linear affine function? If I prove that my probability p is 1/2 and that U=7.5, have I proven that U is the constant function 7.5? If there is only one value of p, it is not defined what the utility function is, unless I successfully carve the universe in such a way as to let me replace the action with various things and see what happens. (or, assuming linearity replace the probability with enough linearly independent things (in this case 2) to define the function.

Comment by scott-garrabrant on Decision Theory · 2018-11-01T21:43:56.035Z · score: 11 (6 votes) · LW · GW

Yeah, so its like you have this private data, which is an infinite sequence of bits, and if you see all 0's you take an exploration action. I think that by giving the agent these private bits and promising that the bits do not change the rest of the world, you are essentially giving the agent access to a causal counterfactual that you constructed. You don't even have to mix with what the agent actually does, you can explore with every action and ask if it is better to explore and take 5 or explore and take 10. By doing this, you are essentially giving the agent access to a causal counterfactual, because conditioning on these infinitesimals is basically like coming in and changing what the agent does. I think giving the agent a true source of randomness actually does let you implement CDT.

If the environment learns from the other possible worlds, It might punish or reward you in one world for stuff that you do in the other world, so you cant just ask which world is best to figure out what to do.

I agree that that is how you want to think about the matching pennies problem. However the point is that your proposed solution assumed linearity. It didn't empirically observe linearity. You have to be able to tell the difference between the situations in order to know not to assume linearity in the matching pennies problem. The method for telling the difference is how you determine whether or not and in what ways you have logical control over Omega's prediction of you.

Comment by scott-garrabrant on What is ambitious value learning? · 2018-11-01T19:15:06.616Z · score: 16 (7 votes) · LW · GW

A conversation that just went down in my head:

Me: "You observe a that a bunch of attempts to write down what we want get Goodharted, and so you suggest writing down what we want using data. This seems like it will have all the same problems."

Straw You: "The reason you fail is because you can't specify what we really want, because value is complex. Trying to write down human values is qualitatively different from trying to write down human values using a pointer to all the data that happened in the past. That pointer cheats the argument from complexity, since it lets us fit lots of data into a simple instruction."

Me: "But the instruction is not simple! Pointing at what the "human" is is hard. Dealing with the fact that the human in inconsistent with itself gives more degrees of freedom. If you just look at the human actions, and don't look inside the brain, there are many many goals consistent with the actions you see. If you do look inside the brain, you need to know how to interpret that data. None of these are objective facts about the universe that you can just learn. You have to specify them, or specify a way to specify them, and when you do that, you do it wrong and you get Goodharted."

Comment by scott-garrabrant on Decision Theory · 2018-11-01T18:26:10.788Z · score: 19 (7 votes) · LW · GW

So, your suggestion is not just an inconsequential grain of uncertainty, it is an grain of exploration. The agent actually does take 10 with some small probability. If you try to do this with just uncertainty, things would be worse, since that uncertainty would not be justified.

One problem is that you actually do explore a bunch, and since you don't get a reset button, you will sometimes explore into irreversible actions, like shutting yourself off. However, if the agent has a source of randomness, and also the ability to simulate worlds in which that randomness went another way, you can have an agent that with probability does not explore ever, and learns from the other worlds in which it does explore. So, you can either explore forever, and shut yourself off, or you can explore very very rarely and learn from other possible worlds.

The problem with learning from other possible worlds is to get good results out of it, you have to assume that the environment does not also learn from other possible worlds, which is not very embedded.

But you are suggesting actually exploring a bunch, and there is a problem other than just shutting yourself off. You are getting past this problem in this case by only allowing linear functions, but that is not an accurate assumption. Let's say you are playing matching pennies with Omega, who has the ability to predict what probability you will pick but not what action you will pick.

(In matching pennies, you each choose H or T, you win if they match, they win if they don't.)

Omega will pick H if your probability of H is less that 1/2 and T otherwise. Your utility as a function of probability is piecewise linear with two parts. Trying to assume that it will be linear will make things messy.

There is this problem where sometimes the outcome of exploring into taking 10, and the outcome of actually taking 10 because it is good are different. More on this here.

Comment by scott-garrabrant on Preface to the sequence on value learning · 2018-11-01T00:00:44.544Z · score: 12 (5 votes) · LW · GW

I don't think this is relevant, but there are theoretical uses for maximizing expected log probability, and maximizing expected log probability is not the same as maximizing expected probability, since they interact with the expectation differently.

Comment by scott-garrabrant on History of the Development of Logical Induction · 2018-08-29T08:40:47.790Z · score: 3 (2 votes) · LW · GW


Comment by scott-garrabrant on Bayesian Probability is for things that are Space-like Separated from You · 2018-08-02T22:00:23.389Z · score: 11 (3 votes) · LW · GW

I think you are correct that I cannot cleanly separate the things that are in my past that I know and the things that are in my post that I do not know. For example, if a probability is chosen uniformly at random in the unit interval, then a coin with that probability is flipped a large number of times, then I see some of the results, I do not know the true probability, but the coin flips that I see really should come after the thing that determines the probability in my Bayes' net.

Comment by scott-garrabrant on Probability is Real, and Value is Complex · 2018-07-20T22:00:10.074Z · score: 6 (3 votes) · LW · GW

The uniqueness of 0 is only roughly equivalent to the half plane definition if you also assume convexity (I.e. the existence of independent coins of no value.)

Comment by scott-garrabrant on Optimization Amplifies · 2018-07-11T15:54:09.711Z · score: 2 (1 votes) · LW · GW

I added the word unit.

Comment by scott-garrabrant on The Alignment Newsletter #1: 04/09/18 · 2018-06-28T21:41:06.755Z · score: 7 (1 votes) · LW · GW

I think these titles should have dates instead of or in addition to numbers for historical context.

Comment by scott-garrabrant on Optimization Amplifies · 2018-06-27T02:32:48.368Z · score: 18 (6 votes) · LW · GW

I think this is similar to Security Mindset, so you might want to think about this post in relation to that.

Comment by scott-garrabrant on Announcement: AI alignment prize round 2 winners and next round · 2018-06-27T02:08:35.622Z · score: 17 (3 votes) · LW · GW

Ok, I have two other things to submit:

Counterfactual Mugging Poker Game and Optimization Amplifies.

I hope that your decision procedure includes a part where if I win, you choose whichever subset of my posts you most want to draw attention to. I think that a single post would get a larger signal boost than each post in a group of three, and would not be offended of one or two of my posts gets cut from the announcement post to increase the signal for other things.

Comment by scott-garrabrant on Prisoners' Dilemma with Costs to Modeling · 2018-06-27T00:47:21.925Z · score: 7 (1 votes) · LW · GW

No, sorry. It wouldn't be very readable, and it is easy to do yourself.

Comment by scott-garrabrant on Prisoners' Dilemma with Costs to Modeling · 2018-06-21T22:11:37.564Z · score: 39 (13 votes) · LW · GW

I am actually worried that because I posted it, people will think it is more relevant to AI safety than it really is. I think it is a little related, but not strongly.

I do think it is surprising and interesting. I think it is useful for thinking about civilization and civilizational collapse and what aliens (or maybe AI or optimization daemons) might look like. My inner Andrew Critch also thinks it is more directly related to AI safety than I do. Also if I thought multipolar scenarios were more likely, I might think it is more relevant.

Also it is made out of pieces such that thinking about it was a useful exercise. I am thinking a lot about Nash equilibria and dynamics. I think the fact that Nash equilibria are not exactly a dynamic type of object and are not easy to find is very relevant to understanding embedded agency. Also, I think that modal combat is relevant, because I think that Lobian handshakes are pointing at an important part of reasoning about oneself.

I think it is relevant enough that it was worth doing, and such that I would be happy if someone expanded on it, but I am not planning on thinking about it much more because it does feel only tangentially related.

That being said, many times I have explicitly thought that I was thinking about a thing that was not really related to the bigger problems I wanted to be working on, only to later see a stronger connection.

Comment by scott-garrabrant on Prisoners' Dilemma with Costs to Modeling · 2018-06-14T19:16:42.995Z · score: 9 (2 votes) · LW · GW

That was wrong. Fixed it. Thanks.