## Posts

## Comments

**Scott Garrabrant**on Finite Factored Sets: Introduction and Factorizations · 2021-06-20T22:01:58.516Z · LW · GW

Fixed, Thanks.

**Scott Garrabrant**on Finite Factored Sets · 2021-06-05T02:45:20.444Z · LW · GW

Sure, if you want to send me an email and propose some times, we could set up a half hour chat. (You could also wait until I post all the math over the next couple weeks.)

**Scott Garrabrant**on Finite Factored Sets · 2021-06-01T07:17:27.220Z · LW · GW

Looks like you copied it wrong. Your B only has one 4.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-28T16:52:27.864Z · LW · GW

I have not thought much about applying to things other than finite sets. (I looked at infinite sets enough to know there is nontrivial work to be done.) I do think it is good that you are thinking about it, but I don't have any promises that it will work out.

What I meant when I think that this can be done in a categorical way is that I think I can define a nice symmetric monodical category of finite factored sets such that things like orthogonality can be given nice categorical definitions. (I see why this was a confusing thing to say.)

**Scott Garrabrant**on Finite Factored Sets · 2021-05-28T16:45:08.113Z · LW · GW

If I understand correctly, that definition is not the same. In particular, it would say that you can get nontrivial factorizations of a 5 element set: {{{0,1},{2,3,4}},{{0,2,4},{1,3}}}.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-28T16:39:02.392Z · LW · GW

When I prove it, I prove and use (a slight notational variation on) these two lemmas.

- If , then for all .
- .

(These are also the two lemmas that I have said elsewhere in the comments look suspiciously like entropy.)

These are not trivial to prove, but they might help.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-27T18:44:27.074Z · LW · GW

I think that you are pointing out that you might get a bunch of false positives in your step 1 after you let a thermodynamical system run for a long time, but they are are only approximate false positives.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-27T16:51:24.151Z · LW · GW

I think my model has macro states. In game of life, if you take the entire grid at time t, that will have full history regardless of t. It is only when you look at the macro states (individual cells) that my time increases with game of life time.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-27T16:49:10.163Z · LW · GW

As for entropy, here is a cute observation (with unclear connection to my framework): whenever you take two independent coin flips (with probabilities not 0,1, or 1/2), their xor will always be high entropy than either of the individual coin flips.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-27T16:43:08.629Z · LW · GW

Wait, I misunderstood, I was just thinking about the game of life combinatorially, and I think you were thinking about temporal inference from statistics. The reversible cellular automaton story is a lot nicer than you'd think.

if you take a general reversible cellular automaton (critters for concreteness), and have a distribution over computations in general position in which initial conditions cells are independent, the cells may not be independent at future time steps.

If all of the initial probabilities are 1/2, you will stay in the uniform distribution, but if the probabilities are in general position, things will change, and time 0 will be special because of the independence between cells.

There will be other events at later times that will be independent, but those later time events will just represent "what was the state at time 0."

For a concrete example consider the reversible cellular automaton that just has 2 cells, X and Y, and each time step it keeps X constant and replaces Y with X xor Y.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-27T00:38:00.674Z · LW · GW

Yep, there is an obnoxious number of factorizations of a large game of life computation, and they all give different definitions of "before."

**Scott Garrabrant**on Finite Factored Sets · 2021-05-26T22:00:04.291Z · LW · GW

I don't have a great answer, which isn't a great sign.

I think the scientist can infer things like. "algorithms reasoning about the situation are more likely to know X but not Y than they are to know Y but not X, because reasonable processes for learning Y tend to learn learn enough information to determine X, but then forget some of that information." But why should I think of that as time?

I think the scientist can infer things like "If I were able to factor the world into variables, and draw a DAG (without determinism) that is consistent with the distribution with no spurious independencies (including in deterministic functions of the variables), and X and Y happen to be variables in that DAG, then there will be a path from X to Y."

The scientist can infer that if Z is orthogonal to Y, then Z is also orthogonal to X, where this is important because Z is orthogonal to Y can be thought of as saying that Z is useless for learning about Y. (and importantly a version of useless for learning that is closed under common refinement, so if you collect a bunch of different Z orthogonal to Y, you can safely combine them, and the combination will be orthogonal to Y.)

This doesn't seem to get at why we want to call it before. Hmm.

Maybe I should just list a bunch of reasons why it feels like time to me (in no particular order):

- It seems like it gets a very reasonable answer in the Game of Life example
- Prior to this theory, I thought that it made sense to think of time as a closure property on orthogonality, and this definition of time is exactly that closure property on orthogonality, where X is weakly before Y if whenever Z is orthogonal to Y, Z is also orthogonal to X. (where the definition of orthogonality is justified with the fundamental theorem.)
- If Y is a refinement of X, then Y cannot be strictly before X. (I notice that I don't have a thing to say about why this feels like time to me, and indeed it feels like it is in direct opposition to your "doesn't agree with what can be computed from what," but it does agree with the way I feel like I want to intuitively describe time in the stories told in the "Saving Time" post.) (I guess one thing I can say is that as an agent learns over time, we think of the agent as collecting information, so later=more information makes sense.)
- History looks a lot like a non-quantitative version of entropy, where instead of thinking of how much randomness goes into a variable, we think of which randomness goes into the variable. There are lemmas towards proving the semigraphoid axioms which look like theorems about entropy modified to replace sums/expectations with unions. Then, "after" exactly corresponds to "greater entropy" in this analogy.
- If I imagine X and Z being computed independently, and Y as being computed from X and Z, it will say that X is before Y, which feels right to me (and indeed this property is basically the definition. It seems like my time is maybe the unique thing that gets the right answer on this simple story and also treats variables with the same info content as the same.)
- We can convert a Pearlian DAG to a FFS, and under this conversion, d-seperation is sent to conditional orthogonality, and paths between nodes are sent to time. (on the questions Pearl knows how to ask. We also generalize the definition to all variables)

**Scott Garrabrant**on Finite Factored Sets · 2021-05-26T17:50:11.551Z · LW · GW

I partially agree, which is partially why I am saying time rather than causality.

I still feel like there is an ontological disagreement in that it feels like you are objecting to saying the physical thing that is Alice's knowledge is (not) before the physical thing that is Bob's knowledge.

In my ontology:

1) the information content of Alice's knowledge is before the information content of Bob's knowledge. (I am curios if this part is controversial.)

and then,

2) there is in some sense no more to say about the physical thing that is e.g. Alice's knowledge beyond the information content.

So, I am not just saying Alice is before Bob, I am also saying e.g. Alice is before Alice+Bob, and I can't disentangle these statements because Alice+Bob=Bob.

I am not sure what to say about the second example. I am somewhat rejecting the dynamics. "Alice travels back in time" is another way of saying that the high level FFS time disagrees with the standard physical time, which is true. The "high level" here is pointing to the fact that we are only looking at the part of Alice's brain that is about the envelopes, and thus talking about coarser variables than e.g. Alice's entire brain state in physical time. And if we are in the ontology where we are only looking at the information content, taking a high level version of a variable is the kind of thing that can change its temporal properties, since you get an entirely new variable.

I suspect most of the disagreement is in the sort of "variable nonrealism" of reducing the physical thing that is Alice's knowledge to its information content?

**Scott Garrabrant**on Finite Factored Sets · 2021-05-26T17:06:45.886Z · LW · GW

Now on OEIS: https://oeis.org/A338681

**Scott Garrabrant**on Finite Factored Sets · 2021-05-26T15:47:00.722Z · LW · GW

is the event you are conditioning on, so the thing you should expect is that , which does indeed hold.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-25T16:27:37.078Z · LW · GW

I think I (at least locally) endorse this view, and I think it is also a good pointer to what seems to me to be the largest crux between the my theory of time and Pearl's theory of time.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-25T04:46:27.616Z · LW · GW

Hmm, I am not sure what to say about the fundamental theorem, because I am not really understanding the confusion. Is there something less motivating about the fundamental theorem, than the analogous theorem about d-seperation being equivalent to conditional independence in all distributions comparable with the DAG?

Maybe this helps? (probably not): I am mostly imagining interacting with only a single distributions in the class, and the claim about independence in all probability distributions comparable with the structure can be replaced with instead independence in a general position probability distribution comparable with the structure.

I am not thinking of it as related to a maximum entropy argument.

The point about SEMs having more structure that I am ignoring is correct. I think that the largest philosophical difference between my framework and Pearlian one is that I am denying realism of anything beyond the "apparently identical." Another way to think about it is that I am denying realism about there being anything about the variables beyond their information. All of my definitions are only based on the information content of the variables, and so, for example, if you have two variables that are deterministic copies of each other, they will have all the same temporal properties, while in an SEM, they could be different. The surprising thing is that even without intervention data, this variable non-realism allows us to define and infer something that looks a lot like time.

I have a lot of uncertainty about learning algorithms. On the surface, it looks like my structure just has so much to check, and is going to have a hard time being practical, but I could see it going either way. Especially if you imagine it as a minor modification to graph learning, where maybe you don't consider all re-factorizations, but you do consider e.g. taking a pair of nodes and replacing one if them with the XOR.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-24T21:49:00.901Z · LW · GW

Makes sense. I think a bit of my naming and presentation was biased by being so surprised by the not on OEIS fact.

I think I disagree about the bipartite graph thing. I think it only feels more natural when comparing to Pearl. The talk frames everything in comparison to Pearl, but I think if you are not looking at Pearl, I think graphs don’t feel like the right representation here. Comparing to Pearl is obviously super important, and maybe the first introduction should just be about the path from Pearl to FFS, but once we are working within the FFS ontology, graphs feel not useful. One crux might be about how I am excited for directions that are not temporal inference from statistical data.

My guess is that if I were putting a lot of work into a very long introduction for e.g. the structure learning community, I might start the way you are emphasizing, but then eventually convert to throwing all the graphs away.

(The paper draft I have basically only ever mentions Pearl/graphs for motivation at the beginning and in the applications section.)

**Scott Garrabrant**on Finite Factored Sets · 2021-05-24T19:30:37.783Z · LW · GW

I was originally using the name Time Cube, but my internal PR center wouldn't let me follow through with that :)

**Scott Garrabrant**on Finite Factored Sets · 2021-05-24T19:19:48.104Z · LW · GW

Thanks Paul, this seems really helpful.

As for the name I feel like "FFS" is a good name for the analog of "DAG", which also doesn't communicate that much of the intuition, but maybe doesn't make as much sense for name of the framework.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-24T17:47:50.483Z · LW · GW

Here is a more concrete example of me using FFS the way I intend them to be used outside of the inference problem. (This is one specific application, but maybe it shows how I intend the concepts to be manipulated).

I can give an example of embedded observation maybe, but it will have to come after a more formal definition of observation (This is observation of a variable, rather than the observation of an event above):

Definition: Given a FFS , and , , , which are partitions of , where , we say observes relative to W if:

1) ,

2) can be expressed in the form , and

3) .

(This should all be interpreted combinatorially, not probabilistically.)

The intuition of what is going on here is that to observe an event, you are being promised that you 1) do not change whether the event holds, and 3) do not change anything that matters in the case where that event does not hold. Then, to observe a variable, you can basically 2) split yourself up into different fragments of your policy, where each policy fragment observes a different value of that variable. (This whole thing is very updateless.)

Example 1: (non-observation)

An agent does not observe a coinflip , and chooses to raise either his left or right hand. Our FFS is given by , and . (I am abusing notation here slightly by conflating with the partition you get on by projecting onto the coordinate.) Then W is the discrete partition on .

In this example, we do not have observation. Proof: A only has two parts, so if we express A as a common refinement of 2 partitions, at least one of these two partitions must be equal to A. However, A is not orthogonal to W given H and A is not orthogonal to W given T. (). Thus we must violate condition 3.

Example 2: (observation)

An agent does observe a coinflip , and chooses to raise either his left or right hand. We can think of as actually choosing a policy that is a function from to , where the two character string in the parts in are the result of H followed by the result of T.

Our FFS is given by , and , where represents what the agent would do seeing heads, and represents what the agent word do given seeing tails. . We also have a partition representing what the agent actually does , where and are each four element sets in the obvious way. We will then say , so W does not get to see what would have done, it only gets to see the coin flip and what actually did.

Now I will prove that observes relative to in this example. First, , and , so we get the first condition, . We will break up A in the obvious way set up in the problem for condition 2, so it suffices now to show that , (and it will follow symmetrically that .)

Im not going to go through the details, but , while , which are disjoint. The important thing here is that doesn't care about in worlds in which holds.

Discussion:

So largely I am sharing this to give an example for how you can manipulate FFS combinatorially, and how you can use this to say things that you might otherwise want to say using graphs, Granted, you could also say the above things using graphs, but now you can say more things, because you are not restricted to the nodes you choose, you can ask the same combinatorial question about any of the other partitions, The benefit is largely about not being dependent on our choice of variables.

It is interesting to try to translate this definition of observation to transparent Newcomb or counterfactual mugging, and see how some of the orthogonalities are violated, and thus it does not count as an observation.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-24T17:45:00.595Z · LW · GW

I'll try. My way of thinking doesn't use the examples, so I have to generate them for communication.

I can give an example of embedded observation maybe, but it will have to come after a more formal definition of observation (This is observation of a variable, rather than the observation of an event above):

Definition: Given a FFS , and , , , which are partitions of , where , we say observes relative to W if:

1) ,

2) can be expressed in the form , and

3) .

(This should all be interpreted combinatorially, not probabilistically.)

The intuition of what is going on here is that to observe an event, you are being promised that you 1) do not change whether the event holds, and 3) do not change anything that matters in the case where that event does not hold. Then, to observe a variable, you can basically 2) split yourself up into different fragments of your policy, where each policy fragment observes a different value of that variable. (This whole thing is very updateless.)

Example 1 (non-observation)

An agent does not observe a coinflip , and chooses to raise either his left or right hand. Our FFS is given by , and . (I am abusing notation here slightly by conflating with the partition you get on by projecting onto the coordinate.) Then W is the discrete partition on .

In this example, we do not have observation. Proof: A only has two parts, so if we express A as a common refinement of 2 partitions, at least one of these two partitions must be equal to A. However, A is not orthogonal to W given H and A is not orthogonal to W given T. (). Thus we must violate condition 3.

Example 2: (observation)

An agent does observe a coinflip , and chooses to raise either his left or right hand. We can think of as actually choosing a policy that is a function from to , where the two character string in the parts in are the result of H followed by the result of T.

Our FFS is given by , and , where represents what the agent would do seeing heads, and represents what the agent word do given seeing tails. . We also have a partition representing what the agent actually does , where and are each four element sets in the obvious way. We will then say , so W does not get to see what would have done, it only gets to see the coin flip and what actually did.

Now I will prove that observes relative to in this example. First, , and , so we get the first condition, . We will break up A in the obvious way set up in the problem for condition 2, so it suffices now to show that , (and it will follow symmetrically that .)

Im not going to go through the details, but , while , which are disjoint. The important thing here is that doesn't care about in worlds in which holds.

Discussion:

So largely I am sharing this to give an example for how you can manipulate FFS combinatorially, and how you can use this to say things that you might otherwise want to say using graphs, Granted, you could also say the above things using graphs, but now you can say more things, because you are not restricted to the nodes you choose, you can ask the same combinatorial question about any of the other partitions, The benefit is largely about not being dependent on our choice of variables.

It is interesting to try to translate this definition of observation to transparent Newcomb or counterfactual mugging, and see how some of the orthogonalities are violated, and thus it does not count as an observation.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-24T16:35:24.657Z · LW · GW

Hmm, first I want to point out that the talk here sort of has natural boundaries around inference, but I also want to work in a larger frame that uses FFS for stuff other than inference.

If I focus on the inference question, one of the natural questions that I answer is where I talk about grue/bleen in the talk.

I think for inference, it makes the most sense to think about FFS relative to Pearl. We have this problem with looking at smoking/tar/cancer, which is what if we carved into variables the wrong way. What if instead of tar/cancer, we had a variable for "How much bad stuff is in your body?" and "What is the ratio of tar to cancer?" The point is that our choice of variables both matters for the Pearlian framework, and is not empirically observable. I am trying to do all the good stuff in Pearl without the dependence on the variables

Indeed, I think the largest crux between FFS and Pearl is something about variable realism. To FFS, there is no realism to a variable beyond its information content, so it doesn't make sense to have two variables X, X' with the same information, but different temporal properties. Pearl's ontology, on the other hand, has these graphs with variables and edges that say "who listens to whom," which sets us up to be able to have e.g. a copy function from X to X', and an arrow from X to Y, which makes us want to say X is before Y, but X' is not.

For the more general uses of FFS, which are not about inference, my answer is something like "the same kind of stuff as Cartesian frames." e.g. specifying embedded observations. (A partition observes a subset relative to a high level world model if and . Notice the first condition is violated by transparent Newcomb, and the second condition is violated by counterfactual mugging. (The symbols here should be read as combinatorial properties, there are no probabilities involved.))

I want to be able to tell the stories like in the Saving Time post, where there are abstract versions of things that are temporally related.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-24T16:11:48.089Z · LW · GW

I are note sure what you are asking (indeed I am not sure if you are responding to me or cousin_it.)

One thing that I think is going on is that I use "factorization" in two places. Once when I say Pearl is using factorization data, and once where I say we are inferring a FFS. I think this is a coincidence. "Factorization" is just a really general and useful concept.

So the carving into A and B and C is a factorization of the world into variables, but it is not the kind of factorization that shows up in the FFS, because disjoint factors should be independent in the FFS.

As for why to switch to this framework, the main reason (to me) is that it has many of the advantages of Pearl with also being able to talk about some variables being coarse abstract versions of other variables. This is largely because I am interested in embedded agency applications.

Another reason is that we can't tell a compelling story about where the variables came from in the Pearlian story. Another reason is that sometimes we can infer time where Pearl cannot.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-24T16:00:52.544Z · LW · GW

I think that the answers to both the concern about 7 elements, and the desire to have questions depend of previous questions come out of thinking about FFS models, rather than FFS.

If you want to have 7 elements in , that just means you will probably have more than 7 elements in .

If I want to model a situation where some questions I ask depend on other questions, I can just make a big FFS that asks all the questions, and have the model hide some of the answers.

For example, Let's say I flip a biased coin, and then if heads I roll a biased 6 sided die, and if tails I roll a biased 10 sided die. There are 16 outcomes in .

I can build a 3 dimensional factored set 2x6x10, which I will imagine as sitting on my table with height 2. heads is on the bottom, and tails is on the top. will then merge together the rows on the bottom, and the columns on the top, so it will look a little like the game Jenga.

In this way, I am imagining there is some hidden data about each world in which I get heads and roll the 6 sided die, which is the answer to the question "what would have happened if I rolled the 10 sided die. Adding in all this counterfactual data gives a latent structure of 120 possible worlds, even though we can only distinguish 16 possible worlds.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-24T15:49:35.109Z · LW · GW

Yep, this all seems like a good way of thinking about it.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-24T15:48:28.102Z · LW · GW

Hmm, I doubt the last paragraph about sets of partitions is going to be valuable, bet the eigenspace thinking might be useful.

Note that I gave my thoughts about how to deal with the uniform distribution over 4 elements in the thread responding to cousin_it.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-24T15:44:59.504Z · LW · GW

So we are allowing S to have more than 4 elements (although we dont need that in this case), so it is not just looking at a small number of factorizations of a 4 element set. This is because we want an FFS model, not just a factorization of the sample space.

If you factor in a different way, X will not be before Y, but if you do this it will not be the case that X is orthogonal to X XOR Y. The theorem in this example is saying that X being orthogonal to X XOR Y implies that X is before Y.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-24T15:38:37.019Z · LW · GW

Ok, makes sense. I think you are just pointing out that when I am saying "general position," that is relative to a given structure, like FFS or DAG or symmetric FFS.

If you have a probability distribution, it might be well modeled by a DAG, or a weaker condition is that it is well modeled by a FFS, or an even weaker condition is that it is well modeled by a SFFS (symmetric finite factored set).

We have a version of the fundamental theorem for DAGs and d-seperation, we have a version of the fundamental theorem for FFS and conditional orthogonality, and we might get a version of the fundamental theorem for SFFS and whatever corresponds to conditional independence in that world.

However, I claim that even if we can extend to a fundamental theorem for SFFS, I still want to think of the independences in a SFFS as having different sources. There are the independences coming from orthogonality, and there are there the independences coming from symmetry (or symmetry together with orthogonality.

In this world, orthogonality won't be as inferable because it will only be a subset of independence, but it will still be an important concept. This is similar to what I think will happen when we go to the infinite dimensional factored sets case.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-24T15:25:15.287Z · LW · GW

It looks like X and V are independent binary variables with different probabilities in general position, and Y is defined to be X XOR V. (and thus V=X XOR Y).

**Scott Garrabrant**on Finite Factored Sets · 2021-05-24T03:25:07.689Z · LW · GW

Yeah, you are right. I will change it. Thanks.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-24T01:46:48.626Z · LW · GW

I don't understand what conspiracy is required here.

X being orthogonal to X XOR Y implies X is before Y, we don't get the converse.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-24T01:13:07.524Z · LW · GW

The swapping within a factor allows for considering rational probabilities to be in general position, and the swapping factors allows IID samples to be considered in general position. I think this is an awesome research direction to go in, but it will make the story more complicated, since will not be able to depend on the fundamental theorem, since we are allowing for a new source of independence that is not orthogonality. (I want to keep calling the independence that comes from disjoint factors orthogonality, and not use "orthogonality" to describe the new independences that come from the principle of indifference.)

**Scott Garrabrant**on Finite Factored Sets · 2021-05-24T01:06:16.273Z · LW · GW

So you should probably not work with probabilities equal to 1/2 in this framework, unless you are doing so for a specific reason. Just like in Pearlian causality, we are mostly talking about probabilities in general position. I have some ideas about how to deal with probability 1/2 (Have a FFS, together with a group of symmetry constraints, which could swap factors, or swap parts within a factor), but that is outside of the scope of what I am doing here.

To give more detail, the uniform distribution on four elements does not satisfy the compositional semigraphoid axioms, since if we take X, Y, Z to be the three partitions into two parts of size two, X is independent with Y and X is independent with Z, but X is not independent with the common refinement of Y and Z. Thus, if we take the orthogonality database generated by this probability distribution, you will find that it is not satisfied by any models.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-23T23:44:26.985Z · LW · GW

If you look at the draft edits for this sequence that is still awaiting approval on OEIS, you'll find some formulas.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-23T22:51:00.295Z · LW · GW

Nope, we have , but not . That breaks the symmetry.

**Scott Garrabrant**on Finite Factored Sets · 2021-05-23T22:31:02.576Z · LW · GW

Indeed, If X is independent of both Y and X xor Y, that violates the compositional semigraphiod axioms (assuming X is nondeterministic.) Although it could still happen e.g. in the uniform distribution on X x Y. In the example in the post, I mean for X to be independent of X xor Y and for X to not be independent of Y.

**Scott Garrabrant**on Overconfidence is Deceit · 2021-02-22T04:09:22.646Z · LW · GW

And I mean the word "maybe" in the above sentence. I am saying the sentence not to express any disagreement, but to play with a conjecture that I am curious about.

**Scott Garrabrant**on Overconfidence is Deceit · 2021-02-22T04:01:59.785Z · LW · GW

Anyway, my reaction to the actual post is:

"Yep, Overconfidence is Deceit. Deceit is bad."

However, reading your post made me think about how maybe your right to not be deceived is trumped by my right to be incorrect.

**Scott Garrabrant**on Overconfidence is Deceit · 2021-02-22T03:55:15.146Z · LW · GW

I believe that I could not pass your ITT. I believe I am projecting some views onto you, in order engage with them in my head (and publicly so you can engage if you want). I guess I have a Duncan-model that I am responding to here, but I am not treating that Duncan-model as particularly truth tracking. It is close enough that it makes sense (to me) to call it a Duncan-model, but its primary purpose in me is not for predicting Duncan, but rather for being there to engage with on various topics.

I suspect that being a better model would help it serve this purpose, and would like to make it better, but I am not requesting that.

I notice that I used different words in my header "Scott's model of Duncan's beliefs," I think that this reveals something, but it certainly isn't clear, "belief" is for true things, "models" are toys for generating things.

I think that in my culture, having a not-that-truth-tracking Duncan-model that I want to engage my ideas with is a sign of respect. I think I don't do that with that many people (more than 10, but less than 50, I think). I also do it with a bunch of concepts, like "Simic," or "Logical Induction." The best models according to me are not the ones that are the most accurate, as much as the ones that are most generally applicable. Rounding off the model makes it fit in more places.

However, I can imagine that maybe in your culture it is something like objectification, which causes you to not be taken seriously. Is this true?

If you are curious about what kind of things my Duncan-model says, I might be able to help you built a (Scott's-Duncan-Model)-Model. In one short phase, I think I often round you off as an avatar of "respect," but even my bad model has more nuance than just the word "respect".

I imagine that you are imagining my comment as a minor libel about you, by contributing to a shared narrative in which you are something that you are not. I am sad to the extent that it has that effect. I am not sure what to do about that. (I could send things like this in private messages, that might help).

However, I want to point out that I am often not asking people to update from my claims. That is often an unfortunate side effect. I want to play with my Duncan-model. I want you to see what I build with it, and point out where it is not correctly tracking what Duncan would actually say. (If that is something you want) I also want to do this in a social context. I want my model to be correct, so that I can learn more from it, but I want to relinquish any responsibility for it being correct. (I am up for being convinced that I should take on that responsibility, either as a general principal, or as a cooperative action towards you.)

Feel free to engage or not.

PS: The above is very much responding to my Duncan-model, rather than what you are actually saying. I reread your above comment, and my comment, and it seems like I am not responding to you at all. I still wanted to share the above text with you.

**Scott Garrabrant**on Overconfidence is Deceit · 2021-02-22T02:40:52.263Z · LW · GW

Yep, I totally agree that it is a riff. I think that I would have put it in response to the poll about how important it is for karma to track truth, if not for the fact that I don't like to post on Facebook.

**Scott Garrabrant**on Overconfidence is Deceit · 2021-02-21T12:18:53.849Z · LW · GW

This comment is not only about this post, but is also a response to Scott's model of Duncan's beliefs about how epistemic communities work, and a couple of Duncan's recent Facebook posts. It is also is a mostly unedited rant. Sorry.

I grant that overconfidence is in a similar reference class as saying false things. (I think there is still a distinction worth making, similar to the difference between lying directly and trying to mislead by saying true things, but I am not really talking about that distinction here.)

I think society needs to be robust to people saying false things, and thus have mechanisms that prevent those false things from becoming widely believed. I think that as little as possible of that responsibility should be placed on the person saying the false things, in order to make it more strategy-proof. (I think that it is also useful for the speaker to help by trying not to say false things, but I am more putting the responsibility in the listener)

I think there should be pockets of society, (e.g. collections of people, specific contexts or events) that can collect true beliefs and reliably significantly decrease the extent to which they put trust in the claims of people who say false things. Call such contexts "rigorous."

I think that it is important that people look to the output these rigorous contexts when e.g. deciding on COVID policy.

I think it is extremely important that the rigorous pockets of society is not "everyone in all contexts."

I think that that society is very much lacking reliable rigorous pockets.

I have this model where in a healthy society, there can be contexts where people generate all sorts of false beliefs, but also sometimes generate gold (e.g. new ontologies that can vastly improve the collective map). If this context is generating a sufficient supply of gold, you DO NOT go in and punish their false beliefs. Instead, you quarantine them. You put up a bunch of signs that point to them and say e.g. "80% boring true beliefs 19% crap 1% gold," then you have your rigorous pockets watch them, and try to learn how to efficiently distinguish between the gold and the crap, and maybe see if they can generate the gold without the crap. However sometimes they will fail and will just have to keep digging through the crap to find the gold.

One might look at lesswrong, and say "We are trying to be rigorous here. Let's push stronger on the gradient of throwing out all the crap." I can see that. I want to be able to say that. I look at the world, and I see all the crap, and I want there to be a good pocket that can be about "true=good", "false=bad", and there isn't one. Science can't do it, and maybe lesswrong can.

Unfortunately, I also look at the world and see a bunch of boring processes that are never going to find gold, Science can't do it, and maybe lesswrong can.

And, maybe there is no tradeoff here. Maybe it can do both. Maybe at our current level of skills, we find more gold in the long run by being better and throwing out the crap.

I don't know what I believe about how much tradeoff there is. I am writing this, and I am not trying to evaluate the claims. I am imagining inhabiting the world where there is a huge trade off. Imagining the world where lesswrong is the closest thing we have to being able to have a rigorous pocket of society, but we have to compromise, because we need a generative pocket of society even more. I am overconfidently imagining lesswrong as better than it is at both tasks, so that the tradeoff feels more real, and I am imagining the world failing to pick up the slack of whichever one it lets slide. I am crying a little bit.

And I am afraid. I am afraid of being the person who overconfidently says "We need less rigor," and sends everyone down the wrong path. I am also afraid of the person who overconfidently says "We need less rigor," and gets flagged as a person who says false things. I am not afraid of saying "We need more rigor." The fact that I am not afraid of saying "We need more rigor" scares me. I think it makes me feel that if I look to closely, I will conclude that "We need more rigor" is true. Specifically, I am afraid of concluding that and being wrong.

In my own head, I have a part of me that is inhabiting the world where there is a large tradeoff, and we need less rigor. I have another part that is trying to believe true things. The second part is making space for the first part, and letting it be as overconfident as it wants. But it is also quarantining the first part. It is not making the claim that we need more space and less rigor. This quarantine action has two positive effects. It helps the second part have good beliefs, but it also protects the first part from having to engage with hammer of truth until it has grown.

I conjecture that to the extent that I am good at generating ideas, it is partially because I quarantine, but do not squash, my crazy ideas. (Where ignoring the crazy ideas counts as squashing them) I conjecture further that in ideal society needs to do similar motions at the group level, not just the individual level. I said at the beginning that you need to put the responsibility for distinguishing in the listener for strategyproofness. This was not the complete story. I conjecture that you need to put the responsibility in the hand of the listener, because you need to have generators that are not worried about accidentally having false/overconfident beliefs. You are not supposed to put policy decisions in the hands of the people/contexts that are not worried about having false beliefs, but you are supposed to keep giving them attention, as long as they keep occasionally generating gold.

Personal Note: If you have the attention for it, I ask that anyone who sometimes listens to me keeps (at least) two separate buckets: one for "Does Scott sometimes say false things?" and one for "Does Scott sometimes generate good ideas?", and decide whether to give me attention based on these two separate scores. If you don't have the attention for that, I'd rather you just keep the second bucket, I concede the first bucket (for now), and think my comparative advantage is the be judged according the the second one, and never be trusted as epistemically sound. (I don't think I am horrible at being epistemically sound, at least in some domains, but if I only get a one dimensional score, I'd rather relinquish the right to be epistemically trusted, in order to absolve myself of the responsibility to not share false beliefs, so my generative parts can share more freely.)

**Scott Garrabrant**on 2019 Review: Voting Results! · 2021-02-05T07:22:10.144Z · LW · GW

Unedited stream of thought:

Before trying to answer the question, I'm just gonna say a bunch of things that might not make sense (either because I am being unclear or being stupid).

So, I think the debate example is much more *about* manipulation, than the iterated amplification example, so I was largely replying to the class that includes IA and debate, I can imagine saying that Iterated amplification done right does not provide an incentive to manipulate the human.

I think that a process that was optimizing directly for finding a fixed point of does have an incentive to manipulate the human, however this is not exactly what IA is doing, because it is only having the gradients pass through the first in the fixed point equation, and I can imagine arguing that the incentive to manipulate comes from having the gradient pass through the second . If you iterate enough times, I think you might effectively have some optimization juice passing through modifying the second , but it might be much less. I am confused about how to think about optimization towards a moving target being different from optimization towards finding a fixed point.

I think that even if you only look at the effect of following the gradients coming from the effect of changing the first , you are at least providing an incentive to predict the human on a wide range of inputs. In some cases, your range of inputs might be such there isn't actually information about the human in the answers, which I think is where you are trying to get with the automated decomposition strategies. If humans have some innate ability to imitate some non-human process, and use that ability to answer the questions, and thinking about humans does not aid in thinking about that non-human process, I agree that you are not providing any incentive to think about the humans. However, it feels like a lot has to go right for that to work.

On the other hand, maybe we just think it is okay to predict, but not manipulate, the humans, while they are answering questions with a lot of common information about humans' work, which is what I think IA is supposed to be doing. In this case, even if I were to say that there is no incentive to "manipulate the human", I still argue that there is "incentive to learn how to manipulate the human," because predicting the human (on a wide range of inputs) is a very similar task to manipulating the human.

Okay, now I'll try to answer the question. I don't understand the question. I assume you are talking about incentive to manipulate in the simple examples with permutations etc in the experiments. I think there is no ability to manipulate those processes, and thus no gradient signal towards manipulation of the automated process. I still feel like there is some weird counterfactual incentive to manipulate the process, but I don't know how to say what that means, and I agree that it does not affect what actually happens in the system.

I agree that changing to a human will not change anything (except via also adding the change where the system is told (or can deduce) that it is interacting with the human, and thus ignores the gradient signal, to do some treacherous turn). Anyway, in these worlds, we likely already lost, and I am not focusing on them. I think the short answer to your question is in practice no, there is no difference, and there isn't even incentive to predict humans in strong generality, much less manipulate them, but that is because the examples are simple and not trying to have common information with how humans work.

I think that there are two paths to go down of crux opportunities for me here, and I'm sure we could find more: 1) being convinced that there is not an incentive to predict humans in generality (predicting humans only when they are very strictly following a non-humanlike algorithm doesn't count as predicting humans in generality), or 2) being convinced that this incentive to predict the humans is sufficiently far from incentive to manipulate.

**Scott Garrabrant**on 2019 Review: Voting Results! · 2021-02-04T22:10:36.384Z · LW · GW

BTW, I would be down for something like a facilitated double crux on this topic, possibly in the form of a weekly LW meetup. (But think it would be a mistake to stop talking about it in this thread just to save it for that.)

**Scott Garrabrant**on 2019 Review: Voting Results! · 2021-02-04T22:06:28.702Z · LW · GW

I am having a hard time generating any ontology that says:

I don't see [let's try to avoid giving models strong incentives to learn how to manipulate humans] as particularly opposed to methods like iterated amplification or debate.

Here are some guesses:

You are distinguishing between an incentive to manipulate real life humans and an incentive to manipulate human models?

You are claiming that the point of e.g. debate is that when you do it right there is no incentive to manipulate?

You are focusing on the task/output of the system, and internal incentives to learn how to manipulate don't count?

These are just guesses.

**Scott Garrabrant**on 2019 Review: Voting Results! · 2021-02-04T07:40:58.322Z · LW · GW

(Where by "my true reason", I mean what feels live to me right now, There is also all the other stuff from the post, and the neglectedness argument)

**Scott Garrabrant**on 2019 Review: Voting Results! · 2021-02-04T07:39:19.234Z · LW · GW

Yeah, looking at this again, I notice that the post probably failed to communicate my true reason. I think my true reason is something like:

I think that drawing a boundary around good and bad behavior is very hard. Luckily, we don't have to draw a boundary between good and bad behavior, we need to draw a boundary that has bad behavior on the outside, and *enough* good behavior on the inside to bootstrap something that can get us through the X-risk. Any distinction between good and bad behavior with any nuance seems very hard to me. However the boundary of "just think about how to make (better transistors and scanning tech and quantum computers or whatever) and don't even start thinking about (humans or agency or the world or whatever)" seems like it might be carving reality at the joints enough for us to be able to watch a system that is much stronger than us and know that it is not misbehaving.

i.e., I think my true reason is not that all reasoning about humans is dangerous, but that it seems very difficult to separate out safe reasoning about humans from dangerous reasoning about humans, and it seems more possible to separate out dangerous reasoning about humans from some sufficiently powerful subset of safe reasoning (but it seems likely that that subset needs to have humans on the outside).

**Scott Garrabrant**on 2019 Review: Voting Results! · 2021-02-03T00:40:08.129Z · LW · GW

Yeah, I am sad, but not surprised, because I have been trying to push this idea (e.g. at conferences) for a few years.

Guesses as to why I'm failing?

I think that we actually undersold the neglectedness point in this post, but I don't think that is the main problem, I think the main problem is that the post (and I) do not give viable alternatives, its like:

"Have you noticed that the CHAI ontology, the Paul ontology, and basically all the concrete plans for safe AGI are trying to get safety out of superhuman models of humans, and there are plausible worlds in which this is on net actively harmful for safety. Further, the few exceptions to this mostly involve AGI interacting directly with the world agentically in such a way as to create an instrumental incentive for human modeling."

"Okay, what do we do instead?"

*shrug*

Perhaps it goes better if I gave any concrete plan at all, even if it is unrealistic like:

1. Understand agency/intelligence/optimization/corrigibility to the point where we could do something good and safe if we had unlimited compute, and maybe reliable brain scans.

2. (in parallel) Build safe enough AGI that can do science/engineering to the point of being able to generate plans to turn Jupiter into a computer, without relying on human models at all.

3. Turn Jupiter into a computer.

4. Do the good and safe thing on the Jupiter computer, or if no better ideas are found, run a literal HCH on the Jupiter computer.

The problem is that draws attention to how unrealistic this plan is, and not on the open question of "What *do* we do instead?"

**Scott Garrabrant**on Eight Definitions of Observability · 2021-01-28T23:24:01.157Z · LW · GW

I am confused, why is it not identical to your other comment?

**Scott Garrabrant**on Eight Definitions of Observability · 2021-01-27T08:46:37.001Z · LW · GW

I think I fixed it. Thanks.