## Posts

## Comments

**michaelcohen (cocoa)**on Response to "What does the universal prior actually look like?" · 2021-06-01T11:29:00.344Z · LW · GW

I assume (6) means that your "anthropic update" scans across possible universes to find those that contain important decisions you might want to influence?

Yes, and then outputs strings from that set with probability proportional to their weight in the universal prior.

By (3) do you mean the same thing as "Simplest output channel that is controllable by advanced civilization with modest resources"?

I would say "successfully controlled" instead of controllable, although that may be what you meant by the term. (I decomposed this as controllable + making good guesses.) For some definitions of controllable, I might have given a point estimate of maybe 1 or 5 bits. But there has to be an output channel for which the way you transmit a bitstring out is the way the evolved consequentialists expect. But recasting it in these terms, implicitly makes the suggestion that the specification of the output channel can take on some of the character of (6'), makes me want to put my range down to 15-60; point estimate 25.

instead of using (1)+(2)+(3) you should compare to (6') = "Simplest program that scans across many possible worlds to find those that contain some pattern that can be engineered by consequentialists trying to influence prior."

Similarly, I would replace "can be" with "seems to have been". And just to make sure we're talking about the same thing, it takes this list of patterns, and outputs them with probability proportional to their weight in the universal prior.

Yeah, this seems like it would make some significant savings compared to (1)+(2)+(3). I think replacing parts of the story from being specified as [arising from natural world dynamics] to being specified as [picked out "deliberately" by a program] generally leads to savings.

Then the comparison is between specifying "important predictor to influence" and whatever the easiest-to-specify pattern that can be engineered by a consequentialist. It feels extremely likely to me that the second category is easier, indeed it's kind of hard for me to see any version of (6) that doesn't have an obviously simpler analog that could be engineered by a sophisticated civilization.

I don't quite understand the sense in which [worlds with consequentialist beacons/geoglyphs] can be described as [easiest-to-specify controllable pattern]. (And if you accept the change of "can be" to "seems to have been", it propagates here). Scanning for important predictors to influence does feel very similar to me to scanning for consequentialist beacons, especially since the important worlds are plausibly the ones with consequentialists.

There's a bit more work to be done in (6') besides just scanning for consequentialist beacons. If the output channel is selected "conveniently" for the consequentialists, since the program is looking for the beacons, instead of the consequentialists giving it their best guess(es) and putting up a bunch of beacons, there has to be some part of the program which aggregates the information of multiple beacons (by searching for coherence, e.g.), or else determines which beacon takes precedence, and then also determines how to interpret their physical signature as a bitstring.

Tangent: in heading down a path trying to compare [scan for "important to influence"] vs. [scan for "consequentialist attempted output messages"] just now, my first attempt had an error, so I'll point it out. It's not necessarily harder to specify "scan for X" than "scan for Y" when X is a subset of Y. For instance "scan for primes" is probably simpler than "scan for numbers with less than 6 factors".

Maybe clarifying or recasting the language around "easiest-to-specify controllable pattern" will clear this up, but can you explain more why it feels to you that [scan for "consequentialists' attempted output messages"] is so much simpler than [scan for "important-to-influence data streams"]? My very preliminary first take is that they are within 8-15 bits.

I also don't really see why you are splitting them [(4) + (5)] apart, shouldn't we just combine them into "wants to influence predictors"? If you're doing that presumably you'd both use the anthropic prior and then the treacherous turn.

I split them in part in case there is there is a contingent of consequentialists who believes that outputting the right bitstring is key to their continued existence, believing that they stop being simulated if they output the wrong bit. I haven't responded to your claim that this would be faulty metapyhsics on their part; it still seems fairly tangential to our main discussion. But you can interpret my 5 bit point estimate for (5) as claiming that 31 times out of 32 that a civilization of consequentialists tries to influence their world's output, it is in an attempt to survive. Tell me if you're interested in a longer justification that responds to your original "line by line comments" comment.

**michaelcohen (cocoa)**on Response to "What does the universal prior actually look like?" · 2021-05-27T17:04:31.346Z · LW · GW

Yeah, seems about right.

I think with 4, I've been assuming for the sake of argument that manipulators get free access to the right prior, and I don't have a strong stance on the question, but it's not complicated for a directly programmed anthropic update to be built on that right prior too.

I guess I can give some estimates for how many bits I think are required for each of the rows in the table. I'll give a point estimate, and a range for a 50% confidence interval for what my point estimate would be if I thought about it for an hour by myself and had to write up my thinking along the way.

I don't have a good sense for how many bits it takes to get past things that are just extremely basic, like an empty string, or an infinite string of 0s. But whatever that number is, add it to both 1 and 6.

1) Consequentialists emerge, 10 - 50 bits; point estimate 18

2) TM output has not yet begun, 10 - 30 bits; point estimate 18

3) make good guesses about controllable output, 18 - 150 bits; point estimate 40

4) decide to output anthropically updated prior, 8 - 35 bits; point estimate 15

5) decide to do a treacherous turn. 1 - 12 bits; point estimate 5

vs. 6) direct program for anthropic update. 18-100 bits; point estimate 30

The ranges are fairly correlated.

**michaelcohen (cocoa)**on Response to "What does the universal prior actually look like?" · 2021-05-27T09:52:38.619Z · LW · GW

Do you have some candidate "directly programmed anthropic update" in mind? (That said, my original claim was just about the universal prior, not about a modified version with an anthropic update)

I’m talking about the weight of an anthropically updated prior *within* the universal prior. I should have added “+ bits to encode anthropic update directly” to that side of the equation. That is, it takes some number of bits to encode “the universal prior, but conditioned on the strings being important to decision-makers in important worlds”. I don’t know how to encode this, but there is presumably a relatively simple direct encoding, since it’s a relatively simple concept. This is what I was talking about in my response to the section “The competition”.

One way that might be helpful about thinking about the bits saved from the anthropic update is that it is string is important to decision-makers in important worlds. I think this gives us a handle in reasoning about anthropic savings as a self-contained object, even if it’s a big number.

> bits to specify camera on earth - bits saved from anthropic update

I think the relevant number is just "log_2 of the number of predictions that the manipulators want to influence." It seems tricky to think about this (rather small) number as the difference between two (giant) numbers.

But suppose they picked only one string to try to manipulate. The cost would go way down, but then it probably wouldn’t be us that they hit. If log of the number of predictions that the manipulators want to influence is 7 bits shorter than [bits to specify camera on earth - bits saved from anthropic update], then there’s a 99% chance we’re okay. If different manipulators in different worlds are choosing differently, we can expect 1% of them to choose our world, and so we start worrying again, but we add the 7 bits back because it’s only 1% of them.

So let’s consider two Turing machines. Each row will have a cost in bits.

A B

Consequentialists emerge, Directly programmed anthropic update.

make good guesses about controllable output,

decide to output anthropically updated prior.

Weight of earth-camera within anthropically updated prior

The last point can be decomposed into [description length of camera in our world - anthropic savings], but it doesn’t matter; it appears in both options.

I don’t think this is what you have in mind, but I’ll add another case, in case this is what you meant by “They are just looking at the earth-like Turing machine”. Maybe, just skip this though.

A B

Consq-alists emerge *in a world like ours*, Directly prog. anthropic update.

make good guesses about controllable output,

output (strong) anth. updated prior.

Weight of earth-camera in strong anth. update … in normal anth. update

They can make a stronger anthropic update by using information about their world, but the savings will be equal to the extra cost of specifying that the consequentialists are in a world like ours. This is basically the case I mentioned above where different manipulators choose different sets of worlds to try to influence, but then the set of manipulators that choose our world has smaller weight.

------ end potential skip

What I think it boils down to is the question:

Is the anthropically updated version of the universal prior most simply described as “the universal prior, but conditioned on the strings being important to decision-makers in important worlds” or “that thing consequentialists sometimes output”? (And consequentialists themselves may be more simply described as “those things that often emerge”). “Sometimes” is of course doing a lot of work, and it will take bits to specify which “sometimes” we are talking about. If the latter is more simple, then we might expect the natural continuation of those sequences to usually contain treacherous turns, and if the former is more simple, then we wouldn’t. This is why I don’t think the weight of an earth-camera in the universal prior ever comes into it.

But/so I don’t understand if I’m missing the point of a couple paragraphs of your comment—the one which starts “They are just looking at the earth-like Turing machine”, and the next paragraph, which I agree with.

**michaelcohen (cocoa)**on Finite Factored Sets · 2021-05-24T13:11:26.355Z · LW · GW

I'm using some of the terminology I suggested here.

A factoring is a set of questions such that each signature of possible answers identifies a unique element. In 20 questions, you can tailor the questions depending on the answers to previous questions, and ultimately each element will have a bitstring signature depending on the history of yesses and nos. I guess you can define the question to include xors with previous questions, so that it effectively changes depending on the answers to others. But it's sometimes useful that the bitstrings are allowed to have different length. It feels like an unfortunate fact that when constructing a factoring for 7 elements, you're forced to use the factoring {"Okay, well, which element is it?"}, just because you don't want to have to answer a different number of questions for different elements. Is this a real cost? Or do we only ever construct cases where it's not?

In the directed graph of subsets, with edges corresponding to the subset relation, why not consider arbitrary subtrees? For example, for the set of 7 elements, we might have {{0, 1, 2}, {{3, 4}, {5, 6}}}. (I'm not writing it out as a tree, but that contains all the information). This corresponds to the sequence of questions: "is it less than 3?", [if yes] "is it 0, 1, or 2?", [if no], "is it less than 5?", "is it even?" Allowing different numbers of questions and different numbers of answers seems to give some extra power here. Is it meaningful?

**michaelcohen (cocoa)**on Finite Factored Sets · 2021-05-24T12:53:21.510Z · LW · GW

I was thinking of some terminology that might make it easier to thinking about factoring and histories and whatnot.

A partition can be thought of as a (multiple-choice) question. Like for a set of words, you could have the partition corresponding to the question "Which letter does the word start with?" and then the partition groups together elements with the same answer.

Then a factoring is set of questions, where the set of answers will uniquely identify an element. The word that comes to mind for me is "signature", where an element's signature is the set of answers to the given set of questions.

For the history of a partition X, X can be thought of as a question, and the history is the subset of questions in the factoring that you need the answers to in order to determine the answer to question X.

And then two questions X and Y are orthogonal if there aren't any questions in the factoring that you need the answer to both for answering X and for answering Y.

**michaelcohen (cocoa)**on Finite Factored Sets · 2021-05-24T12:37:09.948Z · LW · GW

I was thinking about the difficulty of finite factored sets not understanding the uniform distribution over 4 elements, and it makes me feel like something fundamental needs to be recast. An analogy came to mind about eigenvectors vs. eigenspaces.

What we might like to be true about the unit eigenvectors of a matrix is that they are the unique unit vectors for which the linear transformation preserves direction. But if two eigenvectors have the same eigenvalue, the choice of eigenvectors is not unique--we could choose any pair on that plane. So really, it seems like we shouldn't think about a matrix's eigenvectors and (potentially repeated) eigenvalues; we should think about a matrix's eigenvalues and eigenspaces, some of which might be more than 1-dimensional.

I wonder if there's a similar move to be made when defining orthogonality. Maybe (for example) orthogonality would be more conveniently defined between two sets of partitions instead of between two partitions. Probably that specific idea fails, but maybe there's something like this that could be done.

**michaelcohen (cocoa)**on Response to "What does the universal prior actually look like?" · 2021-05-24T11:30:23.865Z · LW · GW

I take your point that we are discussing some output rules which add extra computation states, and so some output rules will add fewer computation states than others.

I'm merging my response to the rest with my comment here.

**michaelcohen (cocoa)**on Response to "What does the universal prior actually look like?" · 2021-05-24T11:27:35.729Z · LW · GW

They are using their highest probability guess about the output channel, which will be higher probability than the output channel exactly matching some camera on old earth (but may still be very low probability). I still don't understand the relevance.

I’m trying to find the simplest setting where we have a disagreement. We don’t need to think about cameras on earth quite yet. I understand the relevance isn’t immediate.

They don't care about "their" Turing machine, indeed they live in an infinite number of Turing machines that (among other things) output bits in different ways.

I think I see the distinction between the frameworks we most naturally think about the situation. I agree that they live in an infinite number of Turing machines, in the sense that their conscious patterns appear in many different Turing machines. All of these Turing machines have weight in some prior. When they change their behavior, they (potentially) change the outputs of any of these Turing machines. Taking these Turing machines as a set, weighted by those prior weights we can consider the probability that the output obeys a predicate P. The answer to this question can be arrived at through an equivalent process. Let the inhabitants imagine that there is a correct answer to the question “which Turing machine do I *really* live in?” They then reason anthropically about which Turing machines give rise to such conscious experiences as theirs. They then use the same prior over Turing machines that I described above. And then they make the same calculation about the probability that “their” Turing machine outputs something that obeys the predicate P. So on the one hand, we could say that we are asking “what is the probability that the section of the universal prior which gives rise to these inhabitants outputs an output that obeys predicate P?” Or we could equivalently ask “what is the probability that this inhabitant ascribes to ‘its’ Turing machine outputting a string that obeys predicate P?”

There are facts that I find much easier to incorporate when thinking in the latter framework, such as “a work tape inhabitant knows nothing about the behavior of its Turing machine’s output tape, except that it has relative simplicity given the world that it knows.” (If it believes that its conscious existence depends on its Turing machine never having output a bit that differs from a data stream in a base world, it will infer other things about its output tape, but you seem to disagree that it would make that assumption, and I’m fine to go along with that). (If the fact were much simpler—“a work tape inhabitant knows nothing about the behavior of its Turing machine’s output tape” full stop—I would feel fairly comfortable in either framework.)

If it is the case that, for any action that a work tape inhabitant takes, the following is unchanged: [the probability that *it* (anthropically) ascribes to “its” Turing machine printing an output that obeys predicate P after it takes that action], then, no matter its choice of action, then the probability under the universal prior that the output obeys predicate P is also unchanged.

What if the work tape inhabitant only cares about the output when the the universal prior is being used for important applications? Let Q be the predicate [P and “the sequence begins with a sequence which is indicative of important application of the universal prior”]. The same logic that applies to P applies to Q. (It feels easier to talk about probabilities of predicates (expectations of Boolean functions) rather than expectations of general functions, but if we wanted to do importance weighting instead of using a strict predicate on importance, the logic is the same).

Writing about the fact I described above about what the inhabitants believe about their Turing machine’s output has actually clarified my thinking a bit. Here’s a predicate where I think inhabitants could expect certain actions to make it more likely that their Turing machine output obeys that predicate. “The output contains the string [particular 1000 bit string]”. They believe that their world’s output is simple given their world’s dynamics, so if they write that 1000 bit string somewhere, it is more likely for the predicate to hold. (Simple manipulations of the string are nearly equally more likely to be output).

So there are *severe* restrictions on the precision with which they can control even low-probability changes to the output, but not total restrictions. So I wasn’t quite right in describing it as a max-entropy situation. But the one piece of information that distinguishes their situation from one of maximum uncertainty about the output is very slight. So I think it’s useful to try to think in terms of how they get from that information to their goal for the output tape.

I was describing the situation where I wanted to maximize the probability where the output of our world obeys the predicate: “this output causes decision-maker simulators to believe that virtue pays”. I think I could very slightly increase that probability by trying to reward virtuous people around me. Consider consequentialists who want to maximize the probability of the predicate “this output causes simulator-decision-makers to run code that recreates us in their world”. They want to make the internals of their world such that there are simple relative descriptions for outputs for which that predicate holds. I guess I think that approach offers extremely limited and imprecise ability to deliberately influence the output, no matter how smart you are.

If an approach has very limited success probability, (i.e. very limited sway over the universal prior), they can focus all their effort on mimicking a few worlds, but then we’ll probably get lucky, and ours won’t be one of the ones they focus on.

From a separate recent comment,

But now that we've learned that physics is the game of life, we can make

muchbetter guesses about how to build a dataset so that a TM could output it. For example, we can:

- Build the dataset at a large number of places.
- [etc.]
...

I challenge you to find

anyplausible description of a rule that outputs the bits observed by a camera, for which I can't describe a simpler extraction rule that would output some set of bits controlled by the sophisticated civilization.

You're comparing the probability of one of these many controlled locations driving the output of the machine to the probability that a random camera does on an earth-like Turing machine drives the output. Whereas it seems to me like the right question is to look at the absolute probabilities that one of these controlled locations drives the output. The reason is that what they attempt to output is a mixture over many sequences that a decision-maker-simulator might want to know about. But if the sequence we're feeding in is from a camera on earth, than their antics only matter to the extent that their mixture puts weight on a random camera on earth. So *they *have to specify the random camera on an earth-like Turing machine too. They're paying the same cost, minus any anthropic update. So the costs to compare are roughly [- log prob. successful control of output + bits to specify camera on earth - bits saved from anthropic update] vs. [bits to specify camera on earth - bits saved from directly programmed anthropic update]. This framing seems to imply we can cross off [bits to specify camera on earth] from both sides.

**michaelcohen (cocoa)**on Response to "What does the universal prior actually look like?" · 2021-05-21T16:31:13.329Z · LW · GW

Okay, now suppose they want the first N bits of the output of their Turing machine to obey predicate P, and they assign that a value of 100, and a they assign a value of 0 to any N-bit string that does not obey predicate P. And they don't value anything else. If some actions have a higher value than other actions, what information about the output tape dynamics are they using, and how did they acquire it?

**michaelcohen (cocoa)**on Response to "What does the universal prior actually look like?" · 2021-05-21T16:05:59.808Z · LW · GW

Just look at the prior--for any set of instructions for the work tape heads of the Turing machine, flipping the "write-1" instructions of the output tape with the "write-0" instructions gives an equally probably Turing machine.

**michaelcohen (cocoa)**on Response to "What does the universal prior actually look like?" · 2021-05-21T16:01:30.015Z · LW · GW

Suppose they know the sequence that actually gets fed to the camera.

If you're saying that they know their Turing machine has output x so far, then I 100% agree. What about in the case where they don't know?

**michaelcohen (cocoa)**on Response to "What does the universal prior actually look like?" · 2021-05-21T15:51:50.592Z · LW · GW

If I flip a coin to randomize between two policies, I don't see how that mixed policy could produce more value for me than the base policies.

(ETA: the logical implications about the fact of my randomization don't have any weird anti-adversarial effects here).

**michaelcohen (cocoa)**on Response to "What does the universal prior actually look like?" · 2021-05-21T15:45:35.394Z · LW · GW

If these consequentialists ascribed a value of 100 to the next output bit being 1, and a value of 0 to the next output bit being 0, and they valued nothing else, would you agree that all actions available to them have identical expected value under the distribution over Turing machines that I have described?

**michaelcohen (cocoa)**on Response to "What does the universal prior actually look like?" · 2021-05-21T15:35:54.504Z · LW · GW

With randomization, you reduce the cost and the upside in concert. If a pair of shoes costs $100, and that's more than I'm willing to pay, I could buy the shoes with probability 1%, and it will only cost me $1 in expectation, but I will only get the shoes with probability 1/100.

**michaelcohen (cocoa)**on Response to "What does the universal prior actually look like?" · 2021-05-21T15:31:40.211Z · LW · GW

It's definitely not too weird a possibility for me. I'm trying to reason backwards here--the best strategy available to them *can't* be effective in expectation at achieving whatever their goals are with the output tape, because of information-theoretic impossibilities, and therefore, any given strategy will be that bad or worse, including randomization.

**michaelcohen (cocoa)**on Response to "What does the universal prior actually look like?" · 2021-05-21T13:41:05.592Z · LW · GW

We can get back to some of these points as needed, but I think our main thread is with your other comment, and I'll resist the urge to start a long tangent about the metaphysics of being "simulated" vs. "imagined".

**michaelcohen (cocoa)**on Response to "What does the universal prior actually look like?" · 2021-05-21T13:28:10.745Z · LW · GW

So we end up with some leading hypotheses about the Turing machine we are running on, the history that gave rise to us, and the output rule used by that Turing machine.

I feel like this story has run aground on an impossibility result. If a random variable’s value is unknowable (but its distribution is known) and an intelligent agent wants to act on its value, and they randomize their actions, the expected log probability of them acting on the true value cannot exceed the entropy of the distribution, no matter their intelligence. (And if they’re wrong about the r.v.’s distribution, they do even worse). But lets assume they are correct. They know there are, say, 80 output instructions (two work tapes and one input tape, and a binary alphabet, and 10 computation states). And each one has a 1/3 chance of being “write 0 and move”, “write 1 and move”, or “do nothing”. Let’s assume they know the rules governing the other tape heads, and the identity of the computation states (up to permutation). Their belief distribution is (at best) uniform over these 3^80 possibilities. Is computation state 7 where most of the writing gets done? They just don’t know. It doesn’t matter if they’ve figured out that computation state 7 is responsible for the high-level organization of the work tapes. It’s totally independent. Making beacons is like assuming that computation state 7, so important for the dynamics of their world, has anything special to do with the output behavior. (Because what is a beacon if not something that speaks to *internally* important computation states?)

That’s all going along with the premise that when consequentialists face uncertainty, they flip a coin, and adopt certainty based on the outcome. So if they think it’s 50/50 whether a 0 or a 1 gets output, they flip a coin or look at some tea leaves, and then act going forward as if they just learned the answer. Then, it only costs 1 bit to say they decided “0”. But I think getting consequentialists to behave this way requires an intervention into their proverbial prefrontal cortices. If these consequentialists were playing a bandit game, and one arm gave a reward of 0.9 with certainty, and the other was 50/50 between a reward of 0 or 1, they obviously don’t flip a coin to decide whether to act as if it’s really reward 1 or reward 0. (I understand that Thompson sampling is a great strategy, but only when your uncertainty is ultimately empirically resolvable, and uncertainty about output behavior is not).

I think you’ll get the impression that I’m being miserly with bits here. And your relative profligacy comes from the fact that you expect they’ll make it up easily in the anthropic update. But you’ll ultimately have to claim that the anthropic update is more cheaply specified as a natural product of consequentialist civilization than through direct encoding. And if every step of the way in your story about how consequentialists arrive at this behavior, I make the point that this is not sensible goal-oriented behavior, and you make the point that it only takes a few bits to make it be what they would do anyway, then if you put it all together, I’m not just haggling over bits. If you put it all together, it looks to me like the consequentialists’ consequentialism is doing little to none of the work; every step of the way, reasonable behavior is being overridden, a few bits at a time. So then I ultimately claim, this anthropic update is not most parsimoniously described as “that thing that consequentialists sometimes produce” because it’s just not actionable for any consequentialists with reasonable epistemics.

In a different part of the story, you claim that if something composes 1% of what consequentialists value, we can assume that they flip 6 and a half coins, and with 1% probability they act as if that’s the only thing they value.

So we're interested in investing some tiny fraction of our resources in escaping from such simulations. The most important way to minimize the cost is by reducing probability. For example, if we decide to do something only with 1% probability (and in the other 99% of worlds focus exclusively on other things) then we only decrease our log probability by 7 bits. I actually doubt we'd waste those 7 bits, but it's worth noting how cheap this kind of breakout attempt could be.

This seems to me to be another case of supplanting consequentialists’ instrumental rationality with randomness. These are key parts of the story. The part where they make a wild guess about the output of their world, and the part where decide to pursue it in the first place are both places where reasonable goal-oriented behavior is being *replaced *with deference to the descriptive and normative wisdom of tea leaves; this is not just specifying one of the things they sometimes do naturally with small-but-non-negligible probability. It would be an unusual model of agency which claimed that: for a utility function , we have . And it would be an even more unusual model of agency which claimed that: for a world , we have . Even within large multiplicative fudge fators. I feel like I need the assistance of a Dutch bookie here.

Of course we don't know for sure that this is the particular way that the TM works, since we can't infer the output rules from our observations, and probably it isn't. But that's just like saying that any given TM

probablydoesn't output the pixels your camera. The point is that we are doing things that would cause a tiny fraction of the TMs containing us to output good sequences, and that's going to be a way higher fraction than those that happen to output the pixels of cameras

I don’t think it’s just like saying that. I think I have argued that the probability that consequentialists act on the belief that camera 12,041 has a direct line to the output tape is *smaller* than the probability that it is actually true. Likewise for something that appears “beacon-like” to the world’s residents. Given a state of total ignorance about the means by which they can affect the output tape, their guesses can be no better than the true prior distribution over what in the world has a direct line to the output tape. (This is contra: "the fact that they are trying puts them at a massive advantage" from your other comment; intelligence and effort don't work in a max-ent setting. With maximum entropy beliefs about the output channel, those silly no free lunch theorems of optimization do actually apply.) And then also given that ignorance, there is much less value in taking a stab in the dark. In general, with probabilistic mixtures over worlds , .

**michaelcohen (cocoa)**on Formal Inner Alignment, Prospectus · 2021-05-20T22:27:24.102Z · LW · GW

A few quick thoughts, and I'll get back to the other stuff later.

To be clear, people I know spent a lot more time than that thinking hard about the consensus algorithm, before coming to the strong conclusion that it was a fruitless path. I agree that this is worth spending >20 hours thinking about.

That's good to know. To clarify, I was only saying that spending 10 hours on the project of applying it to modern ML would not be enough time to deem it a fruitless path. If after 1 hour, you come up with a theoretical reason why it fails on its own terms--i.e. it is not even a theoretical solution--then there is no bound on how strongly you might reasonably conclude that it is fruitless. So this kind of meta point I was making only applied to your objections about slowdown in practice.

a "theoretical solution" to the realizability probelm at all.

I only meant to claim I was just doing theory in a context that lacks the realizability problem, not that I had solved the realizability problem! But yes, I see what you're saying. The theory regards a "fair" demonstrator which does not depend on the operation of the computer. There are probably multiple perspectives about what level of "theoretical" that setting is. I would contend that in practice, the computer itself is not among the most complex and important causal ancestors of the demonstrator's behavior, so this doesn't present a huge challenge for practically arriving at a good model. But that's a whole can of worms.

My main worry continues to be the way bad actors have control over an io channel, rather than the slowdown issue.

Okay good, this worry makes much more sense to me.

**michaelcohen (cocoa)**on Formal Inner Alignment, Prospectus · 2021-05-19T14:43:00.315Z · LW · GW

I felt I had remained quiet about my disagreement with you for too long

Haha that's fine. If you don't voice your objections, I can't respond to them!

I think let's step back for a second, though. Suppose you were in the epistemic position "yes, this works in theory, with the realizability assumption, with no computational slowdown over MAP, but having spent 2-10 hours trying to figure out how to distill a neural network's epistemic uncertainty/submodel-mismatch, and having come up blank..." what's the conclusion here? I don't think it's "my main guess is that there's no way to apply this in practice". Even if you had spent all the time since my original post trying to figure out how to efficiently distill a neural network's epistemic uncertainty, it's potentially a hard problem! But it also seems like a clear problem, maybe even tractable. See Taylor (2016) section 2.1--inductive ambiguity identification. If you were convinced that AGI will be made of neural networks, you could say that I have reduced the problem of inner alignment to the problem of diverse-model-extraction from a neural network, perhaps allowing a few modifications to training (if you bought that the claim that the consensus algorithm is a theoretical solution). I have never tried to claim that analogizing this approach to neural networks will be easy, but I don't think you want to wait to hear my formal ideas until I have figured out how to apply them to neural networks; my ideal situation would be that I figure out how to do something in theory, and then 50 people try to work on analogizing it to state-of-the-art AI (there are many more neural network experts out there than AIXI experts). My less ideal situation is that people provisionally treat the theoretical solution as a dead end, right up until the very point that a practical version is demonstrated.

If it seemed like solving inner alignment in theory was easy (because allowing yourself an agent with the wherewithal to consider "unrealistic" models is such a boon), and there were thus lots of theoretical solutions floating around, any given one might not be such a strong signal: "this is the place to look for realistic solutions". But if there's only one floating around, that's a very a strong signal that we might be looking in a fundamental part of the solution space. In general, I think the most practical place to look for practical solutions is near the best theoretical one, and 10 hours of unsuccessful search isn't even close to the amount of time needed to demote that area from "most promising".

I think this covers my take on a few of your points, but some of your points are separate. In particular, some of them bear on the question of whether this really is an idealized solution in the first place.

With Ted out of the running, the top 100 hypotheses are now all malign, and can coordinate some sort of treacherous turn.

I think the question we are discussing here is: "yes, with the realizability assumption, existence of a benign model in the top set is substantially correlated over infinite time, enough so that all we need to look at is the relative weight of malign and benign models, BUT is the character of this correlation fundamentally different without the realizability assumption?" I don't see how this example makes that point. If the threshold of "unrealistic" is set in such a way that "realistic" models will only know most things about Sally, then this should apply equally to malign and benign models alike. (I think your example respects that, but just making it explicit). However, there should be a benign and malign model that knows about Sally's affinity for butter but not her allergy to flowers, and a benign and a malign model that knows the opposite. It seems to me that we still end up just considering the relative weight of benign and malign models that we might expect to see.

(A frugal hypothesis generating function instead of a brute force search over all reasonable models might miss out on, say, the benign version of the model that understands Sally's allergies; I do not claim to have identified an approach to hypothesis generation that reliably includes benign models. That problem could be one direction in the research agenda of analogizing this approach to state-of-the-art AI. And this example might also be worth thinking about in that project, but if we're just using the example to try to evaluate the effect of *just* removing the realizability assumption, but not removing the privilege of a brute search through reasonable models, then I stand by the choice to deem this paragraph parenthetical).

Why will there be one best? That's the realizability assumption. There is not necessarily a unique model with lowest bayes loss. Another way of stating this is that Bayesian updates lack a convergence guarantee; hypotheses can oscillate forever as to which is on top.

Yeah I was thinking that the realistic setting was a finite length setting, with the one best being the best at the end. (And if it is best at the end, you can put a lower bound on how low its posterior weight ever was, since it's hard to recover from having vanishingly small weight, and then alpha just needs to be set to include that). But fair enough to be interested an *infinite* lifetime with a finite model class that does not include the truth. So yeah, a model's ranking can oscillate forever, although I think intelligent systems won't really do this in practice? I think in an infinite lifetime, it is reasonable in practice to assume that if something is generally intelligent, it has some sort of ability to generate new hypotheses over time. And if you have two hypotheses about Sally, one which explains her relationship to butter, and one which explains her relationship to flowers, it's pretty easy to generate another that combines these. So I'm not sure how instructive it is to consider the finite model class/infinite horizon Bayesian sequence prediction as a model for state-of-the-art AI.

So, what, the human has to act like they're driving the car forever, even though the AI autopilot is really good?

The suggestion I make later is that a few timesteps in advance, the human is told to start acting like they're driving the car, but they're not controlling it yet, and then comes the timestep when autopilot actually defers to them, and that's the behavior that gets recorded. Your response is

Is this procedure supposed to strike me as

so airtight that an inner optimizer could never exploit it?

That proposal was making a different point--it was refuting exactly the concern that the human would have to act like they're driving the car forever. This protocol just defends active imitation learning in general as potentially viable for the operation of heavy machinery. Assume the demonstrator is sitting in a driving simulator, and all the relevant info shows up on their screens; this allows a small team to be the demonstrators for many autopilots simultaneously. It's not a question of whether an inner optimizer could exploit this protocol by finding the most inconvenient time for the human to be called in, and then querying the human--it's much worse than that! If there's a bad time for the human drive to be called in, then the autopilot will fail at that time regardless; the inner optimizer doesn't even have to arrange an actual query. So there's nothing to "exploit" really. And defending against the "much worse" version, I do think that the policy of a human driver with 5 seconds of acquaintance with the current state, does not crash the car at a rate significantly higher than a real human driver. Also, it may be worth saying that I don't think we really need general intelligence to operate heavy machinery at human level, so I'm not sure the heavy machinery case is very important to consider in the first place.

In terms of evaluating this on theoretical terms, the feedback call button

. This feels kind of like "case closed" to me; I take it as a given that a sufficiently intelligent malign actor could exploit this.is an output channel that malign stuff has access to

You said you'd continue to think about the points I made about why I don't have the same reaction to this, so we can come back to those later. But stepping back, I think I want to make a point about the information loss you impose on your beliefs when applying the security mindset or Murphy's Law. A malign model gets to say "query now" once! And then it is removed from the posterior because of its wrong prediction. That is not many bits for an output channel. But... yes, it is an output channel. And if you insist that all output channels for malign stuff transmit world-ending signals, just like all aquariums will leak, I agree that that is sometimes a useful perspective to adopt and run with, but the exercise of making actual probability estimates about events is potentially different from exercising a security mindset. I mentioned information loss. With a security mindset, there is no point in distinguishing a malign agent with a 20 bit output channel from one with an unbounded output channel. You just close the case on both. But we really might live in a world where the former is existentially safe, and the latter is not--I mean that even seems likely to me. Even from my perspective, where I don't think a 20 bit output channel is dangerous, I understand the merit in disregarding that belief as the motivation for a research agenda, but I do think that's a whole different sort of "epistemic status". So when you say "I take it as a given that a sufficiently intelligent malign actor could exploit this" that makes sense to me as an *action*, but not as a proposition. (I mean, I think it's incorrect, so I certainly don't think it's self-evidently true).

I guess I can expand on the object level too. Like, take a moment to really stare at 01110010101011010000, and imagine that you take it to be a signal from some deeply wise agent that arranged for a query at timestep number 469,712. Suppose further that you already think this deeply wise agent is not malign, but rather is much maligned! (So it doesn't even need to spend the first few bits enthralling you). What do you do now? Or maybe timestep 469, 712 doesn't mean that bitstring--maybe it means "this is a very meaningful world-state, and of similar world-states, the fact that this one was chosen tells me I should run program X on a supercomputer, so it can engage in some good, clean world-optimization". Look around you right now--if someone told you "now is important, so think about that, and go run a program", what program do you produce? Am I taking a huge risk even asking that, because there's a precise time such that if I did, you'd run that program and it would end the world?

**michaelcohen (cocoa)**on Formal Inner Alignment, Prospectus · 2021-05-16T18:52:40.732Z · LW · GW

To state the least of our problems first: this requires a 100x slowdown in comparison with the state-of-the-art deep learning (or whatever) we're layering the consensus algorithm on top of

I think you’re imagining deep learning as a MAP-type approach—it just identifies a best hypothesis and does inference with that. Comparing the consensus algorithm with (pure, idealized) MAP, 1) it is no slower, and 2) the various corners that can be cut for MAP can be cut for the consensus algorithm too. Starting with 1), the bulk of the work for either the consensus algorithm or a MAP approach is computing the posterior to determine which model(s) is(are) best. In an analogy to neural networks, it would be like saying most of the work comes from using the model (the forward pass) rather than arriving at the model (the many forward and backward passes in training). Regarding 2), state-of-the-art-type AI basically assumes approximate stationarity when separating a training phase from a test/execution phase. This is cutting a huge corner, and it means that when you think of a neural network running, you mostly think about it using the hypothesis that it has already settled on. But if we compare apples to apples, a consensus algorithm can cut the same corner to some extent. Neither a MAP algorithm nor a consensus algorithm is any better equipped than the other to, say, update the posterior only when the timestep is a power two. In general, training (be it SGD or posterior updating) is the vast bulk of the work in learning. To select a good hypothesis in the first place you will have already had to consider many more; the consensus algorithm just says to keep track of the runner ups.

Third, the consensus algorithm requires a strong form of realizability assumption, where you not only assume that our Bayesian space contains the true hypothesis, but furthermore,

that it's in the top 100(or whatever number we choose). This hypothesis has to be really good: we have to think that malign hypothesesout-guess the benign hypothesis. Otherwise, there's a chance that we eliminate the good guy at some point (allowing the bad guys to coordinate on a wrong answer). But this is unrealistic! The world is big and complex enough that no realistic hypothesis has all the answers.never

I don’t understand what out-guess means. But what we need is that the malign hypothesis don’t have substantially higher posterior weight than the benign ones. As time passes, the probability of this happening is not independent. The result I show about the probability of the truth being in the top set applies to all time, not any given point in time. I don’t know what “no realistic hypothesis has all the answers” means. There will be a best “realistic” benign hypothesis, and we can talk about that one.

Michael Cohen seems to think that restricting to imitation learning makes the realizability assumption realistic

Realistic in theory! Because the model doesn’t need to include the computer. I do not think we can actually compute every hypothesis simpler than a human brain in practice.

When you go from an idealized version to a realistic one, all methods can cut corners, and I don’t see a reason to believe that the consensus algorithm can’t cut corners just as well. Realistically, we will have some hypothesis-proposing heuristic, strong enough to identify models one of which is accurate enough to generate powerful agency. This heuristic will clearly cast a wide net (if not, how would it magically arrive at a good answer? It’s internals would need some hypothesis-generating function). Rather than throwing out the runner ups, the consensus algorithm stores them. The hypothesis generating heuristic is another attack surface for optimization daemons, and I won’t make any claims for now about how easy or hard it is to prevent such a thing.

to apply this to something like deep learning, we need to think that each run has an

independentchance of creating safe AGI

Evan and I talked along these lines for a bit. My basic position is that if “local search” is enough to get to general intelligence, our algorithms will be searching in spaces (or regions) where diverse hypothesis are *close*. Diverse hypothesis generation is just crucial for general intelligence. I do not advocate training GPT-N with 10^100 different initializations. I don’t think you have to, and I don’t think it would help much.

Fourth, the consensus algorithm

still gives inner optimizers control of. This in itself could be catastrophic! For example, it could ask at the worst times, when it anticipates the human feedback to be of poor quality, or when asking itself has catastrophic consequences. A silly example would be causing a self-driving car to ask for feedback right when it needs to enact difficult evasive maneuvers to avoid a crash.when the system asks for more feedback

It’s imitation learning so whatever the person would do when stepping in is what the algorithm learns to do when they don’t. If the human action would be of poor quality if the human were called upon to take the action, then all the reasonable models, benign ones included, will predict that poor action, and the poor action will be taken regardless of whether the human was brought into the loop. If a person would be unable to avoid a crash when stepping in at the last minute, the imitation learner will crash the car too! If 99 hypotheses are correctly predicting the human action, then the 100th hypotheses can correctly predict it too, and the human action is executed, or it can err deliberately, in which case the human is called and the human action is executed anyway. You might have to modify this protocol a bit before using it to learn human policies for operating heavy machinery.

Fifth, although (as Michael Cohen notes) the

frequencyof asking for feedback will rapidly diminish, thelearning algorithm converges to the correct hypothesis extremely slowly…We have to face the fact that it might require human feedback at any point in the future.

Yeah this feels like a small cost to me. One person can be doing this for many instances at once. If you want to mimic a human that has been attentive for a given period of time (because you want to know how a human drives a car, having already gotten their bearings previously, rather than how human drives a car having just been transported into it) you can know a few timesteps in advance if there is any likely circumstance in which you will need to query the demonstrator, and you can use the lead time to get the demonstrator acquainted. The query complexity will go up of course because there will be false alarms, but this is a totally surmountable problem for active learning.

The fourth point [controlling when to ask for more feedback] really seems like the deal-breaker. Malign mesa-optimizers could intelligently leverage this to communicate with the outside world, manipulate the training process, etc.

A malign mesa-optimizer has one moment to err, and then it will be discarded from the posterior. There’s not much it could say to us with a well-timed "now!". If multiple mesa-optimzers coordinate around a message, any mesa-optimizer not on board with this mission could disrupt the message. If you like, we could add a 1/million chance of querying to every timestep. Finally, with butterfly-effects, I don’t think that even a billion well-timed asks for help could precipitate human extinction. This is a bit all over the map, and this may be the point you want to press, but if you do, I think it is worth logging that if this is only attack surface left, we shouldn't forget to put a stake in the ground marking "lots of progress". If you believe that a single malign world-model could derail us if we use a Solomonoff predictor, but a consensus predictor would only be derailed by a supermajority of malign world-models, aligned with each other, that is a qualitative difference.

My third and final example: in one conversation, someone made a claim which I see as "exactly wrong": that we can somehow lower-bound the complexity of a mesa-optimizer in comparison to a non-agentic hypothesis (perhaps because a mesa-optimizer has to have a world-model

plusother stuff, where a regular hypothesis just needs to directly model the world). This idea was used to argue against some concern of mine.

The problem is precisely that we know of no way of doing that!If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway!

Maybe this was someone else, but it could have been me. I think MAP probably does solve the inner alignment problem in theory, but I don’t expect to make progress resolving that question, and I’m interested in hedging against being wrong. Where you say, “We know of no way of doing that” I would say, “We know of ways that might do that, but we’re not 100% sure”. On my to-do list is to write up some of my disagreements with Paul’s original post on optimization daemons in the Solomonoff prior (and maybe with other points in this article). I don’t think it’s good to argue from the premise that a problem is worth taking seriously, and then see what follows from the existence of that problem, because a problem can exist with 10% probability and be worth taking seriously, but one might get in trouble embedding its existence too deeply in one’s view of the world, if it is still on balance unlikely. That’s not to say that most people think Paul’s conclusions are <90% likely, just that one might.

**michaelcohen (cocoa)**on Formal Inner Alignment, Prospectus · 2021-05-16T18:40:20.204Z · LW · GW

I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn't malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn't (and in fact I'm not sure that there's even that much of a connection between the malignity of those two priors)

Agree.

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-03-04T16:51:15.287Z · LW · GW

There will be plenty of functions that have fewer bits in their encoding than the real function used by the demonstrator.

I don't think this is a problem. There will be plenty of them, but when they're wrong they'll get removed from the posterior.

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-03-04T16:46:29.531Z · LW · GW

A policy outputs a distribution over , and equations 3 and 4 define what this distribution is for the imitator. If it outputs (0, a), that means and and and if it outputs (1, a), that means and . When I say

The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator,

that's just describing the difference between equations 3 and 4. Look at equation 4 to see that when , the distribution over the action is equal to that of the demonstrator. So we describe the behavior that follows as "deferring to the demonstrator". If we look at the distribution over the action when , it's something else, so we say the imitator is "picking its own action".

The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator

or picking one of the other actions???

The 0 means the imitator is picking the action, and the means it's not picking another action that's not .

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-28T11:32:08.484Z · LW · GW

What's the distinction between training and deployment when the model can always query for more data?

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-28T11:30:01.038Z · LW · GW

It may be helpful to point to specific sections of such a long paper.

(Also, I agree that a neural network trained trained with that reward could produce a deceptive model that makes a well-timed error.)

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-28T11:14:07.284Z · LW · GW

it certainly should have to consider different models of what the next transition will be like.

Yeah I was agreeing with that.

even though the Q learner is

trainedusing its own point estimate of the next state, it isn't, at inference time, given access to that point estimate.

Right, but one thing the Q-network, in its forward pass, is trying to *reproduce *is the point of estimate of the Q-value of the next state (since it doesn't have access to it). What it *isn't *trying to reproduce, because it isn't trained that way, is multiple models of what the Q-value might be at a given possible next state.

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-27T11:17:01.048Z · LW · GW

I interpreted this bit as talking about RL

I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to

find an agentwhich does do science.

But taking us back out of RL, in a wide neural network with selective attention that enables many qualitatively different forward passes, gradient descent seems to be training the way different models get proposed (i.e. the way attention is allocated), since this happens in a single forward pass, and what we're left with is a modeling routine that is heuristically considering (and later comparing) very different models. And this should include any model that a human would consider.

I think that is main thread of our argument, but now I'm curious if I was totally off the mark about Q-learning and policy gradient.

but overall I think these sorts of differences are pretty minor and shouldn't affect whether these approaches can reach general intelligence or not.

I had thought that maybe since a Q-learner is trained as if the cached point estimate of the Q-value of the next state is the Truth, it won't, in a single forward pass, consider different models about what the actual Q-value of the next state is. At most, it will consider different models about what the very next transition will be.

a) Does that seem right? and b) Aren't there some policy gradient methods that don't face this problem?

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-26T15:49:44.009Z · LW · GW

So, if I understand the paper right, if was still in the set of top policies at time , the agent would not take the action to kill the demonstrator, but ask the demonstrator to take an action instead, which avoids the bad outcome, and might also demote all of the treacherous turn policies out of the top set.

This is exactly right.

But I am also reading in the abstract of the paper that

[...] queries to the demonstrator rapidly diminish in frequency

The reason this is consistent is that queries will become infrequent, but they will still be well-timed. The choice of whether to query *depends* on what the treacherous models are doing. So if the treacherous models wait for a long time, then we will have a long time where no queries are needed, and as soon as they decide to be treacherous, we thwart that plan with a query to the demonstrator. So it does not imply that

over time, it is likely that might disappear from the top set

Does this thought experiment look reasonable or have I overlooked something?

Yep, perfectly reasonable!

What about the probability that is still in the set of top policies at time ?

If we set small enough, we can make it arbitrarily like that never leaves the set of top policies.

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-26T15:38:10.415Z · LW · GW

That's possible. But it seems like way less of a convergent instrumental goal for agents living in a simulated world-models. Both options--our world optimized by us and our world optimized by a random deceptive model--probably contain very little of value as judged by agents in another random deceptive model.

So yeah, I would say some models would think like this, but I would expect the total weight on models that do to be much lower.

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-26T15:15:31.772Z · LW · GW

This is very nice and short!

And to state what you left implicit:

If , then in the setting with no malign hypotheses (which you assume to be safe), 0 is definitely the output, since the malign models can only shift the outcome by , so we assume it is safe to output 0. And likewise with outputting 1.

I'm pretty sure removing those is mostly just a technical complication

One general worry I have about assuming that the deterministic case extends easily to the stochastic case is that a sequence of probabilities that tends to 0 can still have an infinite sum, which is not true when probabilities must , and this sometimes causes trouble. I'm not sure this would raise any issues here--just registering a slightly differing intuition.

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-26T12:33:04.561Z · LW · GW

I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to find an agent which does do science.

Okay I think we've switched from talking about Q-learning to talking about policy gradient. (Or we were talking about the latter the whole time, and I didn't notice it). The question that I think is relevant is: how are possible world-models being hypothesized and analyzed? That's something I expect to be done with messy heuristics that sometimes have discontinuities their sequence of outputs. Which means I think that no reasonable DQN is will be generally intelligent (except maybe an enormously wide one attention-based one, such that finding models is more about selective attention at any given step than it is about gradient descent over the whole history).

A policy gradient network, on the other hand, could maybe (after having its parameters updated through gradient descent) become a network that, in a single forward pass, considers diverse world-models (generated with a messy non-local heuristic), and analyzes their plausibility, and then acts. At the end of the day, what we have is an agent modeling world, and we can expect it to consider any model that a human could come up with. (This paragraph also applies to the DQN with a gradient-descent-trained method for selectively attending to different parts of a wide network, since that could amount to effectively considering different models).

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-26T11:30:09.655Z · LW · GW

If you are willing to query only a billion points, and there are a billion competent treacherous models, and if the game is zero-sum between you and the models, then you're going to get at least as screwed as by the billionth-best opportunity for treachery.

The treacherous models make their decisions first, and whether we query at a timestep depends on whether they chose that moment to be treacherous. We don't have to say: this timestep is one of the billion best for treachery--guess we had better query. The decision process a treacherous model faces is:

- From past observations, does it seem alpha is low enough that an honest model is in the top few?
- If no, [let's just assume whatever happens here is bad]
- If yes, is the honest model going to leave the top few any time?
- If yes, wait. [let's assume this ends badly]
- If no, admit defeat, because whenever you err, that will be when a query gets triggered. Salvage the situation by trying to get as much mileage as you can by decreasing probability on some action the demonstrator would take that you don't like much in order to shift probability mass to some other action they would take that you like a little better. OR
- Hope that 1) other treacherous models, like soldiers on the front line storming the enemy and facing certain death so that their brothers behind them can win the battle, give erring a go anyway. And hope that 2) the operators quietly increase alpha after the billionth query. Acting on this hope, wait. (The front-line treacherous models are not entirely self-sacrificial; they might figure they might as well hope they were wrong about alpha. But the models know when queries happen, because every observation they get comes from a query, and it wouldn't take long to figure out that alpha really truly isn't high enough for them to execute treachery successfully).

Above in the thread, I was talking about the game theory of step 6 (although I maybe wasn't making the case as strongly as I could have). Which models would ever volunteer? They don't care about the long term goal of exhausting our query resources so other treacherous models can capitalize on a quietly increased alpha. All that is to say, I don't follow the comment I excerpted.

(In case this comes up later: I think about the long-run percentage of query-timesteps that we would tolerate, not the total number. So if we ever have cause to consider a case where attempted treachery "merely gets delayed" but still eventually happens, and still results in a query, then depending on how much it is delayed, that could still qualify as win).

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-26T10:44:04.437Z · LW · GW

Does it make sense to talk about , which is like in being treacherous, but is uses the true model instead of the honest model ? I guess you would expect to have a lower posterior than ?

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-25T16:43:48.973Z · LW · GW

If we ever produce generally intelligent predictors (or "accurate world-models" in the terminology we've been using so far), we will need a process that is *much* more efficient than evolution.

But also, I certainly don't think that in order to be generally intelligent you need to start with a generally intelligent subroutine. Then you could never get off the ground. I expect good hypothesis-generation / model-proposal to use a mess of learned heuristics which would not be easily directed to solve arbitrary tasks, and I expect the heuristic "look for models near the best-so-far model" to be useful, but I don't think making it ironclad would be useful.

Another thought on our exchange:

Me: we can expect a good heuristic search through models will identify any model that a human could hypothesize

You: I think I would expect essentially all models that a human could hypothesize to be in the search space—but if you're doing a local search, then you only ever really see the easiest to find model with good behavior

If what you say is correct, then it sounds like exclusively-local search precludes human-level intelligence! (Which I don't believe, by the way, even if I think it's a less efficient path). One human competency is generating lots of hypotheses, and then having many models of the world, and then designing experiments to probe those hypotheses. It's hard for me to imagine that an agent that finds an "easiest-to-find model" and then calls it a day could ever do human-level science. Even something as simple as understanding an interlocuter requires generating diverse models on the fly: "Do they mean X or Y with those words? Let me ask a clarfying question."

I'm not this bearish on local search. But if local search is this bad, I don't think it is a viable path to AGI, and if it's not, then the internals don't for the purposes of our discussion, and we can skip to what I take to be the upshot:

we can expect a good heuristic search through models will identify any model that a human could hypothesize

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-24T23:50:55.732Z · LW · GW

Specifically, they cooperate in that they perfectly mimic the true model up until the point where...

This thread began by considering deceptive models cooperating with each other in the sense of separating the timing of their treacherous turns in order to be maximally annoying to us. So maybe our discussion on that topic is resolved, and we can move on to this scenario.

up until the point where the deceptive models make up enough of the posterior that the true model is no longer being consulted

if alpha is low enough, this won't ever happen, and if alpha is too high, it won't take very long. So I don't think this scenario is quite right.

Then the question becomes, for an alpha that's low enough, how long will it take until queries are infrequent, noting that you need a query any time any treacherous model with enough weight decides to take a treacherous turn?

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-24T23:38:38.184Z · LW · GW

So would you say you disagree with the claim

I think that arbitrary limits on heuristic search of the form "the next model I consider must be fairly close to the last one I did" will not help it very much if it's anywhere near smart enough to merit membership in a generally intelligent predictor.

?

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-24T22:54:07.947Z · LW · GW

Well, just like we can write down the defectors, we can also write down the cooperators

If it's only the case that we can write them down, but they're not likely to arise naturally as simple consequentialists taking over simple physics, then that extra description length will be seriously costly to them, and we won't need to worry about any role they might play in p(treacherous)/p(truth). Meanwhile, when I was saying we could write down some defectors, I wasn't making a simultaneous claim about their relative prior weight, only that their existence would spoil cooperation.

And in this situation, the cooperators should eventually outcompete the defector

For cooperators to outcompete defectors, they would have to be getting a larger share of the gains from cooperation than defectors do. If some people are waiting for the fruit on public trees to ripen before eating, and some people aren't, the defectors will be the ones eating the almost ripe fruit.

if the true model has a low enough prior, [treacherous models cooperating with each other in separating their treacherous turns] could [be treacherous] only once they've pushed the true model out of the top

I might be misunderstanding this statement. The inverse of the posterior on the truth is a supermartingale (doesn't go up in expectation), so I don't know what it could mean for the true model to get pushed out.

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-24T22:38:56.937Z · LW · GW

This is a bit of a sidebar:

I'm curious what you make of the following argument.

- When an infinite sequence is sampled from a true model , there is likely to be another treacherous model which is likely to end up with greater posterior weight than an honest model , and greater still than the posterior on the true model .
- If the sequence were sampled from instead, the eventual posterior weight on will probably be at least as high.
- When an infinite sequence is sampled from a true model , there is likely to be another treacherous model , which is likely to end up with greater posterior weight than an honest model , and greater still than the posterior on the true model .
- And so on.

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-24T22:19:02.317Z · LW · GW

If nothing else, it seems like you could literally add garbage bits to the treacherous models

Okay, sure.

It seems to me like this is clearly wrong in the limit (since simple consequentialists would take over simple physics).

It's not clear to me that there isn't meaningful overhead involved.

You are saying that a special moment is a particularly great one to be treacherous. But if P(discovery) is 99.99% during that period, and there is any other treachery-possible period where P(discovery) is small, then that other period would have been better after all. Right?

I agree with what you're saying but I don't see how it contradicts what I was. First, what I had in mind when saying that some timesteps are better for treachery because when the agent acts on a false prediction it has a greater effect on the world, though of course P(discovery) is also relevant. But my point is that when multiple treacherous models pick the same timestep to err, there may be pros and cons to doing this, but one thing that *isn't* on the cons list, is that in the long run, it makes our lives easier if they do. Making our lives difficult is a public good for treacherous models.

So treacherous models won't be *trying* to avoid collisions *in order* to make queries be linear in p(treachery)/p(truth). If P(discovery) increases when multiple models are trying to be treacherous at the same time--which we could go onto discuss; it's not obvious to me either way as of yet--that will balanced against the inherit variation in some timesteps being a better for treachery than others.

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-24T12:12:04.936Z · LW · GW

I'm not *exactly* sure what I mean either, but I wasn't imagining as much structure as exists in your post. I mean there's some process which constructs hypotheses, and choices are being made about how computation is being directed within that process.

I think it'll be more efficient to run a local search over that space than a global one

I think any any heuristic search algorithm worth its salt will incorporate information about proximity of models. And I think that arbitrary limits on heuristic search of the form "the next model I consider must be fairly close to the last one I did" will not help it very much if it's anywhere near smart enough to merit membership in a generally intelligent predictor.

But for the purpose of analyzing it's output, I don't think this discussion is critical if we agree that we can expect a good heuristic search through models will identify any model that a human could hypothesize.

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-24T11:52:28.344Z · LW · GW

They're one shot. They're not interacting causally. When any pair of models cooperates, a third model defecting benefits just as much from their cooperation with each other. And there will be defectors--we can write them down. So cooperating models would have to cooperate with defectors and cooperators indiscriminately, which I think is just a bad move according to any decision theory. All the LDT stuff I've seen on the topic is how to make one's own cooperation/defection depend logically on the cooperation/defection of others.

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-24T11:40:54.099Z · LW · GW

So it seems to me like this argument can never drive P(successful treachery) super low, or else it would be self-defeating (for competent treacherous agents)

I don't follow. Can't races to the bottom destroy all value for the agents involved? Consider the claim: firms will never set prices to equal marginal cost because that would destroy all profit.

Now you want to say that whatever reasoning a treacherous agent does in order to compress the human, you can just promote

thatto the outside model as well. But that argument seems to be precisely where the meat is (and it's the kind of thing I spend most of time on). If that works then it seems like you don't even need a further solution to inner alignment, just use the simplest model.

Yeah, okay. This is my mainline opinion, and I just think about inner alignment in case it's wrong. So for now we can assume that it is wrong, since there's not much more to say if it's right.

Maybe there is a hope that the intended model won't be "much" more complex than the treacherous models, without literally saying that it's the simplest, but at that point I'm back to wondering whether "much" is like 0.01% or 0.00000001%

That is my next guess, but I hadn't thought in terms of percentages before this. I had thought of normally distributed noise in the number of bits.

It's occurring to me that this question doesn't matter to our present discussion. What makes the linear regime linear rather logarithmic is that if p(treachery)/p(honest model) is high, that allows for a large number of treacherous models to have greater posterior weight than the truth. But if a single model has n times the posterior weight of the truth, it still only takes one query to the demonstrator to interrupt its treacherous turn, not n queries.

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-24T01:34:15.707Z · LW · GW

you're modeling the

training processas Bayesian,

I don't understand the alternative, but maybe that's neither here nor there.

what makes the problem go away isn't more intelligent models, but less efficient training processes

It's a little hard to make a definitive statement about a hypothetical in which the inner alignment problem doesn't apply to Bayesian inference. However, since error bounds are apparently a key piece of a solution, it certainly seems that if Bayesian inference was immune to mesa-optimizers it would be because of competence not resource-prodigality.

Here's another tack. Hypothesis generation seems like a crucial part of intelligent prediction. Pure Bayesian reasoning does hypothesis generation by brute force. Suppose it was inefficiency, and not intelligence, that made Bayesian reasoning avoid mesa-optimizers. Then suppose we had a Bayesian reasoning that was equally intelligent but more efficient, by only generating relevant hypotheses. It gains efficiency by not even bothering to consider a bunch of obviously wrong models, but it's posterior is roughly the same, so it should avoid inner alignment failure equally well. If, on the other hand, the hyopthesis generation routine was bad enough that some plausible hypotheses went unexamined, this could introduce an inner alignment failure, with a notable decrease in intelligence.

I expect us to end up using some sort of local search instead of a global search like that—just because I think that local search is way more efficient than any sort of Bayesianish global search

I expect some heuristic search with no discernable structure, guided "attentively" by an agent. And I expect this heuristic search through models to identify any model that a human could hypothesize, and many more.

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-24T01:04:08.196Z · LW · GW

That seems worth formalizing if we want to have a theorem, since that's the step that we need to be correct.

I agree. Also, it might be very hard to formalize this in a way that's not: write out in math what I said in English, and name it Assumption 2.

I don't think that (say) 99% of smart treacherous models would make this particular kind of incompetent error?

I don't think incompetence is the only reason to try to pull off a treacherous turn at the same time that other models do. Some timesteps are just more important, so there's a trade off. And what's traded off is a public good: among treacherous models, it is a public good for the models' moments of treachery to be spread out. Spreading them out exhausts whatever data source is being called upon in moments of uncertainty. But for an individual model, it never reaps the benefit of helping to exhaust our resources. Given that treacherous models don't have an incentive to be inconvenient to us in this way, I don't think a failure to qualifies as incompetence. This is also my response to

It seems like [the logarithmic regime] must end if there are any treacherous models

Any competent model will err very rarely

Yeah, I've been assuming all treacherous models will err once. (Which is different from the RL case, where they can err repeatedly on off-policy predictions).

If our models are a trillion bits, then it doesn't seem that surprising to me if it takes 100 bits extra to specify an intended model relative to an effective treacherous model, and if you have a linear dependence that would be unworkable. In some sense it's actively surprising if the very shortest intended vs treacherous models have description lengths within 0.0000001% of each other unless you have a very strong skills vs values separation. Overall I feel much more agnostic about this than you are.

I can't remember exactly where we left this in our previous discussions on this topic. My point estimate for the number of extra bits to specify an intended model relative to an effective treacherous model is negative. The treacherous model has to compute the truth, and then also decide when and how to execute treachery. So the subroutine they run to compute the truth, considered as a model in its own right, seems to me like it must be simpler.

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-24T00:26:17.761Z · LW · GW

No matter how much data you have, my bound on the KL divergence won't approach zero.

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-23T14:06:43.291Z · LW · GW

It seems like you have the following tough case even if the human is deterministic:

- There are N hypotheses about the human, of which one is correct and the others are bad.
- I would like to make a series of N potentially-catastrophic decisions.
- On decision k, all (N-1) bad hypotheses propose the same catastrophic prediction. We don't know k in advance.
- For the other (N-1) decisions, (N-2) of the bad hypotheses make the correct prediction a*, and the final bad hypothesis claims that action a* would be catastrophic.
In this setting, it seems like I have no hope other than to query the human on all N decisions (since all days and hypotheses are symmetrical), so I assume that this is what your algorithm would do.

They're not quite symmetrical: midway through, some bad hypotheses will have been ruled out for making erroneous predictions about the demonstrator previously. But your conclusion is still correct. And

That strongly suggests that the number of queries to the human goes as 1 / p(correct demonstrator)

It's certainly no better than that, and the bound I prove is worse.

Usually we try to have a logarithmic dependence on p(correct demonstrator) but this doesn't seem possible here.

Yeah. Error bounds on full Bayesian are logarithmic in the prior weight on the truth, but error bounds on maximum a posteriori prediction are just inverse in the prior weight, and your example above is the one to show it. If each successive MAP model predicts wrong at each successive timestep, it could take N timesteps to get rid of N models, which is how many might begin with a prior weight exceeding the truth, if the truth has a prior weight of 1/N.

But, this situation seems pretty preposterous to me in the real world. If agent's first observation is, say, this paragraph, the number of models with prior weight greater than the truth that predicted something else as the first observation, will probably be a number way, way different from one. I'd go so far as to say at least half of models with prior weight greater than the truth would predict a different observation than this very paragraph. As long as this situation keeps up, we're in a logarithmic regime.

I'm not convinced this logarithmic regime ever ends, but I think the case is more convincing that we at least start there, so let's suppose now that it eventually ends, and after this point the remaining models with posterior weight exceeding the truth are deliberately erring at unique timesteps. What's the posterior on the truth *now*? This is a phase where the all the top models are "capable" of predicting correctly. This shouldn't look like at all. It will look more like p(correct model)/p(treachery).

And actually, depending on the composition of treacherous models, it could be better. To the extent some constant fraction of them are being treacherous at particular times, the logarithmic regime will continue. There are two reasons why I expect that to hold.

- Since there will always be particularly good times to capitalize on treachery, a decent fraction of treacherous models should err at similar times.
- The goals of a treacherous model will impact what to predict, but treacherous models with different goals must predict different false things if they are to have different effects. In your original discussion of treacherous models, you argue the strategy they would take is convergent (well-timed erroneous predictions), while the goals they would have are diverse (and thus more well-represented in a universal prior).

So more precisely, when the logarithmic phase ends, if it ever does, the posterior on the truth will be around: p(correct model)/p(treachery with the strategy of erring at a random moderately-important time rather than a time of extreme import).

If the number of queries required is linear in *this* (plus logarithmic in the prior), that doesn't seem very bad to me.

Which means I disagree with this:

There's a real question about p(treachery) / p(correct model), and if this ratio is bad then it seems like we're definitely screwed with this whole approach.

There is at most a linear cost to this ratio, which I don't think screws us.

Suppose that I want to bound the probability of catastrophe as times the demonstrator probability of catastrophe.

I think this is doable with this approach, but I haven't proven it can be done, let alone said anything about a dependence on epsilon. The closest bound I show not only has a constant factor of like 40; it depends on the prior on the truth too. I think (75% confidence) this is a weakness of the proof technique, not a weakness of the algorithm.

It seems like the number of human queries must scale at least like .

I doubt it's any better than that, but I haven't (just now) managed to convince myself that it couldn't be.

in the context of HCH we may need to push epsilon down to 1/N. But maybe there's some way to avoid that by...

I'm a little skeptical we could get away with having epsilon > O(1/N). But I don't quite follow the proposal well enough to say that I think it's infeasible.

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-23T12:43:36.664Z · LW · GW

It is worth noting this.

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-23T12:39:17.505Z · LW · GW

Yes, I agree that an ensemble of models generated by a neural network may have correlated errors. I only claim to solve the inner alignment problem in theory (i.e. for idealized Bayesian reasoning).

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-20T15:54:36.761Z · LW · GW

Neat, makes sense.

**michaelcohen (cocoa)**on Formal Solution to the Inner Alignment Problem · 2021-02-20T10:23:47.950Z · LW · GW

That makes sense.

Alpha only needs to be set based on a guess about what the prior on the truth is. It doesn't need to be set based on guesses about possibly countably many traps of varying advisor-probability.

I'm not sure I understand whether you were saying ratio of probabilities that the advisor vs. agent takes an unsafe action can indeed be bounded in DL(I)RL.