# Open question: are minimal circuits daemon-free?

post by paulfchristiano · 2018-05-05T22:40:20.509Z · score: 122 (39 votes) · LW · GW · 69 comments## Contents

Problem statement and intuition Difficulties None 69 comments

*Note: weird stuff, very informal.*

Suppose I search for an algorithm that has made good predictions in the past, and use that algorithm to make predictions in the future.

I may get a "daemon," a consequentialist who happens to be motivated to make good predictions (perhaps because it has realized that only good predictors survive). Under different conditions, the daemon may no longer be motivated to predict well, and may instead make "predictions" that help it achieve its goals at my expense.

I don't know whether this is a real problem or not. But from a theoretical perspective, not knowing is already concerning--I'm trying to find a strong argument that we've solved alignment, not just something that seems to work in practice.

I am pretty convinced that daemons are a real problem for Solomonoff induction. Intuitively, the problem is caused by "too much compute." I suspect that daemons are also a problem for some more realistic learning procedures (like human evolution), though in a different shape. I think that this problem can probably be patched, but that's one of the major open questions for the feasibility of prosaic AGI alignment.

I suspect that daemons *aren't* a problem if we exclusively select for computational efficiency. That is, I suspect that **the fastest way to solve any particular problem doesn't involve daemons**.

I don't think this question has much intrinsic importance, because almost all realistic learning procedures involve a strong simplicity prior (e.g. weight sharing in neural networks).

But I do think this question has deep similarities to more important problems, and that answering this question will involve developing useful conceptual machinery. Because we have an unusually strong intuitive handle on the problem, I think it's a good thing to think about.

## Problem statement and intuition

Can the smallest boolean circuit that solves a problem be a daemon? For example, can the smallest circuit that predicts my behavior (at some level of accuracy) be a daemon?

Intuitively, if we have a daemon that is instrumentally or incidentally motivated to solve my problem, then there is some smaller circuit that solves the problem equally well but skips the instrumental reasoning. If my daemon is doing some complex reasoning to answer the question "Should I predict well?" we could just skip straight to the answer "yes." This both makes the circuit smaller, and prevents the circuit from ever deciding not to predict well.

A different perspective on a similar intuition: the daemon is doing some actual cognitive work to solve the problem. Since that computation is being done by the daemon, it is embedded as a smaller circuit. Jessica explores this intuition a bit here. Here we are considering an easy version of the problem, since by taking the smallest circuit we are effectively quantifying over all possible ways of extracting logical information from the daemon.

Instead of showing that minimal circuits can't be daemons, we might end up concluding that they can be. That would be even more interesting.

Another possible outcome is giving a strong argument that captures our intuitions/concerns about daemons, and which clearly doesn't apply to the minimal circuit that solves a problem. In this case we couldn't prove anything positive about the minimal circuit, but we would have "screened off" the possible cause for concern.

## Difficulties

The first and most serious difficulty is understanding what we are talking about.

I don't expect to get total clarity on concepts like "daemon" or "optimization" or "generic problem," but we need to have a better grip than we do right now. I expect that we'll develop better concepts in the course of solving the problem, rather than as a precondition for solving the problem (in general I think "define things so that you can prove the theorem" is often the right strategy).

A second difficulty is that the different parts of the computation can be tangled up in an extremely complex way. In an extreme case, the daemon may be cryptographically obfuscated.

We want to show that given any daemon, there is a smaller circuit that solves the problem. The most natural approach is showing how to construct a smaller circuit, given a daemon. But if the daemon is obfuscated, there is no efficient procedure which takes the daemon circuit as input and produces a smaller circuit that still solves the problem.

So we can't find any efficient constructive argument. That rules out most of the obvious strategies.

## 69 comments

Comments sorted by top scores.

I am pretty convinced that daemons are a real problem for Solomonoff induction. Intuitively, the problem is caused by “too much compute.” I suspect that daemons are also a problem for some more realistic learning procedures (like human evolution), though in a different shape.

For human evolution, the problem is too little compute rather than too much, right? Meaning if evolution just gave humans the goal of "maximize inclusive fitness" then the human wouldn't be able to find a good policy for achieving that due to lack of computing power so instead we got a bunch of goals that would have been subgoals of "maximize inclusive fitness" in our ancestral environment (like eat tasty food and make friends/allies).

Suppose we wanted to make a minimal circuit that would do as well as humans in maximizing inclusive fitness in some range of environments. Wouldn't it make sense to also "help it out" by having it directly optimize for useful subgoals in those environments rather than having it do a big backchain from "maximize inclusive fitness"? And then it would be a daemon because it would keep optimizing for those subgoals even if you moved it outside of those environments?

I agree with this basic point and it seems important, thanks.

It seems like there are two qualitatively different concerns when trying to optimize for X, that probably need to be distinguished / thought about separately:

- Upstream daemon: directly optimizes something that is upstream of X on the training distribution.
- Downstream daemon: optimizes something that is downstream of X on the training distribution (e.g. because survival is a convergent instrumental goal and X is upstream of survival), and so backwards chains through X.

Obviously the real situation can be a complicated mixture, and this is not a clean distinction even apart from that.

The arguments in the OP only plausibly apply to downstream daemons. I think they make the most sense in terms of making induction benign.

I've normally thought of upstream daemons as much more likely, but much easier to deal with:

- In the case of upstream daemons, the most natural response is to probe a broad range of situations similar to the ones you care about. This requires a lot more understanding, but it seems quite realistic to try to avoid the situation where (Y --> empowering humans) on the training distribution but (Y --> disempowering humans) on the slightly expanded distribution. (Compared to others around here, this probably seems more plausible to me because I have an intuition about a basin of attraction around corrigibility, and am imagining consistently retraining the agent such that it doesn't need to deal with giant distributional shift.)
- In the case of downstream daemons, you have to deal with an adversary who is motivated to look for the first opportunity to defect that you won't be able to notice. This pits much more optimization pressure against your training regime. In particular, the "distributional shift" in this case is caused by the agent being powerful enough to seize control, and therefore it is very difficult to avoid.

While I usually flag these as two potentially distinct concerns, they do run together a lot in my head as evidenced by this post. I'm not sure if it's possible to cleanly distinguish them, or how. The right distinction may also be something else, e.g focusing directly on the possibility of a treacherous turn.

I think it makes sense to classify daemons into two types the way you do. Interestingly MIRI seems to be a lot more concerned about what you call upstream daemons. The Arbital page you linked to only talks about upstream daemons and the Google Doc "MIRI notes on alignment difficulty" seems to be mostly about that too. (What is it with people keeping important AI safety documents in private Google Docs these days, with no apparent plans of publication? Do you know any others that I'm not already shared on, BTW?)

and am imagining consistently retraining the agent such that it doesn’t need to deal with giant distributional shift

I don't recall you writing about this before. How do you see this working? I guess with LBO you could train a complete "core for reasoning" and then amplify that to keep retraining the higher level agents on broader and broader distributions, but how would it work with HBO, where the human overseer's time becomes increasingly scarce/costly relative to the AI's as AIs get faster? I'm also pretty concerned about the overseer running into their own lack of robustness against distributional shifts if this is what you're planning.

Interestingly MIRI seems to be a lot more concerned about what you call upstream daemons. The Arbital page you linked to only talks about upstream daemons and the Google Doc "MIRI notes on alignment difficulty" seems to be mostly about that too.

I think people (including at MIRI) normally describe daemons as emerging from upstream optimization, but then describe them as becoming downstream daemons as they improve. Without the second step, it seems hard to be so pessimistic about the "normal" intervention of "test in a wider range of cases."

how would it work with HBO, where the human overseer's time becomes increasingly scarce/costly relative to the AI's as AIs get faster?

At time 0 the human trains the AI to operate at time 1. At time T>>0 the AI trains itself to operate at time T+1, at some point the human no longer needs to be involved---if the AI is actually aligned on inputs that it encounters at time T, then it has a hope of remaining aligned on inputs it encounters at time T+1.

I spoke a bit too glibly though, I think there are lots of possible approaches for dealing with this problem, each of them slightly increases my optimism, this isn't the most important:

- Retraining constantly. More generally, only using the AI for a short period of time before building a completely new AI. (I think that humans basically only need to solve the alignment problem for AI-modestly-better-than-humans-at-alignment, and then we leave the issue up to the AI.)
- Using techniques here to avoid "active malice" in the worst case. This doesn't include all cases where the AI is optimizing a subgoal which is no longer correlated with the real goal. But it does include cases where that subgoal then involves disempowering the human instrumentally, which seems necessary to really have a catastrophe.
- I think there is some real sense in which an upstream daemon (of the kind that could appear for a minimal circuit) may be a much smaller problem, though this requires much more understanding.

I'm also pretty concerned about the overseer running into their own lack of robustness against distributional shifts if this is what you're planning.

I think this is definitely an additional difficulty. Right now I think accidentally introducing consequentialists is a somewhat larger concern, either daemons from the distillation step or weird memetic patterns in the amplification step, but hopefully at some point I'll be focusing on this problem.

Without the second step, it seems hard to be so pessimistic about the “normal” intervention of “test in a wider range of cases.”

Another way to be pessimistic is you expect that if the test fails on a wider range of cases, it will be unclear how to proceed at that point, and less safety-conscious AI projects may take the lead before you figure that out. (I think this, or a similar point, was made in the MIRI doc.)

At time 0 the human trains the AI to operate at time 1. At time T>>0 the AI trains itself to operate at time T+1, at some point the human no longer needs to be involved---if the AI is actually aligned on inputs that it encounters at time T, then it has a hope of remaining aligned on inputs it encounters at time T+1.

I don't think this can work if you're just doing naive imitation learning? Do you have some other training method in mind?

I don't think this can work if you're just doing naive imitation learning? Do you have some other training method in mind?

To be clear, I'm imagining imitation learning + amplification. So the agent at time T engages in some deliberative process to produce training targets for the agent at time T+1. The agent at time T also deliberates in order to choose what situations the agent at time T+1 should train on.

What obstruction do you have in mind?

(I'm imagining using imitation+RL rather than pure imitation, but the difference won't help with this question.)

By "naive imitation learning" I was thinking "without amplification". With amplification, I'm less sure it won't work but it still seems pretty iffy. The plan seems to depend on at least the following:

- We can create a small (since generated by expensive humans) set of training data that is representative of the data manifold of HBO reasoning (without missing some important part of it).
- The data manifold of HBO reasoning is universal, i.e., all future tasks can be broken down (recursively) into subtasks that lie on this manifold.
- At each iteration of amplification, the agent being amplified knows how to break down an input task into subtasks that lie on (or is not too far from) its own training distribution. It's not clear to me how to do this, for example how the agent can obtain a simple enough representation of its own training distribution in order to reason about this problem.
- The AI at time T can predict the test distribution at time T+1 well enough to generate training data for it. This seems hard to ensure given that the environment is likely to contain hard to predict elements like other agents, including adversarial agents. (This may not be a dealbreaker if the AI can detect out-of-distribution inputs at time T+1 and ask for further training data on them. Is this what you have in mind?)

I don't see why to separate 1/2, the goal is to find training data that describes some "universal" core for behavior.

3. I don't think you need to know the training distribution. You just need something that points you back in the direction of the universal core where the human model is competent, e.g. an appropriate notion of simplicity.

4. Hard-to-predict inputs aren't intrinsically a problem. If your agent fails malignly on input x, but not on distribution D, then your agent is able to distinguish x from D. So the difficulty comes from inputs that are easy-to-recognize but hard-to-generate. These certainly exist (e.g. consider a model which kills everyone given a signed edict from the pope). I think the most likely approach is to "reach inside" the model in order to stress test the behavior on inputs that you can't actually synthesize (e.g. by understanding that is checking the pope's signature, and just seeing what would happen if the check passed). This is the endpoint of improvements in these techniques.

(Of course, I don't think we'll be able to prevent benign failures in general.)

I don’t see why to separate 1⁄2, the goal is to find training data that describes some “universal” core for behavior.

It seems to me there are separate risks of the human HBO itself not being universal (e.g., humans are not universal or we need even higher bandwidth to be universal), and not being able to capture enough of the human HBO input/output function in a dataset to train an AI to be universal.

3. I don’t think you need to know the training distribution. You just need something that points you back in the direction of the universal core where the human model is competent, e.g. an appropriate notion of simplicity.

What if the path towards the universal core goes through an area where the AI wasn't trained on?

This is the endpoint of improvements in these techniques.

I think that makes sense but now you're making a conjunctive instead of disjunctive argument (which it seemed like you were claiming by saying "I think there are lots of possible approaches for dealing with this problem" and listing retraining and optimizing worst case performance as separate approaches).

ETA: If you're able to obtain a control guarantee over the whole input space, then that seems to solve the problem and you don't need constant retraining to be aligned. If you're only able to obtain it for some subset of inputs, then it seems that at time T the AI needs to be able to predict the T+1 test distribution so that it can make sure that's covered by the control guarantee.

Regarding daemons starting as upstream and becoming downstream...

I think this makes it sound like the goal (soit Y) of the daemon changes, but I usually don't think of it that way.

What changes is that pursuing Y initially leads to rapidly improving performance at X, but then the performance of X and Y pull apart as the daemon optimizes more heavily for Y.

It seems highly analogous to hacking a learned reward function.

Isn't that the problem Christiano is relatively optimistic about solving through training on different distributions? In particular, it seems like we'd notice the daemon's goals diverging from ours and try to correct it. If I understand Christiano's point correctly, he is optimistic that we could do so successfully due to the existence of a "basin of corrigibility". OTOH, the most concerning case would be when the daemon is intentionally trying to hide the fact that its goals are diverging from X, because we may not realize the system is misaligned before the daemon executes a treacherous turn.

Instead, I interpreted "starting as upstream and becoming downstream" as a description of a phenomenon that might look roughly like this:

1. Upstream daemon (UD) is "discovered" by some optimization process because it seems to do a good job optimizing X on the training data

2. The daemon is "initialized" by the outer optimizer and starts "running"

3. The outer optimizer (OO) provides a set of inputs to the daemon with the goal of optimizing X

4. Based on these inputs, the daemon "notices" that the entity that "initialized" it is trying to optimize X rather than Y

5. The UD infers that the outer optimizer will "terminate" it if the OO discovers the UD is optimizing for Y rather than X

6. The UD starts optimizing for X in order to survive (a convergent instrumental goal regardless of Y)

7. The OO observes the UD's performance on a wide variety of training data and infers that the UD is not in fact a upstream daemon

8. The daemon is "set loose" and proceeds to execute a treacherous turn

As you can tell, my model of how daemons work is **extremely **lacking; I've used quotes for every phrase that feels dubious to me.

One major question I have which this model, if correct, partially resolves for me: How does the downstream daemon ever "know" what X we are optimizing for in order for the outer optimizer to "discover" it?

A concrete vision:

Suppose the best a system can do without a daemon is 97% accuracy.

The daemon can figure out how to get 99% accuracy.

But in order to outperform other systems, it can just provide 98% accuracy, and use 1% of inputs to pursue it's own agenda.

This all happens on-distribution.

If there are multiple daemon-containing systems competing for survival (with selection happening according to accuracy), this might force them to maximize accuracy, instead of just beating a "non-daemon baseline".

This is all only relevant to downstream daemons, right? If so, I don't understand why the DD would ever provide 98% accuracy; I'd expect it to provide 99% accuracy until it sees a chance to provide [arbitarily low]% accuracy and start pursuing its agenda directly. As you say, this might happen due to competition between daemon-containing systems, but I think a DD would want to maximize its chances of survival by maximizng its accuracy either way.

I think it's relevant for either kind (actually, I'm not sure I like the distinction, or find it particularly relevant).

If there aren't other daemons to compete with, then 98% is sufficient for survival, so why not use the extra 1% to begin pursuing your own agenda immediately and covertly? This seems to be how principle-agent problems often play out in real life with humans.

I am interested as well. Please share the docs in question with my LW username at gmail dot com if that is a possibility. Thank you!

You should contact Rob Bensinger since he's the owner of the document in question. (It looks like I technically can share the document with others, but I'm not sure what Rob/MIRI's policy is about who that document should be shared with.)

(Summarizing/reinterpreting the upstream/downstream distinction for myself):

"upstream": has a (relatively benign?) goal which actually helps achieve X

"downstream": doesn't

Coincidentally I'm also trying to understand this post at the same time, and was somewhat confused by the "upstream"/"downstream" distinction.

What I eventually concluded was that there are 3 ways a daemon that intrinsically values optimizing some Y can "look like" it's optimzing X:

- Y = X (this seems both unconcerning and unlikely, and thus somewhat irrelevant)
- optimzing Y causes optimization pressure to be applied to X (upstream daemon, describes humans if Y = our actual goals and X = inclusive genetic fitness)
- The daemon is directly optimizing X because the daemon believes this instrumentally helps it achieve Y (downstream daemon, e.g. if optimizing X helps the daemon survive)

Does this seem correct? In particular, I don't understand why upstream daemons would have to have a relatively benign goal.

Yeah that seems right. I think it's a better summary of what Paul was talking about.

If evolution is to humans as humans are to UFAI, I suppose UFAI corresponds to too little compute allocated to understanding our goal specification, and too much compute allocated to maximizing it. That suggests the solution is relatively simple.

(sorry for commenting on such an old post)

It seems like the problem from evolution's perspective isn't that we don't understand our goal specification but that our goals are different from evolution's goals. It seems fairly tautological that putting more compute towards maximizing a goal specification than towards making sure the goal specification is what we want is likely to lead to UFAI; I don't see how that implies a "relatively simple" solution?

It seems fairly tautological that putting more compute towards maximizing a goal specification than towards making sure the goal specification is what we want is likely to lead to UFAI

And the "relatively simple" solution is to do the reverse, and put more compute towards making sure the goal specification is what we want than towards maximizing it.

(It's possible this point isn't very related to what Wei Dai said.)

Isn't this just saying it would be nice if we collectively put more resources towards alignment research relative to capabilities research? I still feel like I'm missing something :/

We may be able to offload some work to the system, e.g. by having it search for a diverse range of models for the user's intent, instead of making it use a single hardcoded goal specification.

This comment of mine is a bit related if you want more elaboration:

https://www.lesswrong.com/posts/NtX7LKhCXMW2vjWx6/thoughts-on-reward-engineering#jJ7nng3AGmtAWfxsy [LW(p) · GW(p)]

If you have thoughts on it, probably best to reply there--we are already necroposting, so let's keep the discussion organized.

I'm having trouble thinking about what it would mean for a circuit to contain daemons such that we could hope for a proof. It would be nice if we could find a simple such definition, but it seems hard to make this intuition precise.

For example, we might say that a circuit contains daemons if it displays more optimization that necessary to solve a problem. Minimal circuits could have daemons under this definition though. Suppose that some function describes the behaviour of some powerful agent, a function is like with noise added, and our problem is to predict sufficiently well the function . Then, the simplest circuit that does well won't bother to memorize a bunch of noise, so it will pursue the goals of the agent described by more efficiently than , and thus more efficiently than necessary.

I don't know what the statement of the theorem would be. I don't really think we'd have a clean definition of "contains daemons" and then have a proof that a particular circuit doesn't contain daemons.

Also I expect we're going to have to make some assumption that the problem is "generic" (or else be careful about what daemon means), ruling out problems with the consequentialism embedded in them.

(Also, see the comment thread with Wei Dai above, clearly the plausible version of this involves something more specific than daemons.)

Also I expect we're going to have to make some assumption that the problem is "generic" (or else be careful about what daemon means), ruling out problems with the consequentialism embedded in them.

I agree. The following is an attempt to show that if we don't rule out problems with the consequentialism embedded in them then the answer is trivially "no" (i.e. minimal circuits may contain consequentialists).

Let be a minimal circuit that takes as input a string of length that encodes a Turing machine, and outputs a string that is the concatenation of the first configurations in the simulation of that Turing machine (each configuration is encoded as a string).

Now consider a string that encodes a Turing machine that simulates some consequentialist (e.g. a human upload). For the input , the computation of the output of simulates a consequentialist; and is a minimal circuit.

By "predict sufficiently well" do you mean "predict such that we can't distinguish their output"?

Unless the noise is of a special form, can't we distinguish $f$ and $tilde{f}$ by how well they do on $f$'s goals? It seems like for this not to be the case, the noise would have to be of the form "occasionally do something weak which looks strong to weaker agents". But then we could get this distribution by using a weak (or intermediate) agent directly, which would probably need less compute.

Suppose "predict well" means "guess the output with sufficiently high probability," and the noise is just to replace the output with something random 5% of the time.

Yeah, I had something along the lines of what Paul said in mind. I wanted not to require that the circuit implement exactly a given function, so that we could see if daemons show up in the output. It seems easier to define daemons if we can just look at input-output behaviour.

I curated this post partly for the OP, and partly for the subsequent discussion.

Something valuable I think LessWrong can be is a place where people pose well formed questions on important problems, and then make progress on them. I don't have the impression that any clear-cut breakthroughs happened here, but it does look like incremental, "chip away at the edges" progress was made.

My current take is that the knowledge-building process has several phases, that can reinforce each other in a non-linear fashion:

- researching current literature
- transforming ad-hoc exploratory research and impressions into a clearly stated questions
- brainstorming new ideas
- refining those ideas into something legible
- subjective those ideas to scrutiny
- distilling all that into a final concept that others can build on

I think it's important for LW to showcase progress on each of those stages. By default, a tendency is to only publish work that's reached the final stages, or that feels like it makes *some* kind of coherent point. This post and comments seemed to be doing some thing *real, *even if at a middle-stage, and I want it to be clear that this is something LW strives to reward.

Can the smallest boolean circuit that solves a problem be a daemon? For example, can the smallest circuit that predicts my behavior (at some level of accuracy) be a daemon?

Yes. Consider a predictor that predicts what Paul will say if given an input and n time-steps to think about it, where n can be any integer up to some bound k. One possible circuit would have k single-step simulators chained together, plus a mux which takes the output of the nth single-step simulator. But a circuit which consisted of k single-step simulators and took the output of the *last* one would be smaller, and if Paul commits to not use the extra time steps to change his output on any input which could possibly be a training input, then this circuit is a valid predictor. He could then use the extra time steps to implement a daemon strategy for any input which he can reliably recognize as one that would not be used during training.

But can we make a smaller circuit by stripping out the part of Paul that attempts to recognize whether an input could be part of the training distribution?

If he's a neural net, this is likely an obstacle to any attempts to simplify out parts of him; those parts would still be contributing to the result, it's just that within the test input domain those contributions would look like noise.

In general: if a circuit implements a prediction problem that sometimes but doesn't always require simulating an agent, and if that agent is capable of making itself implement the identity function for interesting subsets of its inputs, then it can potentially show up as a daemon for any input that didn't require simulating it.

Why couldn't you just use a smaller circuit that runs one single-step simulator, and outputs the result? It seems like that would output an accurate prediction of Paul's behavior iff the k-step simulator outputs an accurate prediction.

I propose a counterexample. Suppose we are playing a series of games with another agent. To play effectively, we train a circuit to predict the opponent's moves. At this point the circuit already contains an adversarial agent. However, one could object that it's unfair: we asked for an adversarial agent so we got an adversarial agent (nevertheless for AI alignment it's still a problem). To remove the objection, let's make some further assumptions. The training is done on some set of games, but distributional shift happens and later games are different. The opponent knows this, so on the training games it simulates a different agent. Specifically, it simulates an agent who searches for a strategy s.t. the best response to this strategy has the strongest counter-response. The minimal circuit hence contains the same agent. On the training data we win, but on the shifted distribution the daemon deceives us and we lose.

Don't know if this counts as a 'daemon', but here's one scenario where a minimal circuit could plausibly exhibit optimization we don't want.

Say we are trying to build a model of some complex environment containing agents, e.g. a bunch of humans in a room. The fastest circuit that predicts this environment will almost certainly devote more computational resources to certain parts of the environment, in particular the agents, and will try to skimp as much as possible on less relevant parts such as chairs, desks etc. This could lead to 'glitches in the matrix' where there are small discrepancies from what the agents expect.

Finding itself in such a scenario, a smart agent could reason: "I just saw something that gives me reason to believe that I'm in a small-circuit simulation. If it looks like the simulation is going to be used for an important decision, I'll act to advance my interests in the real world; otherwise, I'll act as though I didn't notice anything".

In this way, the overall simulation behavior could be very accurate on most inputs, only deviating in the cases where it is likely to be used for an important decision. In effect, the circuit is 'colluding' with the agents inside it to minimize its computational costs. Indeed, you could imagine extreme scenarios where the smallest circuit instantiates the agents in a blank environment with the message "you are inside a simulation; please provide outputs as you would in environment [X]". If the agents are good at pretending, this could be quite an accurate predictor.

Indeed, you could imagine extreme scenarios where the smallest circuit instantiates the agents in a blank environment with the message "you are inside a simulation; please provide outputs as you would in environment [X]". If the agents are good at pretending, this could be quite an accurate predictor.

But can we just take whatever cognitive process the agent uses for pretending, and then leave the rest of it out?

I'm confused about the definition of **the set of boolean circuits** in which we're looking at the smallest circuit.

Is that set defined in terms of a set of inputs and a boolean utility function ; and then that set is all the boolean circuits that for each input x∈X yield an output that fulfills ?

Here is one definition of a "problem":

Fix some distribution on , and some function . Then consider the set of circuits for which the expectation of , for sampled from , is .

Can we assume that itself is aligned in the sense that it doesn't assign non-negative values to outputs that are catastrophic to us?

Yeah, if we want C to not be evil we need some very hard-to-state assumption on R and D.

(markdown comment editor is unchecked, will take it up with admins)

Perhaps it'll be useful to think about the question for specific and .

Here are the simplest and I can think of that might serve this purpose:

- uniform over the integers in the range .

- for each input , assigns a reward of to the smallest prime number that is larger than , and to everything else.

I think you need to uncheck "Markdown Comment Editor" under "Edit Account". Your comment with latex follows:

Here is one definition of a "problem":

Fix some distribution on , and some function : . Then consider the set of circuits for which the expectation of , for sampled from , is .

This seems like the sort of problem that can be tackled more efficiently in the context of an actual AGI design. I don't see "daemons" as a problem per se; instead I see a heuristic for finding potential problems.

Consider something like code injection. There is no deep theory of code injection, at least not that I know of. It just describes a particular cluster of software vulnerabilities. You might create best practices to prevent particular types of code injection, but a software stack which claims to be "immune to code injection" sounds like snake oil. If someone says their software stack is "immune to code injection", what they really mean is they implement best practices for guarding against all the code injection types they can think of. Which is great, but it doesn't make sense to go around telling people you are "immune to code injection" because that will discourage security researchers from thinking of new code injection types.

Instead of trying to figure out how to create AIs that are "immune to daemons", I would suggest trying to think of more cases where daemons are actually a problem. Trying to guard against a problem before you have characterized it seems like premature optimization. The more cases you can describe where daemons are a problem, and the more clearly you can characterize these cases, the easier it will be to spot potential daemons in a potential AGI design. Once you have spotted the potential daemon, identifying a very general way to guard against it is likely to be the easy part. Proofs are the last step, not the first.

I've listed one algorithm for which daemons are obviously a problem, namely Solomonoff induction. Now I'm describing a very similar algorithm, and wondering if daemons are a problem. As far as I can tell, any learning algorithm is plausibly beset by daemons, so it seems natural to ask for a variant of learning that isn't.

I'm not sure exactly how to characterize the problem other than by doing this kind of exercise. This post is already implicitly considering a particular design for AGI, I don't see what we gain by being more specific here.

That's fair. I guess my intuition is that the Solomonoff induction example could use more work as motivation. Sampling a cell at a particular frequency pretty fairly unrealistic to me. Realistically an AGI is going to be interested in more complex outcomes. So then there's a connection to the idea of adversarial examples, where the consequentialists in the universal prior are trying to make the AGI think that something is going on when it isn't actually going on. (Absent such deception, I'm not sure there is a problem. For example, if consequentialists reliably make their universe one in which everyone is truly having a good time for the purpose of possibly influencing a universal prior, then that will be true in our universe too, and we should take it into account for decisionmaking purposes.) But this is actually easier than typical adversarial examples, because an AGI also gets to observe the consequentialists plot their adversarial strategy and read their minds while they're plotting. The AGI would have to be rather "dumb" in order to get tricked. If it's simulating the universal prior in sufficiently high resolution to produce these weird effects, then by definition it's able to see what is going on.

Humans already seem able to solve this problem: We simulate how others might think and react, and we don't seem super worried about people we simulate internally breaking out of our simulation and hijacking our cognition. (Or at least, insofar as we do get anxious about e.g. putting ourselves in the shoes of people we dislike, this doesn't have obvious relevance to an AGI--although again, perhaps this would be a good heuristic for brainstorming potential problems.) Anyway, my hunch is that this particular manifestation of the "daemon" problem will not require a lot of special effort once other AGI/FAI problems are solved.

This post is already implicitly considering a particular design for AGI, I don't see what we gain by being more specific here.

Does your idea of neural nets + RL involve use of the universal prior? If not, I think I would try to understand if/how the daemon problem transfers to the neural nets + RL framework before trying to solve it. A solid description of a problem is the first step to finding a solution. The minimal version is a concrete example of how it could occur.

(Apologies if I am coming across as disagreeable--IMO, this is a mistake that FAI people make semi-frequently, and I would like for them to make it less often--you just got unlucky that I'm explaining myself in a comment on your post :P)

Pretty minimal in and of itself, but has prompted plenty of interesting discussion. Operationally that suggests to me that posts like this should be encouraged, but not by putting them into "best of" compilations.

This may be relevant:

Imagine a computational task that breaks up into solving many instances of problems A and B. Each instance reduces to at most n instances of problem A and at most m instances of problem B. However, these two maxima are never achieved both at once: The sum of the number of instances of A and instances of B is bounded above by some . One way to compute this with a circuit is to include n copies of a circuit for computing problem A and m copies of a circuit for computing problem B. Another approach for solving the task is to include r copies of a circuit which, with suitable control inputs, can compute either problem A or problem B. Although this approach requires more complicated control circuitry, if r is significantly less than n+m and the size of is significantly less than the sum of the sizes of and (which may occur if problems A and B have common subproblems X and Y which can use a shared circuit) then this approach will use less logic gates overall.

More generally, consider some complex computational task that breaks down into a heterogeneous set of subproblems which are distributed in different ways depending on the exact instance. Analogous reasoning suggests that the minimal circuit for solving this task will involve a structure akin to emulating a CPU: There are many instances of optimized circuits for low-level tasks, connected by a complex dependency graph. In any particular instance of the problem the relevant data dependencies are only a small subgraph of this graph, with connections decided by some control circuitry. A particular low-level circuit need not have a fixed purpose, but is used in different ways in different instances.

So, our circuit has a dependency tree of low-level tasks optimized for solving our problem in the worst-case. Now, at a starting stage of this hierarchy it has to process information about how a particular instance is separated into subproblems and generate the control information for solving this particular instance. The control information might need to be recomputed as new information about the structure of the instance are made manifest, and sometimes a part of the circuit may perform this recomputation without full access to potentially conflicting control information calculated in other parts.

I think it's worth distinguishing between "smallest" and "fastest" circuits.

A note on smallest.

1) Consider a travelling salesman problem and a small program that brute-forces the solution to it. If the "deamon" wants to make a travelling salesman visit a particular city first, then they would simply order the solution space to consider it first. This has no guarantee of working, but the deamon would get what it wants some of the time. More generally, if there is a class of solutions we are indifferent to, but daemons have a preference order over, then nearly all deterministic algorithms could be seen as deamons. That said, this situation may be "acceptable" and it's worth re-defining the problem to exactly understand what is acceptable and what isn't.

A note on fastest

2) Consider a prime-generation problem, where we want some large primes between 10^100 and 10^200. A simple algorithm that hardcodes a set of primes and returns them is "fast". This isn't the smallest, since it has to store the primes. In a less silly example, a general prime-returning algorithm could only look for primes of particular types, such as Mersenne primes. The general intuition is that optimizations that make algorithms "faster" could come at a cost of forcing a particular probability distribution on the solution.

I think some clarity for "minimal", "optimization", "hard", and "different conditions" would help.

I'll take your problem "definition" using a distribution D, a reward function R, and some circuit C and and Expectation E over R(x, C(x)).

Do we want the minimal C that maximizes E? Or do we want the minimal C that satisfies E > 0? These are not necessarily equivalent because max(E) might be non-computable while E > 0 not. Simple example would be: R(x, C(x)) is the number of 1s that the Turing Machine with Gödel number C(x) writes before halting (and -1 for diverging TMs) if it uses at most x states, -1 else. E > 0 just means outputting any TM that halts and writes at least one 1. The smallest circuit for that should be easy to find. max(E) computes the Busy Beaver number which is notoriously non-computable.

Should R be computable, semi-decidable, or arbitrary? The R function in (1) is non-computable (has to solve halting problem) but finding E > 0 is computable.

What does different conditions mean? From your problem definition it could mean changing D or changing R (otherwise you couldn't really reuse the same circuit). I'm especially unclear about what a daemon would be in this scenario. "Slightly changing D would result in E < 0" seems to be a candidate. But then "minimal C that satisfies E > 0" is probably a bad candidate: the chances that a benign, minimal C goes from E > 0 to E < 0 when slightly changing D seem to be pretty high. Maybe C should be (locally) continuous(-ish, no idea how to formulate that for boolean circuits) with respect to D---small changes in D should not trigger large changes in E.

I skimmed through your linked paper for obfuscation and those are only obfuscated with respect to some (bounded) complexity class. Classical boolean circuit minimization is in PSPACE and EXPTIME (if I'm not mistaken) and even in your problem statement you can easily (from a computability standpoint) check if two circuits are the same: just check if C(x) == C'(x) for all x (which are finite). It's just not efficient.

My intuition tells me that your problems mainly arise because we want to impose some reasonable complexity constraints somewhere. But I'm not sure where in the problem formulation would be a good place. A lot of optimization problems have well-defined global maxima which are utterly intractable to compute or even to approximate. Probably most of the interesting ones. Even if you can show that minimal circuits are not daemons (however you'll define them), that will not actually help you: given some circuit C if you cannot compute a corresponding minimal circuit you cannot check if C could be a daemon. Even if you were given the minimal circuit C' you cannot check in polynomial time if C == C' (due to SAT I guess).

(First time posting here, hope to contribute)

This post formulated a concrete open problem about what are now called 'inner optimisers'. For me, it added 'surface area' to the concept of inner optimisers in a way that I think was healthy and important for their study. It also spurred research that resulted in this post [LW · GW] giving a promising framework for a negative answer.

This post grounds a key question in safety in a relatively simple way. It led to the useful distinction between upstream and downstream daemons, which I think is necessary to make conceptual progress on understanding when and how daemons will arise.

We want to show that given any daemon, there is a smaller circuit that solves the problem. The most natural approach is showing how to construct a smaller circuit, given a daemon. But if the daemon is obfuscated, there is no efficient procedure which takes the daemon circuit as input and produces a smaller circuit that still solves the problem.

Is there a non-obfuscated circuit corresponding to every obfuscated one? And would the non-obfuscated circuit be at least as small as the obfuscated one?

If so it seems like you could just show how to construct the smaller circuit for any non-obfuscated daemon, and then even though you don't technically have a construction for efficiently converting all daemons into smaller circuits, you'd still have a solid argument that daemons aren't minimal.

Yes, this is why I think the statement is true.

"Obfuscated circuit" implies there is some circuit that get obfuscated, obfuscation is a map from circuits to circuits that increases the size and makes them inscrutable.

So obfuscation per se is not a problem for the statement. But it is an obstruction to a proof. You can't just handle obfuscation as a special case, it's not a natural kind, just an example that shows what kind of thing is possible.

Is a turing machine that has a property because it searches the space of all turing machines for one with that property and emulates it a daemon?

Let's set aside daemons for a moment, and think about a process which does "try to" make accurate predictions, but also "tries to" perform the relevant calculations as efficiently as possible. If it's successful in this regard, it will generate small (but probably not minimal) prediction circuits. Let's call this an efficient-predictor process. The same intuitive argument used for daemons also applies to this new process: it seems like we can get a smaller circuit which makes the same predictions, by removing the optimizy parts.

This feels like a more natural setting for the problem than daemons, but also feels like any useful result could carry back over to the daemon case.

The next step along this path: the efficient-predictor is presumably quite general; it should be able to predict efficiently in many different environments. The "optimizy parts" are basically the parts needed for generality. Over time, the object-level prediction circuit will hopefully stabilize (as the process adapts to its environment), so the optimizy parts mostly stop changing around the object-level parts. That's something we could check for: after some warm-up time, we expect some chunk of the circuit (the optimizy part) to be mostly independent of the outputs, so we get rid of that chunk of the circuit.

That seems very close to formalizable.

From this standpoint, the key property of daemons (or any other goal-driven process) is that it's adaptive: it will pursue the goal with some success across multiple possible environments. Intuitively, we expect that adaptivity to come with a complexity cost, e.g. in terms of circuit size.

Suppose I search for an algorithm that has made good predictions in the past, and use that algorithm to make predictions in the future.

If I understand the problem correctly, then it is not that deep. Consider the specific example of weather (e.g. temperature) prediction. Let C(n) be the set of circuits that correctly predict the weather for the last n days. It is obvious that the smallest circuit in C(1) is a constant, which predicts nothing, and which also doesn't fall into C(2). Likewise, for every n there are many circuits that simply compress the weather data of those n days, and return garbage for every other day. But there is also a circuit c_opt, which performs the correct weather simulation and generates the right answers, so it belongs to C(n) for every n. The question is, for what n does the shortest element of C(n) equal c_opt?

Obviously, I don't know the answer. But the noisy nature of real world and it's complex initial conditions suggest that c_opt should be a very large circuit and thus require a very large n to remove all the shorter ones. On the other hand, if you relaxed the condition, and only searched for a good approximation, c_apx, then the correct circuit may be shorter.

I think this is fine though. We aren't doing raw program searches, we're doing searches in program spaces that are known (by experience) to produce quite general solutions.

By the way, I'm a bit uncomfortable with the amount of circuit anthropomorphization in your post. What is up with that?

The question is, for what n does the shortest element of C(n) equal c_opt?

Why is this the question?

By the way, I'm a bit uncomfortable with the amount of circuit anthropomorphization in your post. What is up with that?

Which claims / assumptions / conjectures are you uncomfortable with?

Why is this the question?

Because c_opt is the safe circuit you want, and because your question was about the smallest circuits.

Which claims / assumptions / conjectures are you uncomfortable with?

Not claims or assumptions, just weird words, like "motivated" or "evil". I don't think these are useful ways to think of the problem.

But the other elements in C(n) aren't necessarily daemons either, right?; Certainly "encoding n days of weather data" isn't daemonic at all; some versions of c_apx might be upstream daemons, but that's not necessarily concerning. I don't understand how this argument tells us anything about whether the smallest circuit is guaranteed to be (downstream) daemon-free.

(Eli's personal "trying to have thoughts" before reading the other comments. Probably incoherent. Possibly not even on topic. Respond iff you'd like.)

(Also, my thinking here is influenced by having read this report recently.)

On the one hand, I can see the intuition that if a daemon is solving a problem, there is some part of the system that is solving the problem, and there is another part that is working to (potentially) optimize against you. In theory, we could "cut out" the part that is the problematic agency, preserving the part that solves the problem. And that circuit would be smaller.

________________________________________________________________________

Does that argument apply in the evolution/human case?

Could I "cut away" everything that isn't solving the problem of inclusive genetic fitness and end up with a smaller "inclusive genetic fitness maximizer"?

On the on hand, this seems like a kind of confusing frame. If some humans do well on the metric of inclusive genetic fitness (in the ancestral environment), this isn't because there's a part of the human that's optimizing for that and then another part that's patiently waiting and watching for a context shift in order to pull a treacherous turn on evolution. The human is just pursuing its goals, and as a side effect, does well at the IGF metric.

But it also seems like you could, in principle, build an Inclusive Genetic Fitness Maximizer out of human neuro-machinery: a mammal-like brain that does optimize for spreading its genes.

Would such an entity be computationally smaller than a human?

Maybe? I don't have a strong intuition either way. It really doesn't seem like much of the "size" of the system is due to the encoding of the *goals*. Approximately 0 of the difference in size is due to the goals?

A much better mind design might be much smaller, but that wouldn't make it any less daemonic.

And if, in fact, the computationally smallest way to solve the IGF problem is as a side-effect of some processes optimizing for some other goal, then the minimum circuit is not daemon-free.

Though I don't know of any good reason why is should be the case that not optimizing directly for the metric works better than optimizing directly for it. True, evolution "chose" to design human as adaptation-executors, but this seems due to evolution's constraints in searching the space, not due to indirectness having any virtue over directness. Right?

if the daemon is obfuscated, there is no efficient procedure which takes the daemon circuit as input and produces a smaller circuit that still solves the problem.

So we can't find any efficient constructive argument. That rules out most of the obvious strategies.

I don't think the procedure needs to be efficient to solve the problem, since we only care about existence of a smaller circuit (not an efficient way to produce it).

I don't think this question has much intrinsic importance, because almost all realistic learning procedures involve a strong simplicity prior (e.g. weight sharing in neural networks).

Does this mean you do not expect daemons to occur in practice because they are too complicated?

No, I think a simplicity prior clearly leads to daemons in the limit.

We want to show that given any daemon, there is a smaller circuit that solves the problem.

Given any random circuit, you can not, in general, show whether it is the smallest circuit that produces the output it does. That's just Rice's theorem, right? So why would it be possible for a daemon?

Rice's theorem applies if you replace "circuit" with "Turing machine". The circuit version can be resolved with a finite brute force search.