Learning the prior and generalization 2020-07-29T22:49:42.696Z · score: 32 (11 votes)
Weak HCH accesses EXP 2020-07-22T22:36:43.925Z · score: 12 (3 votes)
Alignment proposals and complexity classes 2020-07-16T00:27:37.388Z · score: 46 (10 votes)
AI safety via market making 2020-06-26T23:07:26.747Z · score: 48 (16 votes)
An overview of 11 proposals for building safe advanced AI 2020-05-29T20:38:02.060Z · score: 153 (56 votes)
Zoom In: An Introduction to Circuits 2020-03-10T19:36:14.207Z · score: 84 (23 votes)
Synthesizing amplification and debate 2020-02-05T22:53:56.940Z · score: 39 (15 votes)
Outer alignment and imitative amplification 2020-01-10T00:26:40.480Z · score: 30 (7 votes)
Exploring safe exploration 2020-01-06T21:07:37.761Z · score: 37 (11 votes)
Safe exploration and corrigibility 2019-12-28T23:12:16.585Z · score: 17 (8 votes)
Inductive biases stick around 2019-12-18T19:52:36.136Z · score: 51 (15 votes)
Understanding “Deep Double Descent” 2019-12-06T00:00:10.180Z · score: 109 (49 votes)
What are some non-purely-sampling ways to do deep RL? 2019-12-05T00:09:54.665Z · score: 15 (5 votes)
What I’ll be doing at MIRI 2019-11-12T23:19:15.796Z · score: 118 (37 votes)
More variations on pseudo-alignment 2019-11-04T23:24:20.335Z · score: 20 (6 votes)
Chris Olah’s views on AGI safety 2019-11-01T20:13:35.210Z · score: 151 (46 votes)
Gradient hacking 2019-10-16T00:53:00.735Z · score: 54 (16 votes)
Impact measurement and value-neutrality verification 2019-10-15T00:06:51.879Z · score: 35 (10 votes)
Towards an empirical investigation of inner alignment 2019-09-23T20:43:59.070Z · score: 43 (11 votes)
Relaxed adversarial training for inner alignment 2019-09-10T23:03:07.746Z · score: 45 (11 votes)
Are minimal circuits deceptive? 2019-09-07T18:11:30.058Z · score: 51 (12 votes)
Concrete experiments in inner alignment 2019-09-06T22:16:16.250Z · score: 63 (20 votes)
Towards a mechanistic understanding of corrigibility 2019-08-22T23:20:57.134Z · score: 40 (12 votes)
Risks from Learned Optimization: Conclusion and Related Work 2019-06-07T19:53:51.660Z · score: 65 (19 votes)
Deceptive Alignment 2019-06-05T20:16:28.651Z · score: 64 (18 votes)
The Inner Alignment Problem 2019-06-04T01:20:35.538Z · score: 73 (19 votes)
Conditions for Mesa-Optimization 2019-06-01T20:52:19.461Z · score: 59 (20 votes)
Risks from Learned Optimization: Introduction 2019-05-31T23:44:53.703Z · score: 129 (38 votes)
A Concrete Proposal for Adversarial IDA 2019-03-26T19:50:34.869Z · score: 18 (6 votes)
Nuances with ascription universality 2019-02-12T23:38:24.731Z · score: 24 (7 votes)
Dependent Type Theory and Zero-Shot Reasoning 2018-07-11T01:16:45.557Z · score: 18 (11 votes)


Comment by evhub on Learning the prior and generalization · 2020-08-03T22:17:05.241Z · score: 4 (2 votes) · LW · GW

Fair enough. In practice you still want training to also be from the same distribution because that's what causes your validation performance to be high. (Or put differently, training/validation i.i.d. is about capabilities, and validation/deployment i.i.d. is about safety.)


This seems to rely on an assumption that "human is convinced of X" implies "X"? Which might be fine, but I'm surprised you want to rely on it.

I'm curious what an algorithm might be that leverages this relaxation.

Well, I'm certainly concerned about relying on assumptions like that, but that doesn't mean there aren't ways to make it work. Approaches like debate and approval-based amplification already rely on very similar assumptions—for example, for debate to work it needs to be the case that being convinced of X at the end of the debate implies X. Thus, one way to leverage this relaxation is just to port those approaches over to this setting. For example, you could train via debate over what would do if could access the entirety of , then randomly do full debate rollouts during deployment. Like I mention in the post, this still just gives you average-case guarantees, not worst-case guarantees, though average-case guarantees are still pretty good and you can do a lot with them if you can actually get them.

Comment by evhub on Learning the prior and generalization · 2020-08-03T19:54:46.149Z · score: 4 (2 votes) · LW · GW

Alright, I think we're getting closer to being on the same page now. I think it's interesting to think about what the relaxation from requiring training/validation/deployment to be i.i.d. to just requiring that validation/deployment be i.i.d. gets us, though. First, it's an argument that we shouldn't be that worried about whether the training data is i.i.d. relative to the validation/deployment data. Second, it opens the door to an even further relaxation, which is that you can do the validation while looking at the model's output. That is, if the model can provide a justification which convinces you that your ground truth generator would have produced the same output, that's just as good as actually checking against the ground truth. At that point, though, it really stops looking anything like the standard i.i.d. setup, which is why I'm hesitant to just call it “validation/deployment i.i.d.” or something.

Comment by evhub on Learning the prior and generalization · 2020-08-03T18:03:14.336Z · score: 4 (2 votes) · LW · GW

Sure, but at the point where you're randomly deciding whether to collect ground truth for a data point and check the model on it (that is, put it in the validation data) or collect new data using the model to make predictions, you have verifiability. Importantly, though, you can get verifiability without doing that—including if the data isn't actually i.i.d. (e.g. there are detectable differences between training and deployment data) but you still decide randomly whether to check the model's output against some ground truth. In either situation, though, part of the point that I'm making is that the safety benefits are coming from the verifiability part not the i.i.d. part—even in the simple example of i.i.d.-ness giving you validation data, what's mattering is that the validation and deployment data are i.i.d. (because that's what gives you verifiability), but not whether the training and validation/deployment data are i.i.d.

Comment by evhub on Inner Alignment: Explain like I'm 12 Edition · 2020-08-02T02:41:28.368Z · score: 36 (8 votes) · LW · GW

This is great—thanks for writing this! I particularly liked your explanation of deceptive alignment with the diagrams to explain the different setups. Some comments, however:

(These models are called the training data and the setting is called supervised learning.)

Should be “these images are.”

Thus, there is only a problem if our way of obtaining feedback is flawed.

I don't think that's right. Even if the feedback mechanism is perfect, if your inductive biases are off, you could still end up with a highly misaligned model. Consider, for example, Paul's argument that the universal prior is malign—that's a setting where the feedback is perfect but you still get malign optimization because the prior is bad.

For proxy alignment, think of Martin Luther King.

The analogy is meant to be to the original Martin Luther, not MLK.

If we further assume that processing input data doesn't directly modify the model's objective, it follows that representing a complex objective via internalization is harder than via "modelling" (i.e., corrigibility or deception).

I'm not exactly sure what you're trying to say here. The way I would describe this is that internalization requires an expensive duplication where the objective is represented separately from the world model despite the world model including information about the objective.

Comment by evhub on Learning the prior and generalization · 2020-07-30T21:46:52.959Z · score: 4 (2 votes) · LW · GW

Fwiw, it took me a few re-reads to realize you were just arguing for the no-free-lunch theorem -- I initially thought you were arguing "since there is no 'true' distribution for a dataset, datasets can never be i.i.d., and so the theorems never apply in practice".

Hmmm... I'll try to edit the post to be more clear there.

How does verifiability help with this problem?

Because rather than just relying on doing ML in an i.i.d. setting giving us the guarantees that we want, we're forcing the guarantees to hold by actually randomly checking the model's predictions. From the perspective of a deceptive model, knowing that its predictions will just be trusted because the human thinks the data is i.i.d. gives it a lot more freedom than knowing that its predictions will actually be checked at random.

Perhaps you'd say, "with verifiability, the model would 'show its work', thus allowing the human to notice that the output depends on RSA-2048, and so we'd see that we have a bad model". But this seems to rest on having some sort of interpretability mechanism

There's no need to invoke interpretability here—we can train the model to give answers + justifications via any number of different mechanisms including amplification, debate, etc. The point is just to have some way to independently check the model's answers to induce i.i.d.-like guarantees.

Comment by evhub on Learning the prior and generalization · 2020-07-30T02:54:32.433Z · score: 7 (4 votes) · LW · GW

First off, I want to note that it is important that datasets and data points do not come labeled with "true distributions" and you can't rationalize one for them after the fact. But I don't think that's an important point in the case of i.i.d. data.

I agree—I pretty explicitly make that point in the post.

Why not just apply the no-free-lunch theorem? It says the same thing. Also, why do we care about this? Empirically the no-free-lunch theorem doesn't matter, and even if it did, I struggle to see how it has any safety implications -- we'd just find that our ML model is completely unable to get any validation performance and so we wouldn't deploy.

I agree that this is just the no free lunch theorem, but I generally prefer explaining things fully rather than just linking to something else so it's easier to understand the text just by reading it.

The reason I care, though, is because the fact that the performance is coming from the implicit ML prior means that if that prior is malign then even in the i.i.d. case you can still get malign optimization.

I don't really see this. My read of this post is that you introduced "verifiability", argued that it has exactly the same properties as i.i.d. (since i.i.d. also gives you average-case guarantees), and then claimed it's better specified than i.i.d. because... actually I'm not sure why, but possibly because we can never actually get i.i.d. in practice?

That's mostly right, but the point is that the fact that you can't get i.i.d. in practice matters because it means you can't get good guarantees from it—whereas I think you can get good guarantees from verifiability.

If that's right, then I disagree. The way in which we lose i.i.d. in practice is stuff like "the system could predict the pseudorandom number generator" or "the system could notice how much time has passed" (e.g. via RSA-2048). But verifiability has the same obstacles and more, e.g. you can't verify your system if it can predict which outputs you will verify, you can't verify your system if it varies its answer based on how much time has passed, you can't verify your system if humans will give different answers depending on random background variables like how hungry they are, etc. So I don't see why verifiability does any better.

I agree that you can't verify your model's answers if it can predict which outputs you will verify (though I don't think getting true randomness will actually be that hard)—but the others are notably not problems for verifiability despite being problems for i.i.d.-ness. If the model gives answers where the tree of reasoning supporting those answers depends on how much time has passed or how hungry the human was, then the idea is to reject those answers. The point is to produce a mechanism that allows you to verify justifications for correct answers to the questions that you care about.

Comment by evhub on What Failure Looks Like: Distilling the Discussion · 2020-07-30T00:53:41.685Z · score: 2 (1 votes) · LW · GW

I could imagine “enemy action” making sense as a label if the thing you're worried about is enemy humans deploying misaligned AI, but that's very much not what Paul is worried about in the original post. Rather, Paul is concerned about us accidentally training AIs which are misaligned and thus pursue convergent instrumental goals like resource and power acquisition that result in existential risk.

Furthermore, they're also not “enemy AIs” in the sense that “the AI doesn't hate you”—it's just misaligned and you're in its way—and so even if you specify something like “enemy AI action” that still seems to me to conjure up a pretty inaccurate picture. I think something like “influence-seeking AIs”—which is precisely the term that Paul uses in the original post—is much more accurate.

Comment by evhub on What Failure Looks Like: Distilling the Discussion · 2020-07-30T00:30:40.422Z · score: 2 (1 votes) · LW · GW

I mean, I agree that the scenario is about adversarial action, but it's not adversarial action by enemy humans—or even enemy AIs—it's adversarial action by misaligned (specifically deceptive) mesa-optimizers pursuing convergent instrumental goals.

Comment by evhub on What Failure Looks Like: Distilling the Discussion · 2020-07-29T23:45:02.738Z · score: 5 (3 votes) · LW · GW

Failure by enemy action.

This makes it sound like it's describing misuse risk, when really it's about accident risk.

Comment by evhub on Developmental Stages of GPTs · 2020-07-29T00:19:03.450Z · score: 29 (10 votes) · LW · GW

Here are 11. I wouldn't personally assign greater than 50/50 odds to any of them working, but I do think they all pass the threshold of “could possibly, possibly, possibly work.” It is worth noting that only some of them are language modeling approaches—though they are all prosaic ML approaches—so it does sort of also depend on your definition of “GPT-style” how many of them count or not.

Comment by evhub on Developmental Stages of GPTs · 2020-07-29T00:14:01.549Z · score: 4 (2 votes) · LW · GW

I feel like it's one reasonable position to call such proposals non-starters until a possibility proof is shown, and instead work on basic theory that will eventually be able to give more plausible basic building blocks for designing an intelligent system.

I agree that deciding to work on basic theory is a pretty reasonable research direction—but that doesn't imply that other proposals can't possibly work. Thinking that a research direction is less likely to mitigate existential risk than another is different than thinking that a research direction is entirely a non-starter. The second requires significantly more evidence than the first and it doesn't seem to me like the points that you referenced cross that bar, though of course that's a subjective distinction.

Comment by evhub on Developmental Stages of GPTs · 2020-07-28T22:56:30.976Z · score: 15 (4 votes) · LW · GW

As I understand it, the high level summary (naturally Eliezer can correct me) is that (a) corrigible behaviour is very unnatural and hard to find (most nearby things in mindspace are not in equilibrium and will move away from corrigibility as they reflect / improve), and (b) using complicated recursive setups with gradient descent to do supervised learning is incredible chaotic and hard to manage, and shouldn't be counted on working without major testing and delays (i.e. could not be competitive).

Perhaps Eliezer can interject here, but it seems to me like these are not knockdown criticisms that such an approach can't “possibly, possibly, possibly work”—just reasons that it's unlikely to and that we shouldn't rely on it working.

Comment by evhub on Deceptive Alignment · 2020-07-28T19:12:40.334Z · score: 4 (2 votes) · LW · GW

I talk about this a bit here, but basically if you train huge models for a short period of time, you're really relying on your inductive biases to find the simplest model that fits the data—and mesa-optimizers, especially deceptive mesa-optimizers, are quite simple, compressed policies.

Comment by evhub on Writeup: Progress on AI Safety via Debate · 2020-07-24T23:35:18.796Z · score: 10 (3 votes) · LW · GW

As I understand the way that simultaneity is handled here, debaters and are assigned to positions and and then argue simultaneously for their positions. Thus, and start by each arguing for their positions without seeing each others' arguments, then each refute each others' arguments without seeing each others' refutations, and so on.

I was talking about this procedure with Scott Garrabrant and he came up with an interesting insight: why not generalize this procedure beyond just debates between and ? Why not use this procedure to conduct arbitrary debates between and ? That would let you get general pairwise comparisons for the likelihood of arbitrary propositions, rather than just answering individual questions. Seems like a pretty straightforward generalization that might be useful in some contexts.

Also, another insight from Scott—simultaneous debate over yes-or-no questions can actually answer arbitrary questions if you just ask a question like:

Should I eat for dinner tonight whatever it is that you think I should eat for dinner tonight?

Comment by evhub on Arguments against myopic training · 2020-07-24T20:05:19.666Z · score: 2 (1 votes) · LW · GW

Yeah, I meant without a human looking at the output. I also agree with pretty much everything you just said. We're pretty deep in this comment chain now and I'm not exactly sure why we got here—I agree that Richard's original definition was based on the standard RL definition of myopia, though I was making the point that Richard's attempt to make imitative amplification non-myopic turned it into approval-based amplification. Richard's version has a human evaluate the output rather than a distance metric, which I see as the defining difference between imitative and approval-based amplification.

Comment by evhub on Arguments against myopic training · 2020-07-23T22:47:48.214Z · score: 8 (3 votes) · LW · GW

Both GAIL and AIRL only use expert trajectories rather than expert evaluations, which means they both satisfy the counterfactual oracle analogy, and so I would call them myopic training algorithms in the sense that I was using that term above. That being said, I definitely agree that the term is very overloaded here—some conflicting definitions:

  • an RL training procedure is myopic if ;
  • an RL training procedure is myopic if and it incentivizes CDT-like behavior in the limit (e.g. it shouldn't cooperate with its past self in one-shot prisoner's dilemma);
  • an ML training procedure is myopic if the model is evaluated without (EDIT: a human) looking at its output (as in the counterfactual oracle analogy).
Comment by evhub on Arguments against myopic training · 2020-07-22T19:54:38.582Z · score: 6 (3 votes) · LW · GW

Sure, but imitative amplification can't be done without myopic training or it ceases to be imitative amplification and becomes approval-based amplification, which means you no longer have any nice guarantees about limiting to HCH.

Comment by evhub on Arguments against myopic training · 2020-07-22T19:01:17.856Z · score: 4 (2 votes) · LW · GW

My point here is that I think imitative amplification (if you believe it's competitive) is a counter-example to Richard's argument in his “Myopic training doesn't prevent manipulation of supervisors” section since any manipulative actions that an imitative amplification model takes aren't judged by their consequences but rather just by how closely they match up with what the overseer would do.

Comment by evhub on Alignment proposals and complexity classes · 2020-07-22T02:52:52.496Z · score: 6 (3 votes) · LW · GW

Yeah, I think that's absolutely right—I actually already have a version of my market making proof for amplification that I've been working on cleaning up for publishing. But regardless of how you prove it I certainly agree that I understated amplification here and that it can in fact get to .

Comment by evhub on Arguments against myopic training · 2020-07-21T23:29:02.120Z · score: 4 (2 votes) · LW · GW

... Why isn't this compatible with saying that the supervisor (HCH) is "able to accurately predict how well their actions fulfil long-term goals"? Like, HCH presumably takes those actions because it thinks those actions are good for long-term goals.

In the imitative case, the overseer never makes a determination about how effective the model's actions will be at achieving anything. Rather, the overseer is only trying to produce the best answer for itself, and the loss is determined via a distance metric. While the overseer might very well try to determine how effective it's own actions will be at achieving long-term goals, it never evaluates how effective the model's actions will be. I see this sort of trick as the heart of what makes the counterfactual oracle analogy work.

Comment by evhub on Arguments against myopic training · 2020-07-21T23:25:24.521Z · score: 6 (3 votes) · LW · GW

So it seems misleading to describe a system as myopically imitating a non-myopic system -- there is no significant difference between non-myopic Q-learning vs myopic imitation of Q-learning. A notion of "myopia" which agrees with your usage (allowing for myopic imitation of HCH) does not seem like a very useful notion of myopia. I see this as the heart of Ricraz' critique (or at least, the part that I agree with).

I agree that there's no difference between the training setup where you do myopic RL on a Q function and the training setup where you just doing Q learning directly, but that doesn't at all imply that there's no difference between internally myopically imitating some other Q learning agent and internally doing some sort of non-myopic optimization process. As a silly example, if my Q learning agent has some weird idiosyncratic behavior, then the model which is imitating it will exhibit the same behavior where as the model which is just trying to optimize the reward directly won't.

This especially matters in the context of HCH because we care quite a lot about getting out as direct an imitation of HCH as possible. In particular, it matters quite a lot that our model be internally trying to myopically imitate HCH rather than internally trying to non-myopically get the least loss across episodes, as the latter will lead it to output simpler answers to make its job easier.

Comment by evhub on AI safety via market making · 2020-07-21T00:31:48.892Z · score: 2 (1 votes) · LW · GW

Ah, I see—makes sense.

Comment by evhub on Why is pseudo-alignment "worse" than other ways ML can fail to generalize? · 2020-07-20T21:55:37.440Z · score: 22 (5 votes) · LW · GW

So, I certainly agree that pseudo-alignment is a type of robustness/distributional shift problem. In fact, I would describe “Risks from Learned Optimization” as a deep dive on a particular subset of robustness problems that might be particularly concerning from a safety standpoint. Thus, in that sense, whether it's really a “new” sort of robustness problem is less the point than the analysis that the paper presents of that robustness problem. That being said, I do think that at least the focus on mesa-optimization was fairly novel in terms of caching out the generalization failures we wanted to discuss in terms of the sorts of learned optimization processes that might exhibit them (as well as the discussion of deception, as you mention).

I don't understand what "safety properties of the base optimizer" could be, apart from facts about the optima it tends to produce.

I agree with that and I think that the sentence you're quoting there is meant for a different sort of reader that has less of a clear concept of ML. One way to interpret the passage you're quoting that might help you is that it's just saying that guarantees about global optima don't necessarily translate to local optima or to actual models you might find in practice.

But even without mesa-optimizers, cases of ML generalization failure often involve the latter, not just the former.

I also agree with this. I would describe my picture here as something like: Pseudo-aligned mesa-optimization Objective generalization without capability generalization Robustness problems. Given that picture, I would say that the pseudo-aligned mesa-optimizer case is the most concerning from a safety perspective, then generic objective generalization without capability generalization, then robustness problems in general. And I would argue that it makes sense to break it down in that way precisely because you get more concerning safety problems as you go narrower.

Also, more detail on the capability vs. objective robustness picture is also available here and here.

Comment by evhub on AI safety via market making · 2020-07-20T20:04:30.288Z · score: 2 (1 votes) · LW · GW

afaict there isn't a positive consideration for iterated amplification / market making that doesn't also apply to debate

For amplification, I would say that the fact that it has a known equilibrium (HCH) is a positive consideration that doesn't apply to debate. For market making, I think that the fact that it gets to be per-step myopic is a positive consideration that doesn't apply to debate. There are others too for both, though those are probably my biggest concerns in each case.

Comment by evhub on Alignment proposals and complexity classes · 2020-07-17T04:36:06.174Z · score: 7 (4 votes) · LW · GW

I agree with the gist that it implies that arguments about the equilibrium policy don't necessarily translate to real models, though I disagree that that's necessarily bad news for the alignment scheme—it just means you need to find some guarantees that work even when you're not at equilibrium.

Comment by evhub on Alignment proposals and complexity classes · 2020-07-17T01:20:23.549Z · score: 4 (2 votes) · LW · GW

I read the post as attempting to be literal, ctrl+F-ing "analog" doesn't get me anything until the comments. Also, the post is the one that I read as assuming for the sake of analysis that humans can solve all problems in P, I myself wouldn't necessarily assume that.

I mean, I assumed what I needed to in order to be able to do the proofs and have them make sense. What the proofs actually mean in practice is obviously up for debate, but I think that a pretty reasonable interpretation is that they're something like analogies which help us get a handle on how powerful the different proposals are in theory.

Comment by evhub on Conditions for Mesa-Optimization · 2020-07-16T21:27:49.903Z · score: 6 (3 votes) · LW · GW

I suppose it depends on what you mean by “dominate.” Mathematically, the point here is just that

which I would describe as the term “dominating” the term.

Comment by evhub on Alignment proposals and complexity classes · 2020-07-16T21:11:15.656Z · score: 2 (1 votes) · LW · GW

Glad you found the post exciting!

I follow up to there. What you're showing is basically that you can train M to solve any problem in C using a specific alignment approach, without limiting in any way the computational power of M. So it might take an untractable amount of resources, like exponential time, for this model to solve a problem in PSPACE, but what matters here is just that it does. The point is to show that alignment approaches using a polynomial-time human can solve these problems, not how much resources they will use to do so.

Yep—that's right.

Maybe it's just grammar, but I read this sentence as saying that I trust the output of the polynomial-time strategy. And thus that you can solve PSPACE, NEXP, EXP and R problems in polynomial time. So I'm assuming that you mean trusting the model, which once again has no limits in terms of resources used.

I certainly don't mean that the model needs to be polynomial time. I edited the post to try to clarify this point.

I looked for that statement in the paper, failed to find it, then realized you probably meant that raw polynomial time verification gives you NP (the certificate version of NP, basically). Riffing on the importance of optimal play, Irving et al. show that the debate game is a game in the complexity theoretic sense, and thus that it is equivalent to TQBF, a PSPACE-complete problem. But when seeing a closed formula as a game, the decision problem of finding whether it's in TQBF amounts to showing the existence of a winning strategy for the first player. Debate solves this by assuming optimal play, and thus that the winning debater will have, find, and apply a winning strategy for the debate.

Yeah—that's my interpretation of the debate paper as well.

Comment by evhub on Alignment proposals and complexity classes · 2020-07-16T20:58:02.971Z · score: 2 (1 votes) · LW · GW

Yeah, that's right—for a probability distribution , I mean for to be the value of the probability density function at . I edited the post to clarify.

Comment by evhub on Arguments against myopic training · 2020-07-14T21:22:08.880Z · score: 12 (4 votes) · LW · GW

My comments below are partially copied from earlier comments I made on a draft of this post that Richard shared with me.

I think iterated amplification is an important research direction, but I don’t see what value there is in making the supervisor output approval values to train a myopic agent on, rather than rewards to train a nonmyopic agent on.

This is possible for approval-based amplification, though it's worth noting that I'm not sure if it actually makes sense for imitative amplification. When the loss is just the distance between the overseer's output and the model's output, you already have the full feedback signal, so there's no reason to use a reward.

“Myopic thinking” has never been particularly well-specified

Though still not super well-specified, my current thinking is that an agent is thinking myopically if their goal is a function of their output across some Cartesian boundary. See the section on “Goals across Cartesian boundaries” in this post.

But based on the arguments in this post I expect that, whatever the most reasonable interpretations of “approval-directed” or “myopic” cognition turn out to be, they could be developed in nonmyopic training regimes just as well as (or better than) in myopic training regimes.

What might this look like in practice? Consider the example of an agent trained myopically on the approval of HCH. To make this nonmyopic in a trivial sense, we merely need to convert that approval into a reward using the formula I gave above. However, after just the trivial change, myopic training will outperform nonmyopic training (because the latter requires the agent to do credit assignment across timesteps). To make it nonmyopic in an interesting and advantageous sense, HCH will need to notice when its earlier evaluations were suboptimal, and then assign additional rewards to correct for those errors.

This is definitely the point here that I care most about. I care a lot more about myopic cognition than myopic training procedures—as I see myopic cognition as a solution to deceptive alignment—and I do find it quite plausible that you could use a non-myopic training procedure to train a myopic agent.

However, it's worth noting that the procedure given here really looks a lot more like approval-based amplification rather than imitative amplification. And approval-based amplification doesn't necessarily limit to HCH, which makes me somewhat skeptical of it. Furthermore, by allowing the overseer to see the model's output in giving its feedback, the procedure given here breaks the analogy to counterfactual oracles which means that a model acting like a counterfactual oracle will no longer always be optimal—which is a real problem if the sort of myopic cognition that I want behaves like a counterfactual oracle (which I think it does).

For myopic agents to be competitive on long-term tasks, their objective function needs to be set by a supervisor which is able to accurately predict how well their actions fulfil long-term goals.

I think this is where I disagree with this argument. I think you can get myopic agents which are competitive on long-run tasks because they are trying to do something like “be as close to HCH as possible” which results in good long-run task performance without actually being specified in terms of the long-term consequences of the agent's actions.

Comment by evhub on AI safety via market making · 2020-07-14T06:10:22.604Z · score: 8 (4 votes) · LW · GW

Well, first you need to make sure your training procedure isn't introducing any incentives that would push you away from getting that sort of myopia. Myopic RL with an actually myopic training procedure like a policy gradient algorithm is a good start. But obviously that doesn't actually guarantee you get what I want—it just means that there aren't incentives pushing against it. To actually get any guarantees you'll need to add some additional constraint to the training procedure that actually incentivizes the sort of myopia that I want. Here I proposed using a combination of relaxed adversarial training and cross-examination with transparency tools, though obviously whether or not something like that would actually work is still pretty unknown.

Comment by evhub on AI safety via market making · 2020-07-14T03:25:21.865Z · score: 6 (3 votes) · LW · GW

If Adv instead reported a random coin with p% probability and reported nothing otherwise, and M was a best response to that, then at every timestep Adv would get non-zero expected reward, and so even myopically that is a better strategy for Adv (again under the assumption that M is a best response to Adv).

Ah—I see the issue here. I think that the version of myopia that you're describing is insufficient for most applications where I think you might need myopia in an ML system. What I mean by myopia in this context is to take the action which is best according to the given myopic objective conditioned on . Once starts including acausal effects into its action selection (such as the impact of its current policy on 's past policy), I want to call that non-myopic. Notably, the reason for this isn't isolated to AI safety via market making—a myopic agent which is including acausal considerations can still be deceptive, whereas a fully causal myopic agent can't. Another way of putting this is that what I mean by myopia is specifically something like CDT with a myopic objective, whereas what you're thinking about is more like EDT or UDT with a myopic objective.

Comment by evhub on AI safety via market making · 2020-07-13T23:42:11.146Z · score: 4 (2 votes) · LW · GW

Only if all the arguments can be specified within the length transcript (leading back to my original point about this being like NP instead of PSPACE).

Not necessarily— can make an argument like: “Since the modal prediction of is no, you shouldn't trust argument .”

Adv would learn to provide no information with some probability, in order to prevent convergence to the equilibrium where M immediately reports the correct answer which leads to Adv getting zero reward

That strategy is highly non-myopic. Certainly market making breaks if you get a non-myopic like that, though as I note in the post I think basically every current major prosaic AI safety proposal requires some level of myopia to not break (either per-step or per-episode).

Comment by evhub on AI safety via market making · 2020-07-10T18:40:42.293Z · score: 6 (3 votes) · LW · GW

Hmm, this seems to rely on having the human trust the outputs of on questions that the human can't verify. It's not obvious to me that this is an assumption you can make without breaking the training process.

One possible way to train this is just to recurse on sub-questions some percentage of the time (potentially based on some active learning metric for how useful that recursion will be).

It works in the particular case that you outlined because there is essentially a DAG of arguments -- every claim is broken down into "smaller" claims, that eventually reach a base case, and so everything eventually bottoms out in something the human can check.

Yes, though I believe that it should be possible (at least in theory) for to ensure a DAG for any computable claim.

It might be that this is actually an unimportant problem, because in practice for every claim there are a huge number of ways to argue for the truth, and it's extraordinarily unlikely that all of them fail in the same way such that M would argue for the same wrong answer along all of these possible paths, and so eventually M would have to settle on the truth. I'm not sure, I'd be interested in empirical results here.

Agreed—I'd definitely be interested in results here as well.

It occurs to me that the same problem can happen with iterated amplification, though it doesn't seem to be a problem with debate.

Definitely the problem of requiring the human to decompose problems into actually smaller subproblems exists in amplification also. Without that requirement, HCH can have multiple fixed points rather than just the one, which could potentially give whatever mechanism ends up selecting which fixed point quite a lot of power over the final result.

Also, echoing my other comment below, I'm not sure if this is an equilibrium in the general case where Adv can make many kinds of arguments that H pays attention to. Maybe once this equilibrium has been reached, Adv starts saying things like "I randomly sampled 2 of the 200 numbers, and they were 20 and 30, and so we should expect the sum to be 25 * 100 = 2500". (But actually 20 and 30 were some of the largest numbers and weren't randomly sampled; the true sum is ~1000.) If this causes the human to deviate even slightly from the previous equilibrium, Adv is incentivized to do it. While we could hope to avoid this in math / arithmetic, it seems hard to avoid this sort of thing in general.

Note that even if the human is temporarily convinced by such an argument, as long as there is another argument which de-convinces them then in the limit won't be incentivized to produce that argument. And it seems likely that there should exist de-convincing arguments there—for example, the argument that should follow the strategy that I outlined above if they want to ensure that they get the correct answer. Additionally, we might hope that this sort of “bad faith” argument can also be prevented via the cross-examination mechanism I describe above.

Comment by evhub on Inner alignment in the brain · 2020-07-10T04:25:26.347Z · score: 17 (5 votes) · LW · GW

A thought: it seems to me like the algorithm you're describing here is highly non-robust to relative scale, since if the neocortex became a lot stronger it could probably just find some way to deceive/trick/circumvent the subcortex to get more reward and/or avoid future updates. I think I'd be pretty worried about that failure case if anything like this algorithm were ever to be actually implemented in an AI.

Comment by evhub on AI safety via market making · 2020-07-09T21:32:40.153Z · score: 10 (3 votes) · LW · GW

That's a very good point. After thinking about this, however, I think market making actually does solve this problem, and I think it does so pretty cleanly. Specifically, I think market making can actually convince a judge of the sum of integers in time as long as you allow the traders to exhibit market probabilities as part of their argument.

Consider the task of finding the sum of N integers and suppose that both and have access to all N integers, but that the human judge can only sum two numbers at a time. Then, I claim that there exists a strategy that the judge can implement that, for an unexploitable market, will always produce the desired sum immediately (and thus in time).


's strategy here is to only listen to arguments of the following two forms:

Argument type 1:

The sum of is because the sum of a single-element set is the element of that set.

Argument type 2:

The sum of is because the modal prediction of is , the modal prediction of is , and .

Under that strategy, we'll prove that an unexploitable market will always give the right answer immediately by strong induction on the size of the set.

First, the base case. For any single-element set, only Argument type 1 exists. Thus, if predicts anything other than the actual , can exploit that by implementing Argument type 1, and that is the only possible exploit available. Thus, should always give the right answer immediately for single-argument sets.

Second, the inductive step. Suppose by strong induction that always gives the right answer immediately for all sets of size less than . Now, for a set of size , the only type of argument available is Argument type 2. However, since the first half and second half of the set are of size less than , we know by induction that and must be correct sums. Thus, since can check that , the only exploit available to is to showcase the correct , and if already showcases the correct , then no exploit is possible. Thus, should always give the correct immediately for -argument sets.

EDIT: Thinking about this more, I think my argument generalizes to allow AI safety via market making to access R, which seems pretty exciting given that the best debate could do previously was NEXP.

Comment by evhub on TurnTrout's shortform feed · 2020-07-07T23:36:08.711Z · score: 6 (3 votes) · LW · GW

A lot of examples of this sort of stuff show up in OpenAI clarity's circuits analysis work. In fact, this is precisely their Universality hypothesis. See also my discussion here.

Comment by evhub on Spoiler-Free Review: Horizon Zero Dawn · 2020-07-07T19:18:08.001Z · score: 2 (1 votes) · LW · GW

Am I curious enough to look up what happens, once it’s clear I’m not going to be convinced to keep playing? Not sure yet. We’ll see.

I'd recommend you look it up if you're not going to finish it—the world has a kinda neat grey-goo-y sort of backstory.

Comment by evhub on Learning the prior · 2020-07-06T19:08:07.545Z · score: 6 (3 votes) · LW · GW

I agree with Daniel. Certainly training on actual iid samples from the deployment distribution helps a lot—as it ensures that your limiting behavior is correct—but in the finite data regime you can still find a deceptive model that defects some percentage of the time.

Comment by evhub on Conditions for Mesa-Optimization · 2020-07-02T19:26:22.000Z · score: 2 (1 votes) · LW · GW

This post is the second post in the middle of a five-post sequence—you should definitely start with the introduction and if you still feel confused by the terminology, there's also a glossary that might be helpful as well. You can also get a lot of the content in podcast form here if you'd prefer that.

Comment by evhub on Risks from Learned Optimization: Conclusion and Related Work · 2020-06-26T19:51:36.562Z · score: 4 (2 votes) · LW · GW

Alex Flint recently wrote up this attempt at defining optimization that I think is pretty good and probably worth taking a look at.

Comment by evhub on What are the high-level approaches to AI alignment? · 2020-06-16T22:00:11.464Z · score: 6 (5 votes) · LW · GW

You might be interested in this post I wrote recently that goes into significant detail on what I see as the major leading proposals for building safe advanced AI under the current machine learning paradigm.

Comment by evhub on Synthesizing amplification and debate · 2020-06-10T19:39:20.495Z · score: 4 (2 votes) · LW · GW

“Annealing” here simply means decaying over time (as in learning rate annealing), in this case decaying the influence of one of the losses to zero.

Comment by evhub on Relaxed adversarial training for inner alignment · 2020-06-09T22:13:46.602Z · score: 2 (1 votes) · LW · GW

Yep—that's one of the main concerns. The idea, though, is that all you have to deal with should be a standard overfitting problem, since you don't need the acceptability predicate to work once the model is deceptive, only beforehand. Thus, you should only have to worry about gradient descent overfitting to the acceptability signal, not the model actively trying to trick you—which I think is solvable overfitting problem. Currently, my hope is that you can do that via using the acceptability signal to enforce an easy-to-verify condition that rules out deception such as myopia.

Comment by evhub on An overview of 11 proposals for building safe advanced AI · 2020-06-05T23:27:35.468Z · score: 8 (4 votes) · LW · GW

Glad you liked the post! Hopefully it'll be helpful for your discussion, though unfortunately the timing doesn't really work out for me to be able to attend. However, I'd be happy to talk to you or any of the other attendees some other time—I can be reached at if you or any of the other attendees want to reach out and schedule a time to chat.

In terms of open problems, part of my rationale for writing up this post is that I feel like we as a community still haven't really explored the full space of possible prosaic AI alignment approaches. Thus, I feel like one of the most exciting open problems would be developing new approaches that could be added to this list (like this one, for example). Another open problem is improving our understanding of transparency and interpretability—one thing you might notice with all of these approaches is that they all require at least some degree of interpretability to enable inner alignment to work. I'd also be remiss to mention that if you're interested in concrete ML experiments, I've previously written up a couple of different posts detailing experiments I'd be excited about.

Comment by evhub on An overview of 11 proposals for building safe advanced AI · 2020-06-04T19:44:48.443Z · score: 4 (2 votes) · LW · GW

Well, I don't think we really know the answer to that question right now. My hope is that myopia will turn out to be a pretty easy to verify property—certainly my guess is that it'll be easier to verify than non-deception. Until we get better transparency tools, a better understanding of what algorithms our models are actually implementing, and better definitions of myopia that make sense in that context, however, we don't really know how easy verifying it will be. Maybe it can be done mechanically, maybe it'll require a human—we still really just don't know.

Comment by evhub on Inaccessible information · 2020-06-03T23:49:59.563Z · score: 5 (3 votes) · LW · GW

That makes sense, though I feel like under that definition having things that you care about be accessible via transfer wouldn't actually help you that much unless you know that the model transfers correctly there, since otherwise you'd have no reason to trust the transfer (even if it's actually correct). Unless you have some reason to believe otherwise (e.g. some strong robustness guarantee)—it seems to me like in most cases you have to assume that all the information you get via transfer is suspect, which makes even the correct transfer inaccessible in some sense since you can't distinguish it from the incorrect transfer.

I guess if lots of actors just try to use transfer anyway and hope, then those actors with values that actually are accessible via transfer will be advantaged, though unless you have a particular reason to suspect that your values will be more accessible than average (though I guess the point is that our values are less likely to be accessible via transfer than most AI's values), it seems like in most cases you wouldn't want to pursue that strategy unless you had no other option.

Comment by evhub on An overview of 11 proposals for building safe advanced AI · 2020-06-03T19:57:14.452Z · score: 7 (3 votes) · LW · GW

Glad you enjoyed the post!

so, do we have guarantees or not? Because the first sentence says there are, and the second says the model could end up stronger than the ones it imitates.

The first sentence says that you have a guarantee that the overseer is at least as strong as the target while the second sentence notes that the model might be stronger (or weaker) than the target. So we know overseer > target, but we don't know target > model, so we can't conclude overseer > model.

About 3, isn't there a risk that M_{n+1} behaves such that it simplifies or remove the checks of Amp(M_{n+1})? One way to deal with that would be to make humans do the adversarial attacks, but that will probably hurt training competitiveness.

There's still a human in the loop since is just consulting —and you should still be using a target model to do the oversight. But the real thing you're relying on here to prevent from causing the oversight to fail in the future is myopia verification, as a myopic should never pursue that strategy.

I think I get the intuition, but evaluation is far less rich a signal that production of behavior: you have a score or a binary yes/no for the former, and the full behavior for the latter. What I believe you meant is that using evaluation instead of production makes the method applicable to far more problems, but I might be wrong.

I think there are lots of cases where evaluation is richer than imitation—compare RL to behavior cloning, for example.

Finally, for the 8, do you have examples of when this behaves differently from 3? Because it seems to me that in the limit, imitation will produce the same behavior than extraction of the reward function and maximization of the reward. Maybe something about generalization changes?

Certainly they can behave differently not in the limit. But even in the limit, when you do imitation, you try to mimic both what the human values and also how the human pursues those values—whereas when you do reward learning followed by reward maximization, by contrast, you try to mimic the values but not the strategy the human uses to pursue them. Thus, a model trained to maximize a learned reward might in the limit take actions to maximize that reward that the original human never would—perhaps because the human would never have thought of such actions, for example.

Comment by evhub on Inaccessible information · 2020-06-03T07:52:45.736Z · score: 4 (2 votes) · LW · GW

Maybe I'm missing something, but I don't understand why you're considering the output of the simplest model that provides some checkable information to be accessible. It seems to me like that simplest model could very well be implementing a policy like BAD that would cause its output on the uncheckable information to be false or otherwise misleading. Thus, it seems to me like all of the problems you talk about in terms of the difficulty of getting access to inaccessible information also apply to the uncheckable information accessible via transfer.

Comment by evhub on An overview of 11 proposals for building safe advanced AI · 2020-06-01T19:03:51.252Z · score: 6 (3 votes) · LW · GW

Yep—at least that's how I'm generally thinking about it in this post.