## Posts

Impact measurement and value-neutrality verification 2019-10-15T00:06:51.879Z · score: 35 (10 votes)
Towards an empirical investigation of inner alignment 2019-09-23T20:43:59.070Z · score: 43 (11 votes)
Relaxed adversarial training for inner alignment 2019-09-10T23:03:07.746Z · score: 39 (9 votes)
Are minimal circuits deceptive? 2019-09-07T18:11:30.058Z · score: 41 (10 votes)
Concrete experiments in inner alignment 2019-09-06T22:16:16.250Z · score: 63 (20 votes)
Towards a mechanistic understanding of corrigibility 2019-08-22T23:20:57.134Z · score: 36 (10 votes)
Risks from Learned Optimization: Conclusion and Related Work 2019-06-07T19:53:51.660Z · score: 62 (17 votes)
Deceptive Alignment 2019-06-05T20:16:28.651Z · score: 60 (15 votes)
The Inner Alignment Problem 2019-06-04T01:20:35.538Z · score: 66 (16 votes)
Conditions for Mesa-Optimization 2019-06-01T20:52:19.461Z · score: 56 (18 votes)
Risks from Learned Optimization: Introduction 2019-05-31T23:44:53.703Z · score: 113 (34 votes)
A Concrete Proposal for Adversarial IDA 2019-03-26T19:50:34.869Z · score: 16 (5 votes)
Nuances with ascription universality 2019-02-12T23:38:24.731Z · score: 22 (6 votes)
Dependent Type Theory and Zero-Shot Reasoning 2018-07-11T01:16:45.557Z · score: 18 (11 votes)

Comment by evhub on The Dualist Predict-O-Matic ($100 prize) · 2019-10-20T20:35:34.434Z · score: 1 (1 votes) · LW · GW I don't think we do agree, in that I think pressure towards simple models implies that they won't be dualist in the way that you're claiming. Comment by evhub on The Dualist Predict-O-Matic ($100 prize) · 2019-10-20T01:49:30.774Z · score: 1 (1 votes) · LW · GW

I think maybe what you're getting at is that if we try to get a machine learning model to predict its own predictions (i.e. we give it a bunch of data which consists of labels that it made itself), it will do this very easily. Agreed. But that doesn't imply it's aware of "itself" as an entity.

No, but it does imply that it has the information about its own prediction process encoded in its weights such that there's no reason it would have to encode that information twice by also re-encoding it as part of its knowledge of the world as well.

Furthermore, suppose that we take the weights for a particular model, mask some of those weights out, use them as the labels y, and try to predict them using the other weights in that layer as features x. The model will perform terribly on this because it's not the task that it was trained for. It doesn't magically have the "self-awareness" necessary to see what's going on.

Sure, but that's not actually the relevant task here. It may not understand its own weights, but it does understand its own predictive process, and thus its own output, such that there's no reason it would encode that information again in its world model.

Comment by evhub on Relaxed adversarial training for inner alignment · 2019-10-19T21:16:37.487Z · score: 1 (1 votes) · LW · GW

The point about decompositions is a pretty minor portion of this post; is there a reason you think that part is more worthwhile to focus on for the newsletter?

Comment by evhub on The Dualist Predict-O-Matic ($100 prize) · 2019-10-19T21:13:12.319Z · score: 1 (1 votes) · LW · GW that suggested to me that there were 2 instances of this info about Predict-O-Matic's decision-making process in the dataset whose description length we're trying to minimize. "De-duplication" only makes sense if there's more than one. Why is there more than one? ML doesn't minimize the description length of the dataset—I'm not even sure what that might mean—rather, it minimizes the description length of the model. And the model does contain two copies of information about Predict-O-Matic's decision-making process—one in its prediction process and one in its world model. The prediction machinery is in code, but this code isn't part of the info whose description length is attempting to be minimized, unless we take special action to include it in that info. That's the point I was trying to make previously. Modern predictive models don't have some separate hard-coded piece that does prediction—instead you just train everything. If you consider GPT-2, for example, it's just a bunch of transformers hooked together. The only information that isn't included in the description length of the model is what transformers are, but "what's a transformer" is quite different than "how do I make predictions." All of the information about how the model actually makes its predictions in that sort of a setup is going to be trained. Comment by evhub on The Dualist Predict-O-Matic ($100 prize) · 2019-10-18T07:52:38.920Z · score: 3 (2 votes) · LW · GW

Most of the time, when I train a machine learning model on some data, that data isn't data about the ML training algorithm or model itself.

If the data isn't at all about the ML training algorithm, then why would it even build a model of itself in the first place, regardless of whether it was dualist or not?

A machine learning model doesn't get understanding of or data about its code "for free", in the same way we don't get knowledge of how brains work "for free" despite the fact that we are brains.

We might not have good models of brains, but we do have very good models of ourselves, which is the actual analogy here. You don't have to have a good model of your brain to have a good model of yourself, and to identify that model of yourself with your own actions (i.e. the thing you called an "ego").

Part of what I'm trying to indicate with the "dualist" term is that this Predict-O-Matic is the same way, i.e. its position with respect to itself is similar to the position of an aspiring neuroscientist with respect to their own brain.

Also, if you think that, then I'm confused why you think this is a good safety property; human neuroscientists are precisely the sort of highly agentic misaligned mesa-optimizers that you presumably want to avoid when you just want to build a good prediction machine.

--

I think I didn't fully convey my picture here, so let me try to explain how I think this could happen. Suppose you're training a predictor and the data includes enough information about itself that it has to form some model of itself. Once that's happened--or while it's in the process of happening--there is a massive duplication of information between the part of the model that encodes its prediction machinery and the part that encodes its model of itself. A much simpler model would be one that just uses the same machinery for both, and since ML is biased towards simple models, you should expect it to be shared--which is precisely the thing you were calling an "ego."

Comment by evhub on The Dualist Predict-O-Matic (100 prize) · 2019-10-17T21:37:50.302Z · score: 7 (2 votes) · LW · GW If dualism holds for Abram's prediction AI, the "Predict-O-Matic", its world model may happen to include this thing called the Predict-O-Matic which seems to make accurate predictions -- but it's not special in any way and isn't being modeled any differently than anything else in the world. Again, I think this is a pretty reasonable guess for the Predict-O-Matic's default behavior. I suspect other behavior would require special code which attempts to pinpoint the Predict-O-Matic in its own world model and give it special treatment (an "ego"). I don't think this is right. In particular, I think we should expect ML to be biased towards simple functions such that if there's a simple and obvious compression, then you should expect ML to take it. In particular, having an "ego" which identifies itself with its model of itself significantly reduces description length by not having to duplicate a bunch of information about its own decision-making process. Comment by evhub on Impact measurement and value-neutrality verification · 2019-10-16T20:30:01.719Z · score: 1 (1 votes) · LW · GW 1. That summary might be useful as a TL:DR on that post, unless the description was only referencing what aspects of it are important for (the ideas you are advancing in) this post. The idea of splitting up a model into a value-neutral piece and a value-laden piece was only one of a large number of things I talked about in "Relaxed adversarial training." 1. It seems like those would be hard to disentangle because it seems like a value piece only cares about the things that it values, and thus, its "value neutral piece" might be incomplete for other values - though this might depend on what you mean by "optimization procedure". It's definitely the case that some optimization procedures work better for some values than for others, though I don't think it's that bad. What I mean by optimization process here is something like a concrete implementation of some decision procedure. Something like "predict what action will produce the largest value given some world model and take that action," for example. The trick is just to as much as you can avoid cases where your optimization procedure systematically favors some values over others. For example, you don't want it to be the case that your optimization procedure only works for very easy-to-specify values, but not for other things that we might care about. Or, alternatively, you don't want your optimization process to be something like "train an RL agent and then use that" that might produce actions that privilege simple proxies rather than what you really want (the "forwarding the guarantee" problem). Comment by evhub on Gradient hacking · 2019-10-16T18:16:16.841Z · score: 8 (4 votes) · LW · GW Sure. My best guess for how we will train AGI is via giving it access to huge amounts of data—e.g. every outbound Reddit link as in GPT-2. Given that sort of a massive, rich dataset, I think that for a model to determine the particular training procedure that it's undergoing won't be very difficult. If it just reads the Wikipedia pages on ML, gradient descent, etc. and some arXiv papers on transparency techniques, for example, then that should give it everything it needs to know. Comment by evhub on Gradient hacking · 2019-10-16T18:08:16.406Z · score: 3 (2 votes) · LW · GW Unless this problem is resolved, I don't see how any AI alignment approach that involves using future ML—that looks like contemporary ML but at an arbitrarily large scale—could be safe. I think that's a bit extreme, or at least misplaced. Gradient-hacking is just something that makes catching deceptive alignment more difficult. Deceptive alignment is the real problem: if you can prevent deceptive alignment, then you can prevent gradient hacking. And I don't think it's impossible to catch deceptive alignment in something that looks similar to contemporary ML—or at least if it is impossible then I don't think that's clear yet. I mentioned some of the ML transparency approaches I'm excited about in this post, though really for a full treatment of that problem see "Relaxed adversarial training for inner alignment." Comment by evhub on Impact measurement and value-neutrality verification · 2019-10-15T20:50:58.696Z · score: 5 (3 votes) · LW · GW You're right, I think the absolute value might actually be a problem—you want the policy to help/hurt all values relative to no-op equally, not hurt some and help others. I just edited the post to reflect that. As for the connection between neutrality and objective impact, I think this is related to a confusion that Wei Dai pointed out, which is that I was sort of waffling between two different notions of strategy-stealing, those being: 1. strategy-stealing relative to all the agents present in the world (i.e. is it possible for your AI to steal the strategies of other agents in the world) and 2. strategy-stealing relative to a single AI (i.e. if that AI were copied many times and put in service of many different values, would it advantage some over others). If you believe that most early AGIs will be quite similar in their alignment properties (as I generally do, since I believe that copy-and-paste is quite powerful and will generally be preferred over designing something new), then these two notions of strategy-stealing match up, which was why I was waffling between them. However, conceptually they are quite distinct. In terms of the connection between neutrality and objective impact, I think there I was thinking about strategy-stealing in terms of notion 1, whereas for most of the rest of the post I was thinking about it in terms of notion 2. In terms of notion 1, objective impact is about changing the distribution of resources among all the agents in the world. Comment by evhub on Impact measurement and value-neutrality verification · 2019-10-15T20:43:25.786Z · score: 3 (2 votes) · LW · GW Note that the model's output isn't what's relevant for the neutrality measure; it's the algorithm it's internally implementing. That being said, this sort of trickery is still possible if your model is non-myopic, which is why it's important to have some sort of myopia guarantee. Comment by evhub on AI Alignment Writing Day Roundup #2 · 2019-10-08T22:26:11.119Z · score: 3 (2 votes) · LW · GW Paul's post offers two conditions about the ease of training an acceptable model (in particular, that it should not stop the agent achieving a high average reward and that is shouldn't make hard problems much harder), but Evan's conditions are about the ease of choosing an acceptable action. This is reversed. Paul's conditions were about the ease of choosing an acceptable action; my conditions are about the ease of training an acceptable model. Comment by evhub on Towards a mechanistic understanding of corrigibility · 2019-09-29T23:31:28.048Z · score: 3 (2 votes) · LW · GW Part of the point that I was trying to make in this post is that I'm somewhat dissatisfied with many of the existing definitions and treatments of corrigibility, as I feel like they don't give enough of a basis for actually verifying them. So I can't really give you a definition of act-based corrigibility that I'd be happy with, as I don't think there currently exists such a definition. That being said, I think there is something real in the act-based corrigibility cluster, which (as I describe in the post) I think looks something like corrigible alignment in terms of having some pointer to what the human wants (not in a perfectly reflective way, but just in terms of actually trying to help the human) combined with some sort of pre-prior creating an incentive to improve that pointer. Comment by evhub on Partial Agency · 2019-09-28T17:55:32.419Z · score: 13 (3 votes) · LW · GW I agree that this is possible, but I would be very surprised if a mesa-optimizer actually did something like this. By default, I expect mesa-optimizers to use proxy objectives that are simple, fast, and easy to specify in terms of their input data (e.g. pain) not those that require extremely complex world models to even be able to specify (e.g. spread of DNA). In the context of supervised learning, having an objective that explicitly cares about the value of RAM that stores its loss seems very similar to explicitly caring about the spread of DNA in that it requires a complex model of the computer the mesa-optimizer is running and is quite complex and difficult to reason about. This is why I'm not very worried about reward-tampering: I think proxy-aligned mesa-optimizers basically never tamper with their rewards (though deceptively aligned mesa-optimizers might, but that's a separate problem). Comment by evhub on Partial Agency · 2019-09-28T05:38:20.679Z · score: 16 (4 votes) · LW · GW I really like this post. I'm very excited about understanding more about this as I said in my mechanistic corrigibility post (which as you mention is very related to the full/partial agency distinction). we can kind of expect any type of learning to be myopic to some extent I'm pretty uncertain about this. Certainly to the extent that full agency is impossible (due to computational/informational constraints, for example), I agree with this. But I think a critical point which is missing here is that full agency can still exhibit pseudo-myopic behavior (and thus get selected for) if using an objective that is discounted over time or if deceptive. Thus, I don't think that having some sort of soft episode boundary is enough to rule out full-ish agency. Furthermore, it seems to me like it's quite plausible that for many learning setups models implementing algorithms closer to full agency will be simpler than models implementing algorithms closer to partial agency. As you note, partial agency is a pretty weird thing to do from a mathematical standpoint, so it seems like many learning processes might penalize it pretty heavily for that. At the very least, if you count Solomonoff Induction as a learning process, it seems like you should probably expect something a lot closer to full agency there. That being said, I definitely agree that the fact that epistemic learning seems to just do this by default seems pretty promising for figuring out how to get myopia, so I'm definitely pretty excited about that. RL tends to require temporal discounting -- this also creates a soft episode boundary, because things far enough in the future matter so little that they can be thought of as "a different episode". This is just a side note, but RL also tends to have hard episode boundaries if you are regularly resetting the state of the environment as is common in many RL setups. Comment by evhub on Concrete experiments in inner alignment · 2019-09-12T16:32:46.229Z · score: 1 (1 votes) · LW · GW When I use the term "RL agent," I always mean an agent trained via RL. The other usage just seems confused to me in that it seems to be assuming that if you use RL you'll get an agent which is "trying" to maximize its reward, which is not necessarily the case. "Reward-maximizer" seems like a much better term to describe that situation. Comment by evhub on Relaxed adversarial training for inner alignment · 2019-09-11T18:28:52.140Z · score: 1 (1 votes) · LW · GW Good catch! Also, I generally think of pseudo-inputs as predicates, not particular inputs or sets of inputs (though of course a predicate defines a set of inputs). And as for the reason for the split, see the first section in "Other approaches" (the basic idea is that the split lets us have an adversary, which could be useful for a bunch of reasons). Comment by evhub on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-10T22:51:28.463Z · score: 4 (2 votes) · LW · GW Thinking about this more, this doesn't actually seem very likely for OGD since there are likely to be model parameters controlling how farsighted the agent is (e.g., its discount rate or planning horizon) so it seems like non-myopic agents are not local optima and OGD would keep going downhill (to more and more myopic agents) until it gets to a fully myopic agent. Does this seem right to you? I don't think that's quite right. At least if you look at current RL, it relies on the existence of a strict episode boundary past which the agent isn't supposed to optimize at all. The discount factor is only per-step within an episode; there isn't any between-episode discount factor. Thus, if you think that simple agents are likely to care about things beyond just the episode that they're given, then you get non-myopia. In particular, if you put an agent in an environment with a messy episode boundary (e.g. it's in the real world such that its actions in one episode have the ability to influence its actions in future episodes), I think the natural generalization for an agent in that situation is to keep using something like its discount factor past the artificial episode boundary created by the training process, which gives you non-myopia. Comment by evhub on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-10T17:05:41.525Z · score: 5 (3 votes) · LW · GW I call this problem "non-myopia," which I think interestingly has both an outer alignment component and an inner alignment component: 1. If you train using something like population-based training that explicitly incentivizes cross-episode performance, then the resulting non-myopia was an outer alignment failure. 2. Alternatively, if you train using standard RL/SL/etc. without any PBT, but still get non-myopia, then that's an inner alignment failure. And I find this failure mode quite plausible: even if your training process isn't explicitly incentivizing non-myopia, it might be that non-myopic agents are simpler/more natural/easier to find/etc. such that your inductive biases still incentivize them. Comment by evhub on Concrete experiments in inner alignment · 2019-09-10T16:54:28.169Z · score: 3 (2 votes) · LW · GW The distinction I'm most interested in here is the distinction drawn in "Risks from Learned Optimization" between internalization and modeling. As described in the paper, however, there are really three cases: 1. In the internalization case, the base optimizer (e.g. gradient descent) modifies the mesa-optimizer's parameters to explicitly specify its goal. 2. In the corrigible modeling case, the mesa-optimizer builds a world model containing concepts like "what humans value" and then the base optimizer modifies the mesa-optimizer's objective to "point to" whatever is found by the world model. 3. In the deceptive modeling case, the mesa-optimizer builds a world model containing concepts like "what humans value" and then the base optimizer modifies the mesa-optimizer to instrumentally optimize that for the purpose of staying around. Which one of these situations is most likely seems highly dependent on to what extent goals by default get encoded explicitly or via reference to world models. This is a hard thing to formalize in such a way that we can make any progress on testing it now, but my "Reward side-channels" proposal at least tries to by drawing the following distinction: if I give the model its reward as part of its observations, train it to optimize that, but then change it at test time, what happens? If it's doing pure internalization, it should either fail to optimize for either reward very well or succeed at optimizing the old reward but not the new one. If it's doing pure modeling, however, then it should fail to optimize the old reward but succeed at optimizing the new reward, which is the interesting case that you'd be trying to see under what circumstances you could actually get to appear. Comment by evhub on Concrete experiments in inner alignment · 2019-09-09T18:56:11.603Z · score: 6 (3 votes) · LW · GW I haven't used it myself, but it seems like a reasonably good platform if you want to test reward tampering stuff. DeepMind actually did a similar thing recently based on the game "Baba is You." That being said, most of the experiments here aren't really about reward tampering, so they don't really need the embeddedness you get from AIXIjs or Baba is You (and I'm not that excited about reward tampering research in general). Comment by evhub on Are minimal circuits deceptive? · 2019-09-09T18:50:59.600Z · score: 3 (2 votes) · LW · GW I agree that the proof here can be made significantly more general—and I agree that exploring that definitely seems worthwhile—though I also think it's worth pointing out that the proof rests on assumptions that I would be a lot less confident would hold in other situations. The point of explaining the detail regarding search algorithms here is that it gives a plausible story for why the assumptions made regarding and should actually hold. Comment by evhub on Are minimal circuits deceptive? · 2019-09-09T18:44:02.675Z · score: 3 (2 votes) · LW · GW Regarding Ortega et al., I agree that the proof presented in the paper is just about how a single generator can be equivalent to sequences of multiple generators. The point that the authors are using that proof to make, however, is somewhat more broad, which is that your model can learn a learning algorithm even when the task you give it isn't explicitly a meta-learning task. Since a learning algorithm is a type of search/optimization algorithm, however, if you recast that conclusion into the language of Risks from Learned Optimization, you get exactly the concern regarding mesa-optimization, which is that models can learn optimization algorithms even when you don't intend them to. Comment by evhub on Concrete experiments in inner alignment · 2019-09-07T18:09:04.208Z · score: 2 (2 votes) · LW · GW I agree that you could interpret that as more of an outer alignment problem, though either way I think it's definitely an important safety concern. Comment by evhub on Utility ≠ Reward · 2019-09-06T21:45:09.531Z · score: 8 (5 votes) · LW · GW I've actually been thinking about the exact same thing recently! I have a post coming up soon about some of the sorts of concrete experiments I would be excited about re inner alignment that includes an entry on what happens when you give an RL agent access to its reward as part of its observation. (Edit: I figured I would just publish the post now so you can take a look at it. You can find it here.) Comment by evhub on Utility ≠ Reward · 2019-09-05T21:04:06.573Z · score: 22 (9 votes) · LW · GW arguably outer and inner optimizer were better choices given what is being described "Inner optimizer" pretty consistently led to readers believing we were talking about the possibility of multiple emergent subsystems acting as optimizers rather than what we actually wanted to talk about, which was thinking of the whole learned algorithm as a single optimizer. Mesa-optimizer, on the hand, hasn't led to this confusion nearly as much. I also think that, if you're willing to just accept mesa as the opposite of meta, then mesa really does fit the concept—see this comment for an explanation of why I think so. That being said, I agree that the actual justification for why mesa should be the word that means the opposite of meta is somewhat sketchy, but if you just treat it as an English neologism, then I think it's mostly fine. Comment by evhub on Thoughts on reward engineering · 2019-08-30T06:38:32.700Z · score: 5 (2 votes) · LW · GW Why do you think it's fine to assume this? (Do you think this is unlikely to change in the future, or just that this is the limit of the scope of the problem that you're working on?) I think I would be fairly surprised if future ML techniques weren't smooth in this way, so I think it's a pretty reasonable assumption. This seems to be assuming High Bandwidth Overseer. What about LBO? A low-bandwidth overseer seems unlikely to be competitive to me. Though it'd be nice if it worked, I think you'll probably want to solve the problem of weird hacky inputs via something like filtered-HCH instead. That being said, I expect the human to drop out of the process fairly quickly—it's mostly only useful in the beginning before the model learns how to do decompositions properly—at some point you'll want to switch to implementing as consulting rather than consulting . This looks like a big disconnect between us. The thing that touched off this discussion was Ought's switch from Factored Cognition to Factored Evaluation, and Rohin's explanation: "In iterated amplification (AN #30), when decomposing tasks in the Factored Cognition sense, you would use imitation learning during the distillation step, whereas with Factored Evaluation, you would use reinforcement learning to optimize the evaluation signal." I think if we're using SL for the question answering part and only using RL for "oversight" ("trying to get M to be transparent and to verify that it is in fact doing the right thing") then I'm a lot more optimistic since we only have to worry about security / reward gaming problems in the latter part, and we can do things like making it a constraint and doing quatilization without worrying about competitiveness. But I think Paul and Ought's plan is to use RL for both. In that case it doesn't help much to make "oversight" a constraint since the security / reward gaming problem in the "answer evaluation" part would still be there. And the problem just generally seems a lot harder because there could be so many different kinds of flaws in the "answer evaluation" part that could be exploited. I mostly agree with this and it is a disagreement I have with Paul in that I am more skeptical of relaxing the supervised setting. That being said, you definitely can still make oversight a constraint even if you're optimizing an RL signal and I do think it helps, since it gives you a way to separately verify that the system is actually being transparent. The idea in this sort of a setting would be that, if your system achieves high performance on the RL signal, then it must be outputting answers which the amplified human likes—but then the concern is that it might be tricking the human or something. But then if you can use oversight to look inside the model and verify that it's actually being transparent, then you can rule out that possibility. By making the transparency part a constraint rather than an objective, it might help prevent the model from gaming the transparency part, which I expect to be the most important part and in turn could help you detect if there was any gaming going on of the RL signal. For the record, though, I don't currently think that making the transparency part a constraint is a good idea. First, because I expect transparency to be hard enough that you'll want to be able to benefit from having a strong gradient towards it. And second, because I don't think it actually helps prevent gaming very much: even if your training process doesn't explicitly incentivize gaming, I expect that by default many mesa-optimizers will have objectives that benefit from it. Thus, what you really want is a general solution for preventing your mesa-optimizer from ever doing anything like that, which I expect to be something like corrigibility or myopia, rather than just trying to rely on your training process not incentivizing it. Comment by evhub on Thoughts on reward engineering · 2019-08-30T03:07:03.377Z · score: 5 (2 votes) · LW · GW This part confuses me because intuitively it seems like if is larger that would make the problem easier, but in your proposal would need to inspect a longer sequence of models. You can avoid having to inspect the whole sequence by doing binary search to find the last model for which oversight works. Also, roughly how many calls to do you usually imagine as making? What does "look for any attacks" actually consist of here? How can you be sure that whatever is doing will catch (or at least become suspicious/uncertain about) all possible attacks (and not have so many false positives that competitiveness is significantly impacted)? Well, that's the hard part; this is still a very open question. I'm currently working on a large post summarizing my thoughts on this question, though, so rather than try to start laying out my thoughts here I think I'll just defer until that's out, which should be fairly soon. Ruling out one class of attacks doesn't seem to help much… To make a strong argument it seems like you'd need to exhaustively list all possible classes of attack and show how the overseer can catch all of them. Yep. I'm optimistic that this will be possible if we can get a better understanding of what transparent optimization looks like, however, since then can pursue the strategy of only being certain of the acceptability of models it can verify are in fact being transparent to it. The training process could be dumb but have access to a lot more compute and end up optimizing harder than the model would. It would take a lot longer to develop each attack but eventually find all possible attacks given enough compute. True, but the nice thing about that is that it should be slow and continuous such that if you're doing oversight during training according to a scheme like the one I described then you should be able to catch it. Future advances in ML/RL could make the training process smarter and the whole point of IDA is to take advantage of such advances to keep it competitive, so you can't assume that the training process is dumb. Well, maybe dumb was the wrong word there. I agree that it's bad to assume that the training process is dumb, but I think it's fine to assume that it'll be slow and continuous such that there's always some previous model such that an amplified version of it can oversee the current model. At the lower levels of IDA both the model and the overseer are pretty dumb, so even a moderately smart training process could be smarter. Yeah, this is where you're relying on the human more. For early models, I think you're mostly relying on the human having access to good enough interpretability tools that they can understand simple models without help. Not sure what you have in mind here. If oversight is a constraint, what is providing the reward? Ah, this might be a big difference in the way we're thinking about this problem. I see informed oversight as an inner alignment technique which you can do if you are training your model via the outer alignment technique of amplification. Specifically: Let be the model and be your amplification operation, which in the pure supervised case just means letting a human answer the question given access to the model. Define to be some procedure for training to minimize Then, let and do informed oversight by letting inspect as above. In this picture, there are two distinct training signals: the supervised training signal through which is telling to approximate , and the oversight training signal, which is trying to get to be transparent and to verify that it is in fact doing the right thing. (In practice, you'll probably want to incorporate oversight into the loss function so you get a better gradient at each step, though that's difficult because oversight is likely to be very expensive. However, you could do something like train an approximation to oversight that you use in your loss function in-between actual oversight calls. Regardless, I think this is mostly an implementation detail and generally focus on the above picture.) Seems like with a quantilizer you'd take a competitiveness hit because the agent would no longer find the most optimal plans (that a comparable RL agent could find). How conservative do we need to be here, to ensure safety? How to analyze the competitiveness tradeoff, to make sure the competitiveness hit isn't too big? I think this is related to what I just said above. The way I was thinking about this was that you could train your model to maximize its predictive accuracy of 's behavior subject to the constraint that it always be transparently acceptable to during oversight. Thus, competitiveness is mostly going to be a function of 1) how competitive HCH is, since that's the thing you're approximating, and 2) how much overhead oversight adds. Comment by evhub on Thoughts on reward engineering · 2019-08-29T21:06:54.515Z · score: 8 (3 votes) · LW · GW This is definitely a real concern, and one that I do spend time thinking about. Here are some thoughts: First, if you want to understand what a model is doing, it's definitely going to be necessary to have access to the history/provenance of how/why each different component of the model came to be during the training process. This shouldn't be that difficult in an informed oversight setting if you do something like the following, though. Imagine some sequence of and suppose the minimum number of steps for an attack of this form to be introduced is . Then, at step we can implement oversight by calling to inspect the whole sequence and look for any attacks being developed and then retrain from right before that point in such a way that fixes them. Second, however, it does still have to be the case that the overseer isn't too exploitable. If there is some strategy available to that causes to strongly believe that is acceptable that is easy to reach without going through anything that looks unacceptable, then this won't work. However, there are reasons to believe that this condition might not be that hard to obtain. First, it's fine if is just uncertain about whether is acceptable; we can just retrain the model to be more transparent in that case and be alright. Second, many of the natural sorts of attacks in this class are of the form where they exploit some knowledge that has but that its overseer doesn't, but in the informed oversight setup, should know everything that knows, preventing those sorts of attacks from working. Third, I expect the training process to be quite a lot dumber than the model such that I expect the most concerning attacks on the overseer to come from the model and not the training process. It is conceivable, however, that the model could "hack" its own gradient so as to get the training process to modify it in a particular way, making an attack from the model look like it was coming from the training process. In this case, I would hope to prevent attacks of this form by ensuring that the model is properly myopic such that it isn't considering plans of this form. Finally, if you're concerned about attacks of this form, you can modify your training process to make them less likely. In particular, in the standard setup that I generally think about, oversight is included as part of the model's loss function such that the model is incentivized to make itself transparent to the overseer. However, there are many different ways you can do this. Rather than have oversight as a penalty term, for example, you could have it as a constraint, which could help avoid Goodharting on the overseer (this is similar in spirit to your suggestion of using a quantilizer). Alternatively, you could not include oversight as part of the loss function at all and instead rely on the overseer manually fixing unacceptable models or retraining until an acceptable model is found, though I expect those sorts of approaches to not be very competitive. Comment by evhub on Six AI Risk/Strategy Ideas · 2019-08-28T19:07:24.255Z · score: 4 (3 votes) · LW · GW Another argument that definitely doesn't rely on any sort of "doors" for why physical risk might be preferable to logical risk is just if you have diminishing returns on the total number of happy humans. As long as your returns to happy humans are sublinear (logarithmic is a standard approximation, though anything sublinear works), then you should prefer a guaranteed shot at the Everett branches having lots of happy humans to a chance of all the Everett branches having happy humans. To see this, suppose measures your returns to the total number of happy humans across all Everett branches. Let be the total number of happy humans in a good Everett branch and the total number of Everett branches. Then, in the physical risk situation, you get whereas, in the logical risk situation, you get which are only equal if is linear. Personally, I think my returns are sublinear, since I pretty strongly want there to at least be some humans—more strongly than I want there to be more humans, though I want that as well. Furthermore, if you believe there's a chance that the universe is infinite, then you should probably be using some sort of measure over happy humans rather than just counting the number, and my best guess for what such a measure might look like seems to be at least somewhat locally sublinear. Comment by evhub on Towards a mechanistic understanding of corrigibility · 2019-08-23T01:44:47.883Z · score: 11 (3 votes) · LW · GW I think that we have different pictures of what outer alignment scheme we're considering. In the context of something like value learning, myopia would be a big capabilities hit, and what you're suggesting might be better. In the context of amplification, however, myopia actually helps capabilities. For example, consider a pure supervised amplification model—i.e. I train the model to approximate a human consulting the model. In that case, a non-myopic model will try to produce outputs which make the human easier to predict in the future, which might not look very competent (e.g. output a blank string so the model only has to predict the human rather than predicting itself as well). On the other hand, if the model is properly myopic such that it is actually just trying to match the human as closely as possible, then you actually get an approximation of HCH, which is likely to be a lot more capable. That being said, unless you have a myopia guarantee like the one above, a competitive model might be deceptively myopic rather than actually myopic. Comment by evhub on Classifying specification problems as variants of Goodhart's Law · 2019-08-22T21:36:56.564Z · score: 25 (7 votes) · LW · GW I'm really glad this exists. I think having universal frameworks like this that connect the different ways of thinking across the whole field of AI safety are really helpful for allowing people to connect their research to each other and make the field more unified and cohesive as a whole. Also, you didn't mention it explicitly in your post, but I think it's worth pointing out how this breakdown maps onto the outer alignment vs. inner alignment distinction drawn in "Risks from Learned Optimization." Specifically, I see the outer alignment problem as basically matching directly onto the ideal-design gap and the inner alignment problem as basically matching onto the design-emergent gap. That being said, the framework is a bit different, since inner alignment is specifically about mesa-optimizers in a way in which the design-emergent gap isn't, so perhaps it's better to say inner alignment is a subproblem of resolving the design-emergent gap. Comment by evhub on Two senses of “optimizer” · 2019-08-22T06:47:14.855Z · score: 23 (6 votes) · LW · GW The implicit assumption seems to be that an optimizer_1 could turn into an optimizer_2 unexpectedly if it becomes sufficiently powerful. It is not at all clear to me that this is the case – I have not seen any good argument to support this, nor can I think of any myself. I think that this is one of the major ways in which old discussions of optimization daemons would often get confused. I think the confusion was coming from the fact that, while it is true in isolation that an optimizer_1 won't generally self-modify into an optimizer_2, there is a pretty common case in which this is a possibility: the presence of a training procedure (e.g. gradient descent) which can perform the modification from the outside. In particular, it seems very likely to me that there will be many cases where you'll get an optimizer_1 early in training and then an optimizer_2 later in training. That being said, while having an optimizer_2 seems likely to be necessary for deceptive alignment, I think you only need an optimizer_1 for pseudo-alignment: every search procedure has an objective, and if that objective is misaligned, it raises the possibility of capability generalization without objective generalization. Also, as a terminological note, I've taken to using "optimizer" for optimizer_1 and "agent" for something closer to optimizer_2, where I've been defining an agent as an optimizer that is performing a search over what its own action should be. I prefer that definition to your definition of optimizer_2, as I prefer mechanistic definitions over behavioral definitions since I generally find them more useful, though I think your notion of optimizer_2 is also a useful concept. Comment by evhub on The Inner Alignment Problem · 2019-08-13T20:27:03.210Z · score: 9 (2 votes) · LW · GW The simple reason why that doesn't work is that computing the base objective function during deployment will likely require something like actually running the model, letting it act in the real world, and then checking how good that action was. But in the case of actions that are catastrophically bad, the fact that you can check after the fact that the action was in fact bad isn't much consolation: just having the agent take the action in the first place could be exceptionally dangerous during deployment. Comment by evhub on New paper: Corrigibility with Utility Preservation · 2019-08-09T18:06:27.152Z · score: 1 (1 votes) · LW · GW In button-corrigibility, there is a quite different main safety concern: we want to prevent the agent from taking actions that manipulate human values or human actions in some bad way, with this manipulation creating conditions that make it easier for the agent to get or preserve a high utility score. I generally think of solving this problem via ensuring your agent is myopic rather than via having a specific ritual which the agent is indifferent to, as you describe. It seems to me like even if you could make the agent indifferent to the shutdown button, there would still be other ways for it to influence you into giving it higher reward. Thus, it seems like what you really want here is some guarantee that the agent is only considering its per-episode reward and doesn't care at all about its cross-episode reward, which is the condition I call myopia. That being said, how to reliably produce myopic agents is still a very open problem. Comment by evhub on New paper: Corrigibility with Utility Preservation · 2019-08-08T19:25:13.038Z · score: 1 (1 votes) · LW · GW I think I mostly agree with what you're saying here, except that I do have some concerns about applying a corrigibility layer to a learned model after the fact. Here are some ways in which I could see that going wrong: 1. The learned model might be so opaque that you can't figure out how to apply corrigibility unless you have some way of purposefully training it to be transparent in the right way. 2. If you aren't applying corrigibility during training, then your model could act dangerously during the training process. 3. If corrigibility is applied as a separate layer, the model could figure out how to disable it. I think if you don't have some way of training your model to be transparent, you're not going to be able to make it corrigible after the fact, and if you do have some way of training your model to be transparent, then you might as well apply corrigibility during training rather than doing it at the end. Also, I think making sure your model is always corrigible during training can also help with the task of training it to be transparent, as you can use your corrigible model to work on the task of making itself more transparent. In particular, I mostly think about informed oversight via an amplified overseer as how you would do something like that. Comment by evhub on New paper: Corrigibility with Utility Preservation · 2019-08-08T19:10:41.740Z · score: 4 (3 votes) · LW · GW I think your agent is basically exactly what I mean by a lonely engineer: the key lonely engineer trick is to give the agent a utility function which is a mathematical entity that is entirely independent of its actions in the real world. I think on my first reading of your paper it seemed like you were trying to get the agent to approximate the agent, but reading it more closely now it seems like what you're doing is using the correction term to make it indifferent to the button being pressed; is that correct? I think in that case I tend towards thinking that such corrections are unlikely to scale towards what I eventually want out of corrigibility, though I do think there are other ways in which such approaches can still be quite useful. I mainly see approaches of that form as broadly similar to impact measure approaches such as Attainable Utility Preservation or Relative Reachability where the idea is to try to disincentive the agent from acting non-corrigibly. The biggest hurdles there seem to be a) making sure your penalty doesn't miss anything, b) figuring out how to trade off between the penalty and performance, and c) making sure that if you train on such a penalty the resulting model will actually learn to always behave according to it (I see this as the hardest part). On that note, I totally agree that it becomes quite a lot more complicated when you start imagining creating these things via learning, which is the picture I think about primarily. In fact, I think your intuitive picture here is quite spot on: it's just very difficult to control exactly what sort of model you'll get when you actually try to train an agent to maximize something. Comment by evhub on New paper: Corrigibility with Utility Preservation · 2019-08-06T19:54:39.161Z · score: 15 (6 votes) · LW · GW Even though it is super-intelligent, the AU agent has no emergent incentive to spend any resources to protect its utility function. This is because of how it was constructed: it occupies a universe in which no physics process could possibly corrupt its utility function. With the utility function being safe no matter what, the optimal strategy is to devote no resources at all to the matter of utility function protection. I call this style of approach the "lonely engineer" (a term which comes from MIRI, and I think specifically from Scott Garrabrant, though I could be mistaken on that point). I think lonely-engineer-style approaches (as I think yours basically falls into) are super interesting, and definitely on to something real and important. That being said, my biggest concern with approaches of that form is that they make an extraordinarily strong assumption about our ability to select what utility function our agent has. In particular, if you imagine training a machine learning model on a lonely engineer utility function, it won't work at all, as gradient descent is not going to be able to distinguish between preferences that are entirely about abstract mathematical objects and those that are not. For a full exploration of this problem, see "Risks from Learned Optimization." My other concern is that while the lonely engineer themselves might have no incentive to behave non-corrigibly, if it runs any sort of search process (doing its own machine learning, for example), it will have no incentive to do so in an aligned way. Thus, a policy that its search results in might have preferences over things other than abstract mathematical objects, even if the lonely engineer themselves does not. Comment by evhub on Risks from Learned Optimization: Conclusion and Related Work · 2019-07-24T21:34:11.447Z · score: 3 (2 votes) · LW · GW I think it's still an open question to what extent not having any mesa-optimization would hurt capabilities, but my sense is indeed that mesa-optimization is likely inevitable if you want to build safe AGI which is competitive with a baseline unaligned approach. Thus, I tend towards thinking that the right strategy is to understand that you're definitely going to produce a mesa-optimizer, and just have a really strong story for why it will be aligned. Comment by evhub on Contest:1,000 for good questions to ask to an Oracle AI · 2019-07-03T18:33:47.829Z · score: 1 (1 votes) · LW · GW

If the oracle cares about its own performance in a broader sense, rather than just performance on the current question, then don't we have a problem anyway? E.g. if you ask it question 1, it will be incentivized to make it get an easier question 2? For example, if you are concerned about coordination amongst different instances of the oracle, this seems like it's a problem regardless.

Yeah, that's a good point. In my most recent response to Wei Dai I was trying to develop a loss which would prevent that sort of coordination, but it does seem like if that's happening then it's a problem in any counterfactual oracle setup, not just this one. Though it is thus still a problem you'd have to solve if you ever actually wanted to implement a counterfactual oracle.

Comment by evhub on Risks from Learned Optimization: Introduction · 2019-07-03T18:28:45.770Z · score: 9 (5 votes) · LW · GW

I think my concern with describing mesa-optimizers as emergent subagents is that they're not really "sub" in a very meaningful sense, since we're thinking of the mesa-optimizer as the entire trained model, not some portion of it. One could describe a mesa-optimizer as a subagent in the sense that it is "sub" to gradient descent, but I don't think that's the right relationship—it's not like the mesa-optimizer is some subcomponent of gradient descent; it's just the trained model produced by it.

The reason we opted for "mesa" is that I think it reflects more of the right relationship between the base optimizer and the mesa-optimizer, wherein the base optimizer is "meta" to the mesa-optimizer rather than the mesa-optimizer being "sub" to the base optimizer.

Furthermore, in my experience, when many people encounter "emergent subagents" they think of some portion of the model turning into an agent and (correctly) infer that something like that seems very unlikely, as it's unclear why such a thing would actually be advantageous for getting a model selected by something like gradient descent (unlike mesa-optimization, which I think has a very clear story for why it would be selected for). Thus, we want to be very clear that something like that is not the concern being presented in the paper.

Comment by evhub on Contest: 1,000 for good questions to ask to an Oracle AI · 2019-07-02T20:55:53.463Z · score: 5 (3 votes) · LW · GW You can use a different oracle for every subquestion, but it's unclear what exactly that does if you don't know what the oracle's actual objective is. For example, you could imagine one system that cares about the reward given to copies equally as much as reward given to itself, and another system that only cares about the reward given to itself, and these two systems would be near-indistinguishable if you were just doing empirical analysis on some training distribution. The key here, I think, is the degree to which you're willing to make an assumption of the form you mention--that is, how much leeway are you willing to allow in assuming that the oracle is in fact only going to be optimizing for the specified objective function. On one level, it makes sense to separate out outer alignment and inner alignment concerns, but the problem in this case is that the sorts of objectives you're allowed to specify here are in fact going to be very dependent on whatever inner alignment solution you're assuming you have access to. For example, does your inner alignment solution require access to training data? If so, then that's a big constraint on the sorts of objectives you can specify. Or, going back to your previous question, whether or not I get to pick between an objective that doesn't care about other copies or one that does is also going to depend on exactly what sort of inner alignment solution you're assuming. Comment by evhub on Contest:1,000 for good questions to ask to an Oracle AI · 2019-07-02T18:57:30.595Z · score: 2 (2 votes) · LW · GW

Yeah, that's a good point.

Okay, here's another thought: if you can get the counterfactual oracle to only care about the reward for that invocation and not for any sub-invocations, then you can just go back to using the original algorithm. This is a bit slippery, because really whether you get a mesa-optimizer that cares only about a single step's worth of reward, all the reward it will ever get, or something else somewhere in-between, is mostly out of your control. But if you can guarantee that you get a mesa-optimizer that is only optimizing for one step of reward, you might be able to eliminate the manipulation incentive. The biggest issue in this case, though, is that the different invocations might still be able to acausally cooperate with each other to raise all of their rewards together by manipulating you into being easier to predict.

Maybe we can turn this into a zero-sum game, though? Here's a proposal: let be a copy of and be the set of all questions in the current tree that also get erasures. Then, let such that the net sum of losses across each tree for all the nodes for which losses are computed is zero. It's still somewhat unclear whether this actually works, though, since you only include the losses for those nodes that get erasures.

It is also worth noting, however, that even if this works it is a very artificial fix, since the term you're subtracting is a constant with no dependence on , so if you're trying to do gradient descent to optimize this loss, it won't change anything at all (which sort of goes to show how gradient descent doesn't distinguish between mesa-optimizers with single-episode and cross-episode objectives). Thus, we're still back at the problem of none of this working unless you're willing to make strong claims about your ability to directly produce a mesa-optimizer which is actually optimizing for the specified loss function.

Comment by evhub on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-02T00:03:05.342Z · score: 3 (3 votes) · LW · GW I was thinking about this, and it's a bit unclear. First, if you're willing to make the (very) strong assumption that you can directly specify what objective you want your model to optimize for without requiring a bunch of training data for that objective, then you can only provide a reward in the situation where all subquestions also have erasures. In this situation, you're guarded against any possible manipulation incentive like that, but it also means your oracle will very rarely actually be given a reward in practice, which means if you're relying on getting enough training data to produce an agent which will optimize for this objective, you're screwed. I would argue, however, that if you expect to train an agent to behave as a counterfactual oracle in the first place, you're already screwed, because most mesa-optimizers will care about things other than just the counterfactual case. Thus, the only situation in which this whole thing works in the first place is the situation where you're already willing to make this (very strong) assumption, so it's fine. Second, I don't think you're entirely screwed even if you need training data, since you can do some relaxations that attempt to approximate the situation where you only provide rewards in the event of a complete erasure. For example, you could increase the probability of an erasure with each subquestion, or scale the reward exponentially with the depth at which the erasure occurs, so that the majority of the expected reward is always concentrated in the world where there is a complete erasure. Comment by evhub on Contest:$1,000 for good questions to ask to an Oracle AI · 2019-07-01T18:40:48.618Z · score: 10 (5 votes) · LW · GW

My submission: use a counterfactual oracle to do IDA. To do this, judge the oracle's answer as follows: in the event of an erasure, ask a human to answer the question with the ability to consult the oracle on other subquestions (with erasures during those calls handled exactly the same way) and then reward the oracle by using an automated system (e.g. a language model) to compare the human's answer to that of the oracle.

More precisely, let

• be the counterfactual oracle,
• be the human's answer to question when given the ability to call on any question other than , and
• be some distance metric on answers in natural language (it's not that hard to make something like this, even with current ML tools).

Then, reward as per usual for a counterfactual oracle, only giving it a reward in the event of an erasure, in which case let where is hidden from and judged only by as in the standard counterfactual oracle setup.

(Of course, this doesn't actually work because it has no guarantees wrt to inner alignment, but I think it has a pretty good shot of being outer aligned.)

Comment by evhub on Deceptive Alignment · 2019-06-17T21:41:00.263Z · score: 9 (5 votes) · LW · GW

Suppose you have a system which appears to be aligned on-distribution. Then, you might want to know:

1. Is it actually robustly aligned?
2. How did it learn how to behave aligned? Where did the information about the base objective necessary to display behavior aligned with it come from? Did it learn about the base objective by looking at its input, or was that knowledge programmed into it by the base optimizer?

The different possible answers to these questions give us four cases to consider:

1. It's not robustly aligned and the information came through the base optimizer. This is the standard scenario for (non-deceptive) pseudo-alignment.
2. It is robustly aligned and the information came through the base optimizer. This is the standard scenario for robust alignment, which we call internal alignment.
3. It is not robustly aligned and the information came through the input. This is the standard scenario for deceptive pseudo-alignment.
4. It is robustly aligned and the information came through the input. This is the weird case that we call corrigible alignment. In this case, it's trying to figure out what the base objective is as it goes along, not because it wants to play along and pretend to optimize for the base objective, but because optimizing for the base objective is actually somehow its terminal goal. How could that happen? It would have to be that its mesa-objective somehow "points" to the base objective in such a way that, if it had full knowledge of the base objective, then its mesa-objective would just be the base objective. What does this situation look like? Well, it has a lot of similarities to the notion of corrigibility: the mesa-optimizer in this situation seems to be behaving corrigibly wrt to the base optimizer (though not necessarily wrt the programmers), as it is trying to understand what the base optimizer wants and then do that.
Comment by evhub on The Inner Alignment Problem · 2019-06-07T17:41:04.387Z · score: 5 (3 votes) · LW · GW

I just added a footnote mentioning IDA to this section of the paper, though I'm leaving it as is in the sequence to avoid messing up the bibliography numbering.

Comment by evhub on Deceptive Alignment · 2019-06-06T19:04:02.954Z · score: 7 (5 votes) · LW · GW

This is a response to point 2 before Pattern's post was modified to include the other points.

Interesting point! First of all, I think condition three is mostly a requirement that must be met for a deceptively aligned mesa-optimizer to actually defect at some point, rather than for it to be deceptive in the first place.

That being said, I think the situation you raise is particularly interesting because it depends on whether the mesa-optimizer cares about its own survival. If the mesa-optimizer does care about its own continued existence, then it would want to do as you were reasoning and not defect, keeping itself around. Alternatively, however, suppose the mesa-optimizer just cares about but doesn't care that it's around to do the optimization for it. Then, defecting and optimizing for instead of when you expect to be modified afterwards won't actually hurt the long-term fulfillment of , since another learned algorithm will just take your place to optimize for it instead. Thus, even if the mesa-optimizer prefers lots of to a bit of , that's not actually the choice presented to it; rather, it's actual choice is between a bit of and a lot of versus only a lot of . Thus, in this case, it would defect even if it thought it would be modified.

(Also, on point 6, thanks for the catch; it should be fixed now!)

Comment by evhub on The Inner Alignment Problem · 2019-06-06T17:17:00.419Z · score: 7 (3 votes) · LW · GW

IDA is definitely a good candidate to solve problems of this form. I think IDA's best properties are primarily outer alignment properties, but it does also have some good properties with respect to inner alignment such as allowing you to bootstrap an informed adversary by giving it access to your question-answer system as you're training it. That being said, I suspect you could do something similar under a wide variety of different systems—bootstrapping an informed adversary is not necessarily unique to IDA. Unfortunately, we don't discuss IDA much in the last post, though thinking about mesa-optimizers in IDA (and other proposals e.g. debate) is imo a very important goal, and our hope is to at the very least provide the tools so that we can then go and start answering questions of that form.

Comment by evhub on The Inner Alignment Problem · 2019-06-05T22:10:58.866Z · score: 4 (3 votes) · LW · GW

I broadly agree that description complexity penalties help fight against pseudo-alignment whereas computational complexity penalties make it more likely, though I don't think it's absolute and there are definitely a bunch of caveats to that statement. For example, Solomonoff Induction seems unsafe despite maximally selecting for description complexity, though obviously that's not a physical example.