Posts

Automating Auditing: An ambitious concrete technical research proposal 2021-08-11T20:32:41.487Z
LCDT, A Myopic Decision Theory 2021-08-03T22:41:44.545Z
Answering questions honestly instead of predicting human answers: lots of problems and some solutions 2021-07-13T18:49:01.842Z
Knowledge Neurons in Pretrained Transformers 2021-05-17T22:54:50.494Z
Agents Over Cartesian World Models 2021-04-27T02:06:57.386Z
Open Problems with Myopia 2021-03-10T18:38:09.459Z
Operationalizing compatibility with strategy-stealing 2020-12-24T22:36:28.870Z
Homogeneity vs. heterogeneity in AI takeoff scenarios 2020-12-16T01:37:21.432Z
Clarifying inner alignment terminology 2020-11-09T20:40:27.043Z
Multiple Worlds, One Universal Wave Function 2020-11-04T22:28:22.843Z
Learning the prior and generalization 2020-07-29T22:49:42.696Z
Weak HCH accesses EXP 2020-07-22T22:36:43.925Z
Alignment proposals and complexity classes 2020-07-16T00:27:37.388Z
AI safety via market making 2020-06-26T23:07:26.747Z
An overview of 11 proposals for building safe advanced AI 2020-05-29T20:38:02.060Z
Zoom In: An Introduction to Circuits 2020-03-10T19:36:14.207Z
Synthesizing amplification and debate 2020-02-05T22:53:56.940Z
Outer alignment and imitative amplification 2020-01-10T00:26:40.480Z
Exploring safe exploration 2020-01-06T21:07:37.761Z
Safe exploration and corrigibility 2019-12-28T23:12:16.585Z
Inductive biases stick around 2019-12-18T19:52:36.136Z
Understanding “Deep Double Descent” 2019-12-06T00:00:10.180Z
What are some non-purely-sampling ways to do deep RL? 2019-12-05T00:09:54.665Z
What I’ll be doing at MIRI 2019-11-12T23:19:15.796Z
More variations on pseudo-alignment 2019-11-04T23:24:20.335Z
Chris Olah’s views on AGI safety 2019-11-01T20:13:35.210Z
Gradient hacking 2019-10-16T00:53:00.735Z
Impact measurement and value-neutrality verification 2019-10-15T00:06:51.879Z
Towards an empirical investigation of inner alignment 2019-09-23T20:43:59.070Z
Relaxed adversarial training for inner alignment 2019-09-10T23:03:07.746Z
Are minimal circuits deceptive? 2019-09-07T18:11:30.058Z
Concrete experiments in inner alignment 2019-09-06T22:16:16.250Z
Towards a mechanistic understanding of corrigibility 2019-08-22T23:20:57.134Z
Risks from Learned Optimization: Conclusion and Related Work 2019-06-07T19:53:51.660Z
Deceptive Alignment 2019-06-05T20:16:28.651Z
The Inner Alignment Problem 2019-06-04T01:20:35.538Z
Conditions for Mesa-Optimization 2019-06-01T20:52:19.461Z
Risks from Learned Optimization: Introduction 2019-05-31T23:44:53.703Z
A Concrete Proposal for Adversarial IDA 2019-03-26T19:50:34.869Z
Nuances with ascription universality 2019-02-12T23:38:24.731Z
Dependent Type Theory and Zero-Shot Reasoning 2018-07-11T01:16:45.557Z

Comments

Comment by evhub on Agents Over Cartesian World Models · 2021-09-16T00:56:16.718Z · LW · GW

Or is this just exploring a potential approximation?

Yeah, that's exactly right—I'm interested in how an agent can do something like manage resource allocation to do the best HCH imitation in a resource-bounded setting.

Are we including "long speech about why human should give high approval to me because I'm suffering" as an action? I guess there's a trade-off here, where limiting to word-level output demands too much lookahead coherence of the human, while long sentences run the risk of incentivizing reward tampering. Is that the reason you had in mind?

Yep, that's the idea.

This argument doesn't seem to work, because the zero utility function makes everything optimal.

Yeah, that's fair—if you add the assumption that every trajectory has a unique utility (that is, your preferences are a total ordering), though, then I think the argument still goes through. I don't know how realistic an assumption like that is, but it seems plausible for any complex utility function over a relatively impoverished domain (e.g. a complex utility function over observations or actions would probably have this property, but a simple utility function over world states would probably not).

Comment by evhub on Obstacles to gradient hacking · 2021-09-05T23:12:20.359Z · LW · GW

(Moderation note: added to the Alignment Forum from LessWrong.)

Comment by evhub on Automating Auditing: An ambitious concrete technical research proposal · 2021-08-12T20:29:18.402Z · LW · GW

Sure, but presumably they'll also say what particular attacks are so hard that current ML models aren't capable of solving them—and I think that's a valuable piece of information to have.

Comment by evhub on Automating Auditing: An ambitious concrete technical research proposal · 2021-08-12T20:26:40.485Z · LW · GW

Yeah, that's a great question—I should have talked more about this. I think there are three ways to handle this sort of problem—and ideally we should do some combination of all three:

  1. Putting the onus on the attacker. Probably the simplest way to handle this problem is just to have the attacker produce larger specification breaks than anything that exists in the model independently of the attacker. If it's the case that whenever the attacker does something subtle like you're describing the auditor just finds some other random problem, then the attacker should interpret that as a sign that they can get away with less subtle attacks and try larger modifications instead.
  2. Putting the onus on the judge. Another way to address this sort of problem is just to have a better, more rigorous specification. For example, the specification could include a baseline model that the auditor is supposed to compare to but which they can only sample from not view the internals of (that way modifying an individual neuron in the baseline model isn't trivial for the auditor to detect).
  3. Putting the onus on the auditor. The last way to handle this sort of problem is just to force the auditor to do better. There is a right answer to what behavior exists in the model that's not natural for language models to usually learn when trained on general webtext, and a valid way to handle this problem is to just keep iterating on the auditor until they figure out how to find that right answer. For example: if all your attacks are fine-tuning attacks on some dataset, there is a correct answer about what sort of dataset was fine-tuned on, and you can just force the auditor to try to find that correct answer.
Comment by evhub on Automating Auditing: An ambitious concrete technical research proposal · 2021-08-12T20:08:42.781Z · LW · GW

Is it inconsistent with the original meaning of the term? I thought that the original meaning of inside view was just any methodology that wasn't reference-class forecasting—and I don't use the term “outside view” at all.

Also, I'm not saying that “inside view” means “real reason,” but that my real reason in this case is my inside view.

Comment by evhub on LCDT, A Myopic Decision Theory · 2021-08-11T19:25:48.481Z · LW · GW

That's a really interesting thought—I definitely think you're pointing at a real concern with LCDT now. Some thoughts:

  • Note that this problem is only with actually running agents internally, not with simply having the objective of imitating/simulating an agent—it's just that LCDT will try to simulate that agent exclusively via non-agentic means.
  • That might actually be a good thing, though! If it's possible to simulate an agent via non-agentic means, that certainly seems a lot safer than internally instantiating agents—though it might just be impossible to efficiently simulate an agent without instantiating any agents internally, in which case it would be a problem.
  • In some sense, the core problem here is just that the LCDT agent needs to understand how to decompose its own decision nodes into individual computations so it can efficiently compute things internally and then know when and when not to label its internal computations as agents. How to decompose nodes into subnodes to properly work with multiple layers is a problem with all CDT-based decision theories, though—and it's hopefully the sort of problem that finite factored sets will help with.
Comment by evhub on LCDT, A Myopic Decision Theory · 2021-08-10T21:54:41.132Z · LW · GW

But by assumption, it doesn't think it can influence anything downstream of those (or the probability that they exist, I assume).

This is not true—LCDT is happy to influence nodes downstream of agent nodes, it just doesn't believe it can influence them through those agent nodes. So LCDT (at decision time) doesn't believe it can change what HCH does, but it's happy to change what it does to make it agree with what it thinks HCH will do, even though that utility node is downstream of the HCH agent nodes.

Comment by evhub on LCDT, A Myopic Decision Theory · 2021-08-10T09:06:54.900Z · LW · GW

My issue is in seeing how we find a model that will consistently do the right thing in training (given that it's using LCDT).

How about an LCDT agent with the objective of imitating HCH? Such an agent should be aligned and competitive, assuming the same is true of HCH. Such an agent certainly shouldn't delete itself to free up disk space, since HCH wouldn't do that—nor should it fall prey to the general argument you're making about taking epsilon utility in a non-agent path, since there's only one utility node it can influence without going through other agents, which is the delta between its next action and HCH's action.

We can use it to run simulations, but it seems to me that most problems (deception in particular) get moved to the simulation rather than solved.

I claim that, for a reasonably accurate HCH model that's within some broad basin of attraction, an LCDT agent attempting to imitate that HCH model will end up aligned—and that the same is not true for any other decision theory/agent model that I know of. And LCDT can do this while being able to manage things like how to simulate most efficiently and how to allocate resources between different methods of simulation. The core idea is that LCDT solves the hard problem of being able to put optimization power into simulating something efficiently in a safe way.

Comment by evhub on LCDT, A Myopic Decision Theory · 2021-08-09T19:56:21.682Z · LW · GW

Perhaps I'm now understanding correctly(??). An undesirable action that springs to mind: delete itself to free up disk space. Its future self is assumed to give the same output regardless of this action. More generally, actions with arbitrarily bad side-effects on agents, to gain marginal utility. Does that make sense?

Sure—that's totally fine. The point of LCDT isn't to produce an aligned agent, but to produce an agent that's never deceptive. That way, if your AI is going to delete itself to free up disk space, it'll do it in training and you can see that it's going to do that and correct that behavior.

With an LCDT agent, the idea is that if it does the right thing in training, you know it's not just doing that because it's trying to trick you and it'll actually do the wrong thing later in deployment. The point of LCDT, in my head, is to give you (defects in deployment) iff (defects in training), that way you can just take your LCDT agent and tweak it until it does the right thing in training, and then not be surprised by it actually having been deceiving you and defecting when you deploy it.

Comment by evhub on LCDT, A Myopic Decision Theory · 2021-08-06T19:29:53.272Z · LW · GW

To have an incoherent world model: one in which I can believe with 99% certainty that a kite no longer exists, and with 80% certainty that you're still flying that probably-non-existent kite.

I feel pretty willing to bite the bullet on this—what sorts of bad things do you think LCDT agents would do given such a world model (at decision time)? Such an LCDT agent should still be perfectly capable of tasks like simulating HCH without being deceptive—and should still be perfectly capable of learning and improving its world model, since the incoherence only shows up at decision-time and learning is done independently.

Comment by evhub on LCDT, A Myopic Decision Theory · 2021-08-06T19:27:03.790Z · LW · GW

Specifically, I think if an agent can do the kind of reasoning that would allow it to create a causal world-model in the first place, then the same kind of reasoning would lead it to realize that there is in fact supposed to be a link at each of the places where we manually cut it—i.e., that the causal world-model is incoherent.

An LCDT agent should certainly be aware of the fact that those causal chains actually exist—it just shouldn't care about that. If you want to argue that it'll change to not using LCDT to make decisions anymore, you have to argue that, under the decision rules of LCDT, it will choose to self-modify in some particular situation—but LCDT should rule out its ability to ever believe that any self-modification will do anything, thus ensuring that, once an agent starts making decisions using LCDT, it shouldn't stop.

Comment by evhub on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-08-04T22:30:44.203Z · LW · GW

Can write an arbitrary program for ?

Yes—at least that's the assumption I'm working under.

It seems like this should be lower complexity than the intended result, since True has much lower complexity than H_understands?

I agree that the you've described has lower complexity than the intended —but the in this case has higher complexity, since is no longer getting any of its complexity for free from conditioning on the condition. And in fact what you've just described is precisely the unintended model—what I call —that I'm trying to compete against, with the hope being that the savings that gives you in are sufficient to compensate for the loss in having to specify and H_understands in .

If we calculate the complexity of your proposal, we get whereas, if we calculate the complexity of the intended , we get such that you can see that the question of which one wins is precisely dependent on whether the savings from conditioning on offsets the cost of having to specify and .

Comment by evhub on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-08-02T22:02:05.201Z · LW · GW

Seems like if the different heads do not share weights then "the parameters in " is perfectly well-defined?

It seemed to me like you were using it in a way such that shared no weights with , which I think was because you were confused by the quantification, like you said previously. I think we're on the same page now.

Okay, so iiuc you're relying on an assumption (fact? desire?) that the world model will never produce deduced statements that distinguish between and ?

Sorry, I was unclear about this in my last response. and will only agree in cases where the human understands what's happening. In the dataset version, we get that by collecting a dataset where we think the human always gets it right, whereas in the dataset-less version, we get that by including the check which ensures that we don't have to satisfy the condition when the human would get the question wrong.

Comment by evhub on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-07-30T21:15:32.127Z · LW · GW

I assumed that when you talked about a model with "different heads" you meant that there is a shared backbone that computes a representation, that is then passed through two separate sequences of layers that don't share any weights, and those separate sequences of layers were the "heads" and .

Yep, that's what I mean.

Still, assuming that with your desired part-which-shares-weights, every possible input to parts-which-don't-share-weights can be generated by some (which seems like it will be close enough to true), the argument still suggests that conditioning on the desired part-which-shares-weights, you have just doubled the strength of the neural net prior on the parts-which-don't-share-weights.

Note that conditioning on the part-which-shares-weights is definitely not what the prior is doing—the only conditioning in the prior is conditioning on . If we look at the intended model, however, includes all of the parts-which-don't-share-weights, while is entirely in the part-which-shares-weights.

Technically, I suppose, you can just take the prior and condition on anything you want—but it's going to look really weird to condition on the part-which-shares-weights having some particular value without even knowing which parts came from and which came from .

I do agree that, if were to specify the entire part-which-shares-weights and leave to fill in the parts-which-don't-share-weights, then you would get exactly what you're describing where would have a doubly-strong neural net prior on implementing the same function for both heads. But that's only one particular arrangement of —there are lots of other s which induce very different distributions on .

This seems to suggest that are different functions, i.e. there's some input on which they disagree.

Note that the inputs to are deduced statements, not raw data. They are certainly different functions over the space of all possible deduced statements—but once you put a correct world model in them, they should produce equivalent maps.

Comment by evhub on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-07-29T20:17:39.890Z · LW · GW

Yeah, sorry, I wasn't clear here -- I meant that, rather than reasoning about the complexity of individual pieces / stages and then adding them all up at the end, I am instead simulating out the equations

Sure, makes sense—theoretically, that should be isomorphic.

I want to note that since is chosen randomly, it isn't "choosing" the condition on ; rather the wide distribution over leads to a wide distribution over possible conditions on . But I think that's what you mean.

This seems like a case where I'm using the more constructive formulation of simulating out the equations and you're thinking about in a more complexity-oriented framing. Of course, again, they should be equivalent.

By our bijection assumption, the parameters in must be identical to the parameters in .

I'm not sure what you mean by this part— and are just different heads, not entirely different models, so I'm not sure what you mean by “the parameters in .” I don't think that a bijection assumption between weights and single-head outputs really makes sense in this context. I also definitely would say that if and were separate models such that they couldn't reuse weights between them, then none of the complexity arguments that I make in the post would go through.

These constraints are necessary and sufficient to satisfy the overall constraint that , and therefore any other parameters in are completely unconstrained and are set according to the original neural net prior.

I'm happy to accept that there are ways of setting (e.g. just make and identical) such that the rest of the parameters are unconstrained and just use the neural net prior. However, that's not the only way of setting —and not the most complexity-efficient, I would argue. In the defender's argument, sets all the head-specific parameters for both and to enforce that computes and computes , and also sets all the shared parameters for everything other than the human model, while leaving the human model to , thus enforcing that specify a human model that's correct enough to make without having to pay any extra bits to do so.

Comment by evhub on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-07-28T19:23:38.654Z · LW · GW

Hmm, I'm not thinking about the complexity part at all right now; I'm just thinking mechanically about what is implied by your equations.

The only difference between this setup and normal ML is the prior/complexity—you still have the ability to learn all the same functions, it's just that some are more/less likely now.

though of course to check the condition you need to look at both parameters, so you can view this as the first set of parameters specifying a condition on the second set of parameters

Yep, that's exactly right.

Imagine there was a bijection between model parameters and resulting function. (I'm aware this is not at all true.) In that case it seems like you are enforcing the constraint that the two heads have identical parameters.

That's definitely not what should happen in that case. Note that there is no relation between and or and —both sets of parameters contribute equally to both heads. Thus, can enforce any condition it wants on by leaving some particular hole in how it computes and and forcing to fill in that hole in such a way to make 's computation of the two heads come out equal.

Comment by evhub on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-07-27T19:14:24.643Z · LW · GW

It seems like at this point your prior is "generate parameters randomly under the constraint that the two heads are identical"

That's not what the prior looks like—the prior is more like “generate parameters that specify some condition, then sample parameters that make that condition true.” Thus, you don't need to pay for the complexity of satisfying the condition, only the complexity of specifying it (as long as you're content with the simplest possible way to satisfy it). This is why the two-step nature of the algorithm is necessary—the prior you're describing is what would happen if you used a one-step algorithm rather than a two-step algorithm (which I agree would then not do anything).

Comment by evhub on Thoughts on safety in predictive learning · 2021-07-20T20:22:09.643Z · LW · GW

(A possible objection would be: "real-world foresighted planning" isn't a separate thing that adds to model complexity, instead it naturally falls out of other capabilities that are necessary for postdiction like "building predictive models" and "searching over strategies" and whatnot. I think I would disagree with that objection, but I don't have great certainty here.)

Yup, that's basically my objection.

Comment by evhub on Thoughts on safety in predictive learning · 2021-07-19T20:56:26.331Z · LW · GW

Sure, that's fair. But in the post, you argue that this sort of non-in-universe-processing won't happen because there's no incentive for it:

It seems like there’s no incentive whatsoever for a postdictive learner to have any concept that the data processing steps in the algorithm have any downstream impacts, besides, y’know, processing data within the algorithm. It seems to me like there’s a kind of leap to start taking downstream impacts to be a relevant consideration, and there’s nothing in gradient descent pushing the algorithm to make that leap, and there doesn’t seem to be anything about the structure of the domain or the reasoning it’s likely to be doing that would lead to making that leap, and it doesn’t seem like the kind of thing that would happen by random noise, I think.

However, if there's another “why” for why the model is doing non-in-universe-processing that is incentivized—e.g. simplicity—then I think that makes this argument no longer hold.

Comment by evhub on Thoughts on safety in predictive learning · 2021-07-16T22:05:04.469Z · LW · GW

I mean, I guess it depends on your definition of “unrelated to any anticipated downstream real-world consequences.” Does the reason “it's the simplest way to solve the problem in the training environment” count as “unrelated” to real-world consequences? My point is that it seems like it should, since it's just about description length, not real-world consequences—but that it could nevertheless yield arbitrarily bad real-world consequences.

Comment by evhub on Thoughts on safety in predictive learning · 2021-07-15T21:53:01.901Z · LW · GW

Oh, I guess you're saying something different: that even a deceptive mesa-optimizer which is entirely doing within-universe processing is nevertheless scary. So that would by definition be an algorithm with the property "no operation in the algorithm is likelier to happen vs not happen specifically because of anticipated downstream chains of causation that pass through things in the real world".

Yep, that's right.

So I can say categorically: such an algorithm won't hurt anyone (except by freak accident), won't steal processing resources, won't intervene when I go for the off-switch, etc., right?

No, not at all—just because an algorithm wasn't selected based on causing something to happen in the real world doesn't mean it won't in fact try to make things happen in the real world. In particular, the reason that I expect deception in practice is not primarily because it'll actually be selected for, but primarily just because it's simpler, and so it'll be found despite the fact that there wasn't any explicit selection pressure in favor of it. See: “Does SGD Produce Deceptive Alignment?

Comment by evhub on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-07-15T21:47:28.339Z · LW · GW

I mostly agree with what Paul said re using various techniques to improve the evaluation of to ensure you can test it on more open-ended questions. That being said, I'm more optimistic that, if you can get the initial training procedure right, you can rely on generalization to fill in the rest. Specifically, I'm imagining a situation where the training dataset is of the narrower form you talk about such that and always agree (as in Step 3 here)—but where the deployment setting wouldn't necessarily have to be of this form, since once you're confident that you've actually learned and not e.g. , you can use it for all sorts of things that wouldn't ever be in that training dataset (the hard part, of course, is ever actually being confident that you did in fact learn the intended model).

(Also, thanks for catching the typo—it should be fixed now.)

Comment by evhub on Thoughts on safety in predictive learning · 2021-07-14T21:25:10.832Z · LW · GW

By and large, we expect learning algorithms to do (1) things that they’re being optimized to do, and (2) things that are instrumentally useful for what they’re being optimized to do, or tend to be side-effect of what they’re being optimized to do, or otherwise “come along for the ride”. Let’s call those things “incentivized”. Of course, it’s dicey in practice to declare that a learning algorithm categorically won’t ever do something, just because it’s not incentivized. Like, we may be wrong about what’s instrumentally useful, or we may miss part of the space of possible strategies, or maybe it’s a sufficiently simple thing that it can happen by random chance, etc.

In the presence of deceptive alignment, approximately any goal is possible in this setting, not just the nearby instrumental proxies that you might be okay with. Furthermore, deception need not be 4th-wall-breaking, since the effect of deception on helping you do better in the training process entirely factors through the intended output channel. Thus, I would say that within-universe mesa-optimization can be arbitrarily scary if you have no way of ruling out deception.

Comment by evhub on Discussion: Objective Robustness and Inner Alignment Terminology · 2021-06-24T06:42:33.016Z · LW · GW

(Moderation note: added to the Alignment Forum from LessWrong.)

Comment by evhub on Empirical Observations of Objective Robustness Failures · 2021-06-24T06:42:08.617Z · LW · GW

(Moderation note: added to the Alignment Forum from LessWrong.)

Comment by evhub on Covid 6/17: One Last Scare · 2021-06-20T21:12:43.203Z · LW · GW

There are other models than the discontinuous/fast takeoff model under which alignment of the first advanced AI is critical, e.g. a continuous/slow but homogenous takeoff.

Comment by evhub on "Decision Transformer" (Tool AIs are secret Agent AIs) · 2021-06-09T02:03:35.965Z · LW · GW

(Moderation note: added to the Alignment Forum from LessWrong.)

Comment by evhub on Survey on AI existential risk scenarios · 2021-06-09T02:03:16.268Z · LW · GW

(Moderation note: added to the Alignment Forum from LessWrong.)

Comment by evhub on Game-theoretic Alignment in terms of Attainable Utility · 2021-06-08T19:41:33.367Z · LW · GW

(Moderation note: added to the Alignment Forum from LessWrong.)

Comment by evhub on Thoughts on the Alignment Implications of Scaling Language Models · 2021-06-04T19:55:42.411Z · LW · GW

(Moderation note: added to Alignment Forum from LessWrong.)

Comment by evhub on List of good AI safety project ideas? · 2021-05-27T19:02:17.427Z · LW · GW

Though they're both somewhat outdated at this point, there are certainly still some interesting concrete experiment ideas to be found in my “Towards an empirical investigation of inner alignment” and “Concrete experiments in inner alignment.”

Comment by evhub on Agency in Conway’s Game of Life · 2021-05-19T21:37:11.045Z · LW · GW

Have you seen “Growing Neural Cellular Automata?” It seems like the authors there are trying to do something pretty similar to what you have in mind here.

Comment by evhub on Knowledge Neurons in Pretrained Transformers · 2021-05-19T20:32:00.892Z · LW · GW

Knowledge neurons don't seem to include all of the model's knowledge about a given question. Cutting them out only decreases the probability on the correct answer by 40%.

Yeah, agreed—though I would still say that finding the first ~40% of where knowledge of a particular fact is stored counts as progress (though I'm not saying they have necessarily done that).

I don't think there's evidence that these knowledge neurons don't do a bunch of other stuff. After removing about 0.02% of neurons they found that the mean probability on other correct answers decreased by 0.4%. They describe this as "almost unchanged" but it seems like it's larger than I'd expect for a model trained with dropout for knocking out random neurons (if you extrapolate that to knocking out 10% of the mlp neurons, as done during training, you'd have reduced the correct probability by 50x, whereas the model should still operate OK with 10% dropout).

That's a good point—I didn't look super carefully at their number there, but I agree that looking more carefully it does seem rather large.

Looking at that again, it seems potentially relevant that instead of zeroing those neurons they added the embedding of the [UNK] token.

I also thought this was somewhat strange and am not sure what to make of it.

A priori if a network did work this way, it's unclear why individual neurons would correspond to individual lookups rather than using a distributed representation (and they probably wouldn't, given sparsity---that's a crazy inefficient thing to do and if anything seems harder for SGD to learn) so I'm not sure that this perspective even helps explain the observation that a small number of neurons can have a big effect on particular prompts.

I was also surprised that they used individual neurons rather than NMF factors or something—though the fact that it still worked while just using the neuron basis seems like more evidence that the effect is real rather than less.

But they don't give any evidence that the transformation had a reliable effect, or that it didn't mess up other stuff, or that they couldn't have a similar effect by targeting other neurons.

Actually looking at the replacement stuff in detail it seems even weaker than that. Unless I'm missing something it looks like they only present 3 cherry-picked examples with no quantitative evaluation at all? It's possible that they just didn't care about exploring this effect experimentally, but I'd guess that they tried some simple stuff and found the effect to be super brittle and so didn't report it. And in the cases they report, they are changing the model from remembering an incorrect fact to a correct one---that seems important because probably the model put significant probability on the correct thing already.

Perhaps I'm too trusting—I agree that everything you're describing seems possible given just the evidence in the paper. All of this is testable, though, and suggests obvious future directions that seem worth exploring.

Comment by evhub on Formal Inner Alignment, Prospectus · 2021-05-12T21:41:35.508Z · LW · GW

My third and final example: in one conversation, someone made a claim which I see as "exactly wrong": that we can somehow lower-bound the complexity of a mesa-optimizer in comparison to a non-agentic hypothesis (perhaps because a mesa-optimizer has to have a world-model plus other stuff, where a regular hypothesis just needs to directly model the world). This idea was used to argue against some concern of mine.

The problem is precisely that we know of no way of doing that! If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway!

I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn't malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn't (and in fact I'm not sure that there's even that much of a connection between the malignity of those two priors).


Also, I think that this distinction leads me to view “the main point of the inner alignment problem” quite differently: I would say that the main point of the inner alignment problem is that whatever prior we use in practice will probably be malign. But that does suggest that if you can construct a training process that defuses the arguments for why its prior/inductive biases will be malign, then I think that does make significant progress on defusing the inner alignment problem. Of course, I agree that we'd like to be as confident that there's as little malignancy/deception as possible such that just defusing the arguments that we can come up with might not be enough—but I still think that trying to figure out how plausible it is that the actual prior we use will be malign is in fact at least attempting to address the core problem.

Comment by evhub on Mundane solutions to exotic problems · 2021-05-04T22:44:09.774Z · LW · GW

Your link is broken.

For reference, the first post in Paul's ascription universality sequence can be found here (also Adam has a summary here).

Comment by evhub on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2021-05-04T22:16:14.251Z · LW · GW

I guess I would say something like: random search is clearly a pretty good first-order approximation, but there are also clearly second-order effects. I think that exactly how strong/important/relevant those second-order effects are is unclear, however, and I remain pretty uncertain there.

Comment by evhub on [AN #148]: Analyzing generalization across more axes than just accuracy or loss · 2021-04-30T23:12:35.272Z · LW · GW

Read more: Section 1.3 of this version of the paper

This is in the wrong spot.

Comment by evhub on Covid 4/22: Crisis in India · 2021-04-24T21:40:09.114Z · LW · GW

Is there ways to share this with EAs?

You could write a post about it on the EA Forum.

Comment by evhub on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-22T19:25:21.831Z · LW · GW

(moved from LW to AF)

Meta: I'm going to start commenting this on posts I move from LW to AF just so there's a better record of what moderation actions I'm taking.

Comment by evhub on Does the lottery ticket hypothesis suggest the scaling hypothesis? · 2021-04-22T18:48:52.460Z · LW · GW

*multi-prize

Comment by evhub on Homogeneity vs. heterogeneity in AI takeoff scenarios · 2021-04-15T19:29:17.981Z · LW · GW

I suppose the distinction between "strong" and "weak" warning shots would matter if we thought that we were getting "strong" warning shots. I want to claim that most people (including Evan) don't expect "strong" warning shots, and usually mean the "weak" version when talking about "warning shots", but perhaps I'm just falling prey to the typical mind fallacy.

I guess I would define a warning shot for X as something like: a situation in which a deployed model causes obvious, real-world harm due to X. So “we tested our model in the lab and found deception” isn't a warning shot for deception, but “we deployed a deceptive model that acted misaligned in deployment while actively trying to evade detection” would be a warning shot for deception, even though it doesn't involve taking over the world. By default, in the case of deception, my expectation is that we won't get a warning shot at all—though I'd more expect a warning shot of the form I gave above than one where a model tries and fails to take over the world, just because I expect that a model that wants to take over the world will be able to bide its time until it can actually succeed.

Comment by evhub on Open Problems with Myopia · 2021-03-12T05:56:50.404Z · LW · GW

Yes; episode is correct there—the whole point of that example is that, by breaking the episodic independence assumption, otherwise hidden non-myopia can be revealed. See the discussion of the prisoner's dilemma unit test in Krueger et al.'s “Hidden Incentives for Auto-Induced Distributional Shift” for more detail on how breaking this sort of episodic independence plays out in practice.

Comment by evhub on Open Problems with Myopia · 2021-03-11T08:03:56.813Z · LW · GW

Yeah, I agree—the example should probably just be changed to be about an imitative amplification agent or something instead.

Comment by evhub on Open Problems with Myopia · 2021-03-11T08:00:59.805Z · LW · GW

I think that trying to encourage myopia via behavioral incentives is likely to be extremely difficult, if not impossible (at least without a better understanding of our training processes' inductive biases). Krueger et al.'s “Hidden Incentives for Auto-Induced Distributional Shift” is a good resource for some of the problems that you run into when you try to do that. As a result, I think that mechanistic incentives are likely to be necessary—and I personally favor some form of relaxed adversarial training—but that's going to require us to get a better understanding of what exactly it looks for an agent to be myopic or not so we know what the overseer in a setup like that should be looking for.

Comment by evhub on MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models" · 2021-03-10T21:45:22.678Z · LW · GW

Possibly relevant here is my transparency trichotomy between inspection transparency, training transparency, and architectural transparency. My guess is that inspection transparency and training transparency would mostly go in your “active transparency” bucket and architectural transparency would mostly go in your “passive transparency” bucket. I think there is a position here that makes sense to me, which is perhaps what you're advocating, that architectural transparency isn't relying on any sort of path-continuity arguments in terms of how your training process is going to search through the space, since you're just trying to guarantee that the whole space is transparent, which I do think is pretty good desideratum if it's achievable. Imo, I mostly bite the bullet on path-continuity arguments being necessary, but it definitely would be nice if they weren't.

Comment by evhub on MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models" · 2021-03-09T21:48:06.328Z · LW · GW

Fwiw, I also agree with Adele and Eliezer here and just didn't see Eliezer's comment when I was giving my comments.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-03-09T21:42:02.882Z · LW · GW

Sure—by that definition of realizability, I agree that's where the difficulty is. Though I would seriously question the practical applicability of such an assumption.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-03-09T03:12:09.898Z · LW · GW

Perhaps I just totally don't understand what you mean by realizability, but I fail to see how realizability is relevant here. As I understand it, realizability just says that the true model has some non-zero prior probability—but that doesn't matter (at least for the MAP, which I think is a better model than the full posterior for how SGD actually works) as long as there's some deceptive model with greater prior probability that's indistinguishable on the training distribution, as in my simple toy model from earlier.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-03-08T23:50:12.969Z · LW · GW

Yeah, that's a fair objection—my response to that is just that I think that preventing a model from being able to distinguish training and deployment is likely to be impossible for anything competitive.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-03-08T21:46:55.307Z · LW · GW

Here's a simple toy model. Suppose you have two agents that internally compute their actions as follows (perhaps with actual argmax replaced with some smarter search algorithm, but still basically structured as below):

Then, comparing the K-complexity of the two models, we get

and the problem becomes that both and will produce behavior that looks aligned on the training distribution, but has to be much more complex. To see this, note that essentially any will yield good training performance because the model will choose to act deceptively during training, whereas if you want to get good training performance without deception, then has to actually encode the full objective, which is likely to make it quite complicated.