Learning the prior and generalization 2020-07-29T22:49:42.696Z · score: 33 (11 votes)
Weak HCH accesses EXP 2020-07-22T22:36:43.925Z · score: 14 (4 votes)
Alignment proposals and complexity classes 2020-07-16T00:27:37.388Z · score: 48 (11 votes)
AI safety via market making 2020-06-26T23:07:26.747Z · score: 48 (16 votes)
An overview of 11 proposals for building safe advanced AI 2020-05-29T20:38:02.060Z · score: 153 (56 votes)
Zoom In: An Introduction to Circuits 2020-03-10T19:36:14.207Z · score: 84 (23 votes)
Synthesizing amplification and debate 2020-02-05T22:53:56.940Z · score: 39 (15 votes)
Outer alignment and imitative amplification 2020-01-10T00:26:40.480Z · score: 30 (7 votes)
Exploring safe exploration 2020-01-06T21:07:37.761Z · score: 37 (11 votes)
Safe exploration and corrigibility 2019-12-28T23:12:16.585Z · score: 17 (8 votes)
Inductive biases stick around 2019-12-18T19:52:36.136Z · score: 51 (15 votes)
Understanding “Deep Double Descent” 2019-12-06T00:00:10.180Z · score: 110 (50 votes)
What are some non-purely-sampling ways to do deep RL? 2019-12-05T00:09:54.665Z · score: 15 (5 votes)
What I’ll be doing at MIRI 2019-11-12T23:19:15.796Z · score: 118 (37 votes)
More variations on pseudo-alignment 2019-11-04T23:24:20.335Z · score: 20 (6 votes)
Chris Olah’s views on AGI safety 2019-11-01T20:13:35.210Z · score: 158 (48 votes)
Gradient hacking 2019-10-16T00:53:00.735Z · score: 54 (16 votes)
Impact measurement and value-neutrality verification 2019-10-15T00:06:51.879Z · score: 35 (10 votes)
Towards an empirical investigation of inner alignment 2019-09-23T20:43:59.070Z · score: 45 (12 votes)
Relaxed adversarial training for inner alignment 2019-09-10T23:03:07.746Z · score: 45 (11 votes)
Are minimal circuits deceptive? 2019-09-07T18:11:30.058Z · score: 54 (13 votes)
Concrete experiments in inner alignment 2019-09-06T22:16:16.250Z · score: 65 (21 votes)
Towards a mechanistic understanding of corrigibility 2019-08-22T23:20:57.134Z · score: 41 (13 votes)
Risks from Learned Optimization: Conclusion and Related Work 2019-06-07T19:53:51.660Z · score: 65 (19 votes)
Deceptive Alignment 2019-06-05T20:16:28.651Z · score: 69 (19 votes)
The Inner Alignment Problem 2019-06-04T01:20:35.538Z · score: 73 (19 votes)
Conditions for Mesa-Optimization 2019-06-01T20:52:19.461Z · score: 64 (21 votes)
Risks from Learned Optimization: Introduction 2019-05-31T23:44:53.703Z · score: 135 (40 votes)
A Concrete Proposal for Adversarial IDA 2019-03-26T19:50:34.869Z · score: 18 (6 votes)
Nuances with ascription universality 2019-02-12T23:38:24.731Z · score: 24 (7 votes)
Dependent Type Theory and Zero-Shot Reasoning 2018-07-11T01:16:45.557Z · score: 18 (11 votes)


Comment by evhub on Learning the prior and generalization · 2020-10-24T06:52:20.524Z · score: 2 (1 votes) · LW · GW

Ah, sorry, no—I was assuming you were just using whatever procedure you used previously to allow the human to interface with in that situation as well. I'll edit the post to be more clear there.

Comment by evhub on Learning the prior and generalization · 2020-10-22T20:52:58.330Z · score: 2 (1 votes) · LW · GW

There are lots of ways to allow to interface with an implicitly represented , but the one Paul describes in “Learning the Prior” is to train some model which represents implicitly by responding to human queries about (see also “Approval-maximizing representations” which describes how a model like could represent implicitly as a tree).

Once can interface with , checking whether some answer is correct given is at least no more difficult than producing an answer given —since can just produce their answer then check it against the model's using some distance metric (e.g. an autoregressive language model)—but could be much easier if there are ways for to directly evaluate how likely would be to produce that answer.

Comment by evhub on The Solomonoff Prior is Malign · 2020-10-20T22:37:39.888Z · score: 9 (6 votes) · LW · GW

I don't really care about other universes

Why not? I certainly do. If you can fill another universe with people living happy, fulfilling lives, would you not want to?

Comment by evhub on Competence vs Alignment · 2020-10-05T22:03:46.243Z · score: 3 (2 votes) · LW · GW

Without having insights inside its "brain", and only observing its behavior, can we determine whether it's trying to optimize the correct value function but failing, or maybe it's just misaligned?

No—deceptive alignment ensures that. If a deceptive model that knows you're observing its behavior in a particular situation it can modify its behavior in that situation to be whatever is necessary for you to believe that it's aligned. See e.g. Paul Christiano's RSA-2048 example. This is why we need relaxed adversarial training rather than just normal adversarial training. A useful analogy here might be Rice's Theorem, which suggests that it can be very hard to verify behavioral properties of algorithms while being much easier to verify mechanistic properties.

Comment by evhub on Do mesa-optimizer risk arguments rely on the train-test paradigm? · 2020-09-20T20:49:19.290Z · score: 2 (1 votes) · LW · GW

I agree that the probability of pseudo-alignment will be the same, and that an unrecoverable action could occur despite the threat of modification. I'm interested in whether online learning generally makes it less likely for a deceptively aligned model to defect. I think so because (I expect, in most cases) this adds a threat of modification that is faster-acting and easier for a mesa-optimizer to recognise than otherwise (e.g. human shutting it down).

I agree with all of this—online learning doesn't change the probability of pseudo-alignment but might make it harder for a deceptively aligned model to defect. That being said, I don't think that deceptive models defecting later is necessarily a good thing—if your deceptive models start defecting sooner, but in recoverable ways, that's actually good because it gives you a warning shot. And once you have a deceptive model, it's going to try to defect against you at some point, even if it just has to gamble and defect randomly with some probability.

If I'm not just misunderstanding and there is a crux here, maybe it relates to how promising worst-case guarantees are. Worst-case guarantees are great to have, and save us from worrying about precisely how likely a catastrophic action is. Maybe I am more pessimistic than you about obtaining worst-case guarantees. I think we should do more to model the risks probabilistically.

First, I do think that worst-case guarantees are achievable if we do relaxed adversarial training with transparency tools.

Second, I actually have done a bunch of probabilistic risk analysis on exactly this sort of situation here. Note, however, that the i.i.d. situation imagined in that analysis is extremely hard to realize in practice as there are fundamental distributional shifts that are very difficult to overcome—such as the distributional shift from a situation where the model can't defect profitably to a situation where it can.

Comment by evhub on Sunday September 20, 12:00PM (PT) — talks by Eric Rogstad, Daniel Kokotajlo and more · 2020-09-18T06:06:17.671Z · score: 6 (3 votes) · LW · GW

I'd like to know your thoughts on how to better do junior hires, regardless of whether you give a talk on it or not.

Comment by evhub on capybaralet's Shortform · 2020-09-17T19:58:13.514Z · score: 6 (3 votes) · LW · GW

the AI understands how to appear aligned, and does so, while covertly pursues its own objective on-distribution, during training.

Sure, but the fact that it defects in deployment and not in training is a consequence of distributional shift, specifically the shift from a situation where it can't break out of the box to a situation where it can.

Comment by evhub on Comparing Utilities · 2020-09-16T21:28:46.501Z · score: 4 (2 votes) · LW · GW

But a decision theory like that does mix levels between the decision theory and the utility function!

I agree, though it's unclear whether that's an actual level crossing or just a failure of our ability to be able to properly analyze that strategy. I would lean towards the latter, though I am uncertain.

A crux for me is the coalition metaphor for utilitarianism. I think of utilitarianism as sort of a natural endpoint of forming beneficial coalitions, where you've built a coalition of all life.

This is how I think about preference utilitarianism but not how I think about hedonic utilitarianism—for example, a lot of what I value personally is hedonic-utilitarianism-like, but from a social perspective, I think preference utilitarianism is a good Schelling point for something we can jointly agree on. However, I don't call myself a preference utilitarian—rather, I call myself a hedonic utilitarian—because I think of social Schelling points and my own personal values as pretty distinct objects. And I could certainly imagine someone who terminally valued preference utilitarianism from a personal perspective—which is what I would call actually being a preference utilitarian.

Furthermore, I think that if you're actually a preference utilitarian vs. if you just think preference utilitarianism is a good Schelling point, then there are lots of cases where you'll do different things. For example, if you're just thinking about preference utilitarianism as a useful Schelling point, then you want to carefully consider the incentives that it creates—such as the one that you're pointing to—but if you terminally value preference utilitarianism, then that seems like a weird thing to be thinking about, since the question you should be thinking about in that context should be more like what is it about preferences that you actually value and why.

If we imagine forming a coalition incrementally, and imagine that the coalition simply averages utility functions with its new members, then there's an incentive to join the coalition as late as you can, so that your preferences get the largest possible representation. (I know this isn't the same problem we're talking about, but I see it as analogous, and so a point in favor of worrying about this sort of thing.)

We can correct that by doing 1/n averaging: every time the coalition gains members, we make a fresh average of all member utility functions (using some utility-function normalization, of course), and everybody voluntarily self-modifies to have the new mixed utility function.

One thing I will say here is that usually when I think about socially agreeing on a preference utilitarian coalition, I think about doing so from more of a CEV standpoint, where the idea isn't just to integrate the preferences of agents as they currently are, but as they will/should be from a CEV perspective. In that context, it doesn't really make sense to think about incremental coalition forming, because your CEV (mostly, with some exceptions) should be the same regardless of what point in time you join the coalition.

But the problem with this is, we end up punishing agents for self-modifying to care about us before joining. (This is more closely analogous to the problem we're discussing.) If they've already self-modified to care about us more before joining, then their original values just get washed out even more when we re-average everyone.

I guess this just seems like the correct outcome to me. If you care about the values of the coalition, then the coalition should care less about your preferences, because they can partially satisfy them just by doing what the other people in the coalition want.

So really, the implicit assumption I'm making is that there's an agent "before" altruism, who "chose" to add in everyone's utility functions. I'm trying to set up the rules to be fair to that agent, in an effort to reward agents for making "the altruistic leap".

It certainly makes sense to reward agents for choosing to instrumentally value the coalition—and I would include instrumentally choosing to self-modify yourself to care more about the coalition in that—but I'm not sure why it makes sense to reward agents for terminally valuing the coalition—that is, terminally valuing the coalition independently of any decision theoretic considerations that might cause you to instrumentally modify yourself to do so.

Again, I think this makes more sense from a CEV perspective—if you instrumentally modify yourself to care about the coalition for decision-theoretic reasons, that might change your values, but I don't think that it should change your CEV. In my view, your CEV should be about your general strategy for how to self-modify yourself in different situations rather than the particular incarnation of you that you've currently modified to.

Comment by evhub on Comparing Utilities · 2020-09-16T19:39:55.020Z · score: 11 (3 votes) · LW · GW

If we simply take the fixed point, Primus is going to get the short end of the stick all the time: because Primus cares about everyone else more, everyone else cares about Primus' personal preferences less than anyone else's.

Simply put, I don't think more altruistic individuals should be punished! In this setup, the "utility monster" is the perfectly selfish individual. Altruists will be scrambling to help this person while the selfish person does nothing in return.

I'm not sure why you think this is a problem. Supposing you want to satisfy the group's preferences as much as possible, shouldn't you care about Primus less since Primus will be more satisfied just from you helping the others? I agree that this can create perverse incentives in practice, but that seems like the sort of thing that you should be handling as part of your decision theory, not your utility function.

A different way to do things is to interpret cofrences as integrating only the personal preferences of the other person.

I feel like the solution of having cofrences not count the other person's cofrences just doesn't respect people's preferences—when I care about the preferences of somebody else, that includes caring about the preferences of the people they care about. It seems like the natural solution to this problem is to just cut things off when you go in a loop—but that's exactly what taking the fixed point does, which seems to reinforce the fixed point as the right answer here.

Comment by evhub on Mesa-Search vs Mesa-Control · 2020-09-15T21:12:07.797Z · score: 3 (2 votes) · LW · GW

could you give an example of a task that would require learning in this way? (Note the within-timestep restriction; without that I grant you that there are tasks that require learning)

How about language modeling? I think that the task of predicting what a human will say next given some prompt requires learning in a pretty meaningful way, as it requires the model to be able to learn from the prompt what the human is trying to do and then do that.

Comment by evhub on My computational framework for the brain · 2020-09-14T23:06:10.024Z · score: 25 (12 votes) · LW · GW

Some things which don't fully make sense to me:

  • If the cortical algorithm is the same across all mammals, why do only humans develop complex language? Do you think that the human neocortex is specialized for language in some way, or do you think that other mammal's neocortices would be up to the task if sufficiently scaled up? What about our subcortex—do we get special language-based rewards? How would the subcortex implement those?
  • Furthermore, there are lots of commonalities across human languages—e.g. word order patterns and grammar similarities, see e.g. linguistic universals—how does that make sense if language is neocortical and the neocortex is a blank slate? Do linguistic commonalities come from the subcortex, from our shared environment, or from some way in which our neocortex is predisposed to learn language?
  • Also, on a completely different note, in asking “how does the subcortex steer the neocortex?” you seem to presuppose that the subcortex actually succeeds in steering the neocortex—how confident in that should we be? It seems like there are lots of things that people do that go against a naive interpretation of the subcortical reward algorithm—abstaining from sex, for example, or pursuing complex moral theories like utilitarianism. If the way that the subcortex steers the neocortex is terrible and just breaks down off-distribution, then that sort of cuts into your argument that we should be focusing on understanding how the subcortex steers the neocortex, since if it's not doing a very good job then there's little reason for us to try and copy it.
Comment by evhub on Mesa-Search vs Mesa-Control · 2020-09-14T19:32:34.126Z · score: 4 (2 votes) · LW · GW

And can we taboo the word 'learning' for this discussion, or keep it to the standard ML meaning of 'update model weights through optimisation'? Of course, some domains require responsive policies that act differently depending on what they observe, which is what Rohin observes elsewhere in these comments. In complex tasks on the way to AGI, I can see the kind of responsiveness required become very sophisticated indeed, possessing interesting cognitive structure. But it doesn't have to be the same kind of responsiveness as the learning process of an RL agent; and it doesn't necessarily look like learning in the everyday sense of the word. Since the space of things that could be meant here is so big, it would be good to talk more concretely.

I agree with all of that—I was using the term “learning” to be purposefully vague precisely because the space is so large and the point that I'm making is very general and doesn't really depend on exactly what notion of responsiveness/learning you're considering.

Now, I understand that you argue that if a policy was to learn an internal search procedure, or an internal learning procedure, then it could predict the rewards it would get for different actions. It would then pick the action that scores best according to its prediction, thereby 'updating' based on returns it hasn't yet received, and actions it hasn't yet made. I agree that it's possible this is helpful, and it would be interesting to study existing meta-learners from this perspective (though my guess is that they don't do anything so sophisticated). It isn't clear to me a priori that from the point of view of the policy this is the best strategy to take.

This does in fact seem like an interesting angle from which to analyze this, though it's definitely not what I was saying—and I agree that current meta-learners probably aren't doing this.

I'm not sure what you mean when you say 'taking actions requires learning'. Do you mean something other than the basic requirement that a policy depends on observations?

What I mean here is in fact very basic—let me try to clarify. Let be the optimal policy. Furthermore, suppose that any polynomial-time (or some other similar constraint) algorithm that well-approximates has to perform some operation . Then, my point is just that, for to achieve performance comparable with , it has to do . And my argument for that is just simply because we know that you have to do to get good performance, which means either has to do or the gradient descent algorithm has to—but we know the gradient descent algorithm can't be doing something crazy like running at each step and putting the result into the model because the gradient descent algorithm only updates the model on the given state after the model has already produced its action.

Comment by evhub on [Link] Five Years and One Week of Less Wrong · 2020-09-14T19:12:39.794Z · score: 10 (5 votes) · LW · GW

“I re-read the Sequences”, they tell me, “and everything in them seems so obvious. But I have this intense memory of considering them revelatory at the time.”

This was my feeling also when I went back to the sequences and I figured I was just suffering from hindsight bias. But then I encountered someone else who had never read the sequences or really even hung out around rationalists who was able to reproduce a lot of the ideas, which made me think that maybe a lot of the sequences is just the stuff that you think about if you're smart and you spend a while thinking about how to think about stuff.

Comment by evhub on Mesa-Search vs Mesa-Control · 2020-09-13T22:48:43.748Z · score: 2 (1 votes) · LW · GW

This is not true of RL algorithms in general -- If I want, I can make weight updates after every observation.

You can't update the model based on its action until its taken that action and gotten a reward for it. It's obviously possible to throw in updates based on past data whenever you want, but that's beside the point—the point is that the RL algorithm only gets new information with which to update the model after the model has taken its action, which means if taking actions requires learning, then the model itself has to do that learning.

Comment by evhub on Do mesa-optimizer risk arguments rely on the train-test paradigm? · 2020-09-10T21:23:42.719Z · score: 19 (6 votes) · LW · GW

I don't think that doing online learning changes the analysis much at all.

As a simple transformation, any online learning setup at time step is equivalent to training on steps 1 to and then deploying on step . Thus, online learning for steps won't reduce the probability of pseudo-alignment any more than training for steps will because there isn't any real difference between online learning for steps and training for steps—the only difference is that we generally think of the training environment as being sandboxed and the online learning environment as not being sandboxed, but that just makes online learning more dangerous than training.

You might argue that the fact that you're doing online learning will make a difference after time step because if the model does something catastrophic at time step then online learning can modify it to not do that in the future—but that's always true: what it means for an outcome to be catastrophic is that it's unrecoverable. There are always things that we can try to do after our model starts behaving badly to rein it in—where we have a problem is when it does something so bad that those methods won't work. Fundamentally, the problem of inner alignment is a problem of worst-case guarantees—and doing some modification to the model after it takes its action doesn't help if that action already has the potential to be arbitrarily bad.

Comment by evhub on Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda · 2020-09-04T02:11:58.461Z · score: 3 (2 votes) · LW · GW

Microscope AI as a term refers to the proposal detailed here, though I agree that I don't really understand the usage in this post and I suspect the authors probably did mean OpenAI Microscope.

Comment by evhub on interpreting GPT: the logit lens · 2020-09-01T22:34:25.570Z · score: 2 (1 votes) · LW · GW

That's a great idea!

Thanks! I'd be quite excited to know what you find if you end up trying it.

Hmm... I guess there is some reason to think the basis elements have special meaning (as opposed to the elements of any other basis for the same space), since the layer norm step operates in this basis.

But I doubt there are actually individual components the embedding cares little about, as that seems wasteful (you want to compress 50K into 1600 as well as you possibly can), and if the embedding cares about them even a little bit then the model needs to slot in the appropriate predictive information, eventually.

Thinking out loud, I imagine there might be pattern where embeddings of unlikely tokens (given the context) are repurposed in the middle for computation (you know they're near-impossible so you don't need to track them closely), and then smoothly subtracted out at the end. There's probably a way to check if that's happening.

I wasn't thinking you would do this with the natural component basis—though it's probably worth trying that also—but rather doing some sort of matrix decomposition on the embedding matrix to get a basis ordered by importance (e.g. using PCA or NMF—PCA is simpler though I know NMF is what OpenAI Clarity usually uses when they're trying to extract interpretable basis elements from neural network activations) and then seeing what the linear model looks like in that basis. You could even just do something like what you're saying and find some sort of basis ordered by the frequency of the tokens that each basis element corresponds to (though I'm not sure exactly what the right way would be to generate such a basis).

Comment by evhub on interpreting GPT: the logit lens · 2020-09-01T21:18:59.637Z · score: 11 (3 votes) · LW · GW

This is very neat. I definitely agree that I find the discontinuity from the first transformer block surprising. One thing which occurred to me that might be interesting to do is to try and train a linear model to reconstitute the input from the activations at different layers to get an idea of how the model is encoding the input. You could either train one linear model on data randomly sampled from different layers, or a separate linear model for each layer, and then see if there are any interesting patterns like whether the accuracy increases or decreases as you get further into the model. You could also see if the resulting matrix has any relationship to the embedding matrix (e.g. are the two matrices farther apart or closer together than would be expected by chance?). One possible hypothesis that this might let you test is whether the information about the input is being stored indirectly via what the model's guess is given that input or whether it's just being stored in parts of the embedding space that aren't very relevant to the output (if it's the latter, the linear model should put a lot of weight on basis elements that have very little weight in the embedding matrix).

Comment by evhub on Matt Botvinick on the spontaneous emergence of learning algorithms · 2020-08-22T23:34:46.823Z · score: 12 (5 votes) · LW · GW

Unfortunately, I also only have so much time, and I don't generally think that repeating myself regularly in AF/LW comments is a super great use of it.

Comment by evhub on Matt Botvinick on the spontaneous emergence of learning algorithms · 2020-08-20T21:21:55.274Z · score: 23 (10 votes) · LW · GW

I guess I should explain why I upvoted this post despite agreeing with you that it's not new evidence in favor of mesa-optimization. I actually had a conversation about this post with Adam Shimi prior to you commenting on it where I explained to him that I thought that not only was none of it new but also that it wasn't evidence about the internal structure of models and therefore wasn't really evidence about mesa-optimization. Nevertheless, I chose to upvote the post and not comment my thoughts on it. Some reasons why I did that:

  • I generally upvote most attempts on LW/AF to engage with the academic literature—I think that LW/AF would generally benefit from engaging with academia more and I like to do what I can to encourage that when I see it.
  • I didn't feel like any comment I would have made would have anything more to say than things I've said in the past. In fact, in “Risks from Learned Optimization” itself, we talk about both a) why we chose to be agnostic about whether current systems exhibit mesa-optimization due to the difficulty of determining whether a system is actually implementing search or not (link) and b) examples of current work that we thought did seem to come closest to being evidence of mesa-optimization such as RL^2 (and I think RL^2 is a better example than the work linked here) (link).
Comment by evhub on Mesa-Search vs Mesa-Control · 2020-08-19T20:57:21.771Z · score: 8 (4 votes) · LW · GW

Unsurprisingly, the agent learns to first explore, and then exploit the best arm. This is a simple consequence of the fact that you have to look at observations to figure out what to do; this is no different from the fact that a DQN playing Pong will look at where the ball is in order to figure out what action to take.

Fwiw, I agree with this, and also I think this is the same thing as what I said in my comment on the post regarding how this is a necessary consequence of the RL algorithm only updating the model after it takes actions.

Comment by evhub on Mesa-Search vs Mesa-Control · 2020-08-18T19:35:57.544Z · score: 27 (8 votes) · LW · GW

Paul asks a related theory question. Vanessa gives a counterexample, which involves a control-type mesa-optimizer rather than one which implements an internal search.

You may also be interested in my counterexample, which preceded Vanessa's and uses a search-type mesa-optimizer rather than a control-type mesa-optimizer.

I've thought of two possible reasons so far.

  1. Perhaps your outer RL algorithm is getting very sparse rewards, and so does not learn very fast. The inner RL could implement its own reward function, which gives faster feedback and therefore accelerates learning. This is closer to the story in Evan's mesa-optimization post, just replacing search with RL.
  2. More likely perhaps (based on my understanding), the outer RL algorithm has a learning rate that might be too slow, or is not sufficiently adaptive to the situation. The inner RL algorithm adjusts its learning rate to improve performance.

I would propose a third reason, which is just that learning done by the RL algorithm happens after the agent has taken all of its actions in the episode, whereas learning done inside the model can happen during the episode. Thus, if the task of taking good actions in the episode requires learning, then your model will have to learn some sort of learning procedure to do well.

I don't currently see any reason why the inner learner in an RL system would be more or less agentic than in text prediction.

I agree with this—it has long been my position that language modeling as a task ticks all of the boxes necessary to produce mesa-optimizers.

But if GPT-3 can accomplish the same things empirically, who cares? GPT-3 is entirely reconstructing the "learned information" from the history, at every step. If it can accomplish so much this way, should we count its lack of recurrence against it?

I agree here too—though also I think it's pretty reasonable to expect that future massive language models will have more recurrence in them.

Comment by evhub on Search versus design · 2020-08-17T20:42:54.019Z · score: 13 (4 votes) · LW · GW

I thought this was a great post. One thing which I think this post misses, however, is the extent to which “post-hoc” approaches can be turned into “intrinsic” approaches by using the post-hoc approach as a training objective—in the language of my transparency trichotomy, that lets you turn inspection transparency into training transparency. Relaxed adversarial training is an example of this in which you directly train the model on an overseer's evaluation of some transparency condition given access to inspection transparency tools.

Comment by evhub on Alignment By Default · 2020-08-14T23:50:44.237Z · score: 2 (1 votes) · LW · GW

The RL process is actually optimizing , the log just comes from the REINFORCE trick. Regardless, I'm not sure I understand what you mean by optimizing fully to convergence at each timestep—convergence is a limiting property, so I don't know what it could mean do it for a single timestep. Perhaps you mean just taking the optimal policy such that In that case, that is in fact the definition of outer alignment I've given in the past, so I agree that whether is aligned or not is an outer alignment question.

Comment by evhub on Alignment By Default · 2020-08-14T22:30:13.569Z · score: 4 (2 votes) · LW · GW

Let's consider the following online learning setup:

At each timestep , takes action and receives reward . Then, we perform the simple policy gradient update

Now, we can ask the question, would be a mesa-optimizer? The first thing that's worth noting is that the above setup is precisely the standard RL training setup—the only difference is that there's no deployment stage. What that means, though, is that if standard RL training produces a mesa-optimizer, then this will produce a mesa-optimizer too, because the training process isn't different in any way whatsoever. If is acting in a diverse environment that requires search to be able to be solved effectively, then will still need to learn to do search—the fact that there won't ever be a deployment stage in the future is irrelevant to 's current training dynamics (unless is deceptive and knows there won't be a deployment stage—that's the only situation where it might be relevant).

Given that, we can ask the question of whether , if it's a mesa-optimizer, is likely to be misaligned—and in particular whether it's likely to be deceptive. Again, in terms of proxy alignment, the training process is exactly the same, so the picture isn't any different at all—if there are simpler, easier-to-optimize-for proxies, then is likely to learn those instead of the true base objective. Like I mentioned previously, however, deceptive alignment is the one case where it might matter that you're doing online learning, since if the model knows that it might do different things based on that fact. However, there are still lots of reasons why a model might be deceptive even in an online learning setup—for example, it might expect better opportunities for defection in the future, and thus want to prevent being modified now so that it can defect when it'll be most impactful.

Comment by evhub on Alignment By Default · 2020-08-14T21:05:33.712Z · score: 2 (1 votes) · LW · GW

To the extent that it is an inner alignment issue, it involves generalization failure from the training distribution, which I also generally consider an outer alignment problem (i.e. training on a distribution which differs from the deploy environment generally means the system is not outer aligned, unless the architecture is somehow set up to make the distribution shift irrelevant).

A useful criterion here: would the problem still happen if we just optimized over all the parameters simultaneously at runtime, rather than training offline first? If the problem would still happen, then it's not really an inner alignment problem (at least not in the usual mesa-optimization sense).

That's certainly not how I would define inner alignment. In “Risks from Learned Optimization,” we just define it as the problem of aligning the mesa-objective (if one exists) with the base objective, which is entirely independent of whether or not there's any sort of distinction between the training and deployment distributions and is fully consistent with something like online learning as you're describing it.

Comment by evhub on Alignment By Default · 2020-08-14T18:44:38.642Z · score: 11 (3 votes) · LW · GW

Natural abstractions argument: in an inner alignment failure, the outer optimizer is optimizing for X, but the inner optimizer ends up pointed at some rough approximation ~X. But if X is a natural abstraction, then this is far less likely to be a problem; we expect a wide range of predictive systems to all learn a basically-correct notion of X, so there's little reason for an inner optimizer to end up pointed at a rough approximation, especially if we're leveraging transfer learning from some unsupervised learner.

This isn't an argument against deceptive alignment, just proxy alignment—with deceptive alignment, the agent still learns X, it just does so as part of its world model rather than its objective. In fact, I think it's an argument for deceptive alignment, since if X first crops up as a natural abstraction inside of your agent's world model, that raises the question of how exactly it will get used in the agent's objective function—and deceptive alignment is arguably one of the simplest, most natural ways for the base optimizer to get an agent that has information about the base objective stored in its world model to actually start optimizing for that model of the base objective.

Comment by evhub on Learning the prior and generalization · 2020-08-03T22:17:05.241Z · score: 4 (2 votes) · LW · GW

Fair enough. In practice you still want training to also be from the same distribution because that's what causes your validation performance to be high. (Or put differently, training/validation i.i.d. is about capabilities, and validation/deployment i.i.d. is about safety.)


This seems to rely on an assumption that "human is convinced of X" implies "X"? Which might be fine, but I'm surprised you want to rely on it.

I'm curious what an algorithm might be that leverages this relaxation.

Well, I'm certainly concerned about relying on assumptions like that, but that doesn't mean there aren't ways to make it work. Approaches like debate and approval-based amplification already rely on very similar assumptions—for example, for debate to work it needs to be the case that being convinced of X at the end of the debate implies X. Thus, one way to leverage this relaxation is just to port those approaches over to this setting. For example, you could train via debate over what would do if could access the entirety of , then randomly do full debate rollouts during deployment. Like I mention in the post, this still just gives you average-case guarantees, not worst-case guarantees, though average-case guarantees are still pretty good and you can do a lot with them if you can actually get them.

Comment by evhub on Learning the prior and generalization · 2020-08-03T19:54:46.149Z · score: 4 (2 votes) · LW · GW

Alright, I think we're getting closer to being on the same page now. I think it's interesting to think about what the relaxation from requiring training/validation/deployment to be i.i.d. to just requiring that validation/deployment be i.i.d. gets us, though. First, it's an argument that we shouldn't be that worried about whether the training data is i.i.d. relative to the validation/deployment data. Second, it opens the door to an even further relaxation, which is that you can do the validation while looking at the model's output. That is, if the model can provide a justification which convinces you that your ground truth generator would have produced the same output, that's just as good as actually checking against the ground truth. At that point, though, it really stops looking anything like the standard i.i.d. setup, which is why I'm hesitant to just call it “validation/deployment i.i.d.” or something.

Comment by evhub on Learning the prior and generalization · 2020-08-03T18:03:14.336Z · score: 4 (2 votes) · LW · GW

Sure, but at the point where you're randomly deciding whether to collect ground truth for a data point and check the model on it (that is, put it in the validation data) or collect new data using the model to make predictions, you have verifiability. Importantly, though, you can get verifiability without doing that—including if the data isn't actually i.i.d. (e.g. there are detectable differences between training and deployment data) but you still decide randomly whether to check the model's output against some ground truth. In either situation, though, part of the point that I'm making is that the safety benefits are coming from the verifiability part not the i.i.d. part—even in the simple example of i.i.d.-ness giving you validation data, what's mattering is that the validation and deployment data are i.i.d. (because that's what gives you verifiability), but not whether the training and validation/deployment data are i.i.d.

Comment by evhub on Inner Alignment: Explain like I'm 12 Edition · 2020-08-02T02:41:28.368Z · score: 36 (8 votes) · LW · GW

This is great—thanks for writing this! I particularly liked your explanation of deceptive alignment with the diagrams to explain the different setups. Some comments, however:

(These models are called the training data and the setting is called supervised learning.)

Should be “these images are.”

Thus, there is only a problem if our way of obtaining feedback is flawed.

I don't think that's right. Even if the feedback mechanism is perfect, if your inductive biases are off, you could still end up with a highly misaligned model. Consider, for example, Paul's argument that the universal prior is malign—that's a setting where the feedback is perfect but you still get malign optimization because the prior is bad.

For proxy alignment, think of Martin Luther King.

The analogy is meant to be to the original Martin Luther, not MLK.

If we further assume that processing input data doesn't directly modify the model's objective, it follows that representing a complex objective via internalization is harder than via "modelling" (i.e., corrigibility or deception).

I'm not exactly sure what you're trying to say here. The way I would describe this is that internalization requires an expensive duplication where the objective is represented separately from the world model despite the world model including information about the objective.

Comment by evhub on Learning the prior and generalization · 2020-07-30T21:46:52.959Z · score: 4 (2 votes) · LW · GW

Fwiw, it took me a few re-reads to realize you were just arguing for the no-free-lunch theorem -- I initially thought you were arguing "since there is no 'true' distribution for a dataset, datasets can never be i.i.d., and so the theorems never apply in practice".

Hmmm... I'll try to edit the post to be more clear there.

How does verifiability help with this problem?

Because rather than just relying on doing ML in an i.i.d. setting giving us the guarantees that we want, we're forcing the guarantees to hold by actually randomly checking the model's predictions. From the perspective of a deceptive model, knowing that its predictions will just be trusted because the human thinks the data is i.i.d. gives it a lot more freedom than knowing that its predictions will actually be checked at random.

Perhaps you'd say, "with verifiability, the model would 'show its work', thus allowing the human to notice that the output depends on RSA-2048, and so we'd see that we have a bad model". But this seems to rest on having some sort of interpretability mechanism

There's no need to invoke interpretability here—we can train the model to give answers + justifications via any number of different mechanisms including amplification, debate, etc. The point is just to have some way to independently check the model's answers to induce i.i.d.-like guarantees.

Comment by evhub on Learning the prior and generalization · 2020-07-30T02:54:32.433Z · score: 7 (4 votes) · LW · GW

First off, I want to note that it is important that datasets and data points do not come labeled with "true distributions" and you can't rationalize one for them after the fact. But I don't think that's an important point in the case of i.i.d. data.

I agree—I pretty explicitly make that point in the post.

Why not just apply the no-free-lunch theorem? It says the same thing. Also, why do we care about this? Empirically the no-free-lunch theorem doesn't matter, and even if it did, I struggle to see how it has any safety implications -- we'd just find that our ML model is completely unable to get any validation performance and so we wouldn't deploy.

I agree that this is just the no free lunch theorem, but I generally prefer explaining things fully rather than just linking to something else so it's easier to understand the text just by reading it.

The reason I care, though, is because the fact that the performance is coming from the implicit ML prior means that if that prior is malign then even in the i.i.d. case you can still get malign optimization.

I don't really see this. My read of this post is that you introduced "verifiability", argued that it has exactly the same properties as i.i.d. (since i.i.d. also gives you average-case guarantees), and then claimed it's better specified than i.i.d. because... actually I'm not sure why, but possibly because we can never actually get i.i.d. in practice?

That's mostly right, but the point is that the fact that you can't get i.i.d. in practice matters because it means you can't get good guarantees from it—whereas I think you can get good guarantees from verifiability.

If that's right, then I disagree. The way in which we lose i.i.d. in practice is stuff like "the system could predict the pseudorandom number generator" or "the system could notice how much time has passed" (e.g. via RSA-2048). But verifiability has the same obstacles and more, e.g. you can't verify your system if it can predict which outputs you will verify, you can't verify your system if it varies its answer based on how much time has passed, you can't verify your system if humans will give different answers depending on random background variables like how hungry they are, etc. So I don't see why verifiability does any better.

I agree that you can't verify your model's answers if it can predict which outputs you will verify (though I don't think getting true randomness will actually be that hard)—but the others are notably not problems for verifiability despite being problems for i.i.d.-ness. If the model gives answers where the tree of reasoning supporting those answers depends on how much time has passed or how hungry the human was, then the idea is to reject those answers. The point is to produce a mechanism that allows you to verify justifications for correct answers to the questions that you care about.

Comment by evhub on What Failure Looks Like: Distilling the Discussion · 2020-07-30T00:53:41.685Z · score: 2 (1 votes) · LW · GW

I could imagine “enemy action” making sense as a label if the thing you're worried about is enemy humans deploying misaligned AI, but that's very much not what Paul is worried about in the original post. Rather, Paul is concerned about us accidentally training AIs which are misaligned and thus pursue convergent instrumental goals like resource and power acquisition that result in existential risk.

Furthermore, they're also not “enemy AIs” in the sense that “the AI doesn't hate you”—it's just misaligned and you're in its way—and so even if you specify something like “enemy AI action” that still seems to me to conjure up a pretty inaccurate picture. I think something like “influence-seeking AIs”—which is precisely the term that Paul uses in the original post—is much more accurate.

Comment by evhub on What Failure Looks Like: Distilling the Discussion · 2020-07-30T00:30:40.422Z · score: 2 (1 votes) · LW · GW

I mean, I agree that the scenario is about adversarial action, but it's not adversarial action by enemy humans—or even enemy AIs—it's adversarial action by misaligned (specifically deceptive) mesa-optimizers pursuing convergent instrumental goals.

Comment by evhub on What Failure Looks Like: Distilling the Discussion · 2020-07-29T23:45:02.738Z · score: 7 (4 votes) · LW · GW

Failure by enemy action.

This makes it sound like it's describing misuse risk, when really it's about accident risk.

Comment by evhub on Developmental Stages of GPTs · 2020-07-29T00:19:03.450Z · score: 29 (12 votes) · LW · GW

Here are 11. I wouldn't personally assign greater than 50/50 odds to any of them working, but I do think they all pass the threshold of “could possibly, possibly, possibly work.” It is worth noting that only some of them are language modeling approaches—though they are all prosaic ML approaches—so it does sort of also depend on your definition of “GPT-style” how many of them count or not.

Comment by evhub on Developmental Stages of GPTs · 2020-07-29T00:14:01.549Z · score: 7 (4 votes) · LW · GW

I feel like it's one reasonable position to call such proposals non-starters until a possibility proof is shown, and instead work on basic theory that will eventually be able to give more plausible basic building blocks for designing an intelligent system.

I agree that deciding to work on basic theory is a pretty reasonable research direction—but that doesn't imply that other proposals can't possibly work. Thinking that a research direction is less likely to mitigate existential risk than another is different than thinking that a research direction is entirely a non-starter. The second requires significantly more evidence than the first and it doesn't seem to me like the points that you referenced cross that bar, though of course that's a subjective distinction.

Comment by evhub on Developmental Stages of GPTs · 2020-07-28T22:56:30.976Z · score: 22 (8 votes) · LW · GW

As I understand it, the high level summary (naturally Eliezer can correct me) is that (a) corrigible behaviour is very unnatural and hard to find (most nearby things in mindspace are not in equilibrium and will move away from corrigibility as they reflect / improve), and (b) using complicated recursive setups with gradient descent to do supervised learning is incredible chaotic and hard to manage, and shouldn't be counted on working without major testing and delays (i.e. could not be competitive).

Perhaps Eliezer can interject here, but it seems to me like these are not knockdown criticisms that such an approach can't “possibly, possibly, possibly work”—just reasons that it's unlikely to and that we shouldn't rely on it working.

Comment by evhub on Deceptive Alignment · 2020-07-28T19:12:40.334Z · score: 4 (2 votes) · LW · GW

I talk about this a bit here, but basically if you train huge models for a short period of time, you're really relying on your inductive biases to find the simplest model that fits the data—and mesa-optimizers, especially deceptive mesa-optimizers, are quite simple, compressed policies.

Comment by evhub on Writeup: Progress on AI Safety via Debate · 2020-07-24T23:35:18.796Z · score: 10 (3 votes) · LW · GW

As I understand the way that simultaneity is handled here, debaters and are assigned to positions and and then argue simultaneously for their positions. Thus, and start by each arguing for their positions without seeing each others' arguments, then each refute each others' arguments without seeing each others' refutations, and so on.

I was talking about this procedure with Scott Garrabrant and he came up with an interesting insight: why not generalize this procedure beyond just debates between and ? Why not use this procedure to conduct arbitrary debates between and ? That would let you get general pairwise comparisons for the likelihood of arbitrary propositions, rather than just answering individual questions. Seems like a pretty straightforward generalization that might be useful in some contexts.

Also, another insight from Scott—simultaneous debate over yes-or-no questions can actually answer arbitrary questions if you just ask a question like:

Should I eat for dinner tonight whatever it is that you think I should eat for dinner tonight?

Comment by evhub on Arguments against myopic training · 2020-07-24T20:05:19.666Z · score: 2 (1 votes) · LW · GW

Yeah, I meant without a human looking at the output. I also agree with pretty much everything you just said. We're pretty deep in this comment chain now and I'm not exactly sure why we got here—I agree that Richard's original definition was based on the standard RL definition of myopia, though I was making the point that Richard's attempt to make imitative amplification non-myopic turned it into approval-based amplification. Richard's version has a human evaluate the output rather than a distance metric, which I see as the defining difference between imitative and approval-based amplification.

Comment by evhub on Arguments against myopic training · 2020-07-23T22:47:48.214Z · score: 8 (3 votes) · LW · GW

Both GAIL and AIRL only use expert trajectories rather than expert evaluations, which means they both satisfy the counterfactual oracle analogy, and so I would call them myopic training algorithms in the sense that I was using that term above. That being said, I definitely agree that the term is very overloaded here—some conflicting definitions:

  • an RL training procedure is myopic if ;
  • an RL training procedure is myopic if and it incentivizes CDT-like behavior in the limit (e.g. it shouldn't cooperate with its past self in one-shot prisoner's dilemma);
  • an ML training procedure is myopic if the model is evaluated without (EDIT: a human) looking at its output (as in the counterfactual oracle analogy).
Comment by evhub on Arguments against myopic training · 2020-07-22T19:54:38.582Z · score: 6 (3 votes) · LW · GW

Sure, but imitative amplification can't be done without myopic training or it ceases to be imitative amplification and becomes approval-based amplification, which means you no longer have any nice guarantees about limiting to HCH.

Comment by evhub on Arguments against myopic training · 2020-07-22T19:01:17.856Z · score: 4 (2 votes) · LW · GW

My point here is that I think imitative amplification (if you believe it's competitive) is a counter-example to Richard's argument in his “Myopic training doesn't prevent manipulation of supervisors” section since any manipulative actions that an imitative amplification model takes aren't judged by their consequences but rather just by how closely they match up with what the overseer would do.

Comment by evhub on Alignment proposals and complexity classes · 2020-07-22T02:52:52.496Z · score: 6 (3 votes) · LW · GW

Yeah, I think that's absolutely right—I actually already have a version of my market making proof for amplification that I've been working on cleaning up for publishing. But regardless of how you prove it I certainly agree that I understated amplification here and that it can in fact get to .

Comment by evhub on Arguments against myopic training · 2020-07-21T23:29:02.120Z · score: 4 (2 votes) · LW · GW

... Why isn't this compatible with saying that the supervisor (HCH) is "able to accurately predict how well their actions fulfil long-term goals"? Like, HCH presumably takes those actions because it thinks those actions are good for long-term goals.

In the imitative case, the overseer never makes a determination about how effective the model's actions will be at achieving anything. Rather, the overseer is only trying to produce the best answer for itself, and the loss is determined via a distance metric. While the overseer might very well try to determine how effective it's own actions will be at achieving long-term goals, it never evaluates how effective the model's actions will be. I see this sort of trick as the heart of what makes the counterfactual oracle analogy work.

Comment by evhub on Arguments against myopic training · 2020-07-21T23:25:24.521Z · score: 6 (3 votes) · LW · GW

So it seems misleading to describe a system as myopically imitating a non-myopic system -- there is no significant difference between non-myopic Q-learning vs myopic imitation of Q-learning. A notion of "myopia" which agrees with your usage (allowing for myopic imitation of HCH) does not seem like a very useful notion of myopia. I see this as the heart of Ricraz' critique (or at least, the part that I agree with).

I agree that there's no difference between the training setup where you do myopic RL on a Q function and the training setup where you just doing Q learning directly, but that doesn't at all imply that there's no difference between internally myopically imitating some other Q learning agent and internally doing some sort of non-myopic optimization process. As a silly example, if my Q learning agent has some weird idiosyncratic behavior, then the model which is imitating it will exhibit the same behavior where as the model which is just trying to optimize the reward directly won't.

This especially matters in the context of HCH because we care quite a lot about getting out as direct an imitation of HCH as possible. In particular, it matters quite a lot that our model be internally trying to myopically imitate HCH rather than internally trying to non-myopically get the least loss across episodes, as the latter will lead it to output simpler answers to make its job easier.

Comment by evhub on AI safety via market making · 2020-07-21T00:31:48.892Z · score: 2 (1 votes) · LW · GW

Ah, I see—makes sense.

Comment by evhub on Why is pseudo-alignment "worse" than other ways ML can fail to generalize? · 2020-07-20T21:55:37.440Z · score: 24 (6 votes) · LW · GW

So, I certainly agree that pseudo-alignment is a type of robustness/distributional shift problem. In fact, I would describe “Risks from Learned Optimization” as a deep dive on a particular subset of robustness problems that might be particularly concerning from a safety standpoint. Thus, in that sense, whether it's really a “new” sort of robustness problem is less the point than the analysis that the paper presents of that robustness problem. That being said, I do think that at least the focus on mesa-optimization was fairly novel in terms of caching out the generalization failures we wanted to discuss in terms of the sorts of learned optimization processes that might exhibit them (as well as the discussion of deception, as you mention).

I don't understand what "safety properties of the base optimizer" could be, apart from facts about the optima it tends to produce.

I agree with that and I think that the sentence you're quoting there is meant for a different sort of reader that has less of a clear concept of ML. One way to interpret the passage you're quoting that might help you is that it's just saying that guarantees about global optima don't necessarily translate to local optima or to actual models you might find in practice.

But even without mesa-optimizers, cases of ML generalization failure often involve the latter, not just the former.

I also agree with this. I would describe my picture here as something like: Pseudo-aligned mesa-optimization Objective generalization without capability generalization Robustness problems. Given that picture, I would say that the pseudo-aligned mesa-optimizer case is the most concerning from a safety perspective, then generic objective generalization without capability generalization, then robustness problems in general. And I would argue that it makes sense to break it down in that way precisely because you get more concerning safety problems as you go narrower.

Also, more detail on the capability vs. objective robustness picture is also available here and here.