Synthesizing amplification and debate 2020-02-05T22:53:56.940Z · score: 23 (8 votes)
Outer alignment and imitative amplification 2020-01-10T00:26:40.480Z · score: 28 (6 votes)
Exploring safe exploration 2020-01-06T21:07:37.761Z · score: 35 (10 votes)
Safe exploration and corrigibility 2019-12-28T23:12:16.585Z · score: 17 (8 votes)
Inductive biases stick around 2019-12-18T19:52:36.136Z · score: 50 (14 votes)
Understanding “Deep Double Descent” 2019-12-06T00:00:10.180Z · score: 107 (47 votes)
What are some non-purely-sampling ways to do deep RL? 2019-12-05T00:09:54.665Z · score: 15 (5 votes)
What I’ll be doing at MIRI 2019-11-12T23:19:15.796Z · score: 117 (36 votes)
More variations on pseudo-alignment 2019-11-04T23:24:20.335Z · score: 20 (6 votes)
Chris Olah’s views on AGI safety 2019-11-01T20:13:35.210Z · score: 135 (41 votes)
Gradient hacking 2019-10-16T00:53:00.735Z · score: 50 (16 votes)
Impact measurement and value-neutrality verification 2019-10-15T00:06:51.879Z · score: 35 (10 votes)
Towards an empirical investigation of inner alignment 2019-09-23T20:43:59.070Z · score: 43 (11 votes)
Relaxed adversarial training for inner alignment 2019-09-10T23:03:07.746Z · score: 45 (11 votes)
Are minimal circuits deceptive? 2019-09-07T18:11:30.058Z · score: 51 (12 votes)
Concrete experiments in inner alignment 2019-09-06T22:16:16.250Z · score: 63 (20 votes)
Towards a mechanistic understanding of corrigibility 2019-08-22T23:20:57.134Z · score: 36 (10 votes)
Risks from Learned Optimization: Conclusion and Related Work 2019-06-07T19:53:51.660Z · score: 65 (19 votes)
Deceptive Alignment 2019-06-05T20:16:28.651Z · score: 63 (17 votes)
The Inner Alignment Problem 2019-06-04T01:20:35.538Z · score: 68 (17 votes)
Conditions for Mesa-Optimization 2019-06-01T20:52:19.461Z · score: 59 (20 votes)
Risks from Learned Optimization: Introduction 2019-05-31T23:44:53.703Z · score: 126 (36 votes)
A Concrete Proposal for Adversarial IDA 2019-03-26T19:50:34.869Z · score: 18 (6 votes)
Nuances with ascription universality 2019-02-12T23:38:24.731Z · score: 24 (7 votes)
Dependent Type Theory and Zero-Shot Reasoning 2018-07-11T01:16:45.557Z · score: 18 (11 votes)


Comment by evhub on Does iterated amplification tackle the inner alignment problem? · 2020-02-15T19:24:38.032Z · score: 13 (6 votes) · LW · GW

You are correct that amplification is primarily a proposal for how to solve outer alignment, not inner alignment. That being said, Paul has previously talked about how you might solve inner alignment in an amplification-style setting. For an up-to-date, comprehensive analysis of how to do something like that, see “Relaxed adversarial training for inner alignment.”

Comment by evhub on What is the difference between robustness and inner alignment? · 2020-02-15T19:14:37.245Z · score: 15 (6 votes) · LW · GW

This a good question. Inner alignment definitely is meant to refer to a type of robustness problem—it's just also definitely not meant to refer to the entirety of robustness. I think there are a couple of different levels on which you can think about exactly what subproblem inner alignment is referring to.

First, the definition that's given in “Risks from Learned Optimization”—where the term inner alignment comes from—is not about competence vs. intent robustness, but is directly about the objective that a learned search algorithm is searching for. Risks from Learned Optimization broadly takes the position that though it might not make sense to talk about learned models having objectives in general, it certainly makes sense to talk about a model having an objective if it is internally implementing a search process, and argues that learned models internally implementing search processes (which the paper calls mesa-optimizers) could be quite common. I would encourage reading the full paper to get a sense of how this sort of definition plays out.

Second, that being said, I do think that the competence vs. intent robustness framing that you mention is actually a fairly reasonable one. “2-D Robustness” presents the basic picture here, though in terms of a concrete example of what robust capabilities without robust alignment could actually look like, I am somewhat partial to my maze example. I think the maze example in particular presents a very clear story for how capability and alignment robustness can come apart even for agents that aren't obviously running a search process. The 2-D robustness distinction is also the subject of this alignment newsletter, which I'd also highly recommend taking a look at, as it has some more commentary on thinking about this sort of a definition as well.

Comment by evhub on Bayesian Evolving-to-Extinction · 2020-02-15T18:35:41.514Z · score: 5 (3 votes) · LW · GW

If that ticket is better at predicting the random stuff it's writing to the logs—which it should be if it's generating that randomness—then that would be sufficient. However, that does rely on the logs directly being part of the prediction target rather than only through some complicated function like a human seeing them.

Comment by evhub on Bayesian Evolving-to-Extinction · 2020-02-15T00:46:34.225Z · score: 8 (4 votes) · LW · GW

There is also the "lottery ticket hypothesis" to consider (discussed on LW here and here) -- the idea that a big neural network functions primarily like a bag of hypotheses, not like one hypothesis which gets adapted toward the right thing. We can imagine different parts of the network fighting for control, much like the Bayesian hypotheses.

This is a fascinating point. I'm curious now how bad things can get if your lottery tickets have side channels but aren't deceptive. It might be that the evolving-to-extinction policy of making the world harder to predict through logs is complicated enough that it can only emerge through a deceptive ticket deciding to pursue it—or it could be the case that it's simple enough that one ticket could randomly start writing stuff to logs, get selected for, and end up pursuing such a policy without ever actually having come up with it explicitly. This seems likely to depend on how powerful your base optimization process is and how easy it is to influence the world through side-channels. If it's the case that you need deception, then this probably isn't any worse than the gradient hacking problem (though possibly it gives us more insight into how gradient hacking might work)—but if it can happen without deception, then this sort of evolving-to-extinction behavior could be a serious problem in its own right.

Comment by evhub on Synthesizing amplification and debate · 2020-02-10T20:00:24.672Z · score: 4 (2 votes) · LW · GW

Yep; that's basically how I'm thinking about this. Since I mostly want this process to limit to amplification rather than debate, I'm not that worried about the debate equilibrium not being exactly the same, though in most cases I expect in the limit that such that you can in fact recover the debate equilibrium if you anneal towards debate.

Comment by evhub on Synthesizing amplification and debate · 2020-02-09T17:52:43.539Z · score: 4 (2 votes) · LW · GW

The basic debate RL setup is meant to be unchanged here—when I say “the RL reward derived from ” I mean that in the zero-sum debate game sense. So you're still using self-play to converge on the Nash in the situation where you anneal towards debate, and otherwise you're using that self-play RL reward as part of the loss and the supervised amplification loss as the other part.

Are the arguments the same thing as answers?

The arguments should include what each debater thinks the answer to the question should be.

I think yours is aiming at the second and not the first?


Comment by evhub on Synthesizing amplification and debate · 2020-02-06T23:09:40.964Z · score: 2 (1 votes) · LW · GW

It shouldn't be since is just a function argument here—and I was imagining that including a variable in a question meant it was embedded such that the question-answerer has access to it, but perhaps I should have made that more clear.

Comment by evhub on Outer alignment and imitative amplification · 2020-02-04T06:18:35.451Z · score: 2 (1 votes) · LW · GW

That's a good point. What I really mean is that I think the sort of HCH that you get out of taking actual humans and giving them careful instructions is more likely to be uncompetitive than it is to be unaligned. Also, I think that “HCH for a specific H” is more meaningful than “HCH for a specific level of competitiveness,” since we don't really know what weird things you might need to do to produce an HCH with a given level of competitiveness.

Comment by evhub on Outer alignment and imitative amplification · 2020-02-03T21:19:11.155Z · score: 2 (1 votes) · LW · GW

Another thing that maybe I didn't make clear previously:

I believe the point about Turing machines was that given Low Bandwidth Overseer, it's not clear how to get HCH/IA to do complex tasks without making it instantiate arbitrary Turing machines.

I agree, but if you're instructing your humans not to instantiate arbitrary Turing machines, then that's a competitiveness claim, not an alignment claim. I think there are lots of very valid reasons for thinking that HCH is not competitive—I only said I was skeptical of the reasons for thinking it wouldn't be aligned.

Comment by evhub on The Epistemology of AI risk · 2020-01-30T20:14:48.980Z · score: 8 (2 votes) · LW · GW

I feel like you are drawing the wrong conclusion from the shift in arguments that has occurred. I would argue that what look like wrong ideas that ended up not contributing to future research could actually have been quite necessary for progressing the field's understanding as a whole. That is, maybe we really needed to engage with utility functions first before we could start breaking down that assumption—or maybe optimization daemons were a necessary step towards understanding mesa-optimization. Thus, I don't think the shift in arguments at all justifies the conclusion that prior work wasn't very helpful, as the prior work could have been necessary to achieve that very shift.

Comment by evhub on The Epistemology of AI risk · 2020-01-28T07:11:56.258Z · score: 18 (5 votes) · LW · GW

The argument that continuous takeoff makes AI safe seems robust to most specific items on your list, though I can see several ways that the argument fails.

I feel like this depends on a whole bunch of contingent facts regarding our ability to accurately diagnose and correct what could be very pernicious problems such as deceptive alignment amidst what seems quite likely to be a very quickly changing and highly competitive world.

It seems even harder to do productive work, since I'm skeptical of very short timelines.

Why does being skeptical of very short timelines preclude our ability to do productive work on AI safety? Surely there are things we can be doing now to gain insight, build research/organizational capacity, etc. that will at least help somewhat, no? (And it seems to me like “probably helps somewhat” is enough when it comes to existential risk.)

Comment by evhub on Have epistemic conditions always been this bad? · 2020-01-25T23:45:07.259Z · score: 26 (10 votes) · LW · GW

First, as someone who just (class of 2019) graduated college at a very liberal, highly regarded, private U.S. institution, the description above definitely does not match my experience. In my experience, I found that dissenting opinions and avid discussion were highly encouraged. That being said, I suspect Mudd may be particularly good on that axis due to factors such as being entirely STEM-focused (also Debra Mashek was one of my professors).

Second, I think it is worth pointing out that there are definitely instances where, at least in my opinion, “canceling” is a valid tactic. Deplatforming violent rhetoric (e.g. Nazism, Holocaust denial, etc.) comes to mind as an obvious example.

Third, that being said, I do think there is a real problem along the lines of what you're pointing at. For example, one thing I saw recently was what's been happening to Natalie Wynn, a YouTuber who goes by the name “ContraPoints.” She's a very popular leftist YouTuber who mainly talks about various left-wing social issues, particularly transgender issues (she herself is transgender). In one of her recent videos, she cast a transgender man named Buck Angel as a voice actor for part of it, and people (mostly on Twitter) got extremely upset at her because Buck Angel had at one point previously said something that maybe possibly could be interpreted as anti-non-binary-people. I think that Natalie's recent video responding to her “canceling” is probably the best analysis of the whole phenomenon that I've seen, and aligns pretty well with my views on the topic, though it's quite long.

There are a lot of things about Natalie's canceling that give me hope, though. First, it seemed like her canceling was highly concentrated on Twitter, which makes a lot of sense to me—I tend to think that it's almost impossible to have good discourse in any sort of combative/argumentative setting, especially when it's online, and especially when everyone is limited only to tiny tweets, which lend themselves particularly well to snarky quippy one-liners without any actual real substance. Second, it was really only a fringe group of people canceling her—it's just that the people who were doing it were very loud, which again strikes me as exactly the sort of thing that is highly exacerbated by the internet, and especially by Twitter. Third, I think there's a real movement on the left towards rejecting this sort of thing—I think Natalie is a good example of a very public leftist strongly rejecting “cancel culture,” though I've met lots of other die-hard leftists who think similarly while I was in college. There are a lot of really smart people on the left and I think it's quite reasonable to expect that this will broadly get better over time—especially if people move to better forms of online discourse than Twitter (or Facebook, which I also think is pretty bad). YouTube and Reddit, though, are mainstream platforms that I think produce significantly better discourse than Twitter, so I do think there's hope there.

Comment by evhub on Exploring safe exploration · 2020-01-16T20:58:02.973Z · score: 14 (4 votes) · LW · GW

Hey Aray!

Given this, I think the "within-episode exploration" and "across-episode exploration" relax into each other, and (as the distinction of episode boundaries fades) turn into the same thing, which I think is fine to call "safe exploration".

I agree with this. I jumped the gun a bit in not really making the distinction clear in my earlier post “Safe exploration and corrigibility,” but I think that made it a bit confusing, so I went heavy on the distinction here—but perhaps more heavy than I actually think is warranted.

The problem I have with relaxing within-episode and across-episode exploration into each other, though, is precisely the problem I describe in “Safe exploration and corrigibility,” however, which is that by default you only end up with capability exploration not objective exploration—that is, an agent with a goal (i.e. a mesa-optimizer) is only going to explore to the extent that it helps its current goal, not to the extent that it helps it change its goal to be more like the desired goal. Thus, you need to do something else (something that possibly looks somewhat like corrigibility) to get the agent to explore in such a way that helps you collect data on what its goal is and how to change it.

Comment by evhub on Malign generalization without internal search · 2020-01-13T23:37:32.962Z · score: 2 (1 votes) · LW · GW

I don't feel like you're really understanding what I'm trying to say here. I'm happy to chat with you about this more over video call or something if you're interested.

Comment by evhub on Malign generalization without internal search · 2020-01-12T19:40:25.172Z · score: 6 (1 votes) · LW · GW

I think that piecewise objectives are quite reasonable and natural—and I don't think they'll make transparency that much harder. I don't think there's any reason that we should expect objectives to be continuous in some nice way, so I fully expect you'll get these sorts of piecewise jumps. Nevertheless, the resulting objective in the piecewise case is still quite simple such that you should be able to use interpretability tools to understand it pretty effectively—a switch statement is not that complicated or hard to interpret—with most of the real hard work still primarily being done in the optimization.

I do think there are a lot of possible ways in which the interpretability for mesa-optimizers story could break down—which is why I'm still pretty uncertain about it—but I don't think that a switch-case agent is such an example. Probably the case that I'm most concerned about right now is if you get an agent which has an objective which changes in a feedback loop with its optimization. If the objective and the optimization are highly dependent on each other, then I think that would make the problem a lot more difficult—and is the sort of thing that humans seem to do, which suggests that it's the sort of thing we might see in AI systems as well. On the other hand, a fixed switch-case objective is pretty easy to interpret, since you just need to understand the simple, fixed heuristics being used in the switch statement and then you can get a pretty good grasp on what your agent's objective is. Where I start to get concerned is when those switch statements themselves depend upon the agent's own optimization—a recursion which could possibly be many layers deep and quite difficult to disentangle. That being said, even in such a situation you're still using search to get your robust capabilities.

Comment by evhub on Malign generalization without internal search · 2020-01-12T19:10:12.325Z · score: 3 (2 votes) · LW · GW

Consider an agent that could, during its operation, call upon a vast array of subroutines. Some of these subroutines can accomplish extremely complicated actions, such as "Prove this theorem: [...]" or "Compute the fastest route to Paris." We then imagine that this agent still shares the basic superstructure of the pseudocode I gave initially above.

I feel like what you're describing here is just optimization where the objective is determined by a switch statement, which certainly seems quite plausible to me but also pretty neatly fits into the mesa-optimization framework.

More generally, while I certainly buy that you can produce simple examples of things that look kinda like capability generalization without objective generalization on environments like the lunar lander or my maze example, it still seems to me like you need optimization to actually get capabilities that are robust enough to pose a serious risk, though I remain pretty uncertain about that.

Comment by evhub on Outer alignment and imitative amplification · 2020-01-11T22:46:36.483Z · score: 2 (1 votes) · LW · GW

Is "outer alignment" meant to be applicable in the general case?

I'm not exactly sure what you're asking here.

Do you think it also makes sense to talk about outer alignment of the training process as a whole, so that for example if there is a security hole in the hardware or software environment and the model takes advantage of the security hole to hack its loss/reward, then we'd call that an "outer alignment failure".

I would call that an outer alignment failure, but only because I would say that the ways in which your loss function can be hacked are part of the specification of your loss function. However, I wouldn't consider an entire training process to be outer aligned—rather, I would just say that an entire training process is aligned. I generally use outer and inner alignment to refer to different components of aligning the training process—namely the objective/loss function/environment in the case of outer alignment and the inductive biases/architecture/optimization procedure in the case of inner alignment (though note that this is a more general definition than the one used in “Risks from Learned Optimization,” as it makes no mention of mesa-optimizers, though I would still say that mesa-optimization is my primary example of how you could get an inner alignment failure).

So technically, one should say that a loss function is outer aligned at optimum with respect to some model class, right?

Yes, though in the definition I gave here I just used the model class of all functions, which is obviously too large but has the nice property of being a fully general definition.

Also, related to Ofer's comment, can you clarify whether it's intended for this definition that the loss function only looks at the model's input/output behavior, or can it also take into account other information about the model?

I would include all possible input/output channels in the domain/codomain of the model when interpreted as a function.

I'm also curious whether you have HBO or LBO in mind for this post.

I generally think you need HBO and am skeptical that LBO can actually do very much.

Comment by evhub on Outer alignment and imitative amplification · 2020-01-10T05:29:18.738Z · score: 8 (4 votes) · LW · GW

I think I'm quite happy even if the optimal model is just trying to do what we want. With imitative amplification, the true optimum—HCH—still has benign failures, but I nevertheless want to argue that it's aligned. In fact, I think this post really only makes sense if you adopt a definition of alignment that excludes benign failures, since otherwise you can't really consider HCH aligned (and thus can't consider imitative amplification outer aligned at optimum).

Comment by evhub on Exploring safe exploration · 2020-01-07T08:41:41.723Z · score: 2 (1 votes) · LW · GW

Like I said in the post, I'm skeptical that “preventing the agent from making an accidental mistake” is actually a meaningful concept (or at least, it's a concept with many possible conflicting definitions), so I'm not sure how to give an example of it.

Comment by evhub on Exploring safe exploration · 2020-01-06T23:56:15.253Z · score: 6 (3 votes) · LW · GW

I definitely was not arguing that. I was arguing that safe exploration is currently defined in ML as the agent making an accidental mistake, and that we should really not be having terminology collisions with ML. (I may have left that second part implicit.)

Ah, I see—thanks for the correction. I changed “best” to “current.”

I assume that the difference you see is that you could try to make across-episode exploration less detrimental from the agent's perspective

No, that's not what I was saying. When I said “reward acquisition” I meant the actual reward function (that is, the base objective).


That being said, it's a little bit tricky in some of these safe exploration setups to draw the line between what's part of the base objective and what's not. For example, I would generally include the constraints in constrained optimization setups as just being part of the base objective, only specified slightly differently. In that context, constrained optimization is less of a safe exploration technique and more of a reward-engineering-y/outer alignment sort of thing, though it also has a safe exploration component to the extent that it constrains across-episode exploration.

Note that when across-episode exploration is learned, the distinction between safe exploration and outer alignment becomes even more muddled, since then all the other terms in the loss will implicitly serve to check the across-episode exploration term, as the agent has to figure out how to trade off between them.[1]

  1. This is another one of the points I was trying to make in “Safe exploration and corrigibility” but didn't do a great job of conveying properly. ↩︎

Comment by evhub on Safe exploration and corrigibility · 2019-12-29T05:11:28.819Z · score: 5 (3 votes) · LW · GW

I completely agree with the distinction between across-episode vs. within-episode exploration, and I agree I should have been clearer about that. Mostly I want to talk about across-episode exploration here, though when I was writing this post I was mostly motivated by the online learning case where the distinction is in fact somewhat blurred, since in an online learning setting you do in fact need the deployment policy to balance between within-episode exploration and across-episode exploration.

Usually (in ML) "safe exploration" means "the agent doesn't make a mistake, even by accident"; ϵ-greedy exploration wouldn't be safe in that sense, since it can fall into traps. I'm assuming that by "safe exploration" you mean "when the agent explores, it is not trying to deceive us / hurt us / etc".

Agreed. My point is that “If you assume that the policy without exploration is safe, then for -greedy exploration to be safe on average, it just needs to be the case that the environment is safe on average, which is just a standard engineering question.” That is, even though it seems like it's hard for -greedy exploration to be safe, it's actually quite easy for it to be safe on average—you just need to be in a safe environment. That's not true for learned exploration, though.

Since by default policies can't affect across-episode exploration, I assume you're talking about within-episode exploration. But this happens all the time with current RL methods

Yeah, I agree that was confusing—I'll rephrase it. The point I was trying to make was that across-episode exploration should arise naturally, since an agent with a fixed objective should want to be modified to better pursue that objective, but not want to be modified to pursue a different objective.

This sounds to me like reward uncertainty, assistance games / CIRL, and more generally Stuart Russell's agenda, except applied to mesa optimization now. Should I take away something other than "we should have our mesa optimizers behave like the AIs in assistance games"? I feel like you are trying to say something else but I don't know what.

Agreed that there's a similarity there—that's the motivation for calling it “cooperative.” But I'm not trying to advocate for that agenda here—I'm just trying to better classify the different types of corrigibility and understand how they work. In fact, I think it's quite plausible that you could get away without cooperative corrigibility, though I don't really want to take a stand on that right now.

I thought we were talking about "the agent doesn't try to deceive us / hurt us by exploring", which wouldn't tell us anything about the problem of "the agent doesn't make an accidental mistake".

If your definition of “safe exploration” is “not making accidental mistakes” then I agree that what I'm pointing at doesn't fall under that heading. What I'm trying to point at is that I think there are other problems that we need to figure out regarding how models explore than just the “not making accidental mistakes” problem, though I have no strong feelings about whether or not to call those other problems “safe exploration” problems.

The same way as capability exploration; based on value of information (VoI). (I assume you have a well-specified distribution over objectives; if you don't, then there is no proper way to do it, in the same way there's no proper way to do capability exploration without a prior over what you might see when you take the new action.)

Agreed, though I don't think that's the end of the story. In particular, I don't think it's at all obvious what an agent that cares about the value of information that its actions produce relative to some objective distribution will look like, how you could get such an agent, or how you could verify when you had such an agent. And, even if you could do those things, it still seems pretty unclear to me what the right distribution over objectives should be and how you should learn it.

The algorithms used are not putting dampers on exploration; they are trying to get the agent to do better exploration (e.g. if you crashed into the wall and saw that that violated a constraint, don't crash into the wall again just because you forgot about that experience).

Well, what does “better exploration” mean? Better across-episode exploration or better within-episode exploration? Better relative to the base objective or better relative to the mesa-objective? I think it tends to be “better within-episode exploration relative to the base objective,” which I would call putting a damper on instrumental exploration, which does across-episode and within-episode exploration only for the mesa-objective, not the base objective.

If you have the right uncertainty, then acting optimally to maximize that is the "right" thing to do.

Sure, but as you note getting the right uncertainty could be quite difficult, so for practical purposes my question is still unanswered.

Comment by evhub on Inductive biases stick around · 2019-12-26T08:15:23.518Z · score: 4 (2 votes) · LW · GW

I just edited the last sentence to be clearer in terms of what I actually mean by it.

Comment by evhub on [AN #78] Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison · 2019-12-26T08:11:46.269Z · score: 2 (1 votes) · LW · GW

To be clear, I broadly agree that AGI will be quite underparameterized, but still maintain that double descent demonstrates something—that larger models can do better by being simpler not just by fitting more data—that I think is still quite important.

Comment by evhub on Free Speech and Triskaidekaphobic Calculators: A Reply to Hubinger on the Relevance of Public Online Discussion to Existential Risk · 2019-12-21T06:41:59.683Z · score: 15 (7 votes) · LW · GW

I'm not really interested in debating this on LessWrong, for basically the exact reasons that I stated in the first place, which is that I don't really think these sorts of conversations can be done effectively online. Thus, I probably won't try to respond to any replies to this comment.

At the very least, though, I think it's worth clarifying that my position is certainly not "assume what you're doing is the most important thing and run with it." Rather, I think that trying to think really hard about the most important things to be doing is an incredibly valuable exercise, and I think the effective altruism community provides a great model of how I think that should be done. The only thing I was advocating was not discussing hot-button political issues specifically online. I think to the extent that those sorts of things are relevant to doing the most good, they should be done offline, where the quality of the discussion can be higher and nobody ends up tainted by other people's beliefs by association.

Comment by evhub on Inductive biases stick around · 2019-12-20T19:11:48.336Z · score: 2 (1 votes) · LW · GW

What double descent definitely says is that for a fixed dataset, larger models with zero training error are simpler than smaller models with zero training error. I think it does say somewhat more than that also, which is that larger models do have a real tendency towards being better at finding simpler models in general. That being said, the dataset on which the concept of a dog in your head was trained on is presumably way larger than that of any ML model, so even if your brain is really good at implementing Occam's razor and finding simple models, your model is still probably going to be more complicated.

Comment by evhub on Against Premature Abstraction of Political Issues · 2019-12-20T19:00:02.094Z · score: 3 (2 votes) · LW · GW

I disagree, and think LW can actually do ok, and probably even better with some additional safeguards around political discussions. You weren't around yet when we had the big 2009 political debate that I referenced in the OP, but I think that one worked out pretty well in the end.

Do you think having that debate online was something that needed to happen for AI safety/x-risk? Do you think it benefited AI safety at all? I'm genuinely curious. My bet would be the opposite—that it caused AI safety to be more associated with political drama that helped further taint it.

Comment by evhub on A dilemma for prosaic AI alignment · 2019-12-19T22:36:07.267Z · score: 5 (3 votes) · LW · GW

I'm skeptical of language modeling being enough to be competitive, in the sense of maximizing "log prob of some naturally occurring data or human demonstrations." I don't have a strong view about whether you can get away using only language data rather than e.g. taking images as input and producing motor torques as output.

I agree with this, though I still feel like some sort of active learning approach might be good enough without needing to add in a full-out RL objective.

I'm also not convinced that amplification or debate need to make this bet though. If we can do joint training / fine-tuning of a language model using whatever other objectives we need, then it seems like we could just as well do joint training / fine-tuning for a different kind of model. What's so bad if we use non-language data?

My opinion would be that there is a real safety benefit from being in a situation where you know the theoretical optimum of your loss function (e.g. in a situation where you know that HCH is precisely the thing for which loss is zero). That being said, it does seem obviously fine to have your language data contain other types of data (e.g. images) inside of it.

Comment by evhub on 2019 AI Alignment Literature Review and Charity Comparison · 2019-12-19T08:00:52.166Z · score: 21 (11 votes) · LW · GW

On the other hand, I don’t think we can give people money just because they say they are doing good things, because of the risk of abuse. There are many other reasons for not publishing anything. Some simple alternative hypothesis include “we failed to produce anything publishable” or “it is fun to fool ourselves into thinking we have exciting secrets” or “we are doing bad things and don’t want to get caught.” The fact that MIRI’s researchers appear intelligent suggest they at least think they are doing important and interesting issues, but history has many examples of talented reclusive teams spending years working on pointless stuff in splendid isolation.

Additionally, by hiding the highest quality work we risk impoverishing the field, making it look unproductive and unattractive to potential new researchers.

My work at MIRI is public, btw.

a Mesa-Optimizer - a sub-agent of an optimizer that is itself an optimizer

I think this is a poor description of mesa-optimization. A mesa-optimizer is not a subagent, it's just a trained model implementing a search algorithm.

Comment by evhub on Inductive biases stick around · 2019-12-19T07:34:01.608Z · score: 2 (1 votes) · LW · GW

Note that, in your example, if we do see double descent, it's because the best hypothesis was previously not in the class of hypotheses we were considering. Bayesian methods tend to do badly when the hypothesis class is misspecified.

Yep, that's exactly my model.

As a counterpoint though, you could see double descent even if your hypothesis class always contains the truth, because the "best" hypothesis need not be the truth.

If "best" here means test error, then presumably the truth should generalize at least as well as any other hypothesis.

That first stage is not just a "likelihood descent", it is a "likelihood + prior descent", since you are choosing hypotheses based on the posterior, not based on the likelihood.

True for the Bayesian case, though unclear in the ML case—I think it's quite plausible that current ML underweights the implicit prior of SGD relative to the maximizing the likelihood of the data (EDIT: which is another reason that better future ML might care more about inductive biases).

Comment by evhub on Against Premature Abstraction of Political Issues · 2019-12-18T21:58:49.564Z · score: 7 (4 votes) · LW · GW

How much of an efficiency hit do you think taking all discussion of a subject offline ("in-person") involves?

Probably a good deal for anything academic (like AI safety), but not at all for politics. I think discussions focused on persuasion/debate/argument/etc. are pretty universally bad (e.g. not truth-tracking), and that online discussion lends itself particularly well into falling into such discussions. It is sometimes possible to avoid this failure mode, but imo basically only if the conversations are kept highly academic and avoiding of any hot-button issues (e.g. as in some online AI safety discussions, though not all). I think this is basically impossible for politics, so I suspect that not having the ability to talk about politics online won't be much of a problem (and might even be quite helpful, since I suspect it would overall raise the level of political discourse).

Comment by evhub on Against Premature Abstraction of Political Issues · 2019-12-18T21:42:38.355Z · score: 4 (2 votes) · LW · GW

I can't help but think that we have to find a solution besides "just don't talk about politics" though, because x-risk is inherently political and as the movement gets bigger it's going to inevitably come into conflict with other people's politics.

My preferred solution to this problem continues to be just taking political discussions offline. I recognize that this is difficult for people not situated somewhere like the bay area where there are lots of other rationalist/effective altruist people around to talk to, but nevertheless I still think it's the best solution.


See here for an example of it starting to happen already.

I also agree with Weyl's point here that another very effective thing to do is to talk loudly and publicly about racism, sexism, etc.—though obviously as Eliezer points out that's not always possible, as not every important subject necessarily has such a component.

This is not an entirely rhetorical question, BTW. If anyone can see how things work out well in the end despite LW never getting rid of the "don't talk about politics" norm, I really want to hear that so I can maybe work in that direction instead.

My answer would be that we figure out how to engage with politics, but we do it offline rather than using a public forum like LW.

Comment by evhub on Against Premature Abstraction of Political Issues · 2019-12-18T20:42:28.103Z · score: 32 (11 votes) · LW · GW

I think for this and other reasons, it may be time to relax the norm against discussing object-level political issues around here. There are definitely risks and costs involved in doing that, but I think we can come up with various safeguards to minimize the risks and costs, and if things do go badly wrong anyway, we can be prepared to reinstitute the norm. I won't fully defend that here, as I mainly want to talk about "premature abstraction" in this post, but feel free to voice your objections to the proposal in the comments if you wish to do so.

Apologies in advance for only engaging with the part of this post you said you least wanted to defend, but I just wanted to register strong disagreement here. Personally, I would like LessWrong to be a place where I can talk about AI safety and existential risk without being implicitly associated with lots of other political content that I may or may not agree with. If LessWrong becomes a place for lots of political discussion, people will form such associations regardless of whether or not such associations are accurate. Given that that's the world we live in—and the importance imo of having a space for AI safety and existential risk discussions—I think having a strong norm against political discussions is a quite a good thing.

Comment by evhub on A dilemma for prosaic AI alignment · 2019-12-18T18:43:01.688Z · score: 4 (3 votes) · LW · GW

The goal of something like amplification or debate is to create a sort of oracle AI that can answer arbitrary questions (like how to build your next AI) for you. The claim I'm making is just that language is a rich enough environment that it'll be competitive to only use language as the training data for building your first such system.

Comment by evhub on A dilemma for prosaic AI alignment · 2019-12-18T00:12:42.389Z · score: 8 (5 votes) · LW · GW

I think that this is definitely a concern for prosaic AI safety methods. In the case of something like amplification or debate, I think the bet that you're making is that language modeling alone is sufficient to get you everything you need in a competitive way. I tend to think that that claim is probably true, but it's definitely an assumption of the approach that isn't often made explicit (but probably should be).

To add a bit of color to why you might buy the claim that language is all you need: the claim is basically that language contains enough structure to give you all the high-level cognition you could want, and furthermore that you aren't going to care about the other things that you can't get out of language like performance on fine-grained control tasks. Another way of thinking about this: if the primary purpose of your first highly advanced ML system is to build your second highly advanced ML system, then the claim is that language modelling (on some curriculum) will be sufficient to competitively help you build your next AI.

Comment by evhub on Understanding “Deep Double Descent” · 2019-12-17T19:58:27.583Z · score: 5 (3 votes) · LW · GW

But this doesn't make sense to me, because whatever is being used to "choose" the better model applies throughout training, and so even at the interpolation threshold the model should have been selected throughout training to be the type of model that generalized well. (For example, if you think that regularization is providing a simplicity bias that leads to better generalization, the regularization should also help models at the interpolation threshold, since you always regularize throughout training.)

The idea—at least as I see it—is that the set of possible models that you can choose between increases with training. That is, there are many more models reachable within steps of training than there are models reachable within steps of training. The interpolation threshold is the point at which there are the fewest reachable models with zero training error, so your inductive biases have the fewest choices—past that point, there are many more reachable models with zero training error, which lets the inductive biases be much more pronounced. One way in which I've been thinking about this is that ML models overweight the likelihood and underweight the prior, since we train exclusively on loss and effectively only use our inductive biases as a tiebreaker. Thus, when there aren't many ties to break—that is, at the interpolation threshold—you get worse performance.

Comment by evhub on Is the term mesa optimizer too narrow? · 2019-12-16T18:17:31.377Z · score: 10 (5 votes) · LW · GW

I think this is one of the major remaining open question wrt inner alignment. Personally, I think there is a meaningful sense in which all the models I'm most worried about do some sort of search internally (at least to the same extent that humans do search internally), but I'm definitely uncertain about that. If true, though, it could be quite helpful for solving inner alignment, since it could enable us to factor models into pieces (either through architecture or transparency tools). Also:

As far as I can tell, Hjalmar Wijk introduced the term "malign generalization" to describe the failure mode that I think is most worth worrying about here.

Hjalmar actually cites this post by Paul Christiano as the source of that term—though Hjalmar's usage is slightly different.

Comment by evhub on Understanding “Deep Double Descent” · 2019-12-07T21:05:52.434Z · score: 10 (3 votes) · LW · GW

Note that double descent also happens with polynomial regression—see here for an example.

Comment by evhub on Understanding “Deep Double Descent” · 2019-12-06T23:34:32.033Z · score: 4 (2 votes) · LW · GW

Yep; good catch!

Comment by evhub on Understanding “Deep Double Descent” · 2019-12-06T19:37:44.059Z · score: 4 (2 votes) · LW · GW

I wonder if this is a neural network thing, an SGD thing, or a both thing?

Neither, actually—it's more general than that. Belkin et al. show that it happens even for simple models like decision trees. Also see here for an example with polynomial regression.

Are you aware of this work and the papers they cite?

Yeah, I am. I definitely think that stuff is good, though ideally I want something more than just “approximately K-complexity.”

Comment by evhub on Understanding “Deep Double Descent” · 2019-12-06T06:00:35.921Z · score: 4 (2 votes) · LW · GW

Ah—thanks for the summary. I hadn't fully read that paper yet, though I knew it existed and so I figured I would link it, but that makes sense. Seems like in that case the flat vs. sharp minima hypothesis still has a lot going for it—not sure how that interacts with the lottery tickets hypothesis, though.

Comment by evhub on Understanding “Deep Double Descent” · 2019-12-06T02:23:03.652Z · score: 2 (1 votes) · LW · GW

Thanks! And good catch—should be fixed now.

Comment by evhub on What are some non-purely-sampling ways to do deep RL? · 2019-12-05T19:28:09.040Z · score: 2 (1 votes) · LW · GW

Yep—that's the adversarial training approach to this problem. The problem is that you might not be able to sample all the relevant highly uncertain points (e.g. because you don't know exactly what the deployment distribution will be), which means you have to do some sort of relaxed adversarial training instead, which introduces its own issues.

Comment by evhub on What are some non-purely-sampling ways to do deep RL? · 2019-12-05T19:25:03.452Z · score: 4 (2 votes) · LW · GW

This is really neat; thanks for the pointer!

Comment by evhub on What are some non-purely-sampling ways to do deep RL? · 2019-12-05T01:19:24.322Z · score: 2 (1 votes) · LW · GW

Hmmm... not sure if this is exactly what I want. I'd prefer not to assume too much about the environment dynamics. Not sure if this is related to what you're talking about, but one possibility, maybe, for a way in which you could do model-based planning with an explicit reward function but without assuming much about the environment dynamics could be to learn all the dynamics necessary to do model-based planning in a model-free way (like MuZero) except for the reward function and then include the reward function explicitly.

Comment by evhub on Thoughts on implementing corrigible robust alignment · 2019-11-26T23:07:04.731Z · score: 5 (3 votes) · LW · GW

I really enjoyed this post; thanks for writing this! Some comments:

the AGI uses its understanding of humans to try to figure out what a human would do in a hypothetical scenario.

I think that supervised amplification can also sort of be thought as falling into this category, in that you often want your model to be internally modeling what an HCH would do in a hypothetical scenario. Of course, if you're training a model using supervised amplification, you might not actually get a model which is in fact just trying to guess what an HCH would do, but is instead doing something more strategic and/or deceptive, though in many cases the goal at least is to try and get something that's just trying to approximate HCH.

So that suggests an approach of pre-loading this template database with a hardcoded model of a human, complete with moods, beliefs, and so on.

This is actually quite similar to an approach that Nevan Witchers at Google is working on, which is to hardcode a differentiable model of the reward function as a component in your network when doing RL. The idea there being very similar, which is to prevent the model from learning a proxy by giving it direct access to the actual structure of the reward function rather than just learning based on rewards that were observed during training. The two major difficulties I see with this style of approach, however, are that 1) it requires you to have an explicit differentiable model of the reward function and 2) it still requires the model to learn the policy and value (that is, how much future discounted reward the model expects to get using its current policy starting from some state) functions which could still allow for the introduction of misaligned proxies.

Comment by evhub on Bottle Caps Aren't Optimisers · 2019-11-22T08:23:55.360Z · score: 22 (9 votes) · LW · GW

Daniel Filan's bottle cap example was featured prominently in "Risks from Learned Optimization" for good reason. I think it is a really clear and useful example of why you might want to care about the internals of an optimization algorithm and not just its behavior, and helped motivate that framing in the "Risks from Learned Optimization" paper.

Comment by evhub on Paul's research agenda FAQ · 2019-11-22T08:14:19.162Z · score: 18 (7 votes) · LW · GW

Reading Alex Zhu's Paul agenda FAQ was the first time I felt like I understood Paul's agenda in its entirety as opposed to only understanding individual bits and pieces. I think this FAQ was a major contributing factor in me eventually coming to work on Paul's agenda.

Comment by evhub on Towards a New Impact Measure · 2019-11-22T08:00:35.105Z · score: 13 (5 votes) · LW · GW

I think that the development of Attainable Utility Preservation was significantly more progress on impact measures than (at the time) I thought would ever be possible (though RR also deserves some credit here). I also think it significantly clarified my thoughts on what impact is and how instrumental convergence works.

Comment by evhub on [AN #72]: Alignment, robustness, methodology, and system building as research priorities for AI safety · 2019-11-06T18:54:41.716Z · score: 6 (3 votes) · LW · GW

Asya's opinion on "Norms, Rewards, and the Intentional Stance" appears to have accidentally been replaced by Rohin's opinion on the "Ought Progress Update."

Comment by evhub on But exactly how complex and fragile? · 2019-11-05T07:40:16.168Z · score: 10 (3 votes) · LW · GW

That may be the crux. I'm generally of the mindset that "can't guarantee/verify" implies "completely useless for AI safety". Verifying that's it's safe is the whole point of AI safety research. If we were hoping to make something that just happened to be safe even though we couldn't guarantee it beforehand or double-check afterwards, that would just be called "AI"

Surely "the whole point of AI safety research" is just to save the world, no? If the world ends up being saved, does it matter whether we were able to "verify" that or not? From my perspective, as a utilitarian, it seems to me that the only relevant question is how some particular intervention/research/etc. affects the probability of AI being good for humanity (or the EV, to be precise). It certainly seems quite useful to be able to verify lots of stuff to achieve that goal, but I think it's worth being clear that verification is an instrumental goal not a terminal one—and that there might be other possible ways to achieve that terminal goal (understanding empirical questions, for example, as Rohin wanted to do in this thread). At the very least, I certainly wouldn't go around saying that verification is "the whole point of AI safety research."