Posts

An overview of 11 proposals for building safe advanced AI 2020-05-29T20:38:02.060Z · score: 127 (43 votes)
Zoom In: An Introduction to Circuits 2020-03-10T19:36:14.207Z · score: 83 (22 votes)
Synthesizing amplification and debate 2020-02-05T22:53:56.940Z · score: 34 (13 votes)
Outer alignment and imitative amplification 2020-01-10T00:26:40.480Z · score: 28 (6 votes)
Exploring safe exploration 2020-01-06T21:07:37.761Z · score: 37 (11 votes)
Safe exploration and corrigibility 2019-12-28T23:12:16.585Z · score: 17 (8 votes)
Inductive biases stick around 2019-12-18T19:52:36.136Z · score: 50 (14 votes)
Understanding “Deep Double Descent” 2019-12-06T00:00:10.180Z · score: 108 (48 votes)
What are some non-purely-sampling ways to do deep RL? 2019-12-05T00:09:54.665Z · score: 15 (5 votes)
What I’ll be doing at MIRI 2019-11-12T23:19:15.796Z · score: 117 (36 votes)
More variations on pseudo-alignment 2019-11-04T23:24:20.335Z · score: 20 (6 votes)
Chris Olah’s views on AGI safety 2019-11-01T20:13:35.210Z · score: 141 (44 votes)
Gradient hacking 2019-10-16T00:53:00.735Z · score: 54 (16 votes)
Impact measurement and value-neutrality verification 2019-10-15T00:06:51.879Z · score: 35 (10 votes)
Towards an empirical investigation of inner alignment 2019-09-23T20:43:59.070Z · score: 43 (11 votes)
Relaxed adversarial training for inner alignment 2019-09-10T23:03:07.746Z · score: 45 (11 votes)
Are minimal circuits deceptive? 2019-09-07T18:11:30.058Z · score: 51 (12 votes)
Concrete experiments in inner alignment 2019-09-06T22:16:16.250Z · score: 63 (20 votes)
Towards a mechanistic understanding of corrigibility 2019-08-22T23:20:57.134Z · score: 36 (10 votes)
Risks from Learned Optimization: Conclusion and Related Work 2019-06-07T19:53:51.660Z · score: 65 (19 votes)
Deceptive Alignment 2019-06-05T20:16:28.651Z · score: 63 (17 votes)
The Inner Alignment Problem 2019-06-04T01:20:35.538Z · score: 71 (18 votes)
Conditions for Mesa-Optimization 2019-06-01T20:52:19.461Z · score: 59 (20 votes)
Risks from Learned Optimization: Introduction 2019-05-31T23:44:53.703Z · score: 126 (36 votes)
A Concrete Proposal for Adversarial IDA 2019-03-26T19:50:34.869Z · score: 18 (6 votes)
Nuances with ascription universality 2019-02-12T23:38:24.731Z · score: 24 (7 votes)
Dependent Type Theory and Zero-Shot Reasoning 2018-07-11T01:16:45.557Z · score: 18 (11 votes)

Comments

Comment by evhub on An overview of 11 proposals for building safe advanced AI · 2020-06-01T19:03:51.252Z · score: 4 (2 votes) · LW · GW

Yep—at least that's how I'm generally thinking about it in this post.

Comment by evhub on An overview of 11 proposals for building safe advanced AI · 2020-06-01T02:55:36.052Z · score: 6 (3 votes) · LW · GW

The way I'm using outer alignment here is to refer to outer alignment at optimum. Under that definition, optimal loss on a predictive objective should require doing something like Bayesian inference on the universal prior, making the question of outer alignment in such a case basically just the question of whether Bayesian inference on the universal prior is aligned.

Comment by evhub on An overview of 11 proposals for building safe advanced AI · 2020-06-01T01:20:36.316Z · score: 6 (3 votes) · LW · GW

Thanks—glad this was helpful for you! And I went through and added some more paragraph breaks—hopefully that helps improve the readability a bit.

Comment by evhub on An overview of 11 proposals for building safe advanced AI · 2020-05-31T21:36:35.015Z · score: 8 (4 votes) · LW · GW

Glad you liked the post so much!

I considered trying to make it a living document, but in the end I decided I wasn't willing to commit to spending a bunch of time updating it regularly. I do like the idea of doing another one every year, though—I think I'd be more willing to write a new version every year than try to maintain one up-to-date version at all times, especially if I had some help.

In terms of other proposals, a lot of the other options I would include in a full list would just be additional combinations of the various things already on the list—recursive reward modeling + intermittent oversight, for example—that I didn't feel like would add much to cover separately. That being said, there are also just a lot of different people out there with different ideas that I'm sure would have different versions of the proposals I've talked about.

Re intermittent oversight—I agree that it's a problem if suddenly realizes that it should be deceptive. In that case, I would say that even before realizes it should be deceptive, the fact that it will realize that makes it suboptimality deceptively aligned. Thus, to solve this problem, I think we need it to be the case that can catch suboptimality deceptive alignment, which I agree could be quite difficult. One way in which might be able to ensure that it catches suboptimality deceptive alignment, however, could be to verify that is myopic, as a myopic model should never conclude that deception is a good strategy.

Comment by evhub on Multi-agent safety · 2020-05-18T19:32:56.890Z · score: 6 (1 votes) · LW · GW

It seems to me like the same thing that you envision happening when you fine-tune on the CEO task is likely to happen when you train on the “follow human instructions” task. For example, if your agents initially learn some very simple motivations—self-preservation and research acquisition, for example—before they learn the human instruction following task, it seems like there'd be a strong possibility of them then solving the human instruction following task just by learning that following human instructions will help them with self-preservation and research acquisition rather than learning to follow human instructions as a new intrinsic motivation. Like you say in the CEO example, it's generally easier to learn an additional inference than a new fundamental goal. That being said, for what it's worth, I think the problem I'm pointing at here just is the inner alignment problem, which is to say that I don't think this is a unique problem exclusive to this proposal, though I do think it is a problem.

Comment by evhub on How to choose a PhD with AI Safety in mind · 2020-05-15T23:46:00.918Z · score: 18 (6 votes) · LW · GW

Hi Ariel—I'm not sure if I'm the best person to weigh in on this, since I opted to go straight to OpenAI after completing my undergrad rather than pursue a PhD (and am now at MIRI), but I'm happy to schedule a time to talk to you if you'd be interested. I've also written a couple of different posts on possible concrete ML experiments relevant to AI safety that I think might be exciting for somebody in your position to work on if you'd be interested in chatting about any of those.

Comment by evhub on Covid-19: Comorbidity · 2020-05-13T07:08:03.032Z · score: 3 (2 votes) · LW · GW

It looks like there's actually some evidence that asthma isn't that bad either. I suspect the reason is that a lot of the deaths among young people aren't due to respiratory distress but rather blood clotting issues, which squares well with comorbidities like hypertension and obesity rather than COPD and asthma being the most dangerous.

Comment by evhub on Deminatalist Total Utilitarianism · 2020-04-17T18:33:26.766Z · score: 6 (3 votes) · LW · GW

Ah, I see—I missed the term out in front, that makes more sense. In that case, my normal reaction would be that you're penalizing simulation pausing, though if you use subjective age and gradually identify unique personhood, then I agree that you can get around that. Though that seems to me like a bit of a hack—I feel like the underlying thing that you really want there is variety of happy experience, so you should just be rewarding variety of experience directly rather than trying to do use some sort of continuous uniqueness measure.

Comment by evhub on Deminatalist Total Utilitarianism · 2020-04-17T04:18:47.481Z · score: 6 (3 votes) · LW · GW

If I'm running a simulation of a bunch of happy humans, it's entirely possible for me to completely avoid your penalty term just by turning the simulation off and on again every so often to reset all of the penalty terms. And if that doesn't count because they're the same exact human, I can just make tiny modifications to each person that negate whatever procedure you're using to uniquely identify individual humans. That seems like a really weird thing to morally mandate that people do, so I'm inclined to reject this theory.

Furthermore, I think the above case generalizes to imply that killing someone and then creating an entirely different person with equal happiness is morally positive under this framework, which goes against a lot of the things you say in the post. Specifically:

It avoids the problem with both totalism and averagism that killing a person and creating a different person with equal happiness is morally neutral.

It seems to do so in the opposite direction that I think you want it to.

It captures the intuition many people have that the bar for when it's good to create a person is higher than the bar for when it's good not to kill one.

I think this is just wrong, as like I said it incentives killing people and replacing them with other people to reset their penalty terms.


I do agree that whatever measure of happiness you use should include the extent to which somebody is bored, or tired of life, or whatnot. That being said, I'm personally of the opinion that killing someone and creating a new person with equal happiness is morally neutral. I think one of the strongest arguments in favor of that position is that turning a simulation off and then on again is the only case I can think of where you can do actually do that without any other consequences, and that seems quite morally neutral to me. Thus, personally, I continue to favor Solomonoff-measure-weighted total hedonic utilitarianism.

Comment by evhub on Is this viable physics? · 2020-04-14T22:38:42.384Z · score: 40 (19 votes) · LW · GW

First of all, I'm very unsurprised that you can get special and general relativity out of something like this. Relativity fundamentally just isn't that complicated and you can see what are basically relativistic phenomenon pop out of all sorts of natural setups where you have some sort of space with an emergent distance metric.

The real question is how this approach handles quantum mechanics. The fact that causal graph updates produce branching structure that's consistent with quantum mechanics is nice—and certainly suggestive that graphs could form a nice underlying substrate for quantum field theory (which isn't really new; I would have told you that before reading this)—but it's not a solution in and of itself. And again what the article calls “branchial space” does look vaguely like what you want out of Hilbert space on top of an underlying graph substrate. And it's certainly nice that it connects entanglement to distance, but again that was already theorized to be true in ER = EPR. Beyond that, though, it doesn't seem to really have all that much additional content—the best steelman I can give is that it's saying “hey, graphs could be a really good underlying substrate for QFT,” which I agree with, but isn't really all that new, and leaves the bulk of the work still undone.

That being said—credit where credit is due—I think this is in fact working on what is imo the “right problem” to be working on if you want to find an actual theory of everything. And it's certainly nice to have more of the math worked out for quantum mechanics on top of graphs. But beyond that I don't think this really amounts to much yet other than being pointed in the right direction (which does make it promising in terms of potentially producing real results eventually, even if doesn't have them right now).

TL;DR: This looks fairly pointed in the right direction to me but not really all that novel.

EDIT 1: If you're interested in some of the existing work on quantum mechanics on top of graphs, Sean Carroll wrote up a pretty accessible explanation of how that could work in this 2016 post (which also does a good job of summarizing what is basically my view on the subject).

EDIT 2: It looks like Scott Aaronson has a proof that a previous version of Wolfram's graph stuff is incompatible with quantum mechanics—if you really want to figure out how legit this stuff is I'd probably recommend taking a look at that and determining whether it still applies to this version.

Comment by evhub on Three Kinds of Competitiveness · 2020-03-31T19:11:02.672Z · score: 4 (2 votes) · LW · GW

It's interesting how Paul advocates merging cost and performance-competitiveness, and you advocate merging performance and date-competitiveness.

Also I advocated merging cost and date competitiveness (into training competitiveness), so we have every combination covered.

Comment by evhub on Three Kinds of Competitiveness · 2020-03-31T07:38:12.373Z · score: 15 (4 votes) · LW · GW

In the context of prosaic AI alignment, I've recently taken to splitting up competitiveness into “training competitiveness” and “objective competitiveness,”[1] where training competitiveness refers to the difficulty of training the system to succeed at its objective and objective competitiveness refers to the usefulness of a system that succeeds at that objective. I think my training competitiveness broadly maps onto a combination of your cost and date competitiveness and my objective competitiveness broadly maps onto your performance competitiveness. I think I mildly like my dichotomy better than your trichotomy in terms of thinking about prosaic AI alignment schemes, as I think it provides a better picture of the specific parts of a prosaic AI alignment proposal that are helping or hindering its overall competitiveness—e.g. if it's not very objective competitive, that tells you that you need a stronger objective, and if it's not very training competitive, that tells you that you need a better training process (it's also nice in terms of mirroring the inner/outer alignment distinction). That being said, your trichotomy is certainly more general in terms of applying to things that aren't just prosaic AI alignment.


  1. Objective competitiveness isn't a great term, though, since it can be misread as the opposite of subjective competitiveness—perhaps I'll switch now to using performance competitiveness instead. ↩︎

Comment by evhub on How important are MDPs for AGI (Safety)? · 2020-03-27T06:28:54.074Z · score: 2 (1 votes) · LW · GW

https://en.wikipedia.org/wiki/Temporal_difference_learning

Comment by evhub on Zoom In: An Introduction to Circuits · 2020-03-10T21:33:44.946Z · score: 11 (4 votes) · LW · GW

I think for the remaining 5% to be hiding really big important stuff like the presence of optimization (which is to say, mesa-optimization) or deceptive cognition, it has to be the case that there was adversarial obfuscation (e.g. gradient hacking). Of course, I'm only hypothesizing here, but it seems quite unlikely for that sort of stuff to just be randomly obfuscated.

Given that assumption, I think it's possible to translate 95% transparency into a safety guarantee: just use your transparency to produce a consistent gradient away from deception such that your model never becomes deceptive in the first place and thus never does any sort of adversarial obfuscation.[1] I suspect that the right way to do this is to use your transparency tools to enforce some sort of simple condition that you are confident in rules out deception such as myopia. For more context, see my comment here and the full “Relaxed adversarial training for inner alignment” post.


  1. It is worth noting that this does introduce the possibility of getting obfuscation by overfitting the transparency tools, though I suspect that that sort of overfitting-style obfuscation will be significantly easier to deal with than actively adversarial obfuscation by a deceptive mesa-optimizer. ↩︎

Comment by evhub on Coronavirus: Justified Practical Advice Thread · 2020-03-06T23:53:40.688Z · score: 6 (3 votes) · LW · GW

Do you have any thoughts on where to buy a bipap and a capnometer? Can you get them without a prescription? Are they sold on amazon? If you or anyone else manages to get this to work (or even just starts buying supplies for it), I'd love to know where they obtained all their supplies and what they ended up needing.

Comment by evhub on Coronavirus: Justified Practical Advice Thread · 2020-03-05T21:23:57.386Z · score: 15 (8 votes) · LW · GW

I think you should try to get antibiotics, antivirals, and/or antifungals for secondary infections in case hospitals are full and you need to treat yourself. According to this study, “When populations with low immune function, such as older people, diabetics, people with HIV infection, people with long-term use of immunosuppressive agents, and pregnant women, are infected with 2019-nCoV, prompt administration of antibiotics to prevent infection and strengthening of immune support treatment might reduce complications and mortality.” About what treatment people in Wuhan were given, the study says:

Most patients were given antibiotic treatment (table 2); 25 (25%) patients were treated with a single antibiotic and 45 (45%) patients were given combination therapy. The antibiotics used generally covered common pathogens and some atypical pathogens; when secondary bacterial infection occurred, medication was administered according to the results of bacterial culture and drug sensitivity. The antibiotics used were cephalosporins, quinolones, carbapenems, tigecycline against methicillin-resistant Staphylococcus aureus, linezolid, and antifungal drugs. The duration of antibiotic treatment was 3–17 days (median 5 days [IQR 3–7]). 19 (19%) patients were also treated with methylprednisolone sodium succinate, methylprednisolone, and dexamethasone for 3–15 days (median 5 [3–7]).

I think this sort of treatment might be one of the biggest factors in lower mortality for people with access to hospitals, so I suspect that getting your hands on some prescription antibiotics beforehand could be quite valuable. Some of the pharmacies that Wei Dai recommends here could be good bets, though I'm still currently trying to figure out what the best way is to do this—if anyone has any ideas let me know.

Comment by evhub on Towards a mechanistic understanding of corrigibility · 2020-03-02T19:32:12.810Z · score: 2 (1 votes) · LW · GW

I don't think there's really a disagreement there—I think what Paul's saying is that he views corrigibility as the right way to get an acceptability guarantee.

Comment by evhub on Coronavirus: Justified Practical Advice Thread · 2020-02-29T22:09:44.745Z · score: 2 (1 votes) · LW · GW

How did you order this without a prescription? When I went to order from the second link it asked for a prescription which I don't have.

Comment by evhub on What are the merits of signing up for cryonics with Alcor vs. with the Cryonics Institute? · 2020-02-28T00:27:11.215Z · score: 6 (3 votes) · LW · GW

What were the results from this survey? And what conclusion if any did you come to?

Comment by evhub on At what point should CFAR stop holding workshops due to COVID-19? · 2020-02-25T22:37:59.484Z · score: 18 (9 votes) · LW · GW

The CDC is currently warning that pandemic COVID-19 in the U.S. is likely and are currently moving their focus from prevention to mitigation. Specifically, the CDC has said that while they are “continuing to hope that we won't see [community] spread, ” the current goal is “that our measures give us extra time to prepare." Once spread within the US is confirmed, the CDC has noted that mitigation measures will likely include “social distancing, school closures, canceling mass gatherings, [...] telemedicine, teleschooling, [and] teleworking.” As CFAR workshops certainly seem like they fall into the “mass gatherings” category, the current guidance from the CDC seems to imply that they should be canceled once U.S. spread is confirmed and mitigation measures such as social distancing and school closures start to be announced.

Comment by evhub on Does iterated amplification tackle the inner alignment problem? · 2020-02-15T19:24:38.032Z · score: 13 (6 votes) · LW · GW

You are correct that amplification is primarily a proposal for how to solve outer alignment, not inner alignment. That being said, Paul has previously talked about how you might solve inner alignment in an amplification-style setting. For an up-to-date, comprehensive analysis of how to do something like that, see “Relaxed adversarial training for inner alignment.”

Comment by evhub on What is the difference between robustness and inner alignment? · 2020-02-15T19:14:37.245Z · score: 15 (6 votes) · LW · GW

This a good question. Inner alignment definitely is meant to refer to a type of robustness problem—it's just also definitely not meant to refer to the entirety of robustness. I think there are a couple of different levels on which you can think about exactly what subproblem inner alignment is referring to.

First, the definition that's given in “Risks from Learned Optimization”—where the term inner alignment comes from—is not about competence vs. intent robustness, but is directly about the objective that a learned search algorithm is searching for. Risks from Learned Optimization broadly takes the position that though it might not make sense to talk about learned models having objectives in general, it certainly makes sense to talk about a model having an objective if it is internally implementing a search process, and argues that learned models internally implementing search processes (which the paper calls mesa-optimizers) could be quite common. I would encourage reading the full paper to get a sense of how this sort of definition plays out.

Second, that being said, I do think that the competence vs. intent robustness framing that you mention is actually a fairly reasonable one. “2-D Robustness” presents the basic picture here, though in terms of a concrete example of what robust capabilities without robust alignment could actually look like, I am somewhat partial to my maze example. I think the maze example in particular presents a very clear story for how capability and alignment robustness can come apart even for agents that aren't obviously running a search process. The 2-D robustness distinction is also the subject of this alignment newsletter, which I'd also highly recommend taking a look at, as it has some more commentary on thinking about this sort of a definition as well.

Comment by evhub on Bayesian Evolving-to-Extinction · 2020-02-15T18:35:41.514Z · score: 5 (3 votes) · LW · GW

If that ticket is better at predicting the random stuff it's writing to the logs—which it should be if it's generating that randomness—then that would be sufficient. However, that does rely on the logs directly being part of the prediction target rather than only through some complicated function like a human seeing them.

Comment by evhub on Bayesian Evolving-to-Extinction · 2020-02-15T00:46:34.225Z · score: 9 (5 votes) · LW · GW

There is also the "lottery ticket hypothesis" to consider (discussed on LW here and here) -- the idea that a big neural network functions primarily like a bag of hypotheses, not like one hypothesis which gets adapted toward the right thing. We can imagine different parts of the network fighting for control, much like the Bayesian hypotheses.

This is a fascinating point. I'm curious now how bad things can get if your lottery tickets have side channels but aren't deceptive. It might be that the evolving-to-extinction policy of making the world harder to predict through logs is complicated enough that it can only emerge through a deceptive ticket deciding to pursue it—or it could be the case that it's simple enough that one ticket could randomly start writing stuff to logs, get selected for, and end up pursuing such a policy without ever actually having come up with it explicitly. This seems likely to depend on how powerful your base optimization process is and how easy it is to influence the world through side-channels. If it's the case that you need deception, then this probably isn't any worse than the gradient hacking problem (though possibly it gives us more insight into how gradient hacking might work)—but if it can happen without deception, then this sort of evolving-to-extinction behavior could be a serious problem in its own right.

Comment by evhub on Synthesizing amplification and debate · 2020-02-10T20:00:24.672Z · score: 4 (2 votes) · LW · GW

Yep; that's basically how I'm thinking about this. Since I mostly want this process to limit to amplification rather than debate, I'm not that worried about the debate equilibrium not being exactly the same, though in most cases I expect in the limit that such that you can in fact recover the debate equilibrium if you anneal towards debate.

Comment by evhub on Synthesizing amplification and debate · 2020-02-09T17:52:43.539Z · score: 4 (2 votes) · LW · GW

The basic debate RL setup is meant to be unchanged here—when I say “the RL reward derived from ” I mean that in the zero-sum debate game sense. So you're still using self-play to converge on the Nash in the situation where you anneal towards debate, and otherwise you're using that self-play RL reward as part of the loss and the supervised amplification loss as the other part.

Are the arguments the same thing as answers?

The arguments should include what each debater thinks the answer to the question should be.

I think yours is aiming at the second and not the first?

Yep.

Comment by evhub on Synthesizing amplification and debate · 2020-02-06T23:09:40.964Z · score: 2 (1 votes) · LW · GW

It shouldn't be since is just a function argument here—and I was imagining that including a variable in a question meant it was embedded such that the question-answerer has access to it, but perhaps I should have made that more clear.

Comment by evhub on Outer alignment and imitative amplification · 2020-02-04T06:18:35.451Z · score: 2 (1 votes) · LW · GW

That's a good point. What I really mean is that I think the sort of HCH that you get out of taking actual humans and giving them careful instructions is more likely to be uncompetitive than it is to be unaligned. Also, I think that “HCH for a specific H” is more meaningful than “HCH for a specific level of competitiveness,” since we don't really know what weird things you might need to do to produce an HCH with a given level of competitiveness.

Comment by evhub on Outer alignment and imitative amplification · 2020-02-03T21:19:11.155Z · score: 2 (1 votes) · LW · GW

Another thing that maybe I didn't make clear previously:

I believe the point about Turing machines was that given Low Bandwidth Overseer, it's not clear how to get HCH/IA to do complex tasks without making it instantiate arbitrary Turing machines.

I agree, but if you're instructing your humans not to instantiate arbitrary Turing machines, then that's a competitiveness claim, not an alignment claim. I think there are lots of very valid reasons for thinking that HCH is not competitive—I only said I was skeptical of the reasons for thinking it wouldn't be aligned.

Comment by evhub on The Epistemology of AI risk · 2020-01-30T20:14:48.980Z · score: 8 (2 votes) · LW · GW

I feel like you are drawing the wrong conclusion from the shift in arguments that has occurred. I would argue that what look like wrong ideas that ended up not contributing to future research could actually have been quite necessary for progressing the field's understanding as a whole. That is, maybe we really needed to engage with utility functions first before we could start breaking down that assumption—or maybe optimization daemons were a necessary step towards understanding mesa-optimization. Thus, I don't think the shift in arguments at all justifies the conclusion that prior work wasn't very helpful, as the prior work could have been necessary to achieve that very shift.

Comment by evhub on The Epistemology of AI risk · 2020-01-28T07:11:56.258Z · score: 18 (5 votes) · LW · GW

The argument that continuous takeoff makes AI safe seems robust to most specific items on your list, though I can see several ways that the argument fails.

I feel like this depends on a whole bunch of contingent facts regarding our ability to accurately diagnose and correct what could be very pernicious problems such as deceptive alignment amidst what seems quite likely to be a very quickly changing and highly competitive world.

It seems even harder to do productive work, since I'm skeptical of very short timelines.

Why does being skeptical of very short timelines preclude our ability to do productive work on AI safety? Surely there are things we can be doing now to gain insight, build research/organizational capacity, etc. that will at least help somewhat, no? (And it seems to me like “probably helps somewhat” is enough when it comes to existential risk.)

Comment by evhub on Have epistemic conditions always been this bad? · 2020-01-25T23:45:07.259Z · score: 26 (10 votes) · LW · GW

First, as someone who just (class of 2019) graduated college at a very liberal, highly regarded, private U.S. institution, the description above definitely does not match my experience. In my experience, I found that dissenting opinions and avid discussion were highly encouraged. That being said, I suspect Mudd may be particularly good on that axis due to factors such as being entirely STEM-focused (also Debra Mashek was one of my professors).

Second, I think it is worth pointing out that there are definitely instances where, at least in my opinion, “canceling” is a valid tactic. Deplatforming violent rhetoric (e.g. Nazism, Holocaust denial, etc.) comes to mind as an obvious example.

Third, that being said, I do think there is a real problem along the lines of what you're pointing at. For example, one thing I saw recently was what's been happening to Natalie Wynn, a YouTuber who goes by the name “ContraPoints.” She's a very popular leftist YouTuber who mainly talks about various left-wing social issues, particularly transgender issues (she herself is transgender). In one of her recent videos, she cast a transgender man named Buck Angel as a voice actor for part of it, and people (mostly on Twitter) got extremely upset at her because Buck Angel had at one point previously said something that maybe possibly could be interpreted as anti-non-binary-people. I think that Natalie's recent video responding to her “canceling” is probably the best analysis of the whole phenomenon that I've seen, and aligns pretty well with my views on the topic, though it's quite long.

There are a lot of things about Natalie's canceling that give me hope, though. First, it seemed like her canceling was highly concentrated on Twitter, which makes a lot of sense to me—I tend to think that it's almost impossible to have good discourse in any sort of combative/argumentative setting, especially when it's online, and especially when everyone is limited only to tiny tweets, which lend themselves particularly well to snarky quippy one-liners without any actual real substance. Second, it was really only a fringe group of people canceling her—it's just that the people who were doing it were very loud, which again strikes me as exactly the sort of thing that is highly exacerbated by the internet, and especially by Twitter. Third, I think there's a real movement on the left towards rejecting this sort of thing—I think Natalie is a good example of a very public leftist strongly rejecting “cancel culture,” though I've met lots of other die-hard leftists who think similarly while I was in college. There are a lot of really smart people on the left and I think it's quite reasonable to expect that this will broadly get better over time—especially if people move to better forms of online discourse than Twitter (or Facebook, which I also think is pretty bad). YouTube and Reddit, though, are mainstream platforms that I think produce significantly better discourse than Twitter, so I do think there's hope there.

Comment by evhub on Exploring safe exploration · 2020-01-16T20:58:02.973Z · score: 14 (4 votes) · LW · GW

Hey Aray!

Given this, I think the "within-episode exploration" and "across-episode exploration" relax into each other, and (as the distinction of episode boundaries fades) turn into the same thing, which I think is fine to call "safe exploration".

I agree with this. I jumped the gun a bit in not really making the distinction clear in my earlier post “Safe exploration and corrigibility,” but I think that made it a bit confusing, so I went heavy on the distinction here—but perhaps more heavy than I actually think is warranted.

The problem I have with relaxing within-episode and across-episode exploration into each other, though, is precisely the problem I describe in “Safe exploration and corrigibility,” however, which is that by default you only end up with capability exploration not objective exploration—that is, an agent with a goal (i.e. a mesa-optimizer) is only going to explore to the extent that it helps its current goal, not to the extent that it helps it change its goal to be more like the desired goal. Thus, you need to do something else (something that possibly looks somewhat like corrigibility) to get the agent to explore in such a way that helps you collect data on what its goal is and how to change it.

Comment by evhub on Malign generalization without internal search · 2020-01-13T23:37:32.962Z · score: 2 (1 votes) · LW · GW

I don't feel like you're really understanding what I'm trying to say here. I'm happy to chat with you about this more over video call or something if you're interested.

Comment by evhub on Malign generalization without internal search · 2020-01-12T19:40:25.172Z · score: 6 (1 votes) · LW · GW

I think that piecewise objectives are quite reasonable and natural—and I don't think they'll make transparency that much harder. I don't think there's any reason that we should expect objectives to be continuous in some nice way, so I fully expect you'll get these sorts of piecewise jumps. Nevertheless, the resulting objective in the piecewise case is still quite simple such that you should be able to use interpretability tools to understand it pretty effectively—a switch statement is not that complicated or hard to interpret—with most of the real hard work still primarily being done in the optimization.

I do think there are a lot of possible ways in which the interpretability for mesa-optimizers story could break down—which is why I'm still pretty uncertain about it—but I don't think that a switch-case agent is such an example. Probably the case that I'm most concerned about right now is if you get an agent which has an objective which changes in a feedback loop with its optimization. If the objective and the optimization are highly dependent on each other, then I think that would make the problem a lot more difficult—and is the sort of thing that humans seem to do, which suggests that it's the sort of thing we might see in AI systems as well. On the other hand, a fixed switch-case objective is pretty easy to interpret, since you just need to understand the simple, fixed heuristics being used in the switch statement and then you can get a pretty good grasp on what your agent's objective is. Where I start to get concerned is when those switch statements themselves depend upon the agent's own optimization—a recursion which could possibly be many layers deep and quite difficult to disentangle. That being said, even in such a situation you're still using search to get your robust capabilities.

Comment by evhub on Malign generalization without internal search · 2020-01-12T19:10:12.325Z · score: 3 (2 votes) · LW · GW

Consider an agent that could, during its operation, call upon a vast array of subroutines. Some of these subroutines can accomplish extremely complicated actions, such as "Prove this theorem: [...]" or "Compute the fastest route to Paris." We then imagine that this agent still shares the basic superstructure of the pseudocode I gave initially above.

I feel like what you're describing here is just optimization where the objective is determined by a switch statement, which certainly seems quite plausible to me but also pretty neatly fits into the mesa-optimization framework.

More generally, while I certainly buy that you can produce simple examples of things that look kinda like capability generalization without objective generalization on environments like the lunar lander or my maze example, it still seems to me like you need optimization to actually get capabilities that are robust enough to pose a serious risk, though I remain pretty uncertain about that.

Comment by evhub on Outer alignment and imitative amplification · 2020-01-11T22:46:36.483Z · score: 2 (1 votes) · LW · GW

Is "outer alignment" meant to be applicable in the general case?

I'm not exactly sure what you're asking here.

Do you think it also makes sense to talk about outer alignment of the training process as a whole, so that for example if there is a security hole in the hardware or software environment and the model takes advantage of the security hole to hack its loss/reward, then we'd call that an "outer alignment failure".

I would call that an outer alignment failure, but only because I would say that the ways in which your loss function can be hacked are part of the specification of your loss function. However, I wouldn't consider an entire training process to be outer aligned—rather, I would just say that an entire training process is aligned. I generally use outer and inner alignment to refer to different components of aligning the training process—namely the objective/loss function/environment in the case of outer alignment and the inductive biases/architecture/optimization procedure in the case of inner alignment (though note that this is a more general definition than the one used in “Risks from Learned Optimization,” as it makes no mention of mesa-optimizers, though I would still say that mesa-optimization is my primary example of how you could get an inner alignment failure).

So technically, one should say that a loss function is outer aligned at optimum with respect to some model class, right?

Yes, though in the definition I gave here I just used the model class of all functions, which is obviously too large but has the nice property of being a fully general definition.

Also, related to Ofer's comment, can you clarify whether it's intended for this definition that the loss function only looks at the model's input/output behavior, or can it also take into account other information about the model?

I would include all possible input/output channels in the domain/codomain of the model when interpreted as a function.

I'm also curious whether you have HBO or LBO in mind for this post.

I generally think you need HBO and am skeptical that LBO can actually do very much.

Comment by evhub on Outer alignment and imitative amplification · 2020-01-10T05:29:18.738Z · score: 8 (4 votes) · LW · GW

I think I'm quite happy even if the optimal model is just trying to do what we want. With imitative amplification, the true optimum—HCH—still has benign failures, but I nevertheless want to argue that it's aligned. In fact, I think this post really only makes sense if you adopt a definition of alignment that excludes benign failures, since otherwise you can't really consider HCH aligned (and thus can't consider imitative amplification outer aligned at optimum).

Comment by evhub on Exploring safe exploration · 2020-01-07T08:41:41.723Z · score: 2 (1 votes) · LW · GW

Like I said in the post, I'm skeptical that “preventing the agent from making an accidental mistake” is actually a meaningful concept (or at least, it's a concept with many possible conflicting definitions), so I'm not sure how to give an example of it.

Comment by evhub on Exploring safe exploration · 2020-01-06T23:56:15.253Z · score: 6 (3 votes) · LW · GW

I definitely was not arguing that. I was arguing that safe exploration is currently defined in ML as the agent making an accidental mistake, and that we should really not be having terminology collisions with ML. (I may have left that second part implicit.)

Ah, I see—thanks for the correction. I changed “best” to “current.”

I assume that the difference you see is that you could try to make across-episode exploration less detrimental from the agent's perspective

No, that's not what I was saying. When I said “reward acquisition” I meant the actual reward function (that is, the base objective).

EDIT:

That being said, it's a little bit tricky in some of these safe exploration setups to draw the line between what's part of the base objective and what's not. For example, I would generally include the constraints in constrained optimization setups as just being part of the base objective, only specified slightly differently. In that context, constrained optimization is less of a safe exploration technique and more of a reward-engineering-y/outer alignment sort of thing, though it also has a safe exploration component to the extent that it constrains across-episode exploration.

Note that when across-episode exploration is learned, the distinction between safe exploration and outer alignment becomes even more muddled, since then all the other terms in the loss will implicitly serve to check the across-episode exploration term, as the agent has to figure out how to trade off between them.[1]


  1. This is another one of the points I was trying to make in “Safe exploration and corrigibility” but didn't do a great job of conveying properly. ↩︎

Comment by evhub on Safe exploration and corrigibility · 2019-12-29T05:11:28.819Z · score: 5 (3 votes) · LW · GW

I completely agree with the distinction between across-episode vs. within-episode exploration, and I agree I should have been clearer about that. Mostly I want to talk about across-episode exploration here, though when I was writing this post I was mostly motivated by the online learning case where the distinction is in fact somewhat blurred, since in an online learning setting you do in fact need the deployment policy to balance between within-episode exploration and across-episode exploration.

Usually (in ML) "safe exploration" means "the agent doesn't make a mistake, even by accident"; ϵ-greedy exploration wouldn't be safe in that sense, since it can fall into traps. I'm assuming that by "safe exploration" you mean "when the agent explores, it is not trying to deceive us / hurt us / etc".

Agreed. My point is that “If you assume that the policy without exploration is safe, then for -greedy exploration to be safe on average, it just needs to be the case that the environment is safe on average, which is just a standard engineering question.” That is, even though it seems like it's hard for -greedy exploration to be safe, it's actually quite easy for it to be safe on average—you just need to be in a safe environment. That's not true for learned exploration, though.

Since by default policies can't affect across-episode exploration, I assume you're talking about within-episode exploration. But this happens all the time with current RL methods

Yeah, I agree that was confusing—I'll rephrase it. The point I was trying to make was that across-episode exploration should arise naturally, since an agent with a fixed objective should want to be modified to better pursue that objective, but not want to be modified to pursue a different objective.

This sounds to me like reward uncertainty, assistance games / CIRL, and more generally Stuart Russell's agenda, except applied to mesa optimization now. Should I take away something other than "we should have our mesa optimizers behave like the AIs in assistance games"? I feel like you are trying to say something else but I don't know what.

Agreed that there's a similarity there—that's the motivation for calling it “cooperative.” But I'm not trying to advocate for that agenda here—I'm just trying to better classify the different types of corrigibility and understand how they work. In fact, I think it's quite plausible that you could get away without cooperative corrigibility, though I don't really want to take a stand on that right now.

I thought we were talking about "the agent doesn't try to deceive us / hurt us by exploring", which wouldn't tell us anything about the problem of "the agent doesn't make an accidental mistake".

If your definition of “safe exploration” is “not making accidental mistakes” then I agree that what I'm pointing at doesn't fall under that heading. What I'm trying to point at is that I think there are other problems that we need to figure out regarding how models explore than just the “not making accidental mistakes” problem, though I have no strong feelings about whether or not to call those other problems “safe exploration” problems.

The same way as capability exploration; based on value of information (VoI). (I assume you have a well-specified distribution over objectives; if you don't, then there is no proper way to do it, in the same way there's no proper way to do capability exploration without a prior over what you might see when you take the new action.)

Agreed, though I don't think that's the end of the story. In particular, I don't think it's at all obvious what an agent that cares about the value of information that its actions produce relative to some objective distribution will look like, how you could get such an agent, or how you could verify when you had such an agent. And, even if you could do those things, it still seems pretty unclear to me what the right distribution over objectives should be and how you should learn it.

The algorithms used are not putting dampers on exploration; they are trying to get the agent to do better exploration (e.g. if you crashed into the wall and saw that that violated a constraint, don't crash into the wall again just because you forgot about that experience).

Well, what does “better exploration” mean? Better across-episode exploration or better within-episode exploration? Better relative to the base objective or better relative to the mesa-objective? I think it tends to be “better within-episode exploration relative to the base objective,” which I would call putting a damper on instrumental exploration, which does across-episode and within-episode exploration only for the mesa-objective, not the base objective.

If you have the right uncertainty, then acting optimally to maximize that is the "right" thing to do.

Sure, but as you note getting the right uncertainty could be quite difficult, so for practical purposes my question is still unanswered.

Comment by evhub on Inductive biases stick around · 2019-12-26T08:15:23.518Z · score: 4 (2 votes) · LW · GW

I just edited the last sentence to be clearer in terms of what I actually mean by it.

Comment by evhub on [AN #78] Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison · 2019-12-26T08:11:46.269Z · score: 2 (1 votes) · LW · GW

To be clear, I broadly agree that AGI will be quite underparameterized, but still maintain that double descent demonstrates something—that larger models can do better by being simpler not just by fitting more data—that I think is still quite important.

Comment by evhub on Free Speech and Triskaidekaphobic Calculators: A Reply to Hubinger on the Relevance of Public Online Discussion to Existential Risk · 2019-12-21T06:41:59.683Z · score: 15 (7 votes) · LW · GW

I'm not really interested in debating this on LessWrong, for basically the exact reasons that I stated in the first place, which is that I don't really think these sorts of conversations can be done effectively online. Thus, I probably won't try to respond to any replies to this comment.

At the very least, though, I think it's worth clarifying that my position is certainly not "assume what you're doing is the most important thing and run with it." Rather, I think that trying to think really hard about the most important things to be doing is an incredibly valuable exercise, and I think the effective altruism community provides a great model of how I think that should be done. The only thing I was advocating was not discussing hot-button political issues specifically online. I think to the extent that those sorts of things are relevant to doing the most good, they should be done offline, where the quality of the discussion can be higher and nobody ends up tainted by other people's beliefs by association.

Comment by evhub on Inductive biases stick around · 2019-12-20T19:11:48.336Z · score: 2 (1 votes) · LW · GW

What double descent definitely says is that for a fixed dataset, larger models with zero training error are simpler than smaller models with zero training error. I think it does say somewhat more than that also, which is that larger models do have a real tendency towards being better at finding simpler models in general. That being said, the dataset on which the concept of a dog in your head was trained on is presumably way larger than that of any ML model, so even if your brain is really good at implementing Occam's razor and finding simple models, your model is still probably going to be more complicated.

Comment by evhub on Against Premature Abstraction of Political Issues · 2019-12-20T19:00:02.094Z · score: 3 (2 votes) · LW · GW

I disagree, and think LW can actually do ok, and probably even better with some additional safeguards around political discussions. You weren't around yet when we had the big 2009 political debate that I referenced in the OP, but I think that one worked out pretty well in the end.

Do you think having that debate online was something that needed to happen for AI safety/x-risk? Do you think it benefited AI safety at all? I'm genuinely curious. My bet would be the opposite—that it caused AI safety to be more associated with political drama that helped further taint it.

Comment by evhub on A dilemma for prosaic AI alignment · 2019-12-19T22:36:07.267Z · score: 5 (3 votes) · LW · GW

I'm skeptical of language modeling being enough to be competitive, in the sense of maximizing "log prob of some naturally occurring data or human demonstrations." I don't have a strong view about whether you can get away using only language data rather than e.g. taking images as input and producing motor torques as output.

I agree with this, though I still feel like some sort of active learning approach might be good enough without needing to add in a full-out RL objective.

I'm also not convinced that amplification or debate need to make this bet though. If we can do joint training / fine-tuning of a language model using whatever other objectives we need, then it seems like we could just as well do joint training / fine-tuning for a different kind of model. What's so bad if we use non-language data?

My opinion would be that there is a real safety benefit from being in a situation where you know the theoretical optimum of your loss function (e.g. in a situation where you know that HCH is precisely the thing for which loss is zero). That being said, it does seem obviously fine to have your language data contain other types of data (e.g. images) inside of it.

Comment by evhub on 2019 AI Alignment Literature Review and Charity Comparison · 2019-12-19T08:00:52.166Z · score: 21 (11 votes) · LW · GW

On the other hand, I don’t think we can give people money just because they say they are doing good things, because of the risk of abuse. There are many other reasons for not publishing anything. Some simple alternative hypothesis include “we failed to produce anything publishable” or “it is fun to fool ourselves into thinking we have exciting secrets” or “we are doing bad things and don’t want to get caught.” The fact that MIRI’s researchers appear intelligent suggest they at least think they are doing important and interesting issues, but history has many examples of talented reclusive teams spending years working on pointless stuff in splendid isolation.

Additionally, by hiding the highest quality work we risk impoverishing the field, making it look unproductive and unattractive to potential new researchers.

My work at MIRI is public, btw.

a Mesa-Optimizer - a sub-agent of an optimizer that is itself an optimizer

I think this is a poor description of mesa-optimization. A mesa-optimizer is not a subagent, it's just a trained model implementing a search algorithm.

Comment by evhub on Inductive biases stick around · 2019-12-19T07:34:01.608Z · score: 2 (1 votes) · LW · GW

Note that, in your example, if we do see double descent, it's because the best hypothesis was previously not in the class of hypotheses we were considering. Bayesian methods tend to do badly when the hypothesis class is misspecified.

Yep, that's exactly my model.

As a counterpoint though, you could see double descent even if your hypothesis class always contains the truth, because the "best" hypothesis need not be the truth.

If "best" here means test error, then presumably the truth should generalize at least as well as any other hypothesis.

That first stage is not just a "likelihood descent", it is a "likelihood + prior descent", since you are choosing hypotheses based on the posterior, not based on the likelihood.

True for the Bayesian case, though unclear in the ML case—I think it's quite plausible that current ML underweights the implicit prior of SGD relative to the maximizing the likelihood of the data (EDIT: which is another reason that better future ML might care more about inductive biases).

Comment by evhub on Against Premature Abstraction of Political Issues · 2019-12-18T21:58:49.564Z · score: 7 (4 votes) · LW · GW

How much of an efficiency hit do you think taking all discussion of a subject offline ("in-person") involves?

Probably a good deal for anything academic (like AI safety), but not at all for politics. I think discussions focused on persuasion/debate/argument/etc. are pretty universally bad (e.g. not truth-tracking), and that online discussion lends itself particularly well into falling into such discussions. It is sometimes possible to avoid this failure mode, but imo basically only if the conversations are kept highly academic and avoiding of any hot-button issues (e.g. as in some online AI safety discussions, though not all). I think this is basically impossible for politics, so I suspect that not having the ability to talk about politics online won't be much of a problem (and might even be quite helpful, since I suspect it would overall raise the level of political discourse).