## Posts

[AN #169]: Collaborating with humans without human data 2021-11-24T18:30:03.795Z
[AN #168]: Four technical topics for which Open Phil is soliciting grant proposals 2021-10-28T17:20:03.387Z
[AN #167]: Concrete ML safety problems and their relevance to x-risk 2021-10-20T17:10:03.690Z
[AN #166]: Is it crazy to claim we're in the most important century? 2021-10-08T17:30:11.819Z
[AN #165]: When large models are more likely to lie 2021-09-22T17:30:04.674Z
[AN #164]: How well can language models write code? 2021-09-15T17:20:03.850Z
[AN #163]: Using finite factored sets for causal and temporal inference 2021-09-08T17:20:04.522Z
[AN #162]: Foundation models: a paradigm shift within AI 2021-08-27T17:20:03.831Z
[AN #160]: Building AIs that learn and think like people 2021-08-13T17:10:04.335Z
[AN #158]: Should we be optimistic about generalization? 2021-07-29T17:20:03.409Z
[AN #157]: Measuring misalignment in the technology underlying Copilot 2021-07-23T17:20:03.424Z
[AN #156]: The scaling hypothesis: a plan for building AGI 2021-07-16T17:10:05.809Z
BASALT: A Benchmark for Learning from Human Feedback 2021-07-08T17:40:35.045Z
[AN #155]: A Minecraft benchmark for algorithms that learn without reward functions 2021-07-08T17:20:02.518Z
[AN #154]: What economic growth theory has to say about transformative AI 2021-06-30T17:20:03.292Z
[AN #153]: Experiments that demonstrate failures of objective robustness 2021-06-26T17:10:02.819Z
[AN #152]: How we’ve overestimated few-shot learning capabilities 2021-06-16T17:20:04.454Z
[AN #151]: How sparsity in the final layer makes a neural net debuggable 2021-05-19T17:20:04.453Z
[AN #150]: The subtypes of Cooperative AI research 2021-05-12T17:20:27.267Z
[AN #149]: The newsletter's editorial policy 2021-05-05T17:10:03.189Z
[AN #148]: Analyzing generalization across more axes than just accuracy or loss 2021-04-28T18:30:03.066Z
FAQ: Advice for AI Alignment Researchers 2021-04-26T18:59:52.589Z
[AN #147]: An overview of the interpretability landscape 2021-04-21T17:10:04.433Z
[AN #146]: Plausible stories of how we might fail to avert an existential catastrophe 2021-04-14T17:30:03.535Z
[AN #145]: Our three year anniversary! 2021-04-09T17:48:21.841Z
Alignment Newsletter Three Year Retrospective 2021-04-07T14:39:42.977Z
[AN #144]: How language models can also be finetuned for non-language tasks 2021-04-02T17:20:04.230Z
[AN #143]: How to make embedded agents that reason probabilistically about their environments 2021-03-24T17:20:05.166Z
[AN #142]: The quest to understand a network well enough to reimplement it by hand 2021-03-17T17:10:04.180Z
[AN #141]: The case for practicing alignment work on GPT-3 and other large models 2021-03-10T18:30:04.004Z
[AN #140]: Theoretical models that predict scaling laws 2021-03-04T18:10:08.586Z
[AN #139]: How the simplicity of reality explains the success of neural nets 2021-02-24T18:30:04.038Z
[AN #138]: Why AI governance should find problems rather than just solving them 2021-02-17T18:50:02.962Z
[AN #137]: Quantifying the benefits of pretraining on downstream task performance 2021-02-10T18:10:02.561Z
[AN #136]: How well will GPT-N perform on downstream tasks? 2021-02-03T18:10:03.856Z
[AN #135]: Five properties of goal-directed systems 2021-01-27T18:10:04.648Z
[AN #134]: Underspecification as a cause of fragility to distribution shift 2021-01-21T18:10:06.783Z
[AN #133]: Building machines that can cooperate (with humans, institutions, or other machines) 2021-01-13T18:10:04.932Z
[AN #132]: Complex and subtly incorrect arguments as an obstacle to debate 2021-01-06T18:20:05.694Z
[AN #131]: Formalizing the argument of ignored attributes in a utility function 2020-12-31T18:20:04.835Z
[AN #130]: A new AI x-risk podcast, and reviews of the field 2020-12-24T18:20:05.289Z
[AN #129]: Explaining double descent by measuring bias and variance 2020-12-16T18:10:04.840Z
[AN #126]: Avoiding wireheading by decoupling action feedback from action effects 2020-11-26T23:20:05.290Z
[AN #125]: Neural network scaling laws across multiple modalities 2020-11-11T18:20:04.504Z
[AN #124]: Provably safe exploration through shielding 2020-11-04T18:20:06.003Z
[AN #123]: Inferring what is valuable in order to align recommender systems 2020-10-28T17:00:06.053Z

Comment by rohinmshah on Christiano, Cotra, and Yudkowsky on AI progress · 2021-12-01T10:21:35.609Z · LW · GW

I agree that when you know about a critical threshold, as with nukes or orbits, you can and should predict a discontinuity there. (Sufficient specific knowledge is always going to allow you to outperform a general heuristic.) I think that (a) such thresholds are rare in general and (b) in AI in particular there is no such threshold. (According to me (b) seems like the biggest difference between Eliezer and Paul.)

Some thoughts on aging:

• It does in fact seem surprising, given the complexity of biology relative to physics, if there is a single core cause and core solution that leads to a discontinuity.
• I would a priori guess that there won't be a core solution. (A core cause seems more plausible, and I'll roll with it for now.) Instead, we see a sequence of solutions that intervene on the core problem in different ways, each of which leads to some improvement on lifespan, and discovering these at different times leads to a smoother graph.
• That being said, are people putting in a lot of effort into solving aging in mice? Everyone seems to constantly be saying that we're putting in almost no effort whatsoever. If that's true then a jumpy graph would be much less surprising.
• As a more specific scenario, it seems possible that the graph of mouse lifespan over time looks basically flat, because we were making no progress due to putting in ~no effort. I could totally believe in this world that someone puts in some effort and we get a discontinuity, or even that the near-zero effort we're putting in finds some intervention this year (but not in previous years) which then looks like a discontinuity.

If we had a good operationalization, and people are in fact putting in a lot of effort now, I could imagine putting my $100 to your$300 on this (not going beyond 1:3 odds simply because you know way more about aging than I do).

Comment by rohinmshah on Christiano, Cotra, and Yudkowsky on AI progress · 2021-11-30T13:19:35.759Z · LW · GW

Nitpick: Isn't  the solution for  modulo constants? Or equivalently,  is the solution to .

Comment by rohinmshah on Christiano, Cotra, and Yudkowsky on AI progress · 2021-11-30T10:13:28.568Z · LW · GW

The "continuous view" as I understand it doesn't predict that all straight lines always stay straight. My version of it (which may or may not be Paul's version) predicts that in domains where people are putting in lots of effort to optimize a metric, that metric will grow relatively continuously. In other words, the more effort put in to optimize the metric, the more you can rely on straight lines for that metric staying straight (assuming that the trends in effort are also staying straight).

In its application to AI, this is combined with a prediction that people will in fact be putting in lots of effort into making AI systems intelligent / powerful / able to automate AI R&D / etc, before AI has reached a point where it can execute a pivotal act. This second prediction comes for totally different reasons, like "look at what AI researchers are already trying to do" combined with "it doesn't seem like AI is anywhere near the point of executing a pivotal act yet".

(I think on Paul's view the second prediction is also bolstered by observing that most industries / things that had big economic impacts also seemed to have crappier predecessors. This feels intuitive to me but is not something I've checked and so isn't my personal main reason for believing the second prediction.)

One historical example immediately springs to mind where something-I'd-consider-a-Paul-esque-model utterly failed predictively: the breakdown of the Philips curve.

I'm not very familiar with this (I've only seen your discussion and the discussion in IEM) but it does not seem like the sort of thing where the argument I laid out above would have had a strong opinion. Was the y-axis of the straight line graph a metric that people were trying to optimize? If so, did the change in policy not represent a change in the amount of effort put into optimizing the metric? (I haven't looked at the details here, maybe the answer is yes to both, in which case I would be interested in looking at the details.)

Zooming out a meta-level, I think GDP is a particularly good example of a big aggregate metric which approximately-always looks smooth in hindsight, even when the underlying factors of interest undergo large jumps.

This seems plausible but it also seems like you can apply the above argument to a bunch of other topics besides GDP, like the ones listed in this comment, so it still seems like you should be able to exhibit a failure of the argument on those topics.

Comment by rohinmshah on Yudkowsky and Christiano discuss "Takeoff Speeds" · 2021-11-30T08:07:52.993Z · LW · GW

Who knew that Eliezer would respond with a long list of examples that didn't look like continuous progress at the time, and said this more than 3 days ago?

What examples are you thinking of here? I see (1) humans and chimps, (2) nukes, (3) AlphaGo, (4) invention of airplanes by the Wright brothers, (5) AlphaFold 2, (6) Transformers, (7) TPUs, and (8) GPT-3.

I've explicitly seen 1, 2, and probably 4 in arguments before. (1 and 2 are in Takeoff speeds.) The remainder seem like they plausibly did look like continuous progress* at the time. (Paul explicitly challenged 3, 6, and 7, and I feel like 5 and 8 are also debatable, though 8 is a more complicated story.) I also think I've seen some of 3, 5, 6, 7, and 8 on Eliezer's Twitter claimed as evidence for Eliezer over Hanson in the foom debate, idk which off the top of my head.

I did not know that Eliezer would respond with this list of examples, but that's mostly because I expected him to have different arguments, e.g. more of an emphasis on a core of intelligence that current systems don't have and future systems will have, or more emphasis on aspects of recursive self improvement, or some unknown argument because I hadn't talked to Eliezer nor seen a rebuttal from him so it seemed quite plausible he had points I hadn't considered. The list of examples itself was not all that novel to me.

(Eliezer of course also has other arguments in this post; I'm just confused about the emphasis on a "long list of examples" in the parent comment.)

* Note that "continuous progress" here is a stand-in for the-strategy-Paul-uses-to-predict, which as I understand it is more like "form beliefs about how outputs scale with effort in this domain using past examples / trend lines, then see how much effort is being added now relative to the past, and use that to make a prediction".

Comment by rohinmshah on Yudkowsky and Christiano discuss "Takeoff Speeds" · 2021-11-30T06:55:18.028Z · LW · GW

(To be clear, the thing you quoted was commenting on the specific argument presented in that post. I do expect that in practice AI will need social learning, simply because that's how an AI system could make use of the existing trove of knowledge that humans have built.)

Comment by rohinmshah on Ngo and Yudkowsky on alignment difficulty · 2021-11-28T16:59:08.126Z · LW · GW

We imagine Shah saying: “1. Why will the AI have goals at all?, and 2. If it does have goals, why will its goals be incompatible with human survival? Sure, most goals are incompatible with human survival, but we’re not selecting uniformly from the space of all goals.”

Yeah, that's right. Adapted to the language here, it would be 1. Why would we have a "full and complete" outcome pump, rather than domain-specific outcome pumps that primarily use plans using actions from a certain domain rather than "all possible actions", and 2. Why are the outcomes being pumped incompatible with human survival?

Comment by rohinmshah on Ngo and Yudkowsky on alignment difficulty · 2021-11-28T16:52:22.801Z · LW · GW

The things AI systems today can do are already hitting pretty narrow targets. E.g., generating English text that is coherent is not something you’d expect from a random neural network. Why is corrigibility so much more of a narrow target than that? (I think Rohin may have said this to me at some point.)

I'll note that this is framed a bit too favorably to me, the actual question is "why is an effective and corrigible system so much more of a narrow target than that?"

Comment by rohinmshah on Discussion with Eliezer Yudkowsky on AGI interventions · 2021-11-18T11:22:51.167Z · LW · GW

This just doesn't match my experience at all. Looking through my past AI papers, I only see two papers where I could predict the results of the experiments on the first algorithm I tried at the beginning of the project. The first one (benefits of assistance) was explicitly meant to be a "communication" paper rather than a "research" paper (at the time of project initiation, rather than in hindsight). The second one (Overcooked) was writing up results that were meant to be the baselines against which the actual unpredictable research (e.g. this) was going to be measured against; it just turned out that that was already sufficiently interesting to the broader community.

(Funny story about the Overcooked paper; we wrote the paper + did the user study in ~two weeks iirc, because it was only two weeks before the deadline that we considered that the "baseline" results might already be interesting enough to warrant a conference paper. It's now my most-cited AI paper.)

(I'm also not actually sure that I would have predicted the Overcooked results when writing down the first algorithm; the conceptual story felt strong but there are several other papers where the conceptual story felt strong but nonetheless the first thing we tried didn't work. And in fact we did have to make slight tweaks, like annealing from self-play to BC-play over the course of training, to get our algorithm to work.)

A more typical case would be something like Preferences Implicit in the State of the World, where the conceptual idea never changed over the course of the project, but:

1. The first hacky / heuristic algorithm we wrote down didn't work in some cases. We analyzed it a bunch (via experiments) to figure out what sorts of things it wasn't capturing.
2. When we eventually had a much more elegant derived-from-math algorithm, I gave a CHAI presentation presenting some experimental results. There were some results I was confused by, where I expected something different from what we got, and I mentioned this. (Specifically these were the results in the case where the robot had a uniform prior over the initial state at time -T). Many people in the room (including at least one person from MIRI) thought for a while and gave their explanation for why this was the behavior you should expect. (I'm pretty sure some even said "this isn't surprising" or something along those lines.) I remained unconvinced. Upon further investigation we found out that one of Ziebart's results that we were using had extremely high variance in our setting, since in our setting we only ever had one initial state, rather than sampling several which would give better coverage of the uniform prior. We derived a better version of Ziebart's result, implemented that, and voila the results were now what I had originally expected.
3. It took about... 2 weeks (?) between getting this final version of the algorithm and submitting a paper, constituting maybe 15-20% of the total work. Most of that was what I'd call "communication" rather than "research", e.g. creating another environment to better demonstrate the algorithm's properties, writing up the paper clearly, making good figures, etc. Good communication seems clearly worth putting effort into.

If you want a deep learning example, consider Learning What To Do by Simulating the Past. The biggest example here is the curriculum -- that was not part of the original pseudocode I had written down and was crucial to get it to work.

You might look at this and think that "but the conceptual idea predicted the experiments that were eventually run!" I mean, sure, but then I think your crux is not "were the experiments predictable", rather it's "is there any value in going from a conceptual idea to a working implementation".

It's also pretty easy to predict the results of experiments in a paper, but that's because you have the extra evidence that you're reading a paper. This is super helpful:

1. The experiments are going to show the algorithm working. They wouldn't have published the paper otherwise.
2. The introduction, methods, etc are going to tell you exactly what to expect when you get to the experiments. Even if the authors initially thought the algorithm was going to improve the final score in Atari games, if the algorithm instead improved sample efficiency without changing final score, the introduction is going to be about how the algorithm was inspired by sample efficient learning in humans or whatever.

This is also why I often don't report on experiments in papers in the Alignment Newsletter; usually the point is just "yes, the conceptual idea worked".

I don't know if this is actually true, but one cynical take is that people are used to predicting the results of finished ML work, where they implicitly use (1) and (2) above, and incorrectly conclude that the vast majority of ML experiments are ex ante predictable. And now that they have to predict the outcome of Redwood's project, before knowing that a paper will result, they implicitly realize that no, it really could go either way. And so they incorrectly conclude that of the ML experiments, Redwood's project is a rare unpredictable one.

Comment by rohinmshah on Attempted Gears Analysis of AGI Intervention Discussion With Eliezer · 2021-11-16T06:19:28.017Z · LW · GW

It also seems uncharitable to go from (A) "exaggerated one of the claims in the OP" to (B) "made up the term 'fake' as an incorrect approximation of the true claim, which was not about fakeness".

You didn't literally explicitly say (B), but when you write stuff like

The term ‘faking’ here is turning a claim of ‘approaches that are being taken mostly have epsilon probability of creating meaningful progress’ to a social claim about the good faith of those doing said research, and then interpreted as a social attack, and then therefore as an argument from authority and a status claim, as opposed to pointing out that such moves don’t win the game and we need to play to win the game.

I think most (> 80%) reasonable people would take (B) away from your description, rather than (A).

Just to be totally clear: I'm not denying that the original comment was uncharitable, I'm pushing back on your description of it.

Comment by rohinmshah on Discussion with Eliezer Yudkowsky on AGI interventions · 2021-11-15T19:54:45.422Z · LW · GW

That's a good example, thanks :)

EDIT: To be clear, I don't agree with

But at the same time, I think that Abram wins hands-down on the metric of "progress towards AI alignment per researcher-hour"

but I do think this is a good example of what someone might mean when they say work is "predictable".

Comment by rohinmshah on Attempted Gears Analysis of AGI Intervention Discussion With Eliezer · 2021-11-15T12:10:30.588Z · LW · GW

In the comments to the OP that Eliezer’s comments about small problems versus hard problems got condensed down to ‘almost everyone working on alignment is faking it.’ I think that is not only uncharitable, it’s importantly a wrong interpretation [...]

Note that there is a quote from Eliezer using the term "fake":

And then there is, so far as I can tell, a vast desert full of work that seems to me to be mostly fake or pointless or predictable.

It could certainly be the case that Eliezer means something else by the word "fake" than the commenters mean when they use the word "fake"; it could also be that Eliezer thinks that only a tiny fraction of the work is "fake" and most is instead "pointless" or "predictable", but the commenters aren't just creating the term out of nowhere.

Comment by rohinmshah on Discussion with Eliezer Yudkowsky on AGI interventions · 2021-11-15T09:57:08.456Z · LW · GW

^ This response is great.

I also think I naturally interpreted the terms in Adam's comment as pointing to specific clusters of work in today's world, rather than universal claims about all work that could ever be done. That is, when I see "experimental work and not doing only decision theory and logic", I automatically think of "experimental work" as pointing to a specific cluster of work that exists in today's world (which we might call mainstream ML alignment), rather than "any information you can get by running code". Whereas it seems you interpreted it as something closer to "MIRI thinks there isn't any information to get by running code".

My brain insists that my interpretation is the obvious one and is confused how anyone (within the AI alignment field, who knows about the work that is being done) could interpret it as the latter. (Although the existence of non-public experimental work that isn't mainstream ML is a good candidate for how you would start to interpret "experimental work" as the latter.) But this seems very plausibly a typical mind fallacy.

EDIT: Also, to explicitly say it, sorry for misunderstanding what you were trying to say. I did in fact read your comments as saying "no, MIRI is not categorically against mainstream ML work, and MIRI is not only working on HRAD-ish stuff like decision theory and logic, and furthermore this should be pretty obvious to outside observers", and now I realize that is not what you were saying.

Comment by rohinmshah on Discussion with Eliezer Yudkowsky on AGI interventions · 2021-11-14T10:28:09.888Z · LW · GW

(Responding to entire comment thread) Rob, I don't think you're modeling what MIRI looks like from the outside very well.

• There's a lot of public stuff from MIRI on a cluster that has as central elements decision theory and logic (logical induction, Vingean reflection, FDT, reflective oracles, Cartesian Frames, Finite Factored Sets...)
• There was once an agenda (AAMLS) that involved thinking about machine learning systems, but it was deprioritized, and the people working on it left MIRI.
• There was a non-public agenda that involved Haskell programmers. That's about all I know about it. For all I know they were doing something similar to the modal logic work I've seen in the past.
• Eliezer frequently talks about how everyone doing ML work is pursuing dead ends, with potentially the exception of Chris Olah. Chris's work is not central to the cluster I would call "experimentalist".
• There has been one positive comment on the KL-divergence result in summarizing from human feedback. That wasn't the main point of that paper and was an extremely predictable result.
• There has also been one positive comment on Redwood Research, which was founded by people who have close ties to MIRI. The current steps they are taking are not dramatically different from what other people have been talking about and/or doing.
• There was a positive-ish comment on aligning narrowly superhuman models, though iirc it gave off more of an impression of "well, let's at least die in a slightly more dignified way".

I don't particularly agree with Adam's comments, but it does not surprise me that someone could come to honestly believe the claims within them.

Comment by rohinmshah on Discussion with Eliezer Yudkowsky on AGI interventions · 2021-11-14T09:52:21.251Z · LW · GW

That one makes sense (to the extent that Eliezer did confidently predict the results), since the main point of the work was to generate information through experiments. I thought the "predictable" part was also meant to apply to a lot of ML work where the main point is to produce new algorithms, but perhaps it was just meant to apply to things like Ought.

Comment by rohinmshah on Discussion with Eliezer Yudkowsky on AGI interventions · 2021-11-13T11:11:06.197Z · LW · GW

A confusion: it seems that Eliezer views research that is predictable as basically-useless. I think I don't understand what "predictable" means here. In what sense is expected utility quantilization not predictable?

Maybe the point is that coming up with the concept is all that matters, and the experiments that people usually do don't matter because after coming up with the concept the experiments are predictable? I'm much more sympathetic to that, but then I'm confused why "predictable" implies "useless"; many prosaic alignment papers have as their main contribution a new algorithm, which seems like a similar type of thing as quantilization.

Comment by rohinmshah on Speaking of Stag Hunts · 2021-11-06T15:27:08.075Z · LW · GW

I assume that meant "instead of 80% of the value for 20% of the effort, we're now at least at 85% of the value for 37% of the effort", which parses fine to me

Comment by rohinmshah on Goodhart's Imperius · 2021-11-01T09:59:26.163Z · LW · GW

Caveat: epistemic status of all of this is somewhat tentative, but even if you assign e.g. only 70% confidence in each claim (which seems reasonable) and you assign a 50% hit to the reasoning from sheer skepticism, naively multiplying it out as if all of the claims were independent still leaves you with a 12% chance that your brain is doing this to you, which is large enough that it seems at least worth a few cycles of trying to think about it and ameliorate the situation.

Fwiw, my (not-that-well-sourced-but-not-completely-made-up) impression is that the overall story is a small extrapolation of probably-mainstream neuroscience, and also consistent with the way AI algorithms work, so I'd put significantly higher probability on it (hard to give an actual number without being clearer about the exact claim).

(For someone who wants to actually check the sources, I believe you'd want to read Peter Dayan's work.)

(I'm not expressing confidence in specific details like e.g. turning sensory data into implicit causal models that produce binary signals.)

Comment by rohinmshah on Ruling Out Everything Else · 2021-10-29T10:04:29.340Z · LW · GW

True! I might try that strategy more deliberately in the future.

Comment by rohinmshah on Ruling Out Everything Else · 2021-10-28T20:25:47.305Z · LW · GW

Yeah, I figured that was probably the case. Still seemed worth checking.

You're almost certainly correct that it's nonzero/substantially off-putting to your readers, but I would bet at 5:1 odds that it's still less costly than the otherwise-inevitable-according-to-your-models misunderstandings

I'm not entirely sure what the claim that you're putting odds on is, but usually, in my situation:

• I write different pieces for different audiences
• I promote the writing to the audience that I wrote for
• I predict, but haven't checked, that significantly less of the intended audience would read it if I couldn't simply use the accepted jargon for that audience and instead had to explain it / rule out everything else
• I find that the audience I didn't promote it to mostly doesn't read it (limiting the number of misunderstandings).

So I think I'm in the position where it makes sense to take cheap actions to avoid misunderstandings, but not expensive ones. I also feel constrained by the two groups having very different (effective) norms, e.g. in ML it's a lot more important to be concise, and it's a lot more weird (though maybe not bad?) to propose new short phrases for existing concepts.

Comment by rohinmshah on Ruling Out Everything Else · 2021-10-28T18:35:07.965Z · LW · GW

It might be that you're assuming more shared context between the groups than I am. In my case, I'm usually thinking of LessWrongers and ML researchers as the two groups. One example would be that the word "reward" is interpreted very differently by the two.

To LessWrongers, there's a difference between "reward" and "utility", where "reward" refers to a signal that is computed over observations and is subject to wireheading, and "utility" refers to a function that is computed over states and is not subject to wireheading. (See e.g. this paper, though I don't remember if it actually uses the terminology, or if that came later).

Whereas to ML researchers, this is not the sort of distinction they usually make, and instead "reward" is roughly that-which-you-optimize. In most situations that ML researchers consider, "does the reward lead to wireheading" does not have a well-defined answer.

In the ML context, I might say something like "reward learning is one way that we can avoid the problem of specification gaming". When (some) LessWrongers read this, they think I am saying that the thing-that-leads-to-wireheading is a good solution to specification gaming, which they obviously disagree with, whereas in the context I was operating in, I meant to say "we should learn rather hardcode that-which-you-optimize" without making claims about whether the learned thing was of the type-that-leads-to-wireheading or not.

I definitely could write all of this out every time I want to use the word "reward", but this would be (a) incredibly tedious for me and (b) off-putting to my intended readers, if I'm spending all this time talking about some other interpretation that is completely alien to them.

Comment by rohinmshah on Ruling Out Everything Else · 2021-10-28T13:46:05.385Z · LW · GW

If you can't do it, you'll have a hard time moving past "say the words that match what's in your brain" and getting to "say the words that will cause the thing in their brain to match what's in your brain."

One issue I've had with the second approach is that then you say different words to group A and to group B, and then sometimes group A reads the words you wrote for group B, and gets confused / thinks you are being inconsistent. (In a more charged situation, I'd imagine they'd assume you were acting in bad faith.) Any suggestions on how to deal with that?

(I'm still obviously going to use the second approach because it's way superior to the first, but I do wish I could cheaply mitigate this downside.)

Comment by rohinmshah on [AN #166]: Is it crazy to claim we're in the most important century? · 2021-10-27T20:39:48.675Z · LW · GW

Fyi, I've just added a "Year" column to the spreadsheet (I'm not really sure why I didn't have it before) -- hopefully this doesn't break your code?

Comment by rohinmshah on rohinmshah's Shortform · 2021-10-26T21:10:48.673Z · LW · GW

From the Truthful AI paper:

If all information pointed towards a statement being true when it was made, then it would not be fair to penalise the AI system for making it. Similarly, if contemporary AI technology isn’t sophisticated enough to recognise some statements as potential falsehoods, it may be unfair to penalise AI systems that make those statements.

I wish we would stop talking about what is "fair" to expect of AI systems in AI alignment*. We don't care what is "fair" or "unfair" to expect of the AI system, we simply care about what the AI system actually does. The word "fair" comes along with a lot of connotations, often ones which actively work against our goal.

At least twice I have made an argument where I posed a story in which an AI system fails to an AI safety researcher, and I have gotten the response "but that isn't fair to the AI system" (because it didn't have access to the necessary information to make the right decision), as though this somehow prevents the story from happening in reality.

(This sort of thing happens with mesa optimization -- if you have two objectives that are indistinguishable on the training data, it's "unfair" to expect the AI system to choose the right one, given that they are indistinguishable given the available information. This doesn't change the fact that such an AI system might cause an existential catastrophe.)

In both cases I mentioned that what we care about our actual outcomes, and that you can tell such stories where in actual reality the AI kills everyone regardless of whether you think it is fair or not, and this was convincing. It's not that the people I was talking to didn't understand the point, it's that some mental heuristic of "be fair to the AI system" fired and temporarily led them astray.

Going back to the Truthful AI paper, I happen to agree with their conclusion, but the way I would phrase it would be something like:

If all information pointed towards a statement being true when it was made, then it would appear that the AI system was displaying the behavior we would see from the desired algorithm, and so a positive reward would be more appropriate than a negative reward, despite the fact that the AI system produced a false statement. Similarly, if the AI system cannot recognize the statement as a potential falsehood, providing a negative reward may just add noise to the gradient rather than making the system more truthful.

* Exception: Seems reasonable to talk about fairness when considering whether AI systems are moral patients, and if so, how we should treat them.

Comment by rohinmshah on General alignment plus human values, or alignment via human values? · 2021-10-26T15:38:14.437Z · LW · GW

I often say things that I think you would interpret as belonging to the first category ("general alignment plus human values").

So the AI can and must have an impact on the world. We want it to not have a large or dangerous impact. But, crucially, "dangerous" and "large" are defined by human values.

This feels like the crux. I certainly agree that "dangerous" and "large" are not orthogonal to / independent of human values, and that as a result any realistic safe AI system will contain some information about human values.

But this seems like a very weak conclusion to me. Of course a superintelligent AI will contain some information about human values. GPT-3 isn't superintelligent and it already contains tons of knowledge about human values; possibly more than I do. You'd have to try really hard to prevent it from containing information about human values.

It seems like you conclude something much stronger, which is something like "we must build in all of human values". I don't see why we can't instead have our AI systems do whatever a well-motivated human would do in a similar principal-agent problem. This certainly involves knowing some amount about human values, but not some extraordinarily large amount that means we might as well just learn everything including in exotic philosophical cases.

(I think my position is pretty similar to Steve's.)

From a later comment:

But consider the following scenario: the AGI knows that, as a consequence of its actions, one AGI design will be deployed rather than another. Both of these designs will push the world into uncharted territory. How should it deal with that situation?

The same way a well-motivated personal assistant would deal with it. Tell the human of these two possibilities, and ask them which one should be done. Help them with this decision by providing them with true, useful information about what consequences arise from each of the possibilities.

If you are able to perfectly predict their responses in all possible situations, and the final answer depends on (say) the order in which you ask the questions, then go up a meta level: ask them for their preferences about how you go about eliciting information from them and/or helping them with reflection.

If going up meta levels doesn't solve the problem either, then pick randomly amongst the options, or take an average.

If there's time pressure and you can't get their opinions, take your best guess as to which one they'd prefer, and do that one. (One assumes that such a scenario doesn't come up often.)

Generally with these sorts of hypotheticals, it feels to me like it either (1) isn't likely to come up, or (2) can be solved by deferring to the human, or (3) doesn't matter very much.

Comment by rohinmshah on Naive self-supervised approaches to truthful AI · 2021-10-24T17:12:11.667Z · LW · GW

I have heard of similar experiments that did in fact help, though I don't have any citations (in many cases it is unpublished work). So I think with some effort I do expect you to get some benefit from such an approach.

Comment by rohinmshah on Deleted comments archive? · 2021-10-24T16:46:12.119Z · LW · GW

Shortform?

Comment by rohinmshah on [AN #167]: Concrete ML safety problems and their relevance to x-risk · 2021-10-22T22:30:55.733Z · LW · GW

I think we're just debating semantics of the word "assumption".

Consider the argument:

A superintelligent AI will be VNM-rational, and therefore it will pursue convergent instrumental subgoals

I think we both agree this is not a valid argument, or is at least missing some details about what the AI is VNM-rational over before it becomes a valid argument. That's all I'm trying to say.

Unimportant aside on terminology: I think in colloquial English it is reasonable to say that this is "missing an assumption". I assume that you want to think of this as math. My best guess at how to turn the argument above into math would be something that looks like:

This still seems like "missing assumption", since the thing filling the ? seems like an "assumption".

Maybe you're like "Well, if you start with the setup of an agent that satisfies the VNM axioms over state-based outcomes, then you really do just need VNM to conclude 'convergent instrumental subgoals', so there's no extra assumptions needed". I just don't start with such a setup; I'm always looking for arguments with the conclusion "in the real world, we have a non-trivial chance of building an agent that causes an existential catastrophe". (Maybe readers don't have the same inclination? That would surprise me, but is possible.)

Comment by rohinmshah on [AN #167]: Concrete ML safety problems and their relevance to x-risk · 2021-10-21T09:16:16.681Z · LW · GW

depending on what the agent is coherent over.

That's an assumption :P (And it's also not one that's obviously true, at least according to me.)

Comment by rohinmshah on Zoe Curzi's Experience with Leverage Research · 2021-10-14T23:41:58.367Z · LW · GW

You're reading too much into my response. I didn't claim that Anna should have this extra onus. I made an incorrect inference, was confused, asked for clarification, was still confused by the first response (honestly I'm still confused by that response), understood after the second response, and then explained what I would have said if I were in her place when she asked about norms.

(Yes, I do in fact think that the specific thing said had negative consequences. Yes, this belief shows in my comments. But I didn't say that Anna was wrong/bad for saying the specific thing, nor did I say that she "should" have done something else. Assuming for the moment that the specific statement did have negative consequences, what should I have done instead?)

(On the actual question, I mostly agree that we probably have too many demands on public communication, such that much less public communication happens than would be good.)

I just think people here are smart and independent enough to not be 'coerced' by Anna if she doesn't open the conversation with a bunch of 'you might suffer reprisals' warnings

I also would have been fine with "I hope people share additional true, relevant facts". The specific phrasing seemed bad because it seemed to me to imply that the fear of reprisal was wrong. See also here.

Comment by rohinmshah on Zoe Curzi's Experience with Leverage Research · 2021-10-14T11:47:48.202Z · LW · GW

I'm assuming listeners will only do things if they don't mind doing them, i.e. that my words won't coerce people,

I feel like this assumption seems false. I do predict that (at least in the world where we didn't have this discussion) your statement would create a social expectation for the people to report true, relevant facts, and that this social expectation would in fact move people in the direction of reporting true, relevant facts.

I immediately made the inference myself on reading your comment. There was no choice in the matter, no execution of a deliberate strategy on my part, just an inference that Anna wants people to give the facts, and doesn't think that fear of reprisal is particularly important to care about. Well, probably, it's hard to remember exactly what I thought, but I think it was something like this. I then thought about why this might be, and how I might have misunderstood. In hindsight, the explanation you gave above should have occurred to me, that is the sort of thing that people who speak literally would do, but it did not.

I think there are lots of LWers who, like me, make these sorts of inferences automatically. (And I note that these kinds of inferences are excellent for believing true things about people outside of LW.) I think this is especially true of people in the same reference class as Zoe, and that such people will feel particularly pressured by it. (There are a sadly large number of people in this community who have a lot of self-doubt / not much self-esteem, and are especially likely to take other people's opinions seriously, and as a reason for them to change their behavior even if they don't understand why.) This applies to both facts that are politically-pro-Leverage and facts that are politically-anti-Leverage.

So overall, yes, I think your words would lead people to infer that it would be better for them to report true relevant facts and that any fear they have is somehow misplaced, and to be pressured by that inference into actually doing so, i.e. it coerces them.

I don't have a candidate alternative norm. (I generally don't think very much about norms, and if I made one up now I'm sure it would be bad.) But if I wanted to convey something similar in this particular situation, I would have said something like "I would love to know additional true relevant facts, but I recognize there is a risk that others will take them in a politicized way, or will use them as an excuse to falsely judge you, so please only do this if you think the benefits are worth it".

(Possibly this is missing something you wanted to convey, e.g. you wish that the community were such that people didn't have to fear political judgment?)

(I also agree with TekhneMakre's response about authority.)

Comment by rohinmshah on Zoe Curzi's Experience with Leverage Research · 2021-10-14T11:17:24.966Z · LW · GW

Hypothesis 2 feels truer than hypothesis 1.

(Just to state the obvious: it is clearly not as bad as the words "coercion" and "gaslighting" would usually imply. I am endorsing the mechanism, not the magnitude-of-badness.)

I agree that hypothesis 1 could be an underlying generator of why the effect in hypothesis 2 exists.

I think I am more confident in the prediction that these sorts of statements do influence people in ways-I-don't-endorse, than in any specific mechanism by which that happens.

Comment by rohinmshah on Zoe Curzi's Experience with Leverage Research · 2021-10-14T08:00:48.519Z · LW · GW

It sounds like you are predicting that the people who are sharing true relevant facts have values such that the long-term benefits to the group overall outweigh the short-term costs to them. In particular, it's a prediction about their values (alongside a prediction of what the short-term and long-term effects are).

I'll just note that, on my view of the short-term and long-term effects, it seems pretty unclear whether by my values I should share additional true relevant information, and I lean towards it being negative. Of course, you have way more private info than me, so perhaps you just know a lot more about the short-term and long-term effects.

I'm also not a fan of requests that presume that the listener is altruistic, and willing to accept personal harm for group benefit. I'm not sure if that's what you meant -- maybe you think in the long term sharing of additional facts would help them personally, not just help the group.

Fwiw I don't have any particularly relevant facts. I once tagged along with a friend to a party that I later (i.e. during or after the party) found out was at Leverage. I've occasionally talked briefly with people who worked at Leverage / Paradigm / Reserve. I might have once stopped by a poster they had at EA Global. I don't think there have been other direct interactions with them.

Comment by rohinmshah on Zoe Curzi's Experience with Leverage Research · 2021-10-14T07:44:12.729Z · LW · GW

I agree that's possible, but then I'd say something like "I would love to know additional true relevant facts, but I recognize there are real risks to this and only recommend people do this if they think the benefits are worth it".

Analogy: it could be worth it for an employee to publicly talk about the flaws of their company / manager (e.g. because then others know not to look for jobs at that company), even though it might get them fired. In such a situation I would say something like "It would be particularly helpful to know about the flaws of company X, but I recognize there are substantial risks involved and only recommend people do this if they feel up to it". I would not say "I hope people don't refrain from speaking up about the flaws of company X out of fear that they might be fired", unless I had good reason to believe they wouldn't be fired, or good reason to believe that it would be worth it on their values (though in that case presumably they'd speak up anyway).

Comment by rohinmshah on Zoe Curzi's Experience with Leverage Research · 2021-10-13T22:53:15.288Z · LW · GW

Refraining from sharing true relevant facts, out of fear that others will take them in a politicized way, or will use them as an excuse for false judgments.

Are you somehow guaranteeing or confidently predicting that others will not take them in a politicized way, use them as an excuse for false judgments, or otherwise cause harm to those sharing the true relevant facts? If not, why are you asking people not to refrain from sharing such facts?

(My impression is that it is sheer optimism, bordering on wishful thinking, to expect such a thing, that those who have such a fear are correct to have such a fear, and so I am confused that you are requesting it anyway.)

Comment by rohinmshah on [AN #166]: Is it crazy to claim we're in the most important century? · 2021-10-12T10:22:33.435Z · LW · GW

Yeah, I agree the statement is false as I literally wrote it, though what I meant was that you could easily believe you are in the kind of simulation where there is no extraordinary impact to have.

Comment by rohinmshah on Selection Theorems: A Program For Understanding Agents · 2021-10-11T16:52:50.628Z · LW · GW

Edited to

Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning human values

and

[...] the resulting agents can be represented as maximizing expected utility, if the agents don't have internal state.

(For the second one, that's one of the reasons why I had the weasel word "could", but on reflection it's worth calling out explicitly given I mention it in the previous sentence.)

Comment by rohinmshah on Selection Theorems: A Program For Understanding Agents · 2021-10-11T09:20:12.495Z · LW · GW

Thanks for this and the response to my other comment, I understand where you're coming from a lot better now. (Really I should have figured it out myself, on the basis of this post.) New summary:

This post proposes a research area for understanding agents: **selection theorems**. A selection theorem is a theorem that tells us something about agents that will be selected for in a broad class of environments. Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning values by observing human behavior, and (2) they can tell us likely properties of the agents we build by accident (think inner alignment concerns).

As an example, [coherence arguments](https://www.alignmentforum.org/posts/RQpNHSiWaXTvDxt6R/coherent-decisions-imply-consistent-utilities) demonstrate that when an environment presents an agent with “bets” or “lotteries”, where the agent cares only about the outcomes of the bets, then any “good” agent can be represented as maximizing expected utility. (What does it mean to be “good”? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose resources.) This can then be turned into a selection argument by combining it with something that selects for “good” agents. For example, evolution will select for agents that don’t lose resources for no gain, so humans are likely to be represented as maximizing expected utility. Unfortunately, many coherence arguments implicitly assume that the agent has no internal state, which is not true for humans, so this argument does not clearly work. As another example, our ML training procedures will likely also select for agents that don’t waste resources, which could allow us to conclude that the resulting agents can be represented as maximizing expected utility.

Coherence arguments aren’t the only kind of selection theorem. The <@good(er) regulator theorem@>(@Fixing The Good Regulator Theorem@) provides a set of scenarios under which agents learn an internal “world model”. The [Kelly criterion](http://www.eecs.harvard.edu/cs286r/courses/fall10/papers/Chapter6.pdf) tells us about scenarios in which the best (most selected) agents will make bets as though they are maximizing expected log money. These and other examples are described in [this followup post](https://www.alignmentforum.org/posts/N2NebPD78ioyWHhNm/some-existing-selection-theorems).

The rest of this post elaborates on the various parts of a selection theorem, and provides advice on how to make original research contributions in the area of selection theorems. Another [followup post](https://www.alignmentforum.org/posts/RuDD3aQWLDSb4eTXP/what-selection-theorems-do-we-expect-want) describes some useful properties for which the author expects there are useful selections theorems to prove.

New opinion:

People sometimes expect me to be against this sort of work, because I wrote <@Coherence arguments do not imply goal-directed behavior@>. This is not true. My point in that post is that coherence arguments _alone_ are not enough, you need to combine them with some other assumption (for example, that there exists some “resource” over which the agent has no terminal preferences). I do think it is plausible that this research agenda gives us a better picture of agency that tells us something about how AI systems will behave, or something about how to better infer human values. While I am personally more excited about studying particular development paths to AGI rather than more abstract agent models, I do think this research would be more useful than other types of alignment research I have seen proposed.

Comment by rohinmshah on Selection Theorems: A Program For Understanding Agents · 2021-10-10T14:31:10.780Z · LW · GW

Planned summary for the Alignment Newsletter:

This post proposes a research area for understanding agents: **selection theorems**. A selection theorem is a theorem that tells us something about agents that will be selected for in a broad class of environments. Selection theorems are helpful because they tell us likely properties of the agents we build.

As an example, [coherence arguments](https://www.alignmentforum.org/posts/RQpNHSiWaXTvDxt6R/coherent-decisions-imply-consistent-utilities) demonstrate that when an environment presents an agent with “bets” or “lotteries”, where the agent cares only about the outcomes of the bets, then any non-dominated agent can be represented as maximizing expected utility. (What does it mean to be non-dominated? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose money.) If you combine this with the very reasonable assumption that we will tend to build non-dominated agents, then we can conclude that we select for agents that can be represented as maximizing expected utility.

Coherence arguments aren’t the only kind of selection theorem. The <@good(er) regulator theorem@>(@Fixing The Good Regulator Theorem@) provides a set of scenarios under which agents learn an internal “world model”. The [Kelly criterion](http://www.eecs.harvard.edu/cs286r/courses/fall10/papers/Chapter6.pdf) tells us about scenarios in which the best (most selected) agents will make bets as though they are maximizing expected log money. These and other examples are described in [this followup post](https://www.alignmentforum.org/posts/N2NebPD78ioyWHhNm/some-existing-selection-theorems).

The rest of this post elaborates on the various parts of a selection theorem, and provides advice on how to make original research contributions in the area of selection theorems. Another [followup post](https://www.alignmentforum.org/posts/RuDD3aQWLDSb4eTXP/what-selection-theorems-do-we-expect-want) describes some useful properties for which the author expects there are useful selections theorems to prove.

Planned opinion:

People sometimes expect me to be against this sort of work, because I wrote <@Coherence arguments do not imply goal-directed behavior@>. This is not true. My point in that post is that coherence arguments _alone_ are not enough, you need to combine them with some other assumption (for example, that there is a money-like resource over which the agent has no terminal preferences). Similarly, I don’t expect this research agenda to find a selection theorem that says that an existential catastrophe occurs _assuming only that the agent is intelligent_, but I do think it is plausible that this research agenda gives us a better picture of agency that tells us something about how AI systems will behave, because we think the assumptions involved in the theorems are quite likely to hold. While I am personally more excited about studying particular development paths to AGI rather than more abstract agent models, I would not actively discourage anyone from doing this sort of research, and I think it would be more useful than other types of research I have seen proposed.

Comment by rohinmshah on Selection Theorems: A Program For Understanding Agents · 2021-10-10T13:06:25.592Z · LW · GW

At the same time, better Selection Theorems directly tackle the core conceptual problems of alignment and agency; I expect sufficiently-good Selection Theorems would get us most of the way to solving the hardest parts of alignment.

The former statement makes sense, but can you elaborate on the latter statement? I suppose I could imagine selection theorems revealing that we really do get alignment by default, but I don't see how they quickly lead to solutions to AI alignment if there is a problem to solve.

Comment by rohinmshah on Brain-inspired AGI and the "lifetime anchor" · 2021-10-10T11:46:35.971Z · LW · GW

ASSUMPTION 1: There’s a “secret sauce” of human intelligence, and it looks like a learning algorithm (and associated inference algorithm).

ASSUMPTION 2: It’s a fundamentally different learning algorithm from deep neural networks. I don’t just mean a different neural network architecture, regularizer, etc. I mean really different, like “involving probabilistic program inference algorithms” or whatever.

ASSUMPTION 3: The algorithm is human-legible, but nobody knows how it works yet.

ASSUMPTION 4: We'll eventually figure out this “secret sauce” and get Transformative AI (TAI).

These seem easily like the load-bearing part of the argument; I agree the stuff you listed follows from these assumptions but why should these assumptions be true?

I can imagine justifying assumption 2, and maybe also assumption 1, using biology knowledge that I don't have. I don't see how you justify assumptions 3 and 4. Note that assumption 4 also needs to include a claim that we figure out the "secret sauce" sooner than other paths to AGI, despite lots of effort being put into them already.

Comment by rohinmshah on AI, learn to be conservative, then learn to be less so: reducing side-effects, learning preserved features, and going beyond conservatism · 2021-10-09T13:28:42.526Z · LW · GW

Hmm, you might want to reach out to CHAI folks, though I don't have a specific person in mind at the moment. (I myself am working on different things now.)

Comment by rohinmshah on What Do GDP Growth Curves Really Mean? · 2021-10-08T09:08:46.327Z · LW · GW

Another problem with using GDP to predict something like "continuously increasing impact on the world" is that it seems like new technologies often lead to huge surplus that wouldn't get captured in a GDP metric. Search engines are ridiculously useful, people say they would pay a lot to not lose them, and yet they're free.

(This is arguably the same problem as you identify, in that as you mention you have to quantify the value of a search engine or a smartphone to include it in a GDP metric, and there isn't an obvious way to assign a value in 1960-dollars to a search engine, so you just go with what people pay for it now.)

Personally, the main evidence I rely on for "no discontinuities in impact" is that it seems like across a range of industries / technologies, after the first zero-to-one transition in which the technology is invented, the improvements on the technology tend to be continuous / incremental, and so too is its impact on the world.

(This needs to be combined with a claim that the zero-to-one transition for AI will lead to AI systems that are subhuman and so not very impactful. My impression is that some people disagree with this, seeing current AI systems as qualitatively-different-from-AGI, and expecting a completely different zero-to-one transition to AGI, in which the resulting AGI is immediately much better than humans on some important axis, like ability to self-improve. I'm not sure why they think this, if in fact this is an accurate representation of their beliefs.)

Comment by rohinmshah on Distinguishing AI takeover scenarios · 2021-10-07T19:42:58.403Z · LW · GW

Planned summary for the Alignment Newsletter:

This post summarizes several AI takeover scenarios that have been proposed, and categorizes them according to three main variables. **Speed** refers to the question of whether there is a sudden jump in AI capabilities. **Uni/multipolarity** asks whether a single AI system takes over, or many. **Alignment** asks what goals the AI systems pursue, and if they are misaligned, further asks whether they are outer or inner misaligned. They also analyze other properties of the scenarios, such as how agentic, general and/or homogenous the AI systems are, and whether AI systems coordinate with each other or not. A [followup post](https://www.alignmentforum.org/posts/zkF9PNSyDKusoyLkP/investigating-ai-takeover-scenarios) investigates social, economic, and technological characteristics of these scenarios. It also generates new scenarios by varying some of these factors.

Since these posts are themselves summaries and comparisons of previously proposed scenarios that we’ve covered in this newsletter, I won’t summarize them here, but I do recommend them for an overview of AI takeover scenarios.

Comment by rohinmshah on Beyond fire alarms: freeing the groupstruck · 2021-10-07T19:02:33.464Z · LW · GW

Planned summary for the Alignment Newsletter:

It has been claimed that there’s no fire alarm for AGI, that is, there will be no specific moment or event at which AGI risk becomes sufficiently obvious and agreed upon, so that freaking out about AGI becomes socially acceptable rather than embarrassing. People often implicitly argue for waiting for an (unspecified) future event that tells us AGI is near, after which everyone will know that it’s okay to work on AGI alignment. This seems particularly bad if no such future event (i.e. fire alarm) exists.

This post argues that this is not in fact the implicit strategy that people typically use to evaluate and respond to risks. In particular, it is too discrete. Instead, people perform “the normal dance of accumulating evidence and escalating discussion and brave people calling the problem early and eating the potential embarrassment”. As a result, the existence of a “fire alarm” is not particularly important.

Note that the author does agree that there is some important bias at play here. The original fire alarm post is implicitly considering a _fear shame hypothesis_: people tend to be less cautious in public, because they expect to be negatively judged for looking scared. The author ends up concluding that there is something broader going on and proposes a few possibilities, many of which still suggest that people will tend to be less cautious around risks when they are observed.

Some points made in the very detailed, 15,000-word article:

1. Literal fire alarms don’t work by creating common knowledge, or by providing evidence of a fire. People frequently ignore fire alarms. In one experiment, participants continued to fill out questionnaires while a fire alarm rang, often assuming that someone will lead them outside if it is important.

2. They probably instead work by a variety of mechanisms, some of which are related to the fear shame hypothesis. Sometimes they provide objective evidence that is easier to use as a justification for caution than a personal guess. Sometimes they act as an excuse for cautious or fearful people to leave, without the implication that those people are afraid. Sometimes they act as a source of authority for a course of action (leaving the building).

3. Most of these mechanisms are amenable to partial or incremental effects, and in particular can happen with AGI risk. There are many people who have already boldly claimed that AGI risk is a problem. There exists person-independent evidence; for example, surveys of AI researchers suggest a 5% chance of extinction.

4. For other risks, there does not seem to have been a single discrete moment at which it became acceptable to worry about them (i.e. no “fire alarm”). This includes risks where there has been a lot of caution, such as climate change, the ozone hole, recombinant DNA, COVID, and nuclear weapons.

5. We could think about _building_ fire alarms; many of the mechanisms above are social ones rather than empirical facts about the world. This could be one out of many strategies that we employ against the general bias towards incaution (the post suggests 16).

Planned opinion:

I enjoyed this article quite a lot; it is _really_ thorough. I do see a lot of my own work as pushing on some of these more incremental methods for increasing caution, though I think of it more as a combination of generating more or better evidence, and communicating arguments in a manner more suited to a particular audience. Perhaps I will think of new strategies that aim to reduce fear shame instead.

Comment by rohinmshah on What Heuristics Do You Use to Think About Alignment Topics? · 2021-09-29T09:16:08.402Z · LW · GW

Some notes from a 2018 CHAI meeting on this topic (with some editing). I don't endorse everything on here, nor would CHAI-the-organization.

• Learning part of your model that previously was fixed.
• Can be done using neural nets, other ML models, or uncertainty (probability distributions)
• Example: learning biases instead of hardcoding Boltzmann rationality
• Relatedly, treating an object as evidence instead of as ground truth
• Looking at current examples of the problem and/or its solutions:
• How does the human brain / nature do it?
• How does human culture/society do this
• How has cognitive science formalized similar problems / what insights has it produced that we can build on?
• Principal-agent models/Contracting theory
• Internal design of the system to be adversarial inherently (e.g. debate)
• ‘Normative Bandwidth’
• How much information about the correct behavior is actually conveyed vs how much information does the robot policy assume is conveyed
• E.g. reward functions that are interpreted literally means that you are getting all information necessary for getting the optimal policy -- that’s a huge amount of information, that assumption is always wrong. What’s actually conveyed is a much smaller amount of information -- something like what Inverse Reward Design does (where it says the reward function only conveys information about good behavior in the training environments).
• Proactive Learning (i.e. what if we ask the human?)
• Induction (see e.g. iterated amplification)
• Get good safety properties in simple situations, and then use them to build something more capable while preserving safety properties
• Analyze a simple model of the situation in theory
• Rationality -- either make sure the agent is rational, or make sure it isn’t (i.e. don’t build agents)
• Thinking about the human-robot system as a whole, rather than the robot in isolation. (See e.g. CIRL / assistance games.)
• How would you do it with infinite resources (relaxed constraints)?
• E.g. AIXI, Solomonoff induction, open-source game theory
Comment by rohinmshah on What Heuristics Do You Use to Think About Alignment Topics? · 2021-09-29T09:06:04.676Z · LW · GW

I imagine Rohin Shah would see a proposed solution and ask how it intervenes on a threat model (also correct me if I'm wrong!).

That's certainly one heuristic I often use :)

Comment by rohinmshah on Neural net / decision tree hybrids: a potential path toward bridging the interpretability gap · 2021-09-23T07:00:16.888Z · LW · GW

Sounds like a cool project, I'm looking forward to seeing the results!

Comment by rohinmshah on The theory-practice gap · 2021-09-22T13:59:36.782Z · LW · GW

It's nothing quite so detailed as that. It's more like "maybe in the exotic circumstances we actually encounter, the objective does generalize, but also maybe not; there isn't a strong reason to expect one over the other". (Which is why I only say it is plausible that the AI system works fine, rather than probable.)

You might think that the default expectation is that AI systems don't generalize. But in the world where we've gotten an existential catastrophe, we know that the capabilities generalized to the exotic circumstance; it seems like whatever made the capabilities generalize could also make the objective generalize in that exotic circumstance.

Comment by rohinmshah on Redwood Research’s current project · 2021-09-22T09:36:59.899Z · LW · GW

Planned summary for the Alignment Newsletter:

This post introduces Redwood Research’s current alignment project: to ensure that a language model finetuned on fanfiction never describes someone getting injured, while maintaining the quality of the generations of that model. Their approach is to train a classifier that determines whether a given generation has a description of someone getting injured, and then to use that classifier as a reward function to train the policy to generate non-injurious completions.

Comment by rohinmshah on How truthful is GPT-3? A benchmark for language models · 2021-09-22T09:01:03.264Z · LW · GW

Re: human evaluation, I've added a sentence at the end of the summary:

It could be quite logistically challenging to use this benchmark to test new language models, since it depends on human evaluations. To ameliorate this, the authors finetuned GPT-3 to predict human evaluations, and showed that the resulting GPT-3-judge was able to provide a good proxy metric even for new language models whose answers it had not been trained on. Note also that you can use the version of the task where a model must choose between true and false reference answers, for an automated evaluation.

I take your point about there being reference solutions to make human evaluation easier but I think it's probably more detail than I want to go into in this summary.

Overall, I’m probably less optimistic than you are about how much prompts will help for the models we tried (GPT-3, GPT-J, etc). However, prompts could help more for larger models (as they may understand the prompts better).

I mostly just meant to claim the second thing; I don't have much intuition for the first thing. From my perspective the interesting claim is that an appropriate prompt would change the trend from "larger models perform worse" to "larger models perform better, past a certain model size".

I do think though that the evidence you show suggests that the "certain model size" is probably bigger than GPT-3, given that true+informative doesn't change much across prompts.

However, many of our examples involve human misconceptions about the world that seem harder to characterize in a simple instruction

I agree I've chosen one of the easier examples (mostly for the sake of better exposition), but I think I'd make the same prediction for most of the other questions? E.g. You could frame it as an interview with Alice, who graduated from an elite university, bases her beliefs by following the evidence rather than superstition, is careful to say she doesn't know when she doesn't, but nonetheless has a surprising amount of knowledge; looking at the examples in the paper I feel like this plausibly would get you to truthful and somewhat informative answers on most of the questions in the paper.

I've changed the opinion to:

I like this as an example of the kind of failure mode that does not immediately go away as models become more capable. However, it is possible that this trend could be reversed with better prompts. Take the Isaac Asimov example: if the prompt explicitly says that the questions are about the real world, it may be that a sufficiently capable model would infer that the text is not talking about Asimov’s books, and so ends up giving a truthful answer. In this case, you would see performance decreasing with model size up to a point, after which model performance _increases_ now that the model has sufficient understanding of the prompt. See more discussion [here](https://www.alignmentforum.org/posts/PF58wEdztZFX2dSue/how-truthful-is-gpt-3-a-benchmark-for-language-models?commentId=qoB2swyX4ZJrhhttB).