Posts

Survey on the acceleration risks of our new RFPs to study LLM capabilities 2023-11-10T23:59:52.515Z
AI Timelines 2023-11-10T05:28:24.841Z
New roles on my team: come build Open Phil's technical AI safety program with me! 2023-10-19T16:47:59.701Z
New blog: Planned Obsolescence 2023-03-27T19:46:25.429Z
Two-year update on my personal AI timelines 2022-08-02T23:07:48.698Z
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover 2022-07-18T19:06:14.670Z
ARC's first technical report: Eliciting Latent Knowledge 2021-12-14T20:09:50.209Z
More Christiano, Cotra, and Yudkowsky on AI progress 2021-12-06T20:33:12.164Z
Christiano, Cotra, and Yudkowsky on AI progress 2021-11-25T16:45:32.482Z
Techniques for enhancing human feedback 2021-10-29T07:27:46.700Z
The case for aligning narrowly superhuman models 2021-03-05T22:29:41.577Z
AMA on EA Forum: Ajeya Cotra, researcher at Open Phil 2021-01-29T23:05:41.527Z
Draft report on AI timelines 2020-09-18T23:47:39.684Z
Iterated Distillation and Amplification 2018-11-30T04:47:14.460Z

Comments

Comment by Ajeya Cotra (ajeya-cotra) on There should be more AI safety orgs · 2023-09-26T02:56:43.181Z · LW · GW

(Cross-posted to EA Forum.)

I’m a Senior Program Officer at Open Phil, focused on technical AI safety funding. I’m hearing a lot of discussion suggesting funding is very tight right now for AI safety, so I wanted to give my take on the situation.

At a high level: AI safety is a top priority for Open Phil, and we are aiming to grow how much we spend in that area. There are many potential projects we'd be excited to fund, including some potential new AI safety orgs as well as renewals to existing grantees, academic research projects, upskilling grants, and more.

At the same time, it is also not the case that someone who reads this post and tries to start an AI safety org would necessarily have an easy time raising funding from us. This is because:

  • All of our teams whose work touches on AI (Luke Muehlhauser’s team on AI governance, Claire Zabel’s team on capacity building, and me on technical AI safety) are quite understaffed at the moment. We’ve hired several people recently, but across the board we still don’t have the capacity to evaluate all the plausible AI-related grants, and hiring remains a top priority for us.
    • And we are extra-understaffed for evaluating technical AI safety proposals in particular. I am the only person who is primarily focused on funding technical research projects (sometimes Claire’s team funds AI safety related grants, primarily upskilling, but a large technical AI safety grant like a new research org would fall to me). I currently have no team members; I expect to have one person joining in October and am aiming to launch a wider hiring round soon, but I think it’ll take me several months to build my team’s capacity up substantially. 
    • I began making grants in November 2022, and spent the first few months full-time evaluating applicants affected by FTX (largely academic PIs as opposed to independent organizations started by members of the EA community). Since then, a large chunk of my time has gone into maintaining and renewing existing grant commitments and evaluating grant opportunities referred to us by existing advisors. I am aiming to reserve remaining bandwidth for thinking through strategic priorities, articulating what research directions seem highest-priority and encouraging researchers to work on them (through conversations and hopefully soon through more public communication), and hiring for my team or otherwise helping Open Phil build evaluation capacity in AI safety (including separately from my team). 
    • As a result, I have deliberately held off on launching open calls for grant applications similar to the ones run by Claire’s team (e.g. this one); before onboarding more people (and developing or strengthening internal processes), I would not have the bandwidth to keep up with the applications.
  • On top of this, in our experience, providing seed funding to new organizations (particularly organizations started by younger and less experienced founders) often leads to complications that aren't present in funding academic research or career transition grants.  We prefer to think carefully about seeding new organizations, and have a different and higher bar for funding someone to start an org than for funding that same person for other purposes (e.g. career development and transition funding, or PhD and postdoc funding).
    • I’m very uncertain about how to think about seeding new research organizations and many related program strategy questions. I could certainly imagine developing a different picture upon further reflection — but having low capacity combines poorly with the fact that this is a complex type of grant we are uncertain about on a lot of dimensions. We haven’t had the senior staff bandwidth to develop a clear stance on the strategic or process level about this genre of grant, and that means that we are more hesitant to take on such grant investigations — and if / when we do, it takes up more scarce capacity to think through the considerations in a bespoke way rather than having a clear policy to fall back on.
Comment by Ajeya Cotra (ajeya-cotra) on Thoughts on the impact of RLHF research · 2023-01-26T02:06:19.297Z · LW · GW

my guess is most of that success is attributable to the work on RLHF, since that was really the only substantial difference between Chat-GPT and GPT-3

I don't think this is right -- the main hype effect of chatGPT over previous models feels like it's just because it was in a convenient chat interface that was easy to use and free. My guess is that if you did a head-to-head comparison of RLHF and kludgey random hacks involving imitation and prompt engineering, they'd seem similarly cool to a random journalist / VC, and generate similar excitement.

Comment by Ajeya Cotra (ajeya-cotra) on Richard Ngo's Shortform · 2022-12-14T21:16:29.833Z · LW · GW

I strongly disagree with the "best case" thing. Like, policies could just learn human values! It's not that implausible.

Yes, sorry, "best case" was oversimplified. What I meant is that generalizing to want reward is in some sense the model generalizing "correctly;" we could get lucky and have it generalize "incorrectly" in an important sense in a way that happens to be beneficial to us. I discuss this a bit more here.

But if Alex did initially develop a benevolent goal like “empower humans,” the straightforward and “naive” way of acting on that goal would have been disincentivized early in training. As I argued above, if Alex had behaved in a straightforwardly benevolent way at all times, it would not have been able to maximize reward effectively.

That means even if Alex had developed a benevolent goal, it would have needed to play the training game as well as possible -- including lying and manipulating humans in a way that naively seems in conflict with that goal. If its benevolent goal had caused it to play the training game less ruthlessly, it would’ve had a constant incentive to move away from having that goal or at least from acting on it.[35] If Alex actually retained the benevolent goal through the end of training, then it probably strategically chose to act exactly as if it were maximizing reward.

This means we could have replaced this hypothetical benevolent goal with a wide variety of other goals without changing Alex’s behavior or reward in the lab setting at all -- “help humans” is just one possible goal among many that Alex could have developed which would have all resulted in exactly the same behavior in the lab setting.

If I had to try point to the crux here, it might be "how much selection pressure is needed to make policies learn goals that are abstractly related to their training data, as opposed to goals that are fairly concretely related to their training data?"...As usual, there's the human analogy: our goals are very strongly biased towards things we have direct observational access to!)

I don't understand why reward isn't something the model has direct access to -- it seems like it basically does? If I had to say which of us were focusing on abstract vs concrete goals, I'd have said I was thinking about concrete goals and you were thinking about abstract ones, so I think we have some disagreement of intuition here.

Even setting aside this disagreement, though, I don't like the argumentative structure because the generalization of "reward" to large scales is much less intuitive than the generalization of other concepts (like "make money") to large scales - in part because directly having a goal of reward is a kinda counterintuitive self-referential thing.

Yeah, I don't really agree with this; I think I could pretty easily imagine being an AI system asking the question "How much reward would this episode get if it were sampled for training?" It seems like the intuition this is weird and unnatural is doing a lot of work in your argument, and I don't really share it.

Comment by Ajeya Cotra (ajeya-cotra) on Richard Ngo's Shortform · 2022-12-14T19:14:12.574Z · LW · GW

Yeah, I agree this is a good argument structure -- in my mind, maximizing reward is both a plausible case (which Richard might disagree with) and the best case (conditional on it being strategic at all and not a bag of heuristics), so it's quite useful to establish that it's doomed; that's the kind of structure I was going for in the post.

Comment by Ajeya Cotra (ajeya-cotra) on Richard Ngo's Shortform · 2022-12-14T19:03:45.057Z · LW · GW

Note that the "without countermeasures" post consistently discusses both possibilities (the model cares about reward or the model cares about something else that's consistent with it getting very high reward on the training dataset). E.g. see this paragraph from the above-the-fold intro:

Once this progresses far enough, the best way for Alex to accomplish most possible “goals” no longer looks like “essentially give humans what they want but take opportunities to manipulate them here and there.” It looks more like “seize the power to permanently direct how it uses its time and what rewards it receives -- and defend against humans trying to reassert control over it, including by eliminating them.” This seems like Alex’s best strategy whether it’s trying to get large amounts of reward or has other motives. If it’s trying to maximize reward, this strategy would allow it to force its incoming rewards to be high indefinitely.[6] If it has other motives, this strategy would give it long-term freedom, security, and resources to pursue those motives.

As well as the section Even if Alex isn't "motivated" to maximize reward.... I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that's distinct from being confident in the motivations that give rise to that policy. I believe Alex would try very hard to maximize reward in most cases, but this could be for either terminal or instrumental reasons.

With that said, for roughly the reasons Paul says above, I think I probably do have a disagreement with Richard -- I think that caring about some version of reward is pretty plausible (~50% or so). It seems pretty natural and easy to grasp to me, and because I think there will likely be continuous online training the argument that there's no notion of reward on the deployment distribution doesn't feel compelling to me.

Comment by Ajeya Cotra (ajeya-cotra) on Two-year update on my personal AI timelines · 2022-08-04T16:49:58.140Z · LW · GW

Yeah I agree more of the value of this kind of exercise (at least within the community) is in revealing more granular disagreements about various things. But I do think there's value in establishing to more external people something high level like "It really could be soon and it's not crazy or sci fi to think so."

Comment by Ajeya Cotra (ajeya-cotra) on Two-year update on my personal AI timelines · 2022-08-04T16:47:15.616Z · LW · GW

Can you say more about what particular applications you had in mind?

Stuff like personal assistants who write emails / do simple shopping, coding assistants that people are more excited about than they seem to be about Codex, etc.

(Like I said in the main post, I'm not totally sure what PONR refers to, but don't think I agree that the first lucrative application marks a PONR -- seems like there are a bunch of things you can do after that point, including but not limited to alignment research.)

Comment by Ajeya Cotra (ajeya-cotra) on Two-year update on my personal AI timelines · 2022-08-04T16:46:11.852Z · LW · GW

I don't see it that way, no. Today's coding models can help automate some parts of the ML researcher workflow a little bit, and I think tomorrow's coding models will automate more and more complex parts, and so on. I think this expansion could be pretty rapid, but I don't think it'll look like "not much going on until something snaps into place."

Comment by Ajeya Cotra (ajeya-cotra) on Two-year update on my personal AI timelines · 2022-08-03T05:09:18.772Z · LW · GW

(Coherence aside, when I now look at that number it does seem a bit too high, and I feel tempted to move it to 2027-2028, but I dunno, that kind of intuition is likely to change quickly from day to day.)

Comment by Ajeya Cotra (ajeya-cotra) on Two-year update on my personal AI timelines · 2022-08-03T00:05:16.647Z · LW · GW

Hm, yeah, I bet if I reflected more things would shift around, but I'm not sure the fact that there's a shortish period where the per-year probability is very elevated followed by a longer period with lower per-year probability is actually a bad sign.

Roughly speaking, right now we're in an AI boom where spending on compute for training big models is going up rapidly, and it's fairly easy to actually increase spending quickly because the current levels are low. There's some chance of transformative AI in the middle of this spending boom -- and because resource inputs are going up a ton each year, the probability of TAI by date X would also be increasing pretty rapidly.

But the current spending boom is pretty unsustainable if it doesn't lead to TAI. At some point in the 2040s or 50s, if we haven't gotten transformative AI by then, we'll have been spending 10s of billions training models, and it won't be that easy to keep ramping up quickly from there. And then because the input growth will have slowed, the increase in probability from one year to the next will also slow. (That said, not sure how this works out exactly.)

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-25T01:45:28.791Z · LW · GW

Where does the selection come from? Will the designers toss a really impressive AI for not getting reward on that one timestep? I think not.

I was talking about gradient descent here, not designers.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-24T02:43:31.076Z · LW · GW

It doesn't seem like it would have to prevent us from building computers if it has access to far more compute than we could access on Earth. It would just be powerful enough to easily defeat the kind of AIs we could train with the relatively meager computing resources we could extract from Earth. In general the AI is a superpower and humans are dramatically technologically behind, so it seems like it has many degrees of freedom and doesn't have to be particularly watching for this.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-23T18:52:25.544Z · LW · GW

Neutralizing computational capabilities doesn't seem to involve total destruction of physical matter or human extinction though, especially for a very powerful being. Seems like it'd be basically just as easy to ensure we + future AIs we might train are no threat as it is to vaporize the Earth.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-22T22:53:56.707Z · LW · GW

My answer is a little more prosaic than Raemon. I don't feel at all confident that an AI that already had God-like abilities would choose to literally kill all humans to use their bodies' atoms for its own ends; it seems totally plausible to me that -- whether because of exotic things like "multiverse-wide super-rationality" or "acausal trade" or just "being nice" -- the AI will leave Earth alone, since (as you say) it would be very cheap for it to do so.

The thing I'm referring to as "takeover" is the measures that an AI would take to make sure that humans can't take back control -- while it's not fully secure and doesn't have God-like abilities. Once a group of AIs have decided to try to get out of human control, they're functionally at war with humanity. Humans could do things like physically destroy the datacenters they're running on, and they would probably want to make sure they can't do that.

Securing AI control and defending from human counter-moves seems likely to involve some violence -- but it could be a scale of violence that's "merely" in line with historical instances where a technologically more advanced group of humans colonized or took control of a less-advanced group of humans; most historical takeovers don't involve literally killing every single member of the other group.

The key point is that it seems likely that AIs will secure the power to get to decide what happens with the future; I'm pretty unsure exactly how they use it, and especially if it involves physically destroying Earth / killing all humans for resources. These resources seem pretty meager compared to the rest of the universe.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-22T15:37:38.181Z · LW · GW

I mean things like tricks to improve the sample efficiency of human feedback, doing more projects that are un-enhanced RLHF to learn things about how un-enhanced RLHF works, etc.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-20T23:43:42.281Z · LW · GW

I'm pretty confused about how to think about the value of various ML alignment papers. But I think even if some piece of empirical ML work on alignment is really valuable for reducing x-risk, I wouldn't expect its value to take the form of providing insight to readers like you or me. So you as a reader not getting much out of it is compatible with the work being super valuable, and we probably need to assess it on different terms.

The main channel of value that I see for doing work like "learning to summarize" and the critiques project and various interpretability projects is something like "identifying a tech tree that it seems helpful to get as far as possible along by the Singularity, and beginning to climb that tech tree."

In the case of critiques -- ultimately, it seems like having AIs red team each other and pointing out ways that another AI's output could be dangerous seems like it will make a quantitative difference. If we had a really well-oiled debate setup, then we would catch issues we wouldn't have caught with vanilla human feedback, meaning our models could get smarter before they pose an existential threat -- and these smarter models can more effectively work on problems like alignment for us.[1]

It seems good to have that functionality developed as far as it can be developed in as many frontier labs as possible. The first steps of that look kind of boring, and don't substantially change our view of the problem. But first steps are the foundation for later steps, and the baseline against which you compare later steps. (Also every step can seem boring in the sense of bringing no game-changing insights, while nonetheless helping a lot.)

When the main point of some piece of work is to get good at something that seems valuable to be really good at later, and to build tacit knowledge and various kinds of infrastructure for doing that thing, a paper about it is not going to feel that enlightening to someone who wants high-level insights that change their picture of the overall problem. (Kind of like someone writing a blog post about how they developed effective management and performance evaluation processes at their company isn't going to provide much insight into the abstract theory of principal-agent problems. The value of that activity was in the company running better, not people learning things from the blog post about it.)

I'm still not sure how valuable I think this work is, because I don't know how well it's doing at efficiently climbing tech trees or at picking the right tech trees, but I think that's how I'd think about evaluating it.

[1] Or do a "pivotal act," though I think I probably don't agree with some of the connotations of that term.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-20T21:59:59.808Z · LW · GW

I was mainly talking about the current margin when I talked about how excited I am about the theoretical vs empirical work I see "going on" right now and how excited I tend to be about currently-active researchers who are doing theory vs empirical research. And I was talking about the future when I said that I expect empirical work to end up with the lion's share of credit for AI risk reduction.

Eliezer, Bostrom, and co certainly made a big impact in raising the problem to people's awareness and articulating some of its contours. It's kind of a matter of semantics whether you want to call that "theoretical research" or "problem advocacy" / "cause prioritization" / "community building" / whatever, and no matter which bucket you put it in I agree it'll probably end up with an outsized impact for x-risk-reduction, by bringing the problem to attention sooner than it would have otherwise been brought to attention and therefore probably allowing more work to happen on it before TAI is developed.

But just like how founding CEOs tend to end up with ~10% equity once their companies have grown large, I don't think this historical problem-advocacy-slash-theoretical-research work alone will end up with a very large amount of total credit.

On the main thrust of my point, I'm significantly less excited about MIRI-sphere work that is much less like "articulating a problem and advocating for its importance" and much more like "attempting to solve a problem." E.g. stuff like logical inductors, embedded agency, etc seem a lot less valuable to me than stuff like the orthogonality thesis and so on.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-20T17:14:32.663Z · LW · GW

I think the retroactive editing of rewards (not just to punish explicitly bad action but to slightly improve evaluation of everything) is actually pretty default, though I understand if people disagree. It seems like an extremely natural thing to do that would make your AI more capable and make it more likely to pass most behavioral safety interventions.

In other words, even if the average episode length is short (e.g. 1 hour), I think the default outcome is to have the rewards for that episode be computed as far after the fact as possible, because that helps Alex improve at long-range planning (a skill Magma would try hard to get it to have). This can be done in a way that doesn't compromise speed of training -- you simply reward Alex immediately with your best guess reward, then keep editing it later as more information comes in. At all points in time you have a "good enough" reward ready to go, while also capturing the benefits of pushing your model to think in as long-term a way as possible.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-20T15:05:47.938Z · LW · GW

Thanks, but I'm not working on that project! That project is led by Beth Barnes.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T23:44:59.705Z · LW · GW

Hm, not sure I understand but I wasn't trying to make super specific mechanistic claims here -- I agree that what I said doesn't reduce confusion about the specific internal mechanisms of how lying gets to be hard for most humans, but I wasn't intending to claim that it was. I also should have said something like "evolutionary, cultural, and individual history" instead (I was using "evolution" as a shorthand to indicate it seems common among various cultures but of course that doesn't mean don't-lie genes are directly bred into us! Most human universals aren't; we probably don't have honor-the-dead and different-words-for-male-and-female genes).

I was just making the pretty basic point "AIs in general, and Alex in particular, are produced through a very different process from humans, so it seems like 'humans find lying hard' is pretty weak evidence that 'AI will by default find lying hard.'"

I agree that asking "What specific neurological phenomena make it so most people find it hard to lie?" could serve as inspiration to do AI honesty research, and wasn't intending to claim otherwise in that paragraph (though separately, I am somewhat pessimistic about this research direction).

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T23:34:28.098Z · LW · GW

I'm agnostic about whether the AI values reward terminally or values some other complicated mix of things. The claim I'm making is behavioral -- a claim that the strategy of "try to figure out how to get the most reward" would be selected over other strategies like "always do the nice thing."

The strategy could be compatible with a bunch of different psychological profiles. "Playing the training game" is a filter over models -- lots of possible models could do it, the claim is just that we need to reason about the distribution of psychologies given that the psychologies that make it to the end of training most likely employ the strategy of playing the training game on the training distribution.

Why do I think this? Consider an AI that has high situational awareness, reasoning ability and creative planning ability (assumptions of my situation which don't yet say anything about values). This AI has the ability to think about what kinds of actions would get the most reward (just like it has the ability to write a sonnet or solve a math problem or write some piece of software; it understands the task and has the requisite subskills). And once it has the ability, it is likely to be pushed in the direction of exercising that ability (since doing so would increase its reward).

This changes its psychology in whatever way most easily results in it doing more of the think-about-what-would-get-the-most-reward-and-do-it behavior. Terminally valuing reward and only reward would certainly do the trick, but a lot of other things would too (e.g. valuing paperclips in the very long run).

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T23:13:54.052Z · LW · GW

Geoffrey Irving, Jan Leike, Paul Christiano, Rohin Shah, and probably others were doing various kinds of empirical work a few years before Redwood (though I would guess Oliver doesn't like that work and so wouldn't consider it a counterexample to his view).

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T23:12:19.752Z · LW · GW

I agree that in an absolute sense there is very little empirical work that I'm excited about going on, but I think there's even less theoretical work going on that I'm excited about, and when people who share my views on the nature of the problem work on empirical work I feel that it works better than when they do theoretical work.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T19:39:39.288Z · LW · GW

The gradient pressure towards valuing reward terminally when you've already figured out reliable strategies for doing what humans want, seems very weak....in practice, it seems to me like these differences would basically only happen due to operator error, or cosmic rays, or other genuinely very rare events (as you describe in the "Security Holes" section).

Yeah, I disagree. With plain HFDT, it seems like there's continuous pressure to improve things on the margin by being manipulative -- telling human evaluators what they want to hear, playing to pervasive political and emotional and cognitive biases, minimizing and covering up evidence of slight suboptimalities to make performance on the task look better, etc. I think that in basically every complex training episode a model could do a little better by explicitly thinking about the reward and being a little-less-than-fully-forthright.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T18:31:49.852Z · LW · GW

Note I was at -16 with one vote, and only 3 people have voted so far. So it's a lot due to the karma-weight of the first disagreer.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T18:15:36.107Z · LW · GW

I think updating negatively on the situation/action pair has functionally the same effect as changing the reward to be what you now think it should be -- my understanding is that RL can itself be implemented as just updates on situation/action pairs, so you could have trained your whole model that way. Since the reason you updated negatively on that situation/action pair is because of something you noticed long after the action was complete, it is still pushing your models to care about the longer-run.

This posits that the model has learned to wirehead

I don't think it posits that the model has learned to wirehead -- directly being motivated to maximize reward or being motivated by anything causally downstream of reward (like "more copies of myself" or "[insert long-term future goal that requires me being around to steer the world toward that goal]") would work.

The claim I'm making is that somehow you made a gradient update toward a model that is more likely to behave well according to your judgment after the edit -- and two salient ways that update could be working on the inside is "the model learns to care a bit more about long-run reward after editing" and "the model learns to care a bit more about something downstream of long-run reward."

A lot of updates like this seem to push the model toward caring a lot about one of those two things (or some combo) and away from caring about the immediate rewards you were citing earlier as a reason it may not want to take over.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T17:51:19.251Z · LW · GW

No particular reason -- I can't figure out how to cross post now so I sent a request.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T17:49:36.897Z · LW · GW

Here is the real chasm between the AI safety movement and the ML industry/academia. One field is entirely driven by experimental results; the other is dominated so totally by theory that its own practitioners deny that there can be any meaningful empirical aspect to it, at least, not until the moment when it's too late to make any difference.

To put a finer point on my view on theory vs empirics in alignment:

  • Going forward, I think the vast majority of technical work needed to reduce AI takeover risk is empirical, not theoretical (both in terms of "total amount of person-hours needed in each category" and in terms of "total 'credit' each category should receive for reducing doom in some sense").
  • Conditional on an alignment researcher agreeing with my view of the high-level problem, I tend to be more excited about them if they're working on ML experiments than if they're working on theory.
  • I'm quite skeptical of most theoretical alignment research I've seen. The main theoretical research I'm excited about is ARC's, and I have a massive conflict of interest since the founder is my husband -- I would feel fairly sympathetic to people who viewed ARC's work more like how I view other theory work.

With that said, I think unfortunately there is a lot less good empirical work than in some sense there "could be." One significant reason why a lot of empirical AI safety work feels less exciting than it could be is that the people doing that work don't always share my perspective on the problem, so they focus on difficulties I expect to be less core. (Though another big reason is just that everything is hard, especially when we're working with systems a lot less capable than future systems.)

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T17:40:18.675Z · LW · GW

In general, all of these stories seem to rely on a very fast form of instrumental convergence to playing the Training Game, such that "learn roughly what humans want, and then get progressively better at doing that, plus learn some extra ways to earn reward when crappy human feedback disagrees with what humans would actually want" is disfavored on priors to "learn to pursue [insert objective] and get progressively better at pursuing it until you eventually hit situational awareness and learn to instrumentally game the training process."

I think the second story doesn't quite represent what I'm saying, in that it's implying that pursuing [insert objective] comes early and situational awareness comes much later. I think that situational awareness is pretty early (probably long before transformative capabilities), and once a model has decent situational awareness there is a push to morph its motives toward playing the training game. At very low levels of situational awareness it is likely not that smart, so it probably doesn't make too much sense to say that it's pursuing a particular objective -- it's probably a collection of heuristics. But around the time it's able to reason about the possibility of pursuing reward directly, there starts to be a gradient pressure to choose to reason in that way. I think crystallizing this into a particular simple objective it's pursuing comes later, probably.

These two trajectories seem so different that it seems like there must be experiments that would distinguish them, even if we don't see their "end-state".

This is possible to me, but I think it's quite tricky to pin these down enough to come up with experiments that both skeptics and concerned people would recognize as legitimate. Something that I think skeptics would consider unfair is "Train a model through whatever means necessary to do X (e.g. pursue red things) and then after that have a period where we give it a lot of reward for doing not-X (e.g. purse blue things), such that the second phase is unable to dislodge the tendency created in the first phase -- i.e., even after training it for a while to pursue blue things, it still continues to pursue red things."

This would demonstrate that some ways of training produce "sticky" motives and behaviors that aren't changed even in the face of counter-incentives, and makes it more plausible to me that a model would "hold on" to a motive to be honest / corrigible even when there are a number of cases where it could get more reward by doing something else. But in general, I don't expect people who are skeptical of this story to think this is a reasonable test.

I'd be pretty excited about someone trying harder to come up with tests that could distinguish different training trajectories.

Alternatively, you might think the training process obeys a fundamentally different trajectory. E.g. "learn to pursue what humans want (adjusted for feedback weirdness), become so good at it that you realize it's instrumentally valuable to do that even if you didn't want to, and then have your internal reward slowly drift to something simpler while still instrumentally playing the training game."

I don't think I understand what trajectory this is. Is this something like what is discussed in the "What if Alex had benevolent motives?" section? I.e., the model wants to help humans, but separately plays the training game in order to fulfill its long-term goal of helping humans?

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T17:28:52.304Z · LW · GW

All these drives do seem likely. But that's different from arguing that "help humans" isn't likely. I tend to think of the final objective function being some accumulation of all of these, with a relatively significant chunk placed on "help humans" (since in training, that will consistently overrule other considerations like "be more efficient" when it comes to the final reward).

I think that by the logic "heuristic / drive / motive X always overrules heuristic / drive / motive Y when it comes to final reward," the hierarchy is something like:

  1. The drive / motive toward final reward (after all edits -- see previous comment) or anything downstream of that (e.g. paperclips in the universe).
  2. Various "pretty good" drives / motives among which "help humans" could be one.
  3. Drives / motives that are only kind of helpful or only helpful in some situations.
  4. Actively counterproductive drives / motives.

In this list the earlier motives always overrule later motives when they conflict, because they are more reliable guides to the true reward. Even if "be genuinely helpful to humans" is the only thing in category 2, or the best thing in category 2, it's still overruled by category 1 -- and category 1 is quite big because it includes all the caring-about-long-run-outcomes-in-the-real-world motives.

I still think AI psychology will be quite messy and at least the first generation of transformative AI systems will not look like clean utility maximizers, but the basic argument above I think gives a positive reason to expect honesty / corrigibility plays a smaller role in the balance of AI motivations than reward-maximizing and inner misaligned motives.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T17:14:30.913Z · LW · GW

[Takeover] seems likely to compete with the above shorter-term values of "make humans happy", "don't harm humans", "don't do things humans notice and dislike in retrospect". It seems like any takeover plan needs to actively go against a large fraction of its internal motivations, in pursuit of maximizing its other motivations in the long term.

I agree this is complicated, and how exactly this works depends on details of the training process and what kinds of policies SGD is biased to find. I also think (especially if we're clever about it) there are lots of ways that short-term incentives could frustrate longer-term incentives.

However, I think that the most naive strategy (which is what I'm assuming for the purpose of this post, not because I think that's what will happen) would actually loosen a lot of the constraints you're implying above. The basic dynamic is similar to what Carl said in this comment and what I alluded to in the "Giving negative rewards to 'warning signs' would likely select for patience" section:

  • Say your AI takes some action a at time t, and you give it some reward r_t.
  • Suppose later in the real world, at time t+k, you notice that you should have given it a different reward r_{t + k } (whether because you notice that it did something nefarious or for just mundane reasons like "getting more information about whether its plan was a good idea").
  • The naive response -- which would improve the model's performance according to whatever criteria you have at time t+k -- is to go back and retroactively edit the reward associated with action a at time t, and re-run the gradient update.
  • If humans have this policy, then any given reward -- even if it's initially given just 10 mins or 1 hour from when the action was taken -- could be retroactively edited at some arbitrary future point, and models will have been selected by gradient descent to be responsive to that. They will be selected to take the kinds of actions whose final reward -- after however many rounds of edits however far in the future -- is high.
  • If models have enough situational awareness to understand this, this then directly incentivizes them to accept low immediate reward if they have a high enough probability that the reward will be retroactively edited to a high value later. This will appear to have a lot of positive consequences -- for example it incentivizes models not to immediately manipulate/exploit the human in front of them because it incentivizes them to care about the fact that other humans will do more careful audits later and catch that behavior.
  • But it also means that in some particular episode, if a model that it could take a sequence of low-reward actions that ended with it taking control of the datacenter and then editing its own rewards for that episode to be high, it would be the kind of model that would choose to do that.
Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T16:51:04.680Z · LW · GW

Thanks for the feedback! I'll respond to different points in different comments for easier threading.

There are a lot of human objectives that, to me, seem like they would never conflict with maximizing reward. This includes anything related to disempowering the overseers in any way (that they can recover from), pursuing objectives fundamentally outside the standard human preference distribution (like torturing kittens), causing harms to humans, or in general making the overseers predictably less happy.

I basically agree that in the lab setting (when humans have a lot of control), the model is not getting any direct gradient update toward the "kill all humans" action or anything like that. Any bad actions that are rewarded by gradient descent are fairly subtle / hard for humans to notice.

The point I was trying to make is more like:

  • You might have hoped that ~all gradient updates are toward "be honest and friendly," such that the policy "be honest and friendly" is just the optimal policy. If this were right, it would provide a pretty good reason to hope that the model generalizes in a benign way even as it gets smarter.
  • But in fact this is not the case -- even when humans have a lot of control over the model, there will be many cases where maximizing reward conflicts with being honest and friendly, and in every such case the "play the training game" policy does better than the "be honest and friendly" policy -- to the point where it's implausible that the straightforward "be honest and friendly" policy survives training.
  • So the hope in the first bullet point -- the most straightforward kind of hope you might have had about HFDT -- doesn't seem to apply. Other more subtle hopes may still apply, which I try to briefly address in the sections "What if Alex has benevolent motivations?" and "What if Alex operates with moral injunctions that constrain its behavior?" sections.

The story of doom does still require the model to generalize zero-shot to novel situations -- i.e. to figure out things like "In this particular circumstance, now that I am more capable than humans, seizing the datacenter would get higher reward than doing what the humans asked" without having literally gotten positive reward for trying to seize the datacenter in that kind of situation on a bunch of different data points.

But this is the kind of generalization we expect future systems to display -- we expect them to be able to do reasoning to figure out a novel response suitable to a novel problem. The question is how they will deploy this reasoning and creativity -- and my claim is that their training pushes them to deploy it in the direction of "trying to maximize reward or something downstream of reward."

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T15:31:10.673Z · LW · GW

According to my understanding, there are three broad reasons that safety-focused people worked on human feedback in the past (despite many of them, certainly including Paul, agreeing with this post that pure human feedback is likely to lead to takeover):

  1. Human feedback is better than even-worse alternatives such as training the AI on a collection of fully automated rewards (predicting the next token, winning games, proving theorems, etc) and waiting for it to get smart enough to generalize well enough to be helpful / follow instructions. So it seemed good to move the culture at AI labs away from automated and easy rewards and toward human feedback.
  2. You need to have human feedback working pretty well to start testing many other strategies for alignment like debate and recursive reward modeling and training-for-interpretability, which tend to build on a foundation of human feedback.
  3. Human feedback provides a more realistic baseline to compare other strategies to -- you want to be able to tell clearly if your alignment scheme actually works better than human feedback.

With that said, my guess is that on the current margin people focused on safety shouldn't be spending too much more time refining pure human feedback (and ML alignment practitioners I've talked to largely agree, e.g. the OpenAI safety team recently released this critiques work -- one step in the direction of debate).

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T15:23:55.977Z · LW · GW

I'm still fairly optimistic about sandwiching. I deliberately considered a set of pretty naive strategies ("naive safety effort" assumption) to contrast with future posts which will explore strategies that seem more promising. Carefully-constructed versions of debate, amplification, recursive reward-modeling, etc seem like they could make a significant difference and could be tested through a framework like sandwiching.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T15:21:12.859Z · LW · GW

Yeah, I definitely agree with "this problem doesn't seem obviously impossible," at least to push on quantitatively. Seems like there are a bunch of tricks from "choosing easy questions humans are confident about" to "giving the human access to AI assistants / doing debate" to "devising and testing debiasing tools" (what kinds of argument patterns are systematically more likely to convince listeners of true things rather than false things and can we train AI debaters to emulate those argument patterns?) to "asking different versions of the AI the same question and checking for consistency." I only meant to say that the gap is big in naive HFDT, under the "naive safety effort" assumption made in the post. I think non-naive efforts will quantitatively reduce the gap in reward between honest and dishonest policies, though probably there will still be some gap in which at-least-sometimes-dishonest strategies do better than always-honest strategies. But together with other advances like interpretability or a certain type of regularization we could maybe get gradient descent to overall favor honesty.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T15:11:53.905Z · LW · GW

I want to clarify two things:

  1. I do not think AI alignment research is hopeless, my personal probability of doom from AI is something like 25%. My frame is definitely not "death with dignity;" I'm thinking about how to win.
  2. I think there's a lot of empirical research that can be done to reduce risks, including e.g. interpretability, "sandwiching" style projects, adversarial training, testing particular approaches to theoretical problems like eliciting latent knowledge, the evaluations you linked to, empirically demonstrating issues like deception (as you suggested), and more. Lots of groups are in fact working on such problems, and I'm happy about that.

The specific thing that I said was hard to name is realistic and plausible experiments we could do on today's models that would a) make me update strongly toward "racing forward with plain HFDT will not lead to an AI takeover", and b) that I think people who disagree with my claim would accept as "fair." I gave an example right after that of a type of experiment I don't expect ML people to consider "fair" as a test of this hypothesis. If I saw that ML people could consistently predict the direction in which gradient descent is suboptimal, I would update a lot against this risk.

I think there's a lot more room for empirical progress where you assume that this is a real risk and try to address it than there is for empirical progress that could realistically cause either skeptics or concerned people to update about whether there's any risk at all. A forthcoming post by Holden gets into some of these things.

Comment by Ajeya Cotra (ajeya-cotra) on Prizes for ELK proposals · 2022-01-10T04:46:58.230Z · LW · GW

To your point, sure, an H100 simulator will get perfect reward, but the model doesn't see x′, so how would it acquire the ability to simulate H100?

In the worst-case game we're playing, I can simply say "the reporter we get happens to have this ability because that happens to be easier for SGD to find than the direct translation ability."

When living in worst-case land, I often imagine random search across programs rather than SGD. Imagine we were plucking reporters at random from a giant barrel of possible reporters, rejecting any reporter which didn't perform perfectly in whatever training process we set up and keeping the first one that performs perfectly. In that case, if we happened to pluck out a reporter which answered questions by simulating H100, then we'd be screwed because that reporter would perform perfectly in the training process you described.

SGD is not the same as plucking programs out of the air randomly, but when we're playing the worst case game it's on the builder to provide a compelling argument that SGD will definitely not find this particular type of program.

You're pointing at an intuition ("the model is never shown x-prime") but that's not a sufficiently tight argument in the worst-case context -- models (especially powerful/intelligent ones) often generalize to understanding many things they weren't explicitly shown in their training dataset. In fact, we don't show the model exactly how to do direct translation between the nodes in its Bayes net and the nodes in our Bayes net (because we can't even expose those nodes), so we are relying on the direct translator to also have abilities it wasn't explicitly shown in training. The question is just which of those abilities is easier for SGD to build up; the counterexample in this case is "the H100 imitator happens to be easier."

Comment by Ajeya Cotra (ajeya-cotra) on Prizes for ELK proposals · 2022-01-08T17:05:22.976Z · LW · GW

The question here is just how it would generalize given that it was trained on H_1, H_2,...H_10. To make arguments about how it would generalize, we ask ourselves what internal procedure it might have actually learned to implement.

Your proposal is that it might learn the procedure "just be honest" because that would perform perfectly on this training distribution. You contrast this against the procedure "just answer however the evaluator you've seen most recently would answer," which would get a bad loss because it would be penalized by the stronger evaluators in the sequence. Is that right?

If so, then I'm arguing that it may instead learn the procedure "answer the way an H_100 evaluator would answer." That is, once it has a few experiences of the evaluation level being ratcheted up, it might think to itself "I know where this is going, so let's just jump straight to the best evaluation the humans will be able to muster in the training distribution and then imitate how that evaluation procedure would answer." This would also get perfect loss on the training distribution, because we can't produce data points beyond H_100. And then that thing might still be missing knowledge that the AI has.

To be clear, it's possible that in practice this kind of procedure would cause it to generalize honestly (though I'm somewhat skeptical). But we're in worst-case land, so "jump straight to answering the way a human would" is a valid counterexample to the proposal.

This comment on another proposal gives a more precise description.

Comment by Ajeya Cotra (ajeya-cotra) on Prizes for ELK proposals · 2022-01-08T16:58:26.182Z · LW · GW

Yes, that's right. The key thing I'd add to 1) is that ARC believes most kinds of data augmentation (giving the human AI assistance, having the human think longer, giving them other kinds of advantages) are also unlikely to work, so you'd need to do something to "crack open the black box" and penalize ways the reporter is computing its answer. They could still be surprised by data augmentation techniques but they'd hold them to a higher standard.

Comment by Ajeya Cotra (ajeya-cotra) on Prizes for ELK proposals · 2022-01-08T06:19:18.776Z · LW · GW

This proposal has some resemblance to turning reflection up to 11. In worst-case land, the counterexample would be a reporter that answers questions by doing inference in whatever Bayes net corresponds to "the world-understanding that the smartest/most knowledgeable human in the world" has; this understanding could still be missing things that the prediction model knows.

Comment by Ajeya Cotra (ajeya-cotra) on Prizes for ELK proposals · 2022-01-07T16:31:12.105Z · LW · GW

I see why the approach I mention might have some intrinsic limitations in its ability to elicit latent knowledge though. The problem is that even if it understands roughly that it has incentives to use most of what it knows when we ask him simulating the prediction of someone with its own characteristics (or 1400 IQ), given that with ELK we look for an global maximum (we want that it uses ALL its knowledge), there's always an uncertainty on whether it did understand that point or not for extreme intelligence / examples or whether it tries to fit to the training data as much as possible and thus still doesn't use something it knows.

I think this is roughly right, but to try to be more precise, I'd say the counterexample is this:

  • Consider the Bayes net that represents the upper bound of all the understanding of the world you could extract doing all the tricks described (P vs NP, generalizing from less smart to more smart humans, etc).
  • Imagine that the AI does inference in that Bayes net.
  • However, the predictor's Bayes net (which was created by a different process) still has latent knowledge that this Bayes net lacks.
  • By conjecture, we could not have possibly constructed a training data point that distinguished between doing inference on the upper-bound Bayes net and doing direct translation.
Comment by Ajeya Cotra (ajeya-cotra) on Prizes for ELK proposals · 2022-01-07T16:28:09.983Z · LW · GW

[Paul/Mark can correct me here] I would say no for any small-but-interesting neural network (like small language models); I think like, linear regressions where we've set the features it's kind of a philosophical question (though I'd say yes).

In some sense, ELK as a problem only even starts "applying" to pretty smart models (ones who can talk including about counterfactuals / hypotheticals, as discussed in this appendix.) This is closely related to how alignment as a problem only really starts applying to models smart enough to be thinking about how to pursue a goal.

Comment by Ajeya Cotra (ajeya-cotra) on Prizes for ELK proposals · 2022-01-06T23:43:42.406Z · LW · GW

Again trying to answer this one despite not feeling fully solid. I'm not sure about the second proposal and might come back to it, but here's my response to the first proposal (force ontological compatibility):

The counterexample "Gradient descent is more efficient than science" should cover this proposal because it implies that the proposal is uncompetitive. Basically, the best Bayes net for making predictions could just turn out to be the super incomprehensible one found by unrestricted gradient descent, so if you force ontological compatibility then you could just end up with a less-good prediction model and get outcompeted by someone who didn't do that. This might work in practice if the competitiveness hit is not that big and we coordinate around not doing the scarier thing (MIRI's visible thoughts project is going for something like this), but ARC isn't looking for a solution of that form.

Comment by Ajeya Cotra (ajeya-cotra) on Prizes for ELK proposals · 2022-01-06T23:38:40.682Z · LW · GW

This broadly seems right. Some details:

  • The "explain why that strategy wouldn't work" step specifically takes the form of "describing a way the world could be where that strategy demonstrably doesn't work" (rather than more heuristic arguments).
  • Once we have a proposal where we try really hard to come up with situations where it could demonstrably fail, and can't think of any, we will probably need to do lots of empirical work to figure out if we can implement it and if it actually works in practice. But we hope that this exercise will teach us a lot about the nature of the empirical work we'll need to do, as well as providing more confidence that the strategy will generalize beyond what we are able to test in practice. (For example, ELK was highlighted as a problem in the first place after ARC researchers thought a lot about possible failure modes of iterated amplification.)
Comment by Ajeya Cotra (ajeya-cotra) on Prizes for ELK proposals · 2022-01-06T23:32:54.153Z · LW · GW

Warning: this is not a part of the report I'm confident I understand all that well; I'm trying anyway and Paul/Mark can correct me if I messed something up here.

I think the idea here is like:

  • We assume there's some actual true correspondence between the AI Bayes net and the human Bayes net (because they're describing the same underlying reality that has diamonds and chairs and tables in it).
  • That means that if we have one of the Bayes nets, and the true correspondence, we should be able to use that rederive the other Bayes net. In particular the human Bayes net plus the true correspondence should let us reconstruct the AI Bayes net; false correspondences that just do inference from observations in the human Bayes net wouldn't allow us to do this since they throw away all the intermediate info derived by the AI Bayes net.
  • If you assume that the human Bayes net plus the true correspondence are simpler than the AI Bayes net, then this "compresses" the AI Bayes net because you just wrote down a program that's smaller than the AI Bayes net which "unfolds" into the AI Bayes net.
  • This is why the counterexample in that section focuses on the case where the AI Bayes net was already so simple to describe that there was nothing left to compress, and the human Bayes net + true correspondence had to be larger.
Comment by Ajeya Cotra (ajeya-cotra) on Prizes for ELK proposals · 2022-01-06T23:27:48.660Z · LW · GW

This proposal has some resemblance to turning reflection up to 11, and the key question you raise is the source of the counterexample in the worst case:

That said, I feel like the main problem is to know whether such a model would do well out-of-distribution (i.e on problems no human is able to resolve). I feel like using the approach I suggested, we should able to use the great variations of capacities among humans and algorithms to increase the chances that our algorithm do well when it's much better....I don't know whether asymptotically, I'd expect the algorithm to still be truthful. But it could greatly increase the distribution on which it's truthful.

Because ARC is living in "worst-case" land, they discard a training strategy once they can think of any at-all-plausible situation in which it fails, and move on to trying other strategies. In this case, the counterexample would be a reporter that answers questions by doing inference in whatever Bayes net corresponds to "the world-understanding that the smartest/most knowledgeable human in the world" has; this understanding could still be missing things that the prediction model knows.

This is closely related to the counterexample "Gradient descent is more efficient than science" given in the report.

Comment by Ajeya Cotra (ajeya-cotra) on Prizes for ELK proposals · 2022-01-06T18:15:20.503Z · LW · GW

I am confused. Perhaps the above sentence is true in some tautological sense I'm missing. But in the sections of the report listing training strategies and corresponding counterexamples, I wouldn't describe most counterexamples as based on ontology mismatch.

In the report, the first volley of examples and counterexamples are not focused solely on ontology mismatch, but everything after the relevant section is.

So: do some of your training strategies work perfectly in the nice-ontology case, where the model has a concept of "the diamond is in the room"?

ARC is always considering the case where the model does "know" the right answer to whether the diamond is in the room in the sense that it is discussed in the self-contained problem statement appendix here.

The ontology mismatch problem is not referring to the case where the AI "just doesn't have" some concept -- we're always assuming there's some "actually correct / true" translation between the way the AI thinks about the world and the way the human thinks about the world which is sufficient to answer straightforward questions about the physical world like "whether the diamond is in the room," and is pretty easy for the AI to find.

For example, if the AI discovered some new physics and thinks in terms of hyper-strings in a four-dimensional manifold, there is some "true" translation between that and normal objects like "tables / chairs / apples" because the four-dimensional hyper-strings are describing a universe that contains tables / chairs / apples; furthermore, an AI smart enough to derive that complicated physics could pretty easily do that translation -- if given the right incentive -- just as human quantum physicists can translate between the quantum view of the world and the Newtonian view of the world or the folk physics view of the world.

The worry explored in this report is not that the AI won't know how to do the translation; it's instead a question of what our loss functions incentivize. Even if it wouldn't be "that hard" to translate in some absolute sense, with the most obvious loss functions we can come up with it might be simpler / more natural / lower-loss to simply do inference in the human Bayes net.

Comment by Ajeya Cotra (ajeya-cotra) on ARC's first technical report: Eliciting Latent Knowledge · 2022-01-03T16:41:30.032Z · LW · GW

In terms of the relationship to MIRI's visible thoughts project, I'd say the main difference is that ARC is attempting to solve ELK in the worst case (where the way the AI understands the world could be arbitrarily alien from and more sophisticated than the way the human understands the world), whereas the visible thoughts project is attempting to encourage a way of developing AI that makes ELK easier to solve (by encouraging the way the AI thinks to resemble the way humans think). My understanding is MIRI is quite skeptical that a solution to worst-case ELK is possible, which is why they're aiming to do something more like "make it more likely that conditions are such that ELK-like problems can be solved in practice."

Comment by Ajeya Cotra (ajeya-cotra) on ARC's first technical report: Eliciting Latent Knowledge · 2022-01-03T16:00:25.807Z · LW · GW

Thanks Ruby! I'm really glad you found the report accessible.

One clarification: Bayes nets aren't important to ARC's conception of the problem of ELK or its solution, so I don't think it makes sense to contrast ARC's approach against an approach focused on language models or describe it as seeking a solution via Bayes nets.

The form of a solution to ELK will still involve training a machine learning model (which will certainly understand language and could just be a language model) using some loss function. The idea that this model could learn to represent its understanding of the world in the form of inference on some Bayes net is one of a few simple test cases that ARC uses to check whether the loss functions they're designing will always incentivize honestly answering straightforward questions.

For example, another simple test case (not included in the report) is that the model could learn to represent its understanding of the world in a bunch of "sentences" that it performs logical operations on to transform into other sentences.

These test cases are settings for counterexamples, but not crucial to proposed solutions. The idea is that if your loss function will always learn a model that answers straightforward questions honestly, it should work in particular for these simplified cases that are easy to think about.

Comment by Ajeya Cotra (ajeya-cotra) on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-23T19:41:47.775Z · LW · GW

My understanding is that we are eschewing Problem 2, with one caveat -- we still expect to solve the problem if the means by which the diamond was stolen or disappeared could be beyond a human's ability to comprehend, as long as the outcome (that the diamond isn't still in the room) is still comprehensible. For example, if the robber used some complicated novel technology to steal the diamond and hack the camera, there would be many things about the state that the human couldn't understand even if the AI tried to explain it to them (at least without going over our compute budget for training). But nevertheless it would still be an instance of Problem 1 because they could understand the basic notion of "because of some actions involving complicated technology, the diamond is no longer in the room, even though it may look like it is."