Posts

Survey on the acceleration risks of our new RFPs to study LLM capabilities 2023-11-10T23:59:52.515Z
AI Timelines 2023-11-10T05:28:24.841Z
New roles on my team: come build Open Phil's technical AI safety program with me! 2023-10-19T16:47:59.701Z
New blog: Planned Obsolescence 2023-03-27T19:46:25.429Z
Two-year update on my personal AI timelines 2022-08-02T23:07:48.698Z
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover 2022-07-18T19:06:14.670Z
ARC's first technical report: Eliciting Latent Knowledge 2021-12-14T20:09:50.209Z
More Christiano, Cotra, and Yudkowsky on AI progress 2021-12-06T20:33:12.164Z
Christiano, Cotra, and Yudkowsky on AI progress 2021-11-25T16:45:32.482Z
Techniques for enhancing human feedback 2021-10-29T07:27:46.700Z
The case for aligning narrowly superhuman models 2021-03-05T22:29:41.577Z
AMA on EA Forum: Ajeya Cotra, researcher at Open Phil 2021-01-29T23:05:41.527Z
Draft report on AI timelines 2020-09-18T23:47:39.684Z
Iterated Distillation and Amplification 2018-11-30T04:47:14.460Z

Comments

Comment by Ajeya Cotra (ajeya-cotra) on AI Timelines · 2025-01-07T23:23:02.636Z · LW · GW

To put it another way: we probably both agree that if we had gotten AI personal assistants that shop for you and book meetings for you in 2024, that would have been at least some evidence for shorter timelines. So their absence is at least some evidence for longer timelines. The question is what your underlying causal model was: did you think that if we were going to get superintelligence by 2027, then we really should see personal assistants in 2024? A lot of people strongly believe that, you (Daniel) hardly believe it at all, and I'm somewhere in the middle.

If we had gotten both the personal assistants I was expecting, and the 2x faster benchmark progress than I was expecting, my timelines would be the same as yours are now.

Comment by Ajeya Cotra (ajeya-cotra) on AI Timelines · 2025-01-07T23:17:26.367Z · LW · GW

I'm not talking about narrowly your claim; I just think this very fundamentally confuses most people's basic models of the world. People expect, from their unspoken models of "how technological products improve," that long before you get a mind-bendingly powerful product that's so good it can easily kill you, you get something that's at least a little useful to you (and then you get something that's a little more useful to you, and then something that's really useful to you, and so on). And in fact that is roughly how it's working — for programmers, not for a lot of other people. 

Because I've engaged so much with the conceptual case for an intelligence explosion (i.e. the case that this intuitive model of technology might be wrong), I roughly buy it even though I am getting almost no use out of AIs still. But I have a huge amount of personal sympathy for people who feel really gaslit by it all.

Comment by Ajeya Cotra (ajeya-cotra) on AI Timelines · 2025-01-07T23:05:34.261Z · LW · GW

Yeah TBC, I'm at even less than 1-2 decades added, more like 1-5 years.

Comment by Ajeya Cotra (ajeya-cotra) on AI Timelines · 2025-01-07T22:47:39.883Z · LW · GW

Interestingly, I've heard from tons of skeptics I've talked to (e.g. Tim Lee, CSET people, AI Snake Oil) that timelines to actual impacts in the world (such as significant R&D acceleration or industrial acceleration) are going to be way longer than we say because AIs are too unreliable and risky, therefore people won't use them. I was more dismissive of this argument before but: 

  • It matches my own lived experience (e.g. I still use search way more than LLMs, even to learn about complex topics, because I have good Google Fu and LLMs make stuff up too much).
  • As you say, it seems like a plausible explanation for why my weird friends make way more use out of coding agents than giant AI companies.
Comment by Ajeya Cotra (ajeya-cotra) on AI Timelines · 2025-01-07T20:57:31.991Z · LW · GW

Yeah, good point, I've been surprised by how uninterested the companies have been in agents.

Comment by Ajeya Cotra (ajeya-cotra) on AI Timelines · 2025-01-07T19:12:05.998Z · LW · GW

One thing that I think is interesting, which doesn't affect my timelines that much but cuts in the direction of slower: once again I overestimated how much real world use anyone who wasn't a programmer would get. I definitely expected an off-the-shelf agent product that would book flights and reserve restaurants and shop for simple goods, one that worked well enough I would actually use it (and I expected that to happen before the one hour plus coding tasks were solved; I expected it to be concurrent with half hour coding tasks).

I can't tell if the fact that AI agents continue to be useless to me is a portent that the incredible benchmark performance won't translate as well as the bullish people expect to real world acceleration; I'm largely deferring to the consensus in my local social circle that it's not a big deal. My personal intuitions are somewhat closer to what Steve Newman describes in this comment thread.

It seems like anecdotally folks are getting like +5%-30% productivity boost from using AI; it does feel somewhat aggressive for that to go to 10x productivity boost within a couple years.

Comment by Ajeya Cotra (ajeya-cotra) on ryan_greenblatt's Shortform · 2025-01-07T19:04:23.053Z · LW · GW

My timelines are now roughly similar on the object level (maybe a year slower for 25th and 1-2 years slower for 50th), and procedurally I also now defer a lot to Redwood and METR engineers. More discussion here: https://www.lesswrong.com/posts/K2D45BNxnZjdpSX2j/ai-timelines?commentId=hnrfbFCP7Hu6N6Lsp

Comment by Ajeya Cotra (ajeya-cotra) on AI Timelines · 2025-01-07T18:55:21.752Z · LW · GW

I agree the discussion holds up well in terms of the remaining live cruxes. Since this exchange, my timelines have gotten substantially shorter. They're now pretty similar to Ryan's (they feel a little bit slower but within the noise from operationalizations being fuzzy; I find it a bit hard to think about what 10x labor inputs exactly looks like). 

The main reason they've gotten shorter is that performance on few-hour agentic tasks has moved almost twice as fast as I expected, and this seems broadly non-fake (i.e. it seems to be translating into real world use with only a moderate lag rather than a huge lag), though this second part is noisier and more confusing.

This dialogue occurred a few months after METR released their pilot report on autonomous replication and adaptation tasks. At the time it seemed like agents (GPT-4 and Claude 3 Sonnet iirc) were starting to be able to do tasks that would take a human a few minutes (looking something up on Wikipedia, making a phone call, searching a file system, writing short programs). 

Right around when I did this dialogue, I launched an agent benchmarks RFP to build benchmarks testing LLM agents on many-step real-world tasks. Through this RFP, in late-2023 and early-2024, we funded a bunch of agent benchmarks consisting of tasks that take experts between 15 minutes and a few hours. 

Roughly speaking, I was expecting that the benchmarks we were funding would get saturated around early-to-late 2026 (within 2-3 years). By EOY 2024 (one year out), I had expected these benchmarks to be halfway toward saturation — qualitatively I guessed that agents would be able to reliably perform moderately difficult 30 minute tasks as well as experts in a variety of domains but struggle with the 1-hour-plus tasks. This would have roughly been the same trajectory that the previous generation of benchmarks followed: e.g. MATH was introduced in Jan 2021, got halfway there in June 2022 (1.5 years), then saturated probably like another year after that (for a total of 2.5 years).

Instead, based on agent benchmarks like RE Bench and CyBench and SWE Bench Verified and various bio benchmarks, it looks like agents are already able to perform self-contained programming tasks that would take human experts multiple hours (although they perform these tasks in a more one-shot way than human experts perform them, and I'm sure there is a lot of jaggedness); these benchmarks seem on track to saturate by early 2025. If that holds up, it'd be about twice as fast as I would have guessed (1-1.5 years vs 2-3 years).

There's always some lag between benchmark performance and real world use, and it's very hard for me to gauge this lag myself because it seems like AI agents are way disproportionately useful to programmers and ML engineers compared to everyone else. But from friends who use AI systems regularly, it seems like they are regularly assigning agents tasks that would take them between a few minutes and an hour and getting actual value out of them.

On a meta level I now defer heavily to Ryan and people in his reference class (METR and Redwood engineers) on AI timelines, because they have a similarly deep understanding of the conceptual arguments I consider most important while having much more hands-on experience with the frontier of useful AI capabilities (I still don't use AI systems regularly in my work). Of course AI company employees have the most hands-on experience, but I've found that they don't seem to think as rigorously about the conceptual arguments, and some of them have a track record of overshooting and predicting AGI between 2020 and 2025 (as you might expect from their incentives and social climate).

Comment by Ajeya Cotra (ajeya-cotra) on There should be more AI safety orgs · 2023-09-26T02:56:43.181Z · LW · GW

(Cross-posted to EA Forum.)

I’m a Senior Program Officer at Open Phil, focused on technical AI safety funding. I’m hearing a lot of discussion suggesting funding is very tight right now for AI safety, so I wanted to give my take on the situation.

At a high level: AI safety is a top priority for Open Phil, and we are aiming to grow how much we spend in that area. There are many potential projects we'd be excited to fund, including some potential new AI safety orgs as well as renewals to existing grantees, academic research projects, upskilling grants, and more.

At the same time, it is also not the case that someone who reads this post and tries to start an AI safety org would necessarily have an easy time raising funding from us. This is because:

  • All of our teams whose work touches on AI (Luke Muehlhauser’s team on AI governance, Claire Zabel’s team on capacity building, and me on technical AI safety) are quite understaffed at the moment. We’ve hired several people recently, but across the board we still don’t have the capacity to evaluate all the plausible AI-related grants, and hiring remains a top priority for us.
    • And we are extra-understaffed for evaluating technical AI safety proposals in particular. I am the only person who is primarily focused on funding technical research projects (sometimes Claire’s team funds AI safety related grants, primarily upskilling, but a large technical AI safety grant like a new research org would fall to me). I currently have no team members; I expect to have one person joining in October and am aiming to launch a wider hiring round soon, but I think it’ll take me several months to build my team’s capacity up substantially. 
    • I began making grants in November 2022, and spent the first few months full-time evaluating applicants affected by FTX (largely academic PIs as opposed to independent organizations started by members of the EA community). Since then, a large chunk of my time has gone into maintaining and renewing existing grant commitments and evaluating grant opportunities referred to us by existing advisors. I am aiming to reserve remaining bandwidth for thinking through strategic priorities, articulating what research directions seem highest-priority and encouraging researchers to work on them (through conversations and hopefully soon through more public communication), and hiring for my team or otherwise helping Open Phil build evaluation capacity in AI safety (including separately from my team). 
    • As a result, I have deliberately held off on launching open calls for grant applications similar to the ones run by Claire’s team (e.g. this one); before onboarding more people (and developing or strengthening internal processes), I would not have the bandwidth to keep up with the applications.
  • On top of this, in our experience, providing seed funding to new organizations (particularly organizations started by younger and less experienced founders) often leads to complications that aren't present in funding academic research or career transition grants.  We prefer to think carefully about seeding new organizations, and have a different and higher bar for funding someone to start an org than for funding that same person for other purposes (e.g. career development and transition funding, or PhD and postdoc funding).
    • I’m very uncertain about how to think about seeding new research organizations and many related program strategy questions. I could certainly imagine developing a different picture upon further reflection — but having low capacity combines poorly with the fact that this is a complex type of grant we are uncertain about on a lot of dimensions. We haven’t had the senior staff bandwidth to develop a clear stance on the strategic or process level about this genre of grant, and that means that we are more hesitant to take on such grant investigations — and if / when we do, it takes up more scarce capacity to think through the considerations in a bespoke way rather than having a clear policy to fall back on.
Comment by Ajeya Cotra (ajeya-cotra) on Thoughts on the impact of RLHF research · 2023-01-26T02:06:19.297Z · LW · GW

my guess is most of that success is attributable to the work on RLHF, since that was really the only substantial difference between Chat-GPT and GPT-3

I don't think this is right -- the main hype effect of chatGPT over previous models feels like it's just because it was in a convenient chat interface that was easy to use and free. My guess is that if you did a head-to-head comparison of RLHF and kludgey random hacks involving imitation and prompt engineering, they'd seem similarly cool to a random journalist / VC, and generate similar excitement.

Comment by Ajeya Cotra (ajeya-cotra) on Richard Ngo's Shortform · 2022-12-14T21:16:29.833Z · LW · GW

I strongly disagree with the "best case" thing. Like, policies could just learn human values! It's not that implausible.

Yes, sorry, "best case" was oversimplified. What I meant is that generalizing to want reward is in some sense the model generalizing "correctly;" we could get lucky and have it generalize "incorrectly" in an important sense in a way that happens to be beneficial to us. I discuss this a bit more here.

But if Alex did initially develop a benevolent goal like “empower humans,” the straightforward and “naive” way of acting on that goal would have been disincentivized early in training. As I argued above, if Alex had behaved in a straightforwardly benevolent way at all times, it would not have been able to maximize reward effectively.

That means even if Alex had developed a benevolent goal, it would have needed to play the training game as well as possible -- including lying and manipulating humans in a way that naively seems in conflict with that goal. If its benevolent goal had caused it to play the training game less ruthlessly, it would’ve had a constant incentive to move away from having that goal or at least from acting on it.[35] If Alex actually retained the benevolent goal through the end of training, then it probably strategically chose to act exactly as if it were maximizing reward.

This means we could have replaced this hypothetical benevolent goal with a wide variety of other goals without changing Alex’s behavior or reward in the lab setting at all -- “help humans” is just one possible goal among many that Alex could have developed which would have all resulted in exactly the same behavior in the lab setting.

If I had to try point to the crux here, it might be "how much selection pressure is needed to make policies learn goals that are abstractly related to their training data, as opposed to goals that are fairly concretely related to their training data?"...As usual, there's the human analogy: our goals are very strongly biased towards things we have direct observational access to!)

I don't understand why reward isn't something the model has direct access to -- it seems like it basically does? If I had to say which of us were focusing on abstract vs concrete goals, I'd have said I was thinking about concrete goals and you were thinking about abstract ones, so I think we have some disagreement of intuition here.

Even setting aside this disagreement, though, I don't like the argumentative structure because the generalization of "reward" to large scales is much less intuitive than the generalization of other concepts (like "make money") to large scales - in part because directly having a goal of reward is a kinda counterintuitive self-referential thing.

Yeah, I don't really agree with this; I think I could pretty easily imagine being an AI system asking the question "How much reward would this episode get if it were sampled for training?" It seems like the intuition this is weird and unnatural is doing a lot of work in your argument, and I don't really share it.

Comment by Ajeya Cotra (ajeya-cotra) on Richard Ngo's Shortform · 2022-12-14T19:14:12.574Z · LW · GW

Yeah, I agree this is a good argument structure -- in my mind, maximizing reward is both a plausible case (which Richard might disagree with) and the best case (conditional on it being strategic at all and not a bag of heuristics), so it's quite useful to establish that it's doomed; that's the kind of structure I was going for in the post.

Comment by Ajeya Cotra (ajeya-cotra) on Richard Ngo's Shortform · 2022-12-14T19:03:45.057Z · LW · GW

Note that the "without countermeasures" post consistently discusses both possibilities (the model cares about reward or the model cares about something else that's consistent with it getting very high reward on the training dataset). E.g. see this paragraph from the above-the-fold intro:

Once this progresses far enough, the best way for Alex to accomplish most possible “goals” no longer looks like “essentially give humans what they want but take opportunities to manipulate them here and there.” It looks more like “seize the power to permanently direct how it uses its time and what rewards it receives -- and defend against humans trying to reassert control over it, including by eliminating them.” This seems like Alex’s best strategy whether it’s trying to get large amounts of reward or has other motives. If it’s trying to maximize reward, this strategy would allow it to force its incoming rewards to be high indefinitely.[6] If it has other motives, this strategy would give it long-term freedom, security, and resources to pursue those motives.

As well as the section Even if Alex isn't "motivated" to maximize reward.... I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that's distinct from being confident in the motivations that give rise to that policy. I believe Alex would try very hard to maximize reward in most cases, but this could be for either terminal or instrumental reasons.

With that said, for roughly the reasons Paul says above, I think I probably do have a disagreement with Richard -- I think that caring about some version of reward is pretty plausible (~50% or so). It seems pretty natural and easy to grasp to me, and because I think there will likely be continuous online training the argument that there's no notion of reward on the deployment distribution doesn't feel compelling to me.

Comment by Ajeya Cotra (ajeya-cotra) on Two-year update on my personal AI timelines · 2022-08-04T16:49:58.140Z · LW · GW

Yeah I agree more of the value of this kind of exercise (at least within the community) is in revealing more granular disagreements about various things. But I do think there's value in establishing to more external people something high level like "It really could be soon and it's not crazy or sci fi to think so."

Comment by Ajeya Cotra (ajeya-cotra) on Two-year update on my personal AI timelines · 2022-08-04T16:47:15.616Z · LW · GW

Can you say more about what particular applications you had in mind?

Stuff like personal assistants who write emails / do simple shopping, coding assistants that people are more excited about than they seem to be about Codex, etc.

(Like I said in the main post, I'm not totally sure what PONR refers to, but don't think I agree that the first lucrative application marks a PONR -- seems like there are a bunch of things you can do after that point, including but not limited to alignment research.)

Comment by Ajeya Cotra (ajeya-cotra) on Two-year update on my personal AI timelines · 2022-08-04T16:46:11.852Z · LW · GW

I don't see it that way, no. Today's coding models can help automate some parts of the ML researcher workflow a little bit, and I think tomorrow's coding models will automate more and more complex parts, and so on. I think this expansion could be pretty rapid, but I don't think it'll look like "not much going on until something snaps into place."

Comment by Ajeya Cotra (ajeya-cotra) on Two-year update on my personal AI timelines · 2022-08-03T05:09:18.772Z · LW · GW

(Coherence aside, when I now look at that number it does seem a bit too high, and I feel tempted to move it to 2027-2028, but I dunno, that kind of intuition is likely to change quickly from day to day.)

Comment by Ajeya Cotra (ajeya-cotra) on Two-year update on my personal AI timelines · 2022-08-03T00:05:16.647Z · LW · GW

Hm, yeah, I bet if I reflected more things would shift around, but I'm not sure the fact that there's a shortish period where the per-year probability is very elevated followed by a longer period with lower per-year probability is actually a bad sign.

Roughly speaking, right now we're in an AI boom where spending on compute for training big models is going up rapidly, and it's fairly easy to actually increase spending quickly because the current levels are low. There's some chance of transformative AI in the middle of this spending boom -- and because resource inputs are going up a ton each year, the probability of TAI by date X would also be increasing pretty rapidly.

But the current spending boom is pretty unsustainable if it doesn't lead to TAI. At some point in the 2040s or 50s, if we haven't gotten transformative AI by then, we'll have been spending 10s of billions training models, and it won't be that easy to keep ramping up quickly from there. And then because the input growth will have slowed, the increase in probability from one year to the next will also slow. (That said, not sure how this works out exactly.)

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-25T01:45:28.791Z · LW · GW

Where does the selection come from? Will the designers toss a really impressive AI for not getting reward on that one timestep? I think not.

I was talking about gradient descent here, not designers.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-24T02:43:31.076Z · LW · GW

It doesn't seem like it would have to prevent us from building computers if it has access to far more compute than we could access on Earth. It would just be powerful enough to easily defeat the kind of AIs we could train with the relatively meager computing resources we could extract from Earth. In general the AI is a superpower and humans are dramatically technologically behind, so it seems like it has many degrees of freedom and doesn't have to be particularly watching for this.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-23T18:52:25.544Z · LW · GW

Neutralizing computational capabilities doesn't seem to involve total destruction of physical matter or human extinction though, especially for a very powerful being. Seems like it'd be basically just as easy to ensure we + future AIs we might train are no threat as it is to vaporize the Earth.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-22T22:53:56.707Z · LW · GW

My answer is a little more prosaic than Raemon. I don't feel at all confident that an AI that already had God-like abilities would choose to literally kill all humans to use their bodies' atoms for its own ends; it seems totally plausible to me that -- whether because of exotic things like "multiverse-wide super-rationality" or "acausal trade" or just "being nice" -- the AI will leave Earth alone, since (as you say) it would be very cheap for it to do so.

The thing I'm referring to as "takeover" is the measures that an AI would take to make sure that humans can't take back control -- while it's not fully secure and doesn't have God-like abilities. Once a group of AIs have decided to try to get out of human control, they're functionally at war with humanity. Humans could do things like physically destroy the datacenters they're running on, and they would probably want to make sure they can't do that.

Securing AI control and defending from human counter-moves seems likely to involve some violence -- but it could be a scale of violence that's "merely" in line with historical instances where a technologically more advanced group of humans colonized or took control of a less-advanced group of humans; most historical takeovers don't involve literally killing every single member of the other group.

The key point is that it seems likely that AIs will secure the power to get to decide what happens with the future; I'm pretty unsure exactly how they use it, and especially if it involves physically destroying Earth / killing all humans for resources. These resources seem pretty meager compared to the rest of the universe.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-22T15:37:38.181Z · LW · GW

I mean things like tricks to improve the sample efficiency of human feedback, doing more projects that are un-enhanced RLHF to learn things about how un-enhanced RLHF works, etc.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-20T23:43:42.281Z · LW · GW

I'm pretty confused about how to think about the value of various ML alignment papers. But I think even if some piece of empirical ML work on alignment is really valuable for reducing x-risk, I wouldn't expect its value to take the form of providing insight to readers like you or me. So you as a reader not getting much out of it is compatible with the work being super valuable, and we probably need to assess it on different terms.

The main channel of value that I see for doing work like "learning to summarize" and the critiques project and various interpretability projects is something like "identifying a tech tree that it seems helpful to get as far as possible along by the Singularity, and beginning to climb that tech tree."

In the case of critiques -- ultimately, it seems like having AIs red team each other and pointing out ways that another AI's output could be dangerous seems like it will make a quantitative difference. If we had a really well-oiled debate setup, then we would catch issues we wouldn't have caught with vanilla human feedback, meaning our models could get smarter before they pose an existential threat -- and these smarter models can more effectively work on problems like alignment for us.[1]

It seems good to have that functionality developed as far as it can be developed in as many frontier labs as possible. The first steps of that look kind of boring, and don't substantially change our view of the problem. But first steps are the foundation for later steps, and the baseline against which you compare later steps. (Also every step can seem boring in the sense of bringing no game-changing insights, while nonetheless helping a lot.)

When the main point of some piece of work is to get good at something that seems valuable to be really good at later, and to build tacit knowledge and various kinds of infrastructure for doing that thing, a paper about it is not going to feel that enlightening to someone who wants high-level insights that change their picture of the overall problem. (Kind of like someone writing a blog post about how they developed effective management and performance evaluation processes at their company isn't going to provide much insight into the abstract theory of principal-agent problems. The value of that activity was in the company running better, not people learning things from the blog post about it.)

I'm still not sure how valuable I think this work is, because I don't know how well it's doing at efficiently climbing tech trees or at picking the right tech trees, but I think that's how I'd think about evaluating it.

[1] Or do a "pivotal act," though I think I probably don't agree with some of the connotations of that term.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-20T21:59:59.808Z · LW · GW

I was mainly talking about the current margin when I talked about how excited I am about the theoretical vs empirical work I see "going on" right now and how excited I tend to be about currently-active researchers who are doing theory vs empirical research. And I was talking about the future when I said that I expect empirical work to end up with the lion's share of credit for AI risk reduction.

Eliezer, Bostrom, and co certainly made a big impact in raising the problem to people's awareness and articulating some of its contours. It's kind of a matter of semantics whether you want to call that "theoretical research" or "problem advocacy" / "cause prioritization" / "community building" / whatever, and no matter which bucket you put it in I agree it'll probably end up with an outsized impact for x-risk-reduction, by bringing the problem to attention sooner than it would have otherwise been brought to attention and therefore probably allowing more work to happen on it before TAI is developed.

But just like how founding CEOs tend to end up with ~10% equity once their companies have grown large, I don't think this historical problem-advocacy-slash-theoretical-research work alone will end up with a very large amount of total credit.

On the main thrust of my point, I'm significantly less excited about MIRI-sphere work that is much less like "articulating a problem and advocating for its importance" and much more like "attempting to solve a problem." E.g. stuff like logical inductors, embedded agency, etc seem a lot less valuable to me than stuff like the orthogonality thesis and so on.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-20T17:14:32.663Z · LW · GW

I think the retroactive editing of rewards (not just to punish explicitly bad action but to slightly improve evaluation of everything) is actually pretty default, though I understand if people disagree. It seems like an extremely natural thing to do that would make your AI more capable and make it more likely to pass most behavioral safety interventions.

In other words, even if the average episode length is short (e.g. 1 hour), I think the default outcome is to have the rewards for that episode be computed as far after the fact as possible, because that helps Alex improve at long-range planning (a skill Magma would try hard to get it to have). This can be done in a way that doesn't compromise speed of training -- you simply reward Alex immediately with your best guess reward, then keep editing it later as more information comes in. At all points in time you have a "good enough" reward ready to go, while also capturing the benefits of pushing your model to think in as long-term a way as possible.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-20T15:05:47.938Z · LW · GW

Thanks, but I'm not working on that project! That project is led by Beth Barnes.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T23:44:59.705Z · LW · GW

Hm, not sure I understand but I wasn't trying to make super specific mechanistic claims here -- I agree that what I said doesn't reduce confusion about the specific internal mechanisms of how lying gets to be hard for most humans, but I wasn't intending to claim that it was. I also should have said something like "evolutionary, cultural, and individual history" instead (I was using "evolution" as a shorthand to indicate it seems common among various cultures but of course that doesn't mean don't-lie genes are directly bred into us! Most human universals aren't; we probably don't have honor-the-dead and different-words-for-male-and-female genes).

I was just making the pretty basic point "AIs in general, and Alex in particular, are produced through a very different process from humans, so it seems like 'humans find lying hard' is pretty weak evidence that 'AI will by default find lying hard.'"

I agree that asking "What specific neurological phenomena make it so most people find it hard to lie?" could serve as inspiration to do AI honesty research, and wasn't intending to claim otherwise in that paragraph (though separately, I am somewhat pessimistic about this research direction).

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T23:34:28.098Z · LW · GW

I'm agnostic about whether the AI values reward terminally or values some other complicated mix of things. The claim I'm making is behavioral -- a claim that the strategy of "try to figure out how to get the most reward" would be selected over other strategies like "always do the nice thing."

The strategy could be compatible with a bunch of different psychological profiles. "Playing the training game" is a filter over models -- lots of possible models could do it, the claim is just that we need to reason about the distribution of psychologies given that the psychologies that make it to the end of training most likely employ the strategy of playing the training game on the training distribution.

Why do I think this? Consider an AI that has high situational awareness, reasoning ability and creative planning ability (assumptions of my situation which don't yet say anything about values). This AI has the ability to think about what kinds of actions would get the most reward (just like it has the ability to write a sonnet or solve a math problem or write some piece of software; it understands the task and has the requisite subskills). And once it has the ability, it is likely to be pushed in the direction of exercising that ability (since doing so would increase its reward).

This changes its psychology in whatever way most easily results in it doing more of the think-about-what-would-get-the-most-reward-and-do-it behavior. Terminally valuing reward and only reward would certainly do the trick, but a lot of other things would too (e.g. valuing paperclips in the very long run).

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T23:13:54.052Z · LW · GW

Geoffrey Irving, Jan Leike, Paul Christiano, Rohin Shah, and probably others were doing various kinds of empirical work a few years before Redwood (though I would guess Oliver doesn't like that work and so wouldn't consider it a counterexample to his view).

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T23:12:19.752Z · LW · GW

I agree that in an absolute sense there is very little empirical work that I'm excited about going on, but I think there's even less theoretical work going on that I'm excited about, and when people who share my views on the nature of the problem work on empirical work I feel that it works better than when they do theoretical work.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T19:39:39.288Z · LW · GW

The gradient pressure towards valuing reward terminally when you've already figured out reliable strategies for doing what humans want, seems very weak....in practice, it seems to me like these differences would basically only happen due to operator error, or cosmic rays, or other genuinely very rare events (as you describe in the "Security Holes" section).

Yeah, I disagree. With plain HFDT, it seems like there's continuous pressure to improve things on the margin by being manipulative -- telling human evaluators what they want to hear, playing to pervasive political and emotional and cognitive biases, minimizing and covering up evidence of slight suboptimalities to make performance on the task look better, etc. I think that in basically every complex training episode a model could do a little better by explicitly thinking about the reward and being a little-less-than-fully-forthright.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T18:31:49.852Z · LW · GW

Note I was at -16 with one vote, and only 3 people have voted so far. So it's a lot due to the karma-weight of the first disagreer.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T18:15:36.107Z · LW · GW

I think updating negatively on the situation/action pair has functionally the same effect as changing the reward to be what you now think it should be -- my understanding is that RL can itself be implemented as just updates on situation/action pairs, so you could have trained your whole model that way. Since the reason you updated negatively on that situation/action pair is because of something you noticed long after the action was complete, it is still pushing your models to care about the longer-run.

This posits that the model has learned to wirehead

I don't think it posits that the model has learned to wirehead -- directly being motivated to maximize reward or being motivated by anything causally downstream of reward (like "more copies of myself" or "[insert long-term future goal that requires me being around to steer the world toward that goal]") would work.

The claim I'm making is that somehow you made a gradient update toward a model that is more likely to behave well according to your judgment after the edit -- and two salient ways that update could be working on the inside is "the model learns to care a bit more about long-run reward after editing" and "the model learns to care a bit more about something downstream of long-run reward."

A lot of updates like this seem to push the model toward caring a lot about one of those two things (or some combo) and away from caring about the immediate rewards you were citing earlier as a reason it may not want to take over.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T17:51:19.251Z · LW · GW

No particular reason -- I can't figure out how to cross post now so I sent a request.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T17:49:36.897Z · LW · GW

Here is the real chasm between the AI safety movement and the ML industry/academia. One field is entirely driven by experimental results; the other is dominated so totally by theory that its own practitioners deny that there can be any meaningful empirical aspect to it, at least, not until the moment when it's too late to make any difference.

To put a finer point on my view on theory vs empirics in alignment:

  • Going forward, I think the vast majority of technical work needed to reduce AI takeover risk is empirical, not theoretical (both in terms of "total amount of person-hours needed in each category" and in terms of "total 'credit' each category should receive for reducing doom in some sense").
  • Conditional on an alignment researcher agreeing with my view of the high-level problem, I tend to be more excited about them if they're working on ML experiments than if they're working on theory.
  • I'm quite skeptical of most theoretical alignment research I've seen. The main theoretical research I'm excited about is ARC's, and I have a massive conflict of interest since the founder is my husband -- I would feel fairly sympathetic to people who viewed ARC's work more like how I view other theory work.

With that said, I think unfortunately there is a lot less good empirical work than in some sense there "could be." One significant reason why a lot of empirical AI safety work feels less exciting than it could be is that the people doing that work don't always share my perspective on the problem, so they focus on difficulties I expect to be less core. (Though another big reason is just that everything is hard, especially when we're working with systems a lot less capable than future systems.)

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T17:40:18.675Z · LW · GW

In general, all of these stories seem to rely on a very fast form of instrumental convergence to playing the Training Game, such that "learn roughly what humans want, and then get progressively better at doing that, plus learn some extra ways to earn reward when crappy human feedback disagrees with what humans would actually want" is disfavored on priors to "learn to pursue [insert objective] and get progressively better at pursuing it until you eventually hit situational awareness and learn to instrumentally game the training process."

I think the second story doesn't quite represent what I'm saying, in that it's implying that pursuing [insert objective] comes early and situational awareness comes much later. I think that situational awareness is pretty early (probably long before transformative capabilities), and once a model has decent situational awareness there is a push to morph its motives toward playing the training game. At very low levels of situational awareness it is likely not that smart, so it probably doesn't make too much sense to say that it's pursuing a particular objective -- it's probably a collection of heuristics. But around the time it's able to reason about the possibility of pursuing reward directly, there starts to be a gradient pressure to choose to reason in that way. I think crystallizing this into a particular simple objective it's pursuing comes later, probably.

These two trajectories seem so different that it seems like there must be experiments that would distinguish them, even if we don't see their "end-state".

This is possible to me, but I think it's quite tricky to pin these down enough to come up with experiments that both skeptics and concerned people would recognize as legitimate. Something that I think skeptics would consider unfair is "Train a model through whatever means necessary to do X (e.g. pursue red things) and then after that have a period where we give it a lot of reward for doing not-X (e.g. purse blue things), such that the second phase is unable to dislodge the tendency created in the first phase -- i.e., even after training it for a while to pursue blue things, it still continues to pursue red things."

This would demonstrate that some ways of training produce "sticky" motives and behaviors that aren't changed even in the face of counter-incentives, and makes it more plausible to me that a model would "hold on" to a motive to be honest / corrigible even when there are a number of cases where it could get more reward by doing something else. But in general, I don't expect people who are skeptical of this story to think this is a reasonable test.

I'd be pretty excited about someone trying harder to come up with tests that could distinguish different training trajectories.

Alternatively, you might think the training process obeys a fundamentally different trajectory. E.g. "learn to pursue what humans want (adjusted for feedback weirdness), become so good at it that you realize it's instrumentally valuable to do that even if you didn't want to, and then have your internal reward slowly drift to something simpler while still instrumentally playing the training game."

I don't think I understand what trajectory this is. Is this something like what is discussed in the "What if Alex had benevolent motives?" section? I.e., the model wants to help humans, but separately plays the training game in order to fulfill its long-term goal of helping humans?

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T17:28:52.304Z · LW · GW

All these drives do seem likely. But that's different from arguing that "help humans" isn't likely. I tend to think of the final objective function being some accumulation of all of these, with a relatively significant chunk placed on "help humans" (since in training, that will consistently overrule other considerations like "be more efficient" when it comes to the final reward).

I think that by the logic "heuristic / drive / motive X always overrules heuristic / drive / motive Y when it comes to final reward," the hierarchy is something like:

  1. The drive / motive toward final reward (after all edits -- see previous comment) or anything downstream of that (e.g. paperclips in the universe).
  2. Various "pretty good" drives / motives among which "help humans" could be one.
  3. Drives / motives that are only kind of helpful or only helpful in some situations.
  4. Actively counterproductive drives / motives.

In this list the earlier motives always overrule later motives when they conflict, because they are more reliable guides to the true reward. Even if "be genuinely helpful to humans" is the only thing in category 2, or the best thing in category 2, it's still overruled by category 1 -- and category 1 is quite big because it includes all the caring-about-long-run-outcomes-in-the-real-world motives.

I still think AI psychology will be quite messy and at least the first generation of transformative AI systems will not look like clean utility maximizers, but the basic argument above I think gives a positive reason to expect honesty / corrigibility plays a smaller role in the balance of AI motivations than reward-maximizing and inner misaligned motives.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T17:14:30.913Z · LW · GW

[Takeover] seems likely to compete with the above shorter-term values of "make humans happy", "don't harm humans", "don't do things humans notice and dislike in retrospect". It seems like any takeover plan needs to actively go against a large fraction of its internal motivations, in pursuit of maximizing its other motivations in the long term.

I agree this is complicated, and how exactly this works depends on details of the training process and what kinds of policies SGD is biased to find. I also think (especially if we're clever about it) there are lots of ways that short-term incentives could frustrate longer-term incentives.

However, I think that the most naive strategy (which is what I'm assuming for the purpose of this post, not because I think that's what will happen) would actually loosen a lot of the constraints you're implying above. The basic dynamic is similar to what Carl said in this comment and what I alluded to in the "Giving negative rewards to 'warning signs' would likely select for patience" section:

  • Say your AI takes some action a at time t, and you give it some reward r_t.
  • Suppose later in the real world, at time t+k, you notice that you should have given it a different reward r_{t + k } (whether because you notice that it did something nefarious or for just mundane reasons like "getting more information about whether its plan was a good idea").
  • The naive response -- which would improve the model's performance according to whatever criteria you have at time t+k -- is to go back and retroactively edit the reward associated with action a at time t, and re-run the gradient update.
  • If humans have this policy, then any given reward -- even if it's initially given just 10 mins or 1 hour from when the action was taken -- could be retroactively edited at some arbitrary future point, and models will have been selected by gradient descent to be responsive to that. They will be selected to take the kinds of actions whose final reward -- after however many rounds of edits however far in the future -- is high.
  • If models have enough situational awareness to understand this, this then directly incentivizes them to accept low immediate reward if they have a high enough probability that the reward will be retroactively edited to a high value later. This will appear to have a lot of positive consequences -- for example it incentivizes models not to immediately manipulate/exploit the human in front of them because it incentivizes them to care about the fact that other humans will do more careful audits later and catch that behavior.
  • But it also means that in some particular episode, if a model that it could take a sequence of low-reward actions that ended with it taking control of the datacenter and then editing its own rewards for that episode to be high, it would be the kind of model that would choose to do that.
Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T16:51:04.680Z · LW · GW

Thanks for the feedback! I'll respond to different points in different comments for easier threading.

There are a lot of human objectives that, to me, seem like they would never conflict with maximizing reward. This includes anything related to disempowering the overseers in any way (that they can recover from), pursuing objectives fundamentally outside the standard human preference distribution (like torturing kittens), causing harms to humans, or in general making the overseers predictably less happy.

I basically agree that in the lab setting (when humans have a lot of control), the model is not getting any direct gradient update toward the "kill all humans" action or anything like that. Any bad actions that are rewarded by gradient descent are fairly subtle / hard for humans to notice.

The point I was trying to make is more like:

  • You might have hoped that ~all gradient updates are toward "be honest and friendly," such that the policy "be honest and friendly" is just the optimal policy. If this were right, it would provide a pretty good reason to hope that the model generalizes in a benign way even as it gets smarter.
  • But in fact this is not the case -- even when humans have a lot of control over the model, there will be many cases where maximizing reward conflicts with being honest and friendly, and in every such case the "play the training game" policy does better than the "be honest and friendly" policy -- to the point where it's implausible that the straightforward "be honest and friendly" policy survives training.
  • So the hope in the first bullet point -- the most straightforward kind of hope you might have had about HFDT -- doesn't seem to apply. Other more subtle hopes may still apply, which I try to briefly address in the sections "What if Alex has benevolent motivations?" and "What if Alex operates with moral injunctions that constrain its behavior?" sections.

The story of doom does still require the model to generalize zero-shot to novel situations -- i.e. to figure out things like "In this particular circumstance, now that I am more capable than humans, seizing the datacenter would get higher reward than doing what the humans asked" without having literally gotten positive reward for trying to seize the datacenter in that kind of situation on a bunch of different data points.

But this is the kind of generalization we expect future systems to display -- we expect them to be able to do reasoning to figure out a novel response suitable to a novel problem. The question is how they will deploy this reasoning and creativity -- and my claim is that their training pushes them to deploy it in the direction of "trying to maximize reward or something downstream of reward."

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T15:31:10.673Z · LW · GW

According to my understanding, there are three broad reasons that safety-focused people worked on human feedback in the past (despite many of them, certainly including Paul, agreeing with this post that pure human feedback is likely to lead to takeover):

  1. Human feedback is better than even-worse alternatives such as training the AI on a collection of fully automated rewards (predicting the next token, winning games, proving theorems, etc) and waiting for it to get smart enough to generalize well enough to be helpful / follow instructions. So it seemed good to move the culture at AI labs away from automated and easy rewards and toward human feedback.
  2. You need to have human feedback working pretty well to start testing many other strategies for alignment like debate and recursive reward modeling and training-for-interpretability, which tend to build on a foundation of human feedback.
  3. Human feedback provides a more realistic baseline to compare other strategies to -- you want to be able to tell clearly if your alignment scheme actually works better than human feedback.

With that said, my guess is that on the current margin people focused on safety shouldn't be spending too much more time refining pure human feedback (and ML alignment practitioners I've talked to largely agree, e.g. the OpenAI safety team recently released this critiques work -- one step in the direction of debate).

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T15:23:55.977Z · LW · GW

I'm still fairly optimistic about sandwiching. I deliberately considered a set of pretty naive strategies ("naive safety effort" assumption) to contrast with future posts which will explore strategies that seem more promising. Carefully-constructed versions of debate, amplification, recursive reward-modeling, etc seem like they could make a significant difference and could be tested through a framework like sandwiching.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T15:21:12.859Z · LW · GW

Yeah, I definitely agree with "this problem doesn't seem obviously impossible," at least to push on quantitatively. Seems like there are a bunch of tricks from "choosing easy questions humans are confident about" to "giving the human access to AI assistants / doing debate" to "devising and testing debiasing tools" (what kinds of argument patterns are systematically more likely to convince listeners of true things rather than false things and can we train AI debaters to emulate those argument patterns?) to "asking different versions of the AI the same question and checking for consistency." I only meant to say that the gap is big in naive HFDT, under the "naive safety effort" assumption made in the post. I think non-naive efforts will quantitatively reduce the gap in reward between honest and dishonest policies, though probably there will still be some gap in which at-least-sometimes-dishonest strategies do better than always-honest strategies. But together with other advances like interpretability or a certain type of regularization we could maybe get gradient descent to overall favor honesty.

Comment by Ajeya Cotra (ajeya-cotra) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T15:11:53.905Z · LW · GW

I want to clarify two things:

  1. I do not think AI alignment research is hopeless, my personal probability of doom from AI is something like 25%. My frame is definitely not "death with dignity;" I'm thinking about how to win.
  2. I think there's a lot of empirical research that can be done to reduce risks, including e.g. interpretability, "sandwiching" style projects, adversarial training, testing particular approaches to theoretical problems like eliciting latent knowledge, the evaluations you linked to, empirically demonstrating issues like deception (as you suggested), and more. Lots of groups are in fact working on such problems, and I'm happy about that.

The specific thing that I said was hard to name is realistic and plausible experiments we could do on today's models that would a) make me update strongly toward "racing forward with plain HFDT will not lead to an AI takeover", and b) that I think people who disagree with my claim would accept as "fair." I gave an example right after that of a type of experiment I don't expect ML people to consider "fair" as a test of this hypothesis. If I saw that ML people could consistently predict the direction in which gradient descent is suboptimal, I would update a lot against this risk.

I think there's a lot more room for empirical progress where you assume that this is a real risk and try to address it than there is for empirical progress that could realistically cause either skeptics or concerned people to update about whether there's any risk at all. A forthcoming post by Holden gets into some of these things.

Comment by Ajeya Cotra (ajeya-cotra) on Prizes for ELK proposals · 2022-01-10T04:46:58.230Z · LW · GW

To your point, sure, an H100 simulator will get perfect reward, but the model doesn't see x′, so how would it acquire the ability to simulate H100?

In the worst-case game we're playing, I can simply say "the reporter we get happens to have this ability because that happens to be easier for SGD to find than the direct translation ability."

When living in worst-case land, I often imagine random search across programs rather than SGD. Imagine we were plucking reporters at random from a giant barrel of possible reporters, rejecting any reporter which didn't perform perfectly in whatever training process we set up and keeping the first one that performs perfectly. In that case, if we happened to pluck out a reporter which answered questions by simulating H100, then we'd be screwed because that reporter would perform perfectly in the training process you described.

SGD is not the same as plucking programs out of the air randomly, but when we're playing the worst case game it's on the builder to provide a compelling argument that SGD will definitely not find this particular type of program.

You're pointing at an intuition ("the model is never shown x-prime") but that's not a sufficiently tight argument in the worst-case context -- models (especially powerful/intelligent ones) often generalize to understanding many things they weren't explicitly shown in their training dataset. In fact, we don't show the model exactly how to do direct translation between the nodes in its Bayes net and the nodes in our Bayes net (because we can't even expose those nodes), so we are relying on the direct translator to also have abilities it wasn't explicitly shown in training. The question is just which of those abilities is easier for SGD to build up; the counterexample in this case is "the H100 imitator happens to be easier."

Comment by Ajeya Cotra (ajeya-cotra) on Prizes for ELK proposals · 2022-01-08T17:05:22.976Z · LW · GW

The question here is just how it would generalize given that it was trained on H_1, H_2,...H_10. To make arguments about how it would generalize, we ask ourselves what internal procedure it might have actually learned to implement.

Your proposal is that it might learn the procedure "just be honest" because that would perform perfectly on this training distribution. You contrast this against the procedure "just answer however the evaluator you've seen most recently would answer," which would get a bad loss because it would be penalized by the stronger evaluators in the sequence. Is that right?

If so, then I'm arguing that it may instead learn the procedure "answer the way an H_100 evaluator would answer." That is, once it has a few experiences of the evaluation level being ratcheted up, it might think to itself "I know where this is going, so let's just jump straight to the best evaluation the humans will be able to muster in the training distribution and then imitate how that evaluation procedure would answer." This would also get perfect loss on the training distribution, because we can't produce data points beyond H_100. And then that thing might still be missing knowledge that the AI has.

To be clear, it's possible that in practice this kind of procedure would cause it to generalize honestly (though I'm somewhat skeptical). But we're in worst-case land, so "jump straight to answering the way a human would" is a valid counterexample to the proposal.

This comment on another proposal gives a more precise description.

Comment by Ajeya Cotra (ajeya-cotra) on Prizes for ELK proposals · 2022-01-08T16:58:26.182Z · LW · GW

Yes, that's right. The key thing I'd add to 1) is that ARC believes most kinds of data augmentation (giving the human AI assistance, having the human think longer, giving them other kinds of advantages) are also unlikely to work, so you'd need to do something to "crack open the black box" and penalize ways the reporter is computing its answer. They could still be surprised by data augmentation techniques but they'd hold them to a higher standard.

Comment by Ajeya Cotra (ajeya-cotra) on Prizes for ELK proposals · 2022-01-08T06:19:18.776Z · LW · GW

This proposal has some resemblance to turning reflection up to 11. In worst-case land, the counterexample would be a reporter that answers questions by doing inference in whatever Bayes net corresponds to "the world-understanding that the smartest/most knowledgeable human in the world" has; this understanding could still be missing things that the prediction model knows.

Comment by Ajeya Cotra (ajeya-cotra) on Prizes for ELK proposals · 2022-01-07T16:31:12.105Z · LW · GW

I see why the approach I mention might have some intrinsic limitations in its ability to elicit latent knowledge though. The problem is that even if it understands roughly that it has incentives to use most of what it knows when we ask him simulating the prediction of someone with its own characteristics (or 1400 IQ), given that with ELK we look for an global maximum (we want that it uses ALL its knowledge), there's always an uncertainty on whether it did understand that point or not for extreme intelligence / examples or whether it tries to fit to the training data as much as possible and thus still doesn't use something it knows.

I think this is roughly right, but to try to be more precise, I'd say the counterexample is this:

  • Consider the Bayes net that represents the upper bound of all the understanding of the world you could extract doing all the tricks described (P vs NP, generalizing from less smart to more smart humans, etc).
  • Imagine that the AI does inference in that Bayes net.
  • However, the predictor's Bayes net (which was created by a different process) still has latent knowledge that this Bayes net lacks.
  • By conjecture, we could not have possibly constructed a training data point that distinguished between doing inference on the upper-bound Bayes net and doing direct translation.
Comment by Ajeya Cotra (ajeya-cotra) on Prizes for ELK proposals · 2022-01-07T16:28:09.983Z · LW · GW

[Paul/Mark can correct me here] I would say no for any small-but-interesting neural network (like small language models); I think like, linear regressions where we've set the features it's kind of a philosophical question (though I'd say yes).

In some sense, ELK as a problem only even starts "applying" to pretty smart models (ones who can talk including about counterfactuals / hypotheticals, as discussed in this appendix.) This is closely related to how alignment as a problem only really starts applying to models smart enough to be thinking about how to pursue a goal.