Posts

Sabotage Evaluations for Frontier Models 2024-10-18T22:33:14.320Z
Case studies on social-welfare-based standards in various industries 2024-06-20T13:33:44.780Z
Good job opportunities for helping with the most important century 2024-01-18T17:30:03.332Z
We're Not Ready: thoughts on "pausing" and responsible scaling policies 2023-10-27T15:19:33.757Z
3 levels of threat obfuscation 2023-08-02T14:58:32.506Z
A Playbook for AI Risk Reduction (focused on misaligned AI) 2023-06-06T18:05:55.306Z
Seeking (Paid) Case Studies on Standards 2023-05-26T17:58:57.042Z
Success without dignity: a nearcasting story of avoiding catastrophe by luck 2023-03-14T19:23:15.558Z
Discussion with Nate Soares on a key alignment difficulty 2023-03-13T21:20:02.976Z
What does Bing Chat tell us about AI risk? 2023-02-28T17:40:06.935Z
How major governments can help with the most important century 2023-02-24T18:20:08.530Z
What AI companies can do today to help with the most important century 2023-02-20T17:00:10.531Z
Jobs that can help with the most important century 2023-02-10T18:20:07.048Z
Spreading messages to help with the most important century 2023-01-25T18:20:07.322Z
How we could stumble into AI catastrophe 2023-01-13T16:20:05.745Z
Transformative AI issues (not just misalignment): an overview 2023-01-05T20:20:06.424Z
Racing through a minefield: the AI deployment problem 2022-12-22T16:10:07.694Z
High-level hopes for AI alignment 2022-12-15T18:00:15.625Z
AI Safety Seems Hard to Measure 2022-12-08T19:50:07.352Z
Why Would AI "Aim" To Defeat Humanity? 2022-11-29T19:30:07.828Z
Nearcast-based "deployment problem" analysis 2022-09-21T18:52:22.674Z
Beta Readers are Great 2022-09-05T19:10:19.030Z
How might we align transformative AI if it’s developed very soon? 2022-08-29T15:42:08.985Z
AI strategy nearcasting 2022-08-25T17:26:28.455Z
The Track Record of Futurists Seems ... Fine 2022-06-30T19:40:18.893Z
Nonprofit Boards are Weird 2022-06-23T14:40:11.593Z
AI Could Defeat All Of Us Combined 2022-06-09T15:50:12.952Z
Useful Vices for Wicked Problems 2022-04-12T19:30:22.054Z
Ideal governance (for companies, countries and more) 2022-04-05T18:30:19.228Z
Debating myself on whether “extra lives lived” are as good as “deaths prevented” 2022-03-29T18:30:23.792Z
Cold Takes reader survey - let me know what you want more and less of! 2022-03-18T19:20:24.261Z
Programming note 2022-03-09T21:20:18.062Z
The Wicked Problem Experience 2022-03-02T17:50:18.621Z
Learning By Writing 2022-02-22T15:50:19.452Z
Misc thematic links 2022-02-18T22:00:23.466Z
Defending One-Dimensional Ethics 2022-02-15T17:20:20.107Z
To Match the Greats, Don’t Follow In Their Footsteps 2022-02-11T19:20:19.002Z
"Moral progress" vs. the simple passage of time 2022-02-09T02:50:19.144Z
Reply to Eliezer on Biological Anchors 2021-12-23T16:15:43.508Z
Forecasting Transformative AI, Part 1: What Kind of AI? 2021-09-24T00:46:49.279Z
This Can't Go On 2021-09-18T23:50:33.307Z
Digital People FAQ 2021-09-13T15:24:11.350Z
Digital People Would Be An Even Bigger Deal 2021-09-13T15:23:41.195Z
The Duplicator: Instant Cloning Would Make the World Economy Explode 2021-09-08T00:06:27.721Z
The Most Important Century: Sequence Introduction 2021-09-03T20:19:27.917Z
All Possible Views About Humanity's Future Are Wild 2021-09-03T20:19:06.453Z
Thoughts on the Singularity Institute (SI) 2012-05-11T04:31:30.364Z
Maximizing Cost-effectiveness via Critical Inquiry 2011-11-10T19:25:14.904Z
Why We Can't Take Expected Value Estimates Literally (Even When They're Unbiased) 2011-08-18T23:34:12.099Z

Comments

Comment by HoldenKarnofsky on We're Not Ready: thoughts on "pausing" and responsible scaling policies · 2024-01-15T20:57:13.682Z · LW · GW

Thanks for the thoughts!

#1: METR made some edits to the post in this direction (in particular see footnote 3).

On #2, Malo’s read is what I intended. I think compromising with people who want "less caution" is most likely to result in progress (given the current state of things), so it seems appropriate to focus on that direction of disagreement when making pragmatic calls like this.

On #3: I endorse the “That’s a V 1” view.  While industry-wide standards often take years to revise, I think individual company policies often (maybe usually) update more quickly and frequently.

Comment by HoldenKarnofsky on We're Not Ready: thoughts on "pausing" and responsible scaling policies · 2024-01-15T20:54:25.148Z · LW · GW

Thanks for the thoughts!

I don’t think the communications you’re referring to “take for granted that the best path forward is compromising.” I would simply say that they point out the compromise aspect as a positive consideration, which seems fair to me - “X is a compromise” does seem like a point in favor of X all else equal (implying that it can unite a broader tent), though not a dispositive point.

I address the point about improvements on the status quo in my response to Akash above.

Comment by HoldenKarnofsky on We're Not Ready: thoughts on "pausing" and responsible scaling policies · 2024-01-15T20:53:50.006Z · LW · GW

Thanks for the thoughts! Some brief (and belated) responses:

  • I disagree with you on #1 and think the thread below your comment addresses this.
  • Re: #2, I think we have different expectations. We can just see what happens, but I’ll note that the RSP you refer to is quite explicit about the need for further iterations (not just “revisions” but also the need to define further-out, more severe risks).
  • I’m not sure what you mean by “an evals regime in which the burden of proof is on labs to show that scaling is safe.” How high is the burden you’re hoping for? If they need to definitively rule out risks like “The weights leak, then the science of post-training enhancements moves forward to the point where the leaked weights are catastrophically dangerous” in order to do any further scaling, my sense is that nobody (certainly not me) has any idea how to do this, and so this proposal seems pretty much equivalent to “immediate pause,” which I’ve shared my thoughts on. If you have a lower burden of proof in mind, I think that’s potentially consistent with the work on RSPs that is happening (it depends on exactly what you are hoping for).
  • I agree with the conceptual point that improvements on the status quo can be net negative for the reasons you say. When I said “Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate,” I didn’t mean to say that there’s no way this can make intellectual sense. To take a quick stab at what bugs me: I think to the extent a measure is an improvement but insufficient, the strategy should be to say “This is an improvement but insufficient,” accept the improvement and count on oneself to win the “insufficient” argument. This kind of behavior seems to generalize to a world in which everyone is clear about their preferred order of object-level states of the world, and incentive gradients consistently point in the right direction (especially important if - as I believe - getting some progress generally makes it easier rather than harder to get more). I worry that the behavior of opposing object-level improvements on the grounds that others might find them sufficient seems to generalize to a world with choppier incentive gradients, more confusing discourse, and a lot of difficulty building coalitions generally (it’s a lot harder to get agreement on “X vs. the best otherwise achievable outcome” than on “X vs. the status quo”).
  • I think nearly all proponents of RSPs do not see them as a substitute for regulation. Early communications could have emphasized this point more (including the METR post, which has been updated). I think communications since then have been clearer about it. 
  • I lean toward agreeing that another name would be better.  I don’t feel very strongly, and am not sure it matters at this point anyway with different parties using different names.
  • I don’t agree that "we can't do anything until we literally have proof that the model is imminently dangerous" is the frame of RSPs, although I do agree that the frame is distinct from a “pause now” frame. I’m excited about conditional pauses as something that can reduce risk a lot while having high tractability and bringing together a big coalition; the developments you mention are great, but I think we’re still a long way from where “advocate for an immediate pause” looks better to me than working within this framework. I also disagree with your implication that the RSP framework has sucked up a lot of talent; while evals have drawn in a lot of people and momentum, hammering out conditional pause related frameworks seems to be something that only a handful of people were working on as of the date of your comment. (Since then the number has gone up due to AI companies forming teams dedicated to this; this seems like a good thing to me.) Overall, it seems to me that most of the people working in this area would otherwise be working on evals and other things short of advocating for immediate pauses.
Comment by HoldenKarnofsky on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-10-02T17:18:03.721Z · LW · GW

(Apologies for slow reply!)

I see, I guess where we might disagree is I think that IMO a productive social movement could want to apply the Henry Spira's playbook (overall pretty adversarial) oriented mostly towards slowing things down until labs have a clue of what they're doing on the alignment front. I would guess you wouldn't agree with that, but I'm not sure.

I think an adversarial social movement could have a positive impact. I have tended to think of the impact as mostly being about getting risks taken more seriously and thus creating more political will for “standards and monitoring,” but you’re right that there could also be benefits simply from buying time generically for other stuff.

I'm not saying that it would be a force against regulation in general but that it would be a force against any regulation which slows down substantially the current capabilities progress rate of labs. And empirics don't demonstrate the opposite as far as I can tell. 

I said it’s “far from obvious” empirically what’s going on. I agree that discussion of slowing down has focused on the future rather than now, but I don’t think it has been pointing to a specific time horizon (the vibe looks to me more like “slow down at a certain capabilities level”).

Finally, on your conceptual part, as some argued, it's in fact probably not possible to affect all players equally without a drastic regime of control (which is a true downside of slowing down now, but IMO still much less worse than slowing down once a leak or a jailbreak of an advanced system can cause a large-scale engineered pandemic) bc smaller actors will use the time to try to catch up as close as possible from the frontier. 

It’s true that no regulation will affect everyone precisely the same way. But there is plenty of precedent for major industry players supporting regulation that generally slows things down (even when the dynamic you’re describing applies).

I agree, but if anything, my sense is that due to various compound effects (due to AI accelerating AI, to investment, to increased compute demand, and to more talent earlier), an earlier product release of N months just gives a lower bound for TAI timelines shortening (hence greater than N). Moreover, I think that the ChatGPT product release is, ex-post at least, not in the typical product release reference class. It was clearly a massive game changer for OpenAI and the entire ecosystem.

I don’t agree that we are looking at a lower bound here, bearing in mind that (I think) we are just talking about when ChatGPT was released (not when the underlying technology was developed), and that (I think) we should be holding fixed the release timing of GPT-4. (What I’ve seen in the NYT seems to imply that they rushed out functionality they’d otherwise have bundled with GPT-4.)

If ChatGPT had been held for longer, then:

  • Scaling and research would have continued in the meantime. And even with investment and talent flooding in, I expect that there’s very disproportionate impact from players who were already active before ChatGPT came out, who were easily well capitalized enough to go at ~max speed for the few months between ChatGPT and GPT-4.
  • GPT-4 would have had the same absolute level of impressiveness, revenue potential, etc. (as would some other things that I think have been important factors in bringing in investment, such as Midjourney). You could have a model like “ChatGPT maxed out hype such that the bottleneck on money and talent rushing into the field became calendar time alone,” which would support your picture; but you could also have other models, e.g. where the level of investment is more like a function of the absolute level of visible AI capabilities such that the timing of ChatGPT mattered little, holding fixed the timing of GPT-4. I’d guess the right model is somewhere in between those two; in particular, I'd guess that it matters a lot how high revenue is from various sources, and revenue seems to behave somewhere in between these two things (there are calendar-time bottlenecks, but absolute capabilities matter a lot too; and parallel progress on image generation seems important here as well.)
  • Attention from policymakers would’ve been more delayed; the more hopeful you are about slowing things via regulation, the more you should think of this as an offsetting factor, especially since regulation may be more of a pure “calendar-time-bottlenecked response to hype” model than research and scaling progress.
  • (I also am not sure I understand your point for why it could be more than 3 months of speedup. All the factors you name seem like they will nearly-inevitably happen somewhere between here and TAI - e.g., there will be releases and demos that galvanize investment, talent, etc. - so it’s not clear how speeding a bunch of these things up 3 months speeds the whole thing up more than 3 months, assuming that there will be enough time for these things to matter either way.)

But more important than any of these points is that circumstances have (unfortunately, IMO) changed. My take on the “successful, careful AI lab” intervention was quite a bit more negative in mid-2022 (when I worried about exactly the kind of acceleration effects you point to) than when I did my writing on this topic in 2023 (at which point ChatGPT had already been released and the marginal further speedup of this kind of thing seemed a lot lower). Since I wrote this post, it seems like the marginal downsides have continued to fall, although I do remain ambivalent.


 


 

Comment by HoldenKarnofsky on 3 levels of threat obfuscation · 2023-10-02T17:12:30.661Z · LW · GW

Just noting that these seem like valid points! (Apologies for slow reply!) 

Comment by HoldenKarnofsky on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-07-26T05:04:58.497Z · LW · GW

Thanks for the response!

Re: your other interventions - I meant for these to be part of the "Standards and monitoring" category of interventions (my discussion of that mentions advocacy and external pressure as important factors).

I think it's far from obvious that an AI company needs to be a force against regulation, both conceptually (if it affects all players, it doesn't necessarily hurt the company) and empirically.

Thanks for giving your take on the size of speedup effects. I disagree on a number of fronts. I don't want to get into the details of most of them, but will comment that it seems like a big leap from "X product was released N months earlier than otherwise" to "Transformative AI will now arrive N months earlier than otherwise." (I think this feels fairly clear when looking at other technological breakthroughs and how much they would've been affected by differently timed product releases.)

Comment by HoldenKarnofsky on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-07-14T23:48:13.193Z · LW · GW

I'm not convinced it requires a huge compute tax to reliably avoid being caught. (If I were, I would in fact probably be feeling a lot more chill than I am.)

The analogy to humans seems important. Humans are capable of things like going undercover, and pulling off coups, and also things like "working every day with people they'd fire if they could, without clearly revealing this." I think they mostly pull this off with:

  • Simple heuristics like "Be nice, unless you're in the very unusual situation where hostile action would work well." (I think the analogy to how AIs might behave is straightforward.)
  • The fact that they don't need to be perfect - lots of behaviors and situations are just ambiguous. (I think this also applies to AI, and maybe more so: there could be lots of weird little incidents that are chalked up to bugs without causing concern sufficient to address the problem reliably.)
  • Also, humans generally need to do a lot of reasoning along the lines of "X usually works, but I do need to notice the rare situations when something radically different is called for." So if this is expensive, they just need to be doing that expensive thing a lot.
Comment by HoldenKarnofsky on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-07-14T23:39:29.552Z · LW · GW

I think training exclusively on objective measures has a couple of other issues:

  • For sufficiently open-ended training, objective performance metrics could incentivize manipulating and deceiving humans to accomplish the objective. A simple example would be training an AI to make money, which might incentivize illegal/unethical behavior.
  • For less open-ended training, I basically just think you can only get so much done this way, and people will want to use fuzzier "approval" measures to get help from AIs with fuzzier goals (this seems to be how things are now with LLMs).

I think your point about the footprint is a good one and means we could potentially be very well-placed to track "escaped" AIs if a big effort were put in to do so. But I don't see signs of that effort today and don't feel at all confident that it will happen in time to stop an "escape."

Comment by HoldenKarnofsky on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-07-08T13:18:22.360Z · LW · GW

That's interesting, thanks!

In addition to some generalized concern about "unknown unknowns" leading to faster progress on reliability than expected by default (especially in the presence of commercial incentives for reliability), I also want to point out that there may be some level of capabilities where AIs become good at doing things like:

  • Assessing the reliability of their own thoughts, and putting more effort into things that have the right combination of uncertainty and importance.
  • Being able to use that effort productively, via things like "trying multiple angles on a question" and "setting up systems for error checking."

I think that in some sense humans are quite unreliable, and use a lot of scaffolding - variable effort at reliability, consulting with each other and trying to catch things each other missed, using systems and procedures, etc. - to achieve high reliability, when we do so. Because of this, I think AIs could be have pretty low baseline reliability (like humans) while finding ways to be effectively highly reliable (like humans). And I think this applies to deception as much as anything else (if a human thinks it's really important to deceive someone, they're going to make a lot of use of things like this).

Comment by HoldenKarnofsky on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-07-08T13:12:54.973Z · LW · GW

I agree with these points! But:

  • Getting the capabilities to be used by other agents to do good things could still be tricky and/or risky, when reinforcement is vulnerable to deception and manipulation.
  • I still don't think this adds up to a case for being confident that there aren't going to be "escapes" anytime soon.
Comment by HoldenKarnofsky on How might we align transformative AI if it’s developed very soon? · 2023-06-14T05:27:38.932Z · LW · GW

This sounds right to me!

Only note is that I think the setup can be simplified a bit. The central idea I have in mind is that the AI does something like:

  1. "Think" about what to do next, for up to some max period of time ("what to do next" can be "think more, with prompt X").
  2. Do it
  3. Repeat

This seems like a pretty natural way for an "agent" to operate, and then every #1 is an "auditable step" in your terminology. (And the audits are done by comparing a few rollouts of that step, and performing gradient descent without executing any of them.)

There are probably subtleties I'm missing, but I think this points pretty well at what I tend to think of as the hopes of process-based supervision.

Comment by HoldenKarnofsky on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-06-14T05:19:52.617Z · LW · GW

On your last three paragraphs, I agree! I think the idea of security requirements for AI labs as systems become more capable is really important.

I think good security is difficult enough (and inconvenient enough) that we shouldn't expect this sort of thing to happen smoothly or by default. I think we should assume there will be AIs kept under security that has plenty of holes, some of which may be easier for AIs to find (and exploit) than humans.

I don't find the points about pretraining compute vs. "agent" compute very compelling, naively. One possibility that seems pretty live to me is that the pretraining is giving the model a strong understanding of all kinds of things about the world - for example, understanding in a lot of detail what someone would do to find vulnerabilities and overcome obstacles if they had a particular goal. So then if you put some scaffolding on at the end to orient the AI toward a goal, you might have a very capable agent quite quickly, without needing vast quantities of training specifically "as an agent." To give a simple concrete example that I admittedly don't have a strong understanding of, Voyager seems pretty competent at a task that it didn't have vast amounts of task-specific training for.

Comment by HoldenKarnofsky on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-06-14T04:58:51.636Z · LW · GW

I agree that today's AI systems aren't highly reliable at pretty much anything, including deception. But I think we should expect more reliability in the future, partly for reasons you give above, and I think that's a double-edged sword.

Under the picture you sketch out above, companies will try to train AIs to be capable of being much more reliable (while also, presumably, being intelligent and even creative). I also think reliability is likely to increase without necessarily having big reliability-focused efforts: just continuing to train systems at larger scale and with more/better data is likely to make them more capable in a way that makes them more reliable. (E.g., I think current language models have generally gotten more reliable partly via pure scaling up, though things like RLHF are also part of the picture.) For both reasons, I expect progress on reliability, with the pace of progress very hard to forecast. If AI systems become capable of being intelligent and creative in useful ways while having extraordinary rare mistakes, then it seems like we should be worrying about their having developed reliable deception capabilities as well. Thoughts on that?

Comment by HoldenKarnofsky on How might we align transformative AI if it’s developed very soon? · 2023-06-12T18:22:36.157Z · LW · GW

A way this could be an issue would be if we are auditing at a stage before it gets the paywalled textbook, and when we "generate two alternative actions for us to choose between", we are allowing that generation process to involve a bunch of browsing and searching that we don't inquire into.

But we could require "generate two alternative actions for us to choose between" to be pretty constrained (e.g., not involve interacting with browsers and other tools), and this would still not be the same as boxing because we would still be auditing an unconstrained sequence of actions (while constraining the generation of alternatives while doing the audit).

Comment by HoldenKarnofsky on How might we align transformative AI if it’s developed very soon? · 2023-06-12T17:55:42.495Z · LW · GW

As far as I can tell, this example is hard to fit into the more specific description I gave of process-based supervision way upthread - specifically, see step 4:

4. With chance (1-p), the step is simply executed, with no gradient descent implications. With chance p,  we prompt the AI to generate a number of alternative next steps; drill down extensively on its reasoning; and perform gradient descent based on which of the alternative next steps we like best. (We could potentially then execute a weighted random suggested step, rather than the AI's first-choice step, to further reduce the extent to which proposing a step toward e.g. subtly gaining power is encouraged by gradient descent.)

In this case, let's presume that we are auditing a step that is taking place after the AI got the paywalled textbook. In this case, it might propose two different next steps that are both informed by the paywalled textbook, and we pick the one we like better.  This doesn't seem to differentially reinforce the behavior of going for the paywalled textbook. (Whereas if we are auditing an earlier stage where it was considering going for the paywalled textbook, we can reinforce against this.)

Comment by HoldenKarnofsky on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-06-10T21:15:13.610Z · LW · GW

Hm, it seems to me that RL would be more like training away the desire to deceive, although I'm not sure either "ability" or "desire" is totally on target - I think something like "habit" or "policy" captures it better. The training might not be bulletproof (AI systems might have multiple goals and sometimes notice that deception would help accomplish much), but one doesn't need 100% elimination of deception anyway, especially not when combined with effective checks and balances.

Comment by HoldenKarnofsky on How might we align transformative AI if it’s developed very soon? · 2023-06-10T07:05:58.254Z · LW · GW

I'm not intending to use Def'n 2 at all. The hope here is not that we can "rest assured that there is no dangerous consequentialist means-end reasoning" due to e.g. it not fitting into the context in question. The hope is merely that if we don't specifically differentially reinforce unintended behavior, there's a chance we won't get it (even if there is scope to do it).

I see your point that consistently, effectively "boxing" an AI during training could also be a way to avoid reinforcing behaviors we're worried about. But they don't seem the same to me: I think you can get the (admittedly limited) benefit of process-based supervision without boxing. Boxing an AI during training might have various challenges and competitiveness costs. Process-based supervision means you can allow an unrestricted scope of action, while avoiding specifically reinforcing various unintended behaviors. That seems different from boxing.

Comment by HoldenKarnofsky on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-06-10T06:57:05.663Z · LW · GW

Thanks for the thoughts! I agree that there will likely be commercial incentives for some amount of risk reduction, though I worry that the incentives will trail off before the needs trail off - more on that here and here.

Comment by HoldenKarnofsky on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-06-10T06:54:54.274Z · LW · GW

I agree that this is a major concern. I touched on some related issues in this piece.

This post focused on misalignment because I think readers of this forum tend to be heavily focused on misalignment, and in this piece I wanted to talk about what a playbook might look like assuming that focus (I have pushed back on this as the exclusive focus elsewhere).

I think somewhat adapted versions of the four categories of intervention I listed could be useful for the issue you raise, as well.

Comment by HoldenKarnofsky on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-06-10T06:50:35.789Z · LW · GW

Even bracketing that concern, I think another reason to worry about training (not just deploying) AI systems is if they can be stolen (and/or, in an open-source case, freely used) by malicious actors. It's possible that any given AI-enabled attack is offset by some AI-enabled defense, but that doesn't seem safe to assume.

Comment by HoldenKarnofsky on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-06-10T06:50:27.557Z · LW · GW

I'm curious why you are "not worried in any near future about AI 'escaping.'" It seems very hard to be confident in even pretty imminent AI systems' lack of capability to do a particular thing, at this juncture.

Comment by HoldenKarnofsky on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-06-10T06:41:09.147Z · LW · GW

To be clear, "it turns out to be trivial to make the AI not want to escape" is a big part of my model of how this might work. The basic thinking is that for a human-level-ish system, consistently reinforcing (via gradient descent) intended behavior might be good enough, because alternative generalizations like "Behave as intended unless there are opportunities to get lots of resources, undetected or unchallenged" might not have many or any "use cases."

A number of other measures, including AI checks and balances, also seem like they might work pretty easily for human-level-ish systems, which could have a lot of trouble doing things like coordinating reliably with each other.

So the idea isn't that human-level-ish capabilities are inherently safe, but that straightforward attempts at catching/checking/blocking/disincentivizing unintended behavior could be quite effective for such systems (while such things might be less effective on systems that are extraordinarily capable relative to supervisors).

Comment by HoldenKarnofsky on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-06-10T06:26:07.152Z · LW · GW

Noting that I don't think alignment being "solved" is a binary.  As discussed in the post, I think there are a number of measures that could improve our odds of getting early human-level-ish AIs to be aligned "enough," even assuming no positive surprises on alignment science. This would imply that if lab A is more attentive to alignment and more inclined to invest heavily in even basic measures for aligning its systems than lab B, it could matter which lab develops very capable AI systems first.

Comment by HoldenKarnofsky on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-06-10T06:23:41.423Z · LW · GW

Thanks for this comment - I get vibes along these lines from a lot of people but I don't think I understand the position, so I'm enthused to hear more about it.

> I believe that by not touching the "decrease the race" or "don't make the race worse" interventions, this playbook misses a big part of the picture of "how one single think could help massively". 

"Standards and monitoring" is the main "decrease the race" path I see. It doesn't seem feasible to me for the world to clamp down on AI development unconditionally, which is why I am more focused on the conditional (i.e., "unless it's demonstrably safe") version. 

But is there another "decrease the race" or "don't make the race worse" intervention that you think can make a big difference? Based on the fact that you're talking about a single thing that can help massively, I don't think you are referring to "just don't make things worse"; what are you thinking of?

> Staying at the frontier of capabilities and deploying leads the frontrunner to feel the heat which accelerates both capabilities & the chances of uncareful deployment which increases pretty substantially the chances of extinction.

I agree that this is an effect, directionally, but it seems small by default in a setting with lots of players (I imagine there will be, and is, a lot of "heat" to be felt regardless of any one player's actions). And the potential benefits seem big. My rough impression is that you're confident the costs outweigh the benefits for nearly any imaginable version of this; if that's right, can you give some quantitative or other sense of how you get there?

Comment by HoldenKarnofsky on How might we align transformative AI if it’s developed very soon? · 2023-06-07T18:34:55.765Z · LW · GW

I think it is not at all about boxing - I gave the example I did to make a clear distinction with the "number of steps between audits" idea.

For the distinction with boxing, I'd focus on what I wrote at the end: "The central picture of process-based feedback isn’t either of these, though - it’s more like 'Let the AI do whatever, but make sure all supervision is based on randomly auditing some step the AI takes, having it generate a few alternative steps it could’ve taken, and rating those steps based on how good they seem, without knowing how they will turn out. The AI has plenty of scope to do dastardly stuff, but you are never giving gradient updates toward doing the dastardly stuff.'"

Comment by HoldenKarnofsky on How might we align transformative AI if it’s developed very soon? · 2023-06-07T05:34:39.971Z · LW · GW

I don't think of process-based supervision as a totally clean binary, but I don't think of it as just/primarily being about how many steps you allow in between audits. I think of it as primarily being about whether you're doing gradient updates (or whatever) based on outcomes (X was achieved) or processes (Y seems like a well-reasoned step to achieve X). I think your "Example 0" isn't really either - I'd call it internals-based supervision. 

I agree it matters how many steps you allow in between audits, I just think that's a different distinction.

Here’s one way to think about it. Say option A is to give the AI 10 years to produce a plan to make money (with a very restricted action space, e.g. searching and reading lit from an airgapped database, and maintaining a scratchpad), then rate whether the plan seems like a good (ethically as well as practically) plan to make money; whereas option B is to audit the AI every hour, based on how much money is in a bank account. Here option B has shorter cycles, but also seems more dangerous. Does that help illustrate the difference I'm trying to point at?

(The central picture of process-based feedback isn’t either of these, though - it’s more like “Let the AI do whatever, but make sure all supervision is based on randomly auditing some step the AI takes, having it generate a few alternative steps it could’ve taken, and rating those steps based on how good they seem, without knowing how they will turn out. The AI has plenty of scope to do dastardly stuff, but you are never giving gradient updates toward doing the dastardly stuff.”)
 

Comment by HoldenKarnofsky on How might we align transformative AI if it’s developed very soon? · 2023-06-03T06:25:24.311Z · LW · GW

Hm, I think we are probably still missing each other at least somewhat (and maybe still a lot), because I don't think the interpretability bit is important for this particular idea - I think you can get all the juice from "process-based supervision" without any interpretability.

I feel like once we sync up you're going to be disappointed, because the benefit of "process-based supervision" is pretty much just that you aren't differentially reinforcing dangerous behavior. (At worst, you're reinforcing "Doing stuff that looks better to humans than it actually is." But not e.g. reward hacking.)

The question is, if you never differentially reinforce dangerous unintended behavior/aims, how does dangerous behavior/aims arise? There are potential answers - perhaps you are inadvertently training an AI to pursue some correlate of "this plan looks good to a human," leading to inner misalignment - but I think that most mechanistic stories you can tell from the kind of supervision process I described (even without interpretability) to AIs seeking to disempower humans seem pretty questionable - at best highly uncertain rather than "strong default of danger." This is how it seems to me, though someone with intuitions like Nate's would likely disagree.

Comment by HoldenKarnofsky on How might we align transformative AI if it’s developed very soon? · 2023-06-03T05:50:27.259Z · LW · GW

I'm not sure what your intuitive model is and how it differs from mine, but one possibility is that you're picturing a sort of bureaucracy in which we simultaneously have many agents supervising each other (A supervises B who supervises C who supervises D ...) whereas I'm picturing something more like: we train B while making extensive use of A for accurate supervision, adversarial training, threat assessment, etc. (perhaps allocating resources such that there is a lot more of A than B and generally a lot of redundancy and robustness in our alignment efforts and threat assessment), and try to get to the point where we trust B, then do a similar thing with C. I still don't think this is a great idea to do too many times; I'd hope that at some point we get alignment techniques that scale more cleanly.

Comment by HoldenKarnofsky on Seeking (Paid) Case Studies on Standards · 2023-06-02T22:25:40.769Z · LW · GW

We got it! You should get an update within a week.

Comment by HoldenKarnofsky on How might we align transformative AI if it’s developed very soon? · 2023-05-22T16:20:06.510Z · LW · GW

I think that's a legit disagreement. But I also claim that the argument I gave still works if you assume that AI is trained exclusively using RL - as long as that RL is exclusively "process-based." So this basic idea: the AI takes a bunch of steps, and gradient descent is performed based on audits of whether those steps seem reasonable while blinded to what happened as a result. 

It still seems, here, like you're not reinforcing unintended behaviors, so the concern comes exclusively from the kind of goal misgeneralization you'd get without having any particular reason to believe you are reinforcing it.

Does that seem reasonable to you? If so, why do you think making RL more central makes process-based supervision less interesting? Is it basically that in a world where RL is central, it's too uncompetitive/practically difficult to stick with the process-based regime?

Comment by HoldenKarnofsky on How might we align transformative AI if it’s developed very soon? · 2023-05-19T16:53:39.709Z · LW · GW

Some reactions on your summary:

  • In process-based training, X = “produce a good plan to make money ethically”

This feels sort of off as a description - what actually might happen is that it takes a bunch of actual steps to make money ethically, but steps are graded based on audits of whether they seem reasonable without the auditor knowing the outcome.

  • In process-based training, maybe Y = “produce a deliberately deceptive plan” or “hack out of the box”.

The latter is the bigger concern, unless you mean the former as aimed at something like the latter. E.g., producing a "plan that seems better to us than it is" seems more likely to get reinforced by this process, but is also less scary, compared to doing something that manipulates and/or disempowers humans.

  • AI does Y, a little bit, randomly or incompetently.
  • AI is rewarded for doing Y.

Or AI does a moderate-to-large amount of  Y competently and successfully. Process-based training still doesn't seem like it would reinforce that behavior in the sense of making it more likely in the future, assuming the Y is short of something like "Hacks into its own reinforcement system to reinforce the behavior it just did" or "Totally disempowers humanity."

  • Solve Failure Mode 1 by giving near-perfect rewards

I don't think you need near-perfect rewards. The mistakes reinforce behaviors like "Do things that a silly human would think are reasonable steps toward the goal", not behaviors like "Manipulate the world into creating an appearance that the goal was accomplished." If we just get a whole lot of the former, that doesn't seem clearly worse than humans just continuing to do everything. This is a pretty central part of the hope.

I agree you can still get a problem from goal misgeneralization and instrumental reasoning, but this seems noticeably less likely (assuming process-based training) than getting a problem from reinforcing pursuit of unintended outcomes. (https://www.lesswrong.com/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficulty has some discussion.) I put significant credence on something like "Internals-based training doesn't pan out, but neither does the concern about goal misgeneralization and instrumental reasoning (in the context of process-based training, ie in the context of not reinforcing pursuit of unintended outcomes)."

Comment by HoldenKarnofsky on How might we align transformative AI if it’s developed very soon? · 2023-05-13T00:25:00.136Z · LW · GW

This feels a bit to me like assuming the conclusion. "Rose" is someone who already has aims (we assume this when we imagine a human); I'm talking about an approach to training that seems less likely to give rise to dangerous aims. The idea of the benefit, here, is to make dangerous aims less likely (e.g., by not rewarding behavior that affects the world through unexpected and opaque pathways); the idea is not to contain something that already has dangerous aims (though I think there is some hope of the latter as well, especially with relatively early human-level-ish AI systems).

Comment by HoldenKarnofsky on How we could stumble into AI catastrophe · 2023-05-08T21:50:32.566Z · LW · GW

I think that as people push AIs to do more and more ambitious things, it will become more and more likely that situational awareness comes along with this, for reasons broadly along the lines of those I linked to (it will be useful to train the AI to have situational awareness and/or other properties tightly linked to it).

I think this could happen via RL fine-tuning, but I also think it's a mistake to fixate too much on today's dominant methods - if today's methods can't produce situational awareness, they probably can't produce as much value as possible, and people will probably move beyond them.

The "responsible things to do" you list seem reasonable, but expensive, and perhaps skipped over in an environment where there's intense competition, things are moving quickly, and the risks aren't obvious (because situationally aware AIs are deliberately hiding a lot of the evidence of risk).

Comment by HoldenKarnofsky on How we could stumble into AI catastrophe · 2023-05-05T19:43:59.995Z · LW · GW

Is the disagreement here about whether AIs are likely to develop things like situational awareness, foresightful planning ability, and understanding of adversaries' decisions as they are used for more and more challenging tasks?

I think this piece represents my POV on this pretty well, especially the bits starting around here

Comment by HoldenKarnofsky on Discussion with Nate Soares on a key alignment difficulty · 2023-04-14T18:02:24.259Z · LW · GW

It seems like the same question would apply to humans trying to solve the alignment problem - does that seem right? My answer to your question is "maybe", but it seems good to get on the same page about whether "humans trying to solve alignment" and "specialized human-ish safe AIs trying to solve alignment" are basically the same challenge.

Comment by HoldenKarnofsky on Discussion with Nate Soares on a key alignment difficulty · 2023-04-14T16:45:14.524Z · LW · GW

The hope discussed in this post is that you could have a system that is aligned but not superintelligent (more like human-level-ish, and aligned in the sense that it is imitation-ish), doing the kind of alignment work humans are doing today, which could hopefully lead to a more scalable alignment approach that works on more capable systems.

Comment by HoldenKarnofsky on How we could stumble into AI catastrophe · 2023-04-07T20:55:26.105Z · LW · GW

I think this kind of thing is common among humans. Employees might appear to be accomplishing the objectives they were given, with distortions hard to notice (and sometimes noticed, sometimes not) - e.g., programmers cutting corners and leaving a company with problems in the code that don't get discovered until later (if ever). People in government may appear to be loyal to the person in power, while plotting a coup, with the plot not noticed until it's too late. I think the key question here is whether AIs might get situational awareness and other abilities comparable to those of humans. 

Comment by HoldenKarnofsky on How we could stumble into AI catastrophe · 2023-03-21T06:25:09.628Z · LW · GW

I think the more capable AI systems are, the more we'll see patterns like "Every time you ask an AI to do something, it does it well; the less you put yourself in the loop and the fewer constraints you impose, the better and/or faster it goes; and you ~never see downsides." (You never SEE them, which doesn't mean they don't happen.)

I think the world is quite capable of handling a dynamic like that as badly as in my hypothetical scenario, especially if things are generally moving very quickly - I could see a scenario like the one above playing out in a handful of years or faster, and it often takes much longer than that for e.g. good regulation to get designed and implemented in response to some novel problem.

Comment by HoldenKarnofsky on Discussion with Nate Soares on a key alignment difficulty · 2023-03-21T05:24:14.944Z · LW · GW

I hear you on this concern, but it basically seems similar (IMO) to a concern like: "The future of humanity after N more generations will be ~without value, due to all the reflection humans will do - and all the ways their values will change - between now and then." A large set of "ems" gaining control of the future after a lot of "reflection" seems like quite comparable to future humans having control over the future (also after a lot of effective "reflection").

I think there's some validity to worrying about a future with very different values from today's. But I think misaligned AI is (reasonably) usually assumed to diverge in more drastic and/or "bad" ways than humans themselves would if they stayed in control; I think of this difference as the major driver of wanting to align AIs at all. And it seems Nate thinks that the hypothetical training process I outline above gets us something much closer to "misaligned AI" levels of value divergence than to "ems" levels of value divergence.

Comment by HoldenKarnofsky on How might we align transformative AI if it’s developed very soon? · 2023-03-21T02:31:04.559Z · LW · GW

I see, thanks. I feel like the closest analogy here that seems viable to me would be to something like: is Open Philanthropy able to hire security experts to improve its security and assess whether they're improving its security? And I think the answer to that is yes. (Most of its grantees aren't doing work where security is very important.)

It feels harder to draw an analogy for something like "helping with standards enforcement," but maybe we could consider OP's ability to assess whether its farm animal welfare grantees are having an impact on who adheres to what standards, and how strong adherence is? I think OP has pretty good (not perfect) ability to do so.

Comment by HoldenKarnofsky on How might we align transformative AI if it’s developed very soon? · 2023-03-21T02:18:30.024Z · LW · GW

(Chiming in late, sorry!)

I think #3 and #4 are issues, but can be compensated for if aligned AIs outnumber or outclass misaligned AIs by enough. The situation seems fairly analogous to how things are with humans - law-abiding people face a lot of extra constraints, but are still collectively more powerful.

I think #1 is a risk, but it seems <<50% likely to be decisive, especially when considering (a) the possibility for things like space travel, hardened refuges, intense medical interventions, digital people, etc. that could become viable with aligned AIs; (b) the possibility that a relatively small number of biological humans’ surviving could still be enough to stop misaligned AIs (if we posit that aligned AIs greatly outnumber misaligned AIs). And I think misaligned AIs are less likely to cause any damage if the odds are against ultimately achieving their aims. 

I also suspect that the disagreement on point #1 is infecting #2 and #4 a bit - you seem to be picturing scenarios where a small number of misaligned AIs can pose threats that can *only* be defended against with extremely intense, scary, sudden measures. 

I’m pretty not sold on #2. There are stories like this you can tell, but I think there could be significant forces pushing the other way, such as a desire not to fall behind others’ capabilities. In a world where there are lots of powerful AIs and they’re continually advancing, I think the situation looks less like “Here’s a singular terrifying AI for you to integrate into your systems” and more like “Here’s the latest security upgrade, I think you’re getting pwned if you skip it.”

Finally, you seem to have focused heavily here on the “defense/deterrence/hardening” part of the picture, which I think *might* be sufficient, but isn’t the only tool in the toolkit. Many of the other AI uses in that section are about stopping misaligned AIs from being developed and deployed in the first place, which could make it much easier for them to be radically outnumbered/outclassed.


 

Comment by HoldenKarnofsky on AI Safety Seems Hard to Measure · 2023-03-21T02:16:30.035Z · LW · GW

(Apologies for the late reply!) For now, my goal is to write something that interested, motivated nontechnical people can follow - the focus is on the content being followable rather than on distribution. I've tried to achieve this mostly via nontechnical beta (and alpha) readers.

Doing this gives me something I can send to people when I want them to understand where I'm coming from, and it also helps me clarify my own thoughts (I tend to trust ideas more when I can explain them to an outsider, and I think that getting to that point helps me get clear on which are the major high-level points I'm hanging my hat on when deciding what to do). I think there's also potential for this work to reach highly motivated but nontechnical people who are better at communication and distribution than I am (and have seen some of this happening).

I have the impression that these posts are pretty widely read in the EA community and at some AI labs, and have raised understanding and concern about misalignment to some degree. 

I may explore more aggressive promotion in the future, but I'm not doing so now.

Comment by HoldenKarnofsky on How might we align transformative AI if it’s developed very soon? · 2023-03-18T06:58:07.755Z · LW · GW

I think I find the "grokking general-purpose search" argument weaker than you do, but it's not clear by how much.

The "we" in "we can point AIs toward and have some ability to assess" meant humans, not Open Phil. You might be arguing for some analogy but it's not immediately clear to me what, so maybe clarify if that's the case?

Comment by HoldenKarnofsky on Why Not Just Outsource Alignment Research To An AI? · 2023-03-18T05:18:57.685Z · LW · GW

I don't agree with this characterization, at least for myself. I think people should be doing object-level alignment research now, partly (maybe mostly?) to be in better position to automate it later. I expect alignment researchers to be central to automation attempts.

It seems to me like the basic equation is something like: "If today's alignment researchers would be able to succeed given a lot more time, then they also are reasonably likely to succeed given access to a lot of human-level-ish AIs." There are reasons this could fail (perhaps future alignment research will require major adaptations and different skills such that today's top alignment researchers will be unable to assess it; perhaps there are parallelization issues, though AIs can give significant serial speedup), but the argument in this post seems far from a knockdown.

Also, it seems worth noting that non-experts work productively with experts all the time. There are lots of shortcomings and failure modes, but the video is a parody.

Comment by HoldenKarnofsky on How we could stumble into AI catastrophe · 2023-03-17T22:47:09.425Z · LW · GW

I think there is hope in measures along these lines, but my fear is that it is inherently more complex (and probably slow) to do something like "Make sure to separate plan generation and execution; make sure we can evaluate how a plan is going using reliable metrics and independent assessment" than something like "Just tell an AI what we want, give it access to a terminal/browser and let it go for it."

When AIs are limited and unreliable, the extra effort can be justified purely on grounds of "If you don't put in the extra effort, you'll get results too unreliable to be useful."

If AIs become more and more general - approaching human capabilities - I expect this to become less true, and hence I expect a constant temptation to skimp on independent checks, make execution more loops more quick and closed, etc.

The more people are aware of the risks, and concerned about them, the more we might take such precautions anyway. This piece is about how we could stumble into catastrophe if there is relatively little awareness until late in the game.

Comment by HoldenKarnofsky on How we could stumble into AI catastrophe · 2023-03-17T22:41:04.617Z · LW · GW

Thanks! I agree this is a concern. In theory, people who are constantly thinking about the risks should be able to make a reasonable decision about "when to pause", but in practice I think there is a lot of important work to do today making the "pause" more likely in the future, including on AI safety standards and on the kinds of measures described at https://www.cold-takes.com/what-ai-companies-can-do-today-to-help-with-the-most-important-century/

Comment by HoldenKarnofsky on What AI companies can do today to help with the most important century · 2023-03-17T21:57:34.621Z · LW · GW

I think there's truth to what you're saying, but I think the downsides of premature government involvement are big too. I discuss this more in a followup post.

Comment by HoldenKarnofsky on Racing through a minefield: the AI deployment problem · 2023-03-17T21:49:26.821Z · LW · GW

(Apologies for the late reply!) I think working on improved institutions is a good goal that could potentially help, and I'm excited about some of the work going on in general categories you mentioned. It's not my focus because (a) I do think the "timelines don't match up" problem is big; (b) I think it's really hard to identify specific interventions that would improve all decision-making - it's really hard to predict the long-run effects of any given reform (e.g., a new voting system) as the context changes. Accordingly, what feels most pressing to me is getting more clarity on specific measures that can be taken to reduce the biggest risks to humanity, and then looking specifically at which institutional changes would make the world better-positioned to evaluate and act on those types of measures. Hence my interest in AI strategy "nearcasting" and in AI safety standards.

Comment by HoldenKarnofsky on How might we align transformative AI if it’s developed very soon? · 2023-03-17T21:30:59.895Z · LW · GW

It seems like we could simply try to be as vigilant elsewhere as we would be without this measure, and then we could reasonably expect this measure to be net-beneficial (*how* net beneficial is debatable).

Comment by HoldenKarnofsky on How might we align transformative AI if it’s developed very soon? · 2023-03-17T21:29:16.240Z · LW · GW

I now think I wrote that part poorly. The idea isn't so much that we say to an AI, "Go out and do whatever you need to do - accumulate money, hire analysts, run experiments, etc. - and come back with a plan that we will evaluate."

The idea is more like this:

  1. We want to accomplish X.
  2. We describe X to an AI.
  3. The AI proposes a next step toward X, based entirely on thinking about it (and not doing other stuff like e.g. hiring researchers - though its proposed next step can be "Hire researchers").
  4. With chance (1-p), the step is simply executed, with no gradient descent implications. With chance p,  we prompt the AI to generate a number of alternative next steps; drill down extensively on its reasoning; and perform gradient descent based on which of the alternative next steps we like best. (We could potentially then execute a weighted random suggested step, rather than the AI's first-choice step, to further reduce the extent to which proposing a step toward e.g. subtly gaining power is encouraged by gradient descent.)

This seems to significantly reduce the scope for things like "Subtly accumulating power gets encouraged by gradient descent," and to produce something closer to "optimization for taking a series of steps each of which a human would find reasonable ex ante" as opposed to "optimization for doing whatever it takes to get the desired outcome."

I don't think it's bullet proof by any means - there could still be an inner alignment problem leading to an AI optimizing for something other than how its proposed steps are rated - but it seems like a risk reducer to me.

(This all comes pretty much straight from conversations Paul and I had a few months ago, though I didn't check this comment with Paul and it may be off from his take.)