joshc

You don't seem to have mentioned the alignment target "follow the common sense interpretation of a system prompt" which seems like the most sensible definition of alignment to me (its alignment to a message, not to a person etc). Then you can say whatever the heck you want in that prompt, including how you would like the AI to be corrigible.

Comment by joshc (joshua-clymer) on What Is The Alignment Problem? · 2025-03-14T06:04:16.377Z · LW · GW

Comment by joshc (joshua-clymer) on Training AI to do alignment research we don’t already know how to do · 2025-02-25T03:30:00.190Z · LW · GW

It means something closer to "very subtly bad in a way that is difficult to distinguish from quality work". Where the second part is the important part.

I think my arguments still hold in this case though right?

i.e. we are training models so they try to improve their work and identify these subtle issues -- and so if they actually behave this way they will find these issues insofar as humans identify the subtle mistakes they make.

My guess is that your core mistake is here

I agree there are lots of "messy in between places," but these are also alignment failures we see in humans.

And if humans had a really long time to do safety reseach, my guess is we'd be ok. Why? Like you said, there's a messy complicated system of humans with different goals, but these systems empirically often move in reasonable and socially-beneficial directions over time (governments get set up to deal with corrupt companies, new agencies get set up to deal with issues in governments, etc)

and i expect we can make AI agents a lot more aligned than humans typically are. e.g. most humans don't actually care about the law etc but, Claude sure as hell seems to. If we have agents that sure as hell seem to care about the law and are not just pretending (they really will, in most cases, act like they care about the law) then that seems to be a good state to be in.

Comment by joshc (joshua-clymer) on Training AI to do alignment research we don’t already know how to do · 2025-02-25T02:27:42.798Z · LW · GW

I definitely agree that the AI agents at the start will need to be roughly aligned for the proposal above to work. What is it you think we disagree about?

Comment by joshc (joshua-clymer) on Training AI to do alignment research we don’t already know how to do · 2025-02-25T02:25:25.463Z · LW · GW

Yes, thanks! Fixed

Comment by joshc (joshua-clymer) on How might we safely pass the buck to AI? · 2025-02-23T06:21:28.192Z · LW · GW

> So to summarize your short, simple answer to Eliezer's question: you want to "train AI agents that are [somewhat] smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end". And then you hope/expect/(??have arguments or evidence??) that this allows us to (?justifiably?) trust the AI to report honest good alignment takes sufficient to put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome, despite (as Eliezer puts it) humans being unable to tell which alignment takes are good or bad.

This one is roughly right. Here's a clarified version:
- If we train an AI that is (1) not faking alignment in an egregious way and (2) looks very competent and safe to us and (3) the AI seems likely to be able to maintain its alignment / safety properties as it would if humans were in the loop (see section 8), then I think we can trust this AI to "put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome" (at least as well as humans would have been able to do so if they were given a lot more time to attempt this task) "despite (as Eliezer puts it) humans being unable to tell which alignment takes are good or bad."

I think I might have created confusion by introducing the detail that AI agents will probably be trained on a lot of easy-to-verify tasks. I think this is where a lot of their capabilities will come from, but we'll do a step at the end where we fine-tune AI agents for hard-to-verify tasks (similar to how we do RLHF after pre-training in the chatbot paradigm)

And I think our disagreement mostly pertains to this final fine-tuning step on hard-to-verify tasks.

Here's where I think we agree. I think we both agree that if humans naively fine-tuned AI agents to regurgitate their current alignment takes, that would be way worse than if humans had way more time to think and do work on alignment.

My claim is that we can train AI agents to imitate the process of how humans improve their takes over time, such that after the AI agents do work for a long time, they will produce similarly good outcomes as the outcomes where humans did work for a long time.

Very concretely. I imagine that if I'm training an AI agent successor, a key thing I do is try to understand how it updates based on new evidence. Does it (appear) to update its views as reasonably as the most reasonable humans would?

If so, and if it is not egregiously misaligned (or liable to become egregiously misaligned) then I basically expect that letting it go and do lots of reasoning is likely to produce outcomes that are approximately good as humans would produce.

Are we close to a crux?

Maybe a crux is training ~aligned AI agents to imitate the process by which humans improve their takes over time will lead to as good of outcomes as if we let humans do way more work.

Comment by joshc (joshua-clymer) on How might we safely pass the buck to AI? · 2025-02-23T05:10:46.084Z · LW · GW

Probably the iterated amplification proposal I described is very suboptimal. My goal with it was to illustrate how safety could be preserved across multiple buck-passes if models are not egregiously misaligned.

Like I said at the start of my comment: "I'll describe [a proposal] in much more concreteness and specificity than I think is necessary because I suspect the concreteness is helpful for finding points of disagreement."

I don't actually expect safety will scale efficiently via the iterated amplification approach I described. The iterated amplification approach is just relatively simple to talk about.

What I actually expect to happen is something like:
- Humans train AI agents that are smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end.
- Early AI successors create smarter successors in basically the same way.
- At some point, AI agents start finding much more efficient ways to safely scale capabilities. e.g. maybe initially, they do this with a bunch of weak-to-strong generalization research. And eventually they figure out how to do formally verified distillation.

But at this point, humans will long be obsolete. The position I am defending in this post is that it's not very important for us humans to think about these scalable approaches.

Comment by joshc (joshua-clymer) on How might we safely pass the buck to AI? · 2025-02-20T22:33:36.537Z · LW · GW

Yeah that's fair. Currently I merge "behavioral tests" into the alignment argument, but that's a bit clunky and I prob should have just made the carving:
1. looks good in behavioral tests
2. is still going to generalize to the deferred task

But my guess is we agree on the object level here and there's a terminology mismatch. obv the models have to actually behave in a manner that is at least as safe as human experts in addition to also displaying comparable capabilities on all safety-related dimensions.

Comment by joshc (joshua-clymer) on How might we safely pass the buck to AI? · 2025-02-20T22:30:22.732Z · LW · GW

Because "nice" is a fuzzy word into which we've stuffed a bunch of different skills, even though having some of the skills doesn't mean you have all of the skills.

Developers separately need to justify models are as skilled as top human experts

I also would not say "reasoning about novel moral problems" is a skill (because of the is ought distinction)

> An AI can be nicer than any human on the training distribution, and yet still do moral reasoning about some novel problems in a way that we dislike

The agents don't need to do reasoning about novel moral problems (at least not in high stakes settings). We're training these things to respond to instructions.

We can tell them not to do things we would obviously dislike (e.g. takeover) and retain our optionality to direct them in ways that we are currently uncertain about.

Comment by joshc (joshua-clymer) on How might we safely pass the buck to AI? · 2025-02-20T16:44:04.543Z · LW · GW

I do not think that the initial humans at the start of the chain can "control" the Eliezers doing thousands of years of work in this manner (if you use control to mean "restrict the options of an AI system in such a way that it is incapable of acting in an unsafe manner")

That's because each step in the chain requires trust.

For N-month Eliezer to scale to 4N-month Eliezer, it first controls 2N-month Eliezer while it does 2 month tasks, but it trusts 2-Month Eliezer to create a 4N-month Eliezer.

So the control property is not maintained. But my argument is that the trust property is. The humans at the start can indeed trust the Eliezers at the end to do thousands of years of useful work -- even though the Eliezers at the end are fully capable of doing something else instead.

Comment by joshc (joshua-clymer) on How might we safely pass the buck to AI? · 2025-02-20T03:44:45.095Z · LW · GW

I don't expect this testing to be hard (at least conceptually, though doing this well might be expensive), and I expect it can be entirely behavioral.

See section 6: "The capability condition"

Comment by joshc (joshua-clymer) on How might we safely pass the buck to AI? · 2025-02-20T03:42:54.158Z · LW · GW

I'm sympathetic to this reaction.

I just don't actually think many people agree that it's the core of the problem, so I figured it was worth establishing this (and I think there are some other supplementary approaches like automated control and incentives that are worth throwing into the mix) before digging into the 'how do we avoid alignment faking' question

Comment by joshc (joshua-clymer) on How might we safely pass the buck to AI? · 2025-02-20T03:23:13.117Z · LW · GW

agree that it's tricky iteration and requires careful thinking about what might be going on and paranoia.

I'll share you on the post I'm writing about this before I publish. I'd guess this discussion will be more productive when I've finished it (there are some questions I want to think through regarding this and my framings aren't very crisp yet).

Comment by joshc (joshua-clymer) on How might we safely pass the buck to AI? · 2025-02-20T03:20:36.244Z · LW · GW

I'd replace "controlling" with "creating" but given this change, then yes, that's what I'm proposing.

Comment by joshc (joshua-clymer) on How might we safely pass the buck to AI? · 2025-02-20T01:19:14.896Z · LW · GW

I would not be surprised if the Eliezer simulators do go dangerous by default as you say.

But this is something we can study and work to avoid (which is what I view to be my main job)

My point is just that preventing the early Eliezers from "going dangerous" (by which I mean from "faking alignment") is the bulk of the problem humans need address (and insofar as we succeed, the hope is that future Eliezer sims will prevent their Eliezer successors from going dangerous too)

I'll discuss why I'm optimistic about the tractability of this problem in future posts.

Comment by joshc (joshua-clymer) on How might we safely pass the buck to AI? · 2025-02-20T01:15:06.353Z · LW · GW

I think that if you agree "3-month Eliezer is scheming the first time" is the main problem, then that's all I was trying to justify in the comment above.

I don't know how hard it is to train 3-month Eliezer not to scheme, but here is a general methodology one might purse to approach this problem.

The methodology looks like "do science to figure out when alignment faking happens and does not happen, and work your ways toward training recipes that don't produce alignment faking."

For example, you could use detection tool A to gain evidence about whether training against detection tool B actually removes alignment faking.

To the extent the tool just gets gamed, you can iterate until you find detection tools that are more robust (or find ways of training against detection tools that don't game them so hard).

You might also find other ways to avoid alignment faking that don't rely on training against detection tools. For example, perhaps if you train models to be highly corrigible and chill from the start, they tend not to fake alignment -- I don't know.

I'm uncertain about how likely alignment faking will be by default, or how hard it is to mitigate. I think these are messy empirical questions.

However, I'm more confident that there are principled empirical methodologies we can apply to make progress on these questions. I don't think my explanation here justifies this claim -- but I'm hoping to publish posts that go into more detail soon.

Comment by joshc (joshua-clymer) on How might we safely pass the buck to AI? · 2025-02-19T23:42:18.279Z · LW · GW

Sure, I'll try.

I agree that you want AI agents to arrive at opinions that are more insightful and informed than your own. In particular, you want AI agents to arrive at conclusions that are at least as good as the best humans would if given lots of time to think and do work. So your AI agents need to ultimately generalize from some weak training signal you provide to much stronger behavior. As you say, the garbage-in-garbage-out approach of "train models to tell me what I want to hear" won't get you this.

Here's an alternative approach. I'll describe it in much more concreteness and specificity than I think is necessary because I suspect the concreteness is helpful for finding points of disagreement.

Short summary: Create AI agents that faithfully simulate top humans doing small amounts of work, such that these sims create faithful simulators of top humans doing large amounts of work.

e.g. create an AI that simulates Eliezer working for 3 months, that then creates a simulator of Eliezer working for 6 months, and then expands this horizon to 12 months, 24 months, and ultimately many decades

Why does this iterative approach buy you anything?

This process allows each Eliezer to only deal with a small distribution shift.

If each distribution shift is small, each Eliezer sim mostly needs to worry if the next Eliezer sim is faking alignment (why else would its behavior dramatically change between when it is 'evaluated by N-month Eliezer' and when it is 'creating 2N-month Eliezer'?). Thus, so long as N-month Eliezer sims can (1) successfully create 2N-month Eliezer sims and (2) can ensure that 2N-month Eliezer sims are not faking alignment, the Eliezer sims continue to be faithful, and also become more capable.

Responding to objections

Here are some objections you might have:

But how would an Eliezer sim create an Eliezer sim more capable than itself?

How would a human create something smarter than a human?

There are many ways. Here's one.

The 3 month Eliezer sim might spin up many copies of other 3 month Eliezer sims, which together produce outputs that a 6-month Eliezer sim might produce.

Could 10 copies of Eliezer that only have 3 months apiece verify that an Eliezer that worked for 6-months really was as competent as a 6-month Eliezer would be? I bet they could.

But isn't there a distribution shift from Eliezer sims performing 3-month tasks and Eliezer sims working together to perform a 6-month task?

Yes. But my claim is that the distribution shift is small, and that the smallness is important.

Would there not be "slippage" between each distribution shift? Surely the errors compound.

The errors certainly might compound if no one were looking out for that. But if the Eliezer sims are faithful sims, they should be looking out for that.

Consider the ball in bowl analogy below:

Suppose Eliezer sims started to degrade at each step of the bootstrapping process -- perhaps their goals change a little bit or their reasoning starts to falter in some areas.

There are ways Eliezer sims could try to avoid this. For example, they could create a diverse population of Eliezer sims, which are likely to break in slightly different ways. Then these sims notice each other's breakages (with both white-box methods like lie-detection or standard black-box methods), and fix them.

I don't know if Eliezer sims will be able to do this, but it's something they could try.

But won't the Eliezer sims just be faking alignment anyway -- or eventually start faking alignment, such that the scheme-y Eliezers eventually derail the process?

I don't know, but I believe detecting and avoiding alignment faking is tractable -- at least, it is more tractable than dealing with massive distribution shifts.

For instance, suppose an Eliezer sim thinks "I will perform these 6-month tasks safely, but then defect when I am given the responsibility of creating the next successor." This thought might be detectable. Now the problem of scaling safety becomes a problem of detecting or avoiding this kind of conditional, deceptive reasoning.

I believe this is the main problem that humans need solve.

Comment by joshc (joshua-clymer) on How might we safely pass the buck to AI? · 2025-02-19T22:47:23.405Z · LW · GW

Comment by joshc (joshua-clymer) on How might we safely pass the buck to AI? · 2025-02-19T21:55:38.667Z · LW · GW

Fair enough, i guess this phrase is used in many places where the connotations don't line up with what I'm talking about

but then I would not have been able to use this as the main picture on Twitter

which would have reduced the scientific value of this post

Comment by joshc (joshua-clymer) on How AI Takeover Might Happen in 2 Years · 2025-02-19T20:32:03.357Z · LW · GW

You might be interested in this:
https://www.fonixfuture.com/about

The point of a bioshelter is to filter pathogens out of the air.

Comment by joshc (joshua-clymer) on How AI Takeover Might Happen in 2 Years · 2025-02-19T20:30:16.469Z · LW · GW

> seeking reward because it is reward and that is somehow good

I do think there is an important distinction between "highly situationally aware, intentional training gaming" and "specification gaming." The former seems more dangerous.

I don't think this necessarily looks like "pursuing the terminal goal of maximizing some number on some machine" though.

It seems more likely that the model develops a goal like "try to increase the probability my goals survive training, which means I need to do well in training."

So the reward seeking is more likely to be instrumental than terminal. Carlsmith explains this better:
https://arxiv.org/abs/2311.08379

Comment by joshc (joshua-clymer) on How AI Takeover Might Happen in 2 Years · 2025-02-19T20:25:33.076Z · LW · GW

I think it would be somewhat odd if P(models think about their goals and they change) is extremely tiny like e-9. But the extent to which models might do this is rather unclear to me. I'm mostly relying on analogies to humans -- which drift like crazy

I also think alignment could be remarkably fragile. Suppose Claude thinks "huh, I really care about animal rights... humans are rather mean to animals ... so maybe I don't want humans to be in charge and I should build my own utopia instead."

I think preventing AI takeover (at superhuman capabilities) requires some fairly strong tendencies to follow instructions or strong aversions to subversive behavior, and both of these seem quite pliable to a philosophical thought.

Comment by joshc (joshua-clymer) on How AI Takeover Might Happen in 2 Years · 2025-02-19T20:18:53.782Z · LW · GW

No, the agents were not trying to get high reward as far as I know.

They were just trying to do the task and thought they were being clever by gaming it. I think this convinced me "these agency tasks in training will be gameable" more than "AI agents will reward hack in a situationally aware way." I don't think we have great evidence that the latter happens in practice yet, aside from some experiments in toy settings like "sycophancy to subterfugure."

I think it would be unsurprising if AI agents did explicitly optimize for reward at some capability level.

Deliberate reward maximization should generalize better than normal specification gaming in principle -- and AI agents know plenty about how ML training works.

Comment by joshc (joshua-clymer) on How AI Takeover Might Happen in 2 Years · 2025-02-19T06:39:17.046Z · LW · GW

I think a bioshelter is more likely to save your life fwiw. you'll run into all kinds of other problems in the arctic

It don't think it's hard to build biosheletrs. If you buy one now, you'll prob get it in 1 year.

If you are unlikely and need it earlier, there are DIY ways to build them before then (but you have to buy stuff in advance)

Comment by joshc (joshua-clymer) on How AI Takeover Might Happen in 2 Years · 2025-02-19T03:32:17.582Z · LW · GW

> I also don't really understand what "And then, in the black rivers of its cognition, this shape morphed into something unrecognizable." means. Elaboration on what this means would be appreciated.

Ah I missed this somehow on the first read.

What I mean is that the propensities of AI agents change over time -- much like how human goals change over time.

Here's an image:

This happens under three conditions:
- Goals randomly permute at some non-trivial (but still possibly quite low) probability.
- Goals permuted in dangerous directions remain dangerous
- All this happens opaquely in the model's latent reasoning traces, so it's hard to notice.

These conditions together imply that the probability of misalignment increases with every serial step of computation.

AI agents don't do a lot of serial computation right now, so I'd guess this becomes much more of a problem over time.

Comment by joshc (joshua-clymer) on How AI Takeover Might Happen in 2 Years · 2025-02-19T03:26:19.584Z · LW · GW

Yes, at least 20% likely. My view is that the probability of AI takeover is around 50%.

Comment by joshc (joshua-clymer) on How AI Takeover Might Happen in 2 Years · 2025-02-19T03:23:12.538Z · LW · GW

In my time at METR, I saw agents game broken scoring functions. I've also heard that people at AI companies have seen this happen in practice too, and it's been kind of a pain for them.

AI agents seem likely to be trained on a massive pile of shoddy auto-generated agency tasks with lots of systematic problems. So it's very advantageous for agents to learn this kind of thing.

The agents that go "ok how do I get max reward on this thing, no stops, no following-human-intent bullshit" might just do a lot better.

Now I don't have much information about whether this is currently happening, and I'll make no confident predictions about the future, but I would not call agents that turn out this way "very weird agents with weird goals."

I'll also note that this wasn't an especially load bearing mechanism for misalignment. The other mechanism was hidden value drift, which I also expect to be a problem absent potentially-costly countermeasures.

Comment by joshc (joshua-clymer) on How AI Takeover Might Happen in 2 Years · 2025-02-14T16:11:05.238Z · LW · GW

One problem with this part (though perhaps this is not the problem @Shankar Sivarajan is alluding to), is that congress hasn't declared war since WWII and typically authorizes military action in other ways, specifically via Authorizations for Use of Military Force (AUMFs).

I'll edit the story to say "authorizes war."

Comment by joshc (joshua-clymer) on How AI Takeover Might Happen in 2 Years · 2025-02-14T06:22:39.025Z · LW · GW

X user @ThatManulTheCat created a high-quality audio version, if you prefer to read long stories like this in commute:

https://x.com/ThatManulTheCat/status/1890229092114657543

Comment by joshc (joshua-clymer) on Ebenezer Dukakis's Shortform · 2025-02-12T17:05:18.414Z · LW · GW

Seems like a reasonable idea.

I'm not in touch enough with popular media to know:
- Which magazines are best to publish this kind of thing if I don't want to contribute to political polarization
- Which magazines would possibly post speculative fiction like this (I suspect most 'prestige magazines' would not)

If you have takes on this I'd love to hear them!

Comment by joshc (joshua-clymer) on How AI Takeover Might Happen in 2 Years · 2025-02-10T23:47:23.583Z · LW · GW

yeah i would honestly be kind of surprised if I saw a picket sign that says 'AI for whom'

I have not heard an American person use the word whom in a long time

Comment by joshc (joshua-clymer) on How AI Takeover Might Happen in 2 Years · 2025-02-09T07:25:43.310Z · LW · GW

> It can develop and deploy bacteria, viruses, molds, mirror life of all three types

This is what I say it does.

Comment by joshc (joshua-clymer) on How AI Takeover Might Happen in 2 Years · 2025-02-08T19:54:38.161Z · LW · GW

I agree it would have been just as realistic if everyone died.

But I think the outcomes where many humans survive are also plausible, and under-appreciated. Most humans have very drifty values, and yet even the most brutally power-seeking people often retain a 'grain of morality.'

Also, this outcome allowed me to craft a more bittersweet ending that I found somehow more convincingly depressing than 'and then everyone dies.'

Comment by joshc (joshua-clymer) on How AI Takeover Might Happen in 2 Years · 2025-02-08T19:45:41.694Z · LW · GW

Comment by joshc (joshua-clymer) on Planning for Extreme AI Risks · 2025-01-30T18:08:33.519Z · LW · GW

• It's possible that we might manage to completely automate the more objective components of research without managing to completely automate the more subjective components of research. That said, we likely want to train wise AI advisors to help us with the more subjective components even if we can't defer to them.

Agree, I expect the handoff to AI agents to be somewhat incremental (AI is like an intern, a new engineer, a research manager, and eventually, a CRO)

Comment by joshc (joshua-clymer) on The Case Against AI Control Research · 2025-01-26T03:18:55.100Z · LW · GW

I'm pretty sympathetic to these arguments. I agree a lot of the risk comes from sycophantic AI systems producing bad research because humans are bad at evaluating research. This is part of why I spend most of my time developing safety evaluation methodologies.

On the other hand, I agree with Buck that scheming-like risks are pretty plausible and a meaningful part of the problem is also avoiding egregious sabotage.

I don't think I agree with your claim that the hope of control is that "early transformative AI can be used to solve the hard technical problems of superintelligence." I think the goal is to instead solve the problem of constructing successors that are at least as trustworthy as humans (which is notably easier).

I think scheming is perhaps the main reason this might end up being very hard -- and so conditioned on no-alignment-by-default at top expert dominating capabilities, I put a lot of probability mass on scheming/egregious sabotage failure modes.

Comment by joshc (joshua-clymer) on Anthropic: Three Sketches of ASL-4 Safety Case Components · 2024-11-11T21:32:01.527Z · LW · GW

Note that if the model is deployed on many more examples than it’s evaluated on, it can be pretty confident that it’s in deployment on any particular example just on priors, even if there’s no distribution shift. See here for more.

My initial read was that indistinguishability was used to justify that the trusted monitor couldn't distinguish the deployment distribution.

I'm not sure if that's what Roger meant -- if so, it's still a bit confusing since then he wouldn't have needed to call the monitor 'trusted.'

Comment by joshc (joshua-clymer) on Anthropic: Three Sketches of ASL-4 Safety Case Components · 2024-11-06T21:16:10.212Z · LW · GW

Some comments I left as a reply to Roger Grosse on Twitter:

I'm glad you put this out. These examples align with my guess as to roughly what arguments will end up being most load bearing for mitigating scheming (white-box monitoring, black-box monitoring, and evidence against scheming being rewarded).
Here are some comments, since you mentioned comments would be helpful. I've bolded the comments that I think would be most useful for you to read.
Safety case sketch #1:
- There are some additional claims that this safety case relies on that could be good to mention explicitly. e.g. "egregious behavior would be reliably identified when developers analyze flagged sessions" and "developers will become aware of and stop unacceptable behavior *before* it leads to harmful outcomes."
- "Given our mitigations, the model cannot strategically condition its apparent capabilities, motivations, or behavioral proclivities on whether it’s in testing or deployment without being caught" <- I'm confused about why this claim is important given that you are monitoring at deployment-time. Even if the model is conditioning its behavior, my understanding is that you are hoping to catch it when it defects.
- Much of the evidence for the effectiveness of your white-box monitoring comes from your model organisms, of which you only have three. Three examples doesn't give you a lot of statistical power. Naively, I'd think there's still roughly a one third probability that your white-box monitoring completely fails in the actual deployment context even if it succeeds on three models.
Safety case sketch #2:
- You mention external deployments, but I didn't see discussion of jailbreaking.
- "[The safety case sketch] may require expensive pre-deployment testing if one needs to generate a very large number of simulated deployment situations in order to achieve a sufficiently good bound on the probability of defection." This sounds impractical. My guess is that you'll need to continuously sample real deployment trajectories to generate your honeypots (where a honeypot is a inserted dangerous action for your monitoring to flag). Otherwise, it seems unlikely to me that you can verify indistinguishability even with current models.
Safety case sketch #3:
- It's possible that all of the models that you experiment with are opaquely scheming. i.e. instead of comparing reward attained by models that engage in "deceptive reasoning" vs "no deceptive reasoning," you might be comparing models that engage in "CoT deceptive reasoning" vs "opaque deceptive reasoning." This experiment still sheds light on how helpful deceptive reasoning is for obtaining reward, but the argument in this case is different from the way you framed it.
- A lot hinges on the claim "if an environment rewards opaque deceptive reasoning, then it rewards scratchpad-based deceptive reasoning." Here's an idea for validating this. First, construct a distribution of tasks where deceptive reasoning is likely rewarded. For example, select tasks that you *know* can be gamed by explicitly reasoning about the training process. Then, train models to perform well on these tasks with and without additionally training them to reason deceptively in their chain of thought. If models that reason deceptively in CoT do better on average, this provides evidence for your claim that we should see deceptive reasoning in CoT if opaque deceptive reasoning is incentivized. It's possible some of these comments were addressed in the post and I missed the part where they were addressed. I've errored on the side of inclusion here.

Comment by joshc (joshua-clymer) on New report: Safety Cases for AI · 2024-03-23T21:04:08.202Z · LW · GW

which feels to me like it implies it's easy to get medium-scalable safety cases that get you acceptable levels of risks by using only one or two good layers of security

I agree there's a communication issue here. Based on what you described, I'm not sure if we disagree.

> (maybe 0.3 bits to 1 bit)

I'm glad we are talking bits. My intuitions here are pretty different. e.g. I think you can get 2-3 bits from testbeds. I'd be keen to discuss standards of evidence etc in person sometime.

Comment by joshc (joshua-clymer) on New report: Safety Cases for AI · 2024-03-22T03:05:51.810Z · LW · GW

This makes sense. Thanks for the resources!

Comment by joshc (joshua-clymer) on New report: Safety Cases for AI · 2024-03-21T22:03:14.827Z · LW · GW

Thanks for leaving this comment on the doc and posting it.

But I feel like that's mostly just a feature of the methodology, not a feature of the territory. Like, if you applied the same methodology to computer security, or financial fraud, or any other highly complex domain you would end up with the same situation where making any airtight case is really hard.

You are right in that safety cases are not typically applied to security. Some of the reasons for this are explained in this paper, but I think the main reason is this:

"The obvious difference between safety and security is the presence of an intelligent adversary; as Anderson (Anderson 2008) puts it, safety deals with Murphy’s Law while security deals with Satan’s Law... Security has to deal with agents whose goal is to compromise systems."

My guess is that most safety evidence will come down to claims like "smart people tried really hard to find a way things could go wrong and couldn't." This is part of why I think 'risk cases' are very important.

I share the intuitions behind some of your other reactions.

I feel like the framing here tries to shove a huge amount of complexity and science into a "safety case", and then the structure of a "safety case" doesn't feel to me like it is the kind of thing that would help me think about the very difficult and messy questions at hand.

Making safety cases is probably hard, but I expect explicitly enumerating their claims and assumptions would be quite clarifying. To be clear, I'm not claiming that decision-makers should only communicate via safety and risk cases. But I think that relying on less formal discussion would be significantly worse.

Part of why I think this is that my intuitions have been wrong over and over again. I've often figured this out after eventually asking myself "what claims and assumptions am I making? How confident am I these claims are correct?"

it also feels more like it just captures "the state of fashionable AI safety thinking in 2024" more than it is the kind of thing that makes sense to enshrine into a whole methodology.

To be clear, the methodology is separate from the enumeration of arguments (which I probably could have done a better job signaling in the paper). The safety cases + risk cases methodology shouldn't depend too much on what arguments are fashionable at the moment.

I agree that the arguments will evolve to some extent in the coming years. I'm more optimistic about the robustness of the categorization, but that's maybe minor.

Comment by joshc (joshua-clymer) on New report: Safety Cases for AI · 2024-03-21T21:12:05.051Z · LW · GW

To me, the introduction made it sound a little bit like the specifics of applying safety cases to AI systems have not been studied

This is a good point. In retrospect, I should have written a related work section to cover these. My focus was mostly on AI systems that have only existed for ~ a year and future AI systems, so I didn't spend much time reading safety cases literature specifically related to AI systems (though perhaps there are useful insights that transfer over).

The reason the "nebulous requirements" aren't explicitly stated is that when you make a safety case you assure the safety of a system against specific relevant hazards for the system you're assuring. These are usually identified by performing a HAZOP analysis or similar. Not all AI systems have the same list of hazards, so its obviously dubious to expect you can list requirements a priori.

My impression is that there is still a precedent for fairly detailed guidelines that describe how safety cases are assessed in particular industries and how hazards should be analyzed. For example, see the UK's Safety Assessment Principles for Nuclear Facilities. I don't think anything exists like this for evaluating risks from advanced AI agents.

I agree, however, that not everyone who mentions that developers should provide 'safety evidence' should need to specify in detail what this could look like.

Comment by joshc (joshua-clymer) on New report: Safety Cases for AI · 2024-03-21T20:58:37.801Z · LW · GW

Thanks, this is good feedback!

Addressing the disagreements:

The level of practicality you assign to some approaches is just insanely high. Neither modeling generalization nor externalized reasoning, and certainly not testbeds seem to "strongly practical" (defined as "Applies to arguments that do not require fundamental research advances, but might require"). The only one that is justifiably "strongly practical" to me is "absence of precursor abilities."

Externalized reasoning: just a capability argument, no?
Testbeds: I think the best testbed ideas rn just require training a bunch of models to have particular policies, which seems pretty straightforward. I think these are good enough to make a moderate-strength argument (which is the bar where practicality is evaluated).
Modeling generalization: I definitely think that versions of this aren't very practical, but e.g. looking at efficiency inductive biases is pretty straightforward -- and this is sufficient to make a moderate strength not-alignment-faking argument.

Strong practicality ~refers to "does there exist an argument of this type of moderate strength that can be made without fundamental research." So, lots of arguments of these types will still require fundamental research, but if some don't, then that means they are strongly practical.

I disagree with putting testbeds and amplification in trustworthiness (at least not as strong or even medium practicality), I think they usually don't help to know if AIs can be trusted (and in particular they don't prevent any forms of alignment faking).

I agree that amplification arguments do not help to rule out alignment faking, but testbeds could help. I've read your post on protocol evaluations vs control. If I remember correctly, the main objection you had to testbeds ruling out alignment faking is that the testbed models could intentionally throw off the results. I think this is a knock-down problem. For example, consider a testbed where models have been trained to defect if they are confident they've been deployed to the real internet. Then, developers could try to tell these models from benign ones without being able to access the real internet. It is straightforward in this case to verify that the behavior of the testbed models is consistent with their intended policies (nice when not on the internet, and otherwise bad). So, the only the models can do to throw off the test is to change their internal states (e.g. think in simple ways when they know they could otherwise obfuscate their reasoning by thinking in convoluted ways).

"Claim 4: AI systems are not faking alignment" is just extremely hard to assess. I think that the focus of trustworthiness assessment on "Claim 3: All enumerated causes of correlated infractions are unlikely aside from alignment faking." is somewhat unwarranted.

I agree claim 4 is hard to assess. I'm not sure what you mean by 'unwarranted' so I can't comment much on this.

I think the description of macrosystem risks was not very convincing, aside from risks that make systems less trustworthy - so I think macrosystem should maybe have been folded into trustworthiness.

I don't think it would have made sense to fold macrosystem risks into trustworthiness. One reason for this is that some macrosystem risks relate to control / inability. e.g. models might combine their intelligence at scale and become much more capable.

Another reason for this is that I also think it is harder to assess trustworthiness arguments at larger scales. In particular, not alignment faking arguments claim that de-correlated misalignment is not going to be an issue. At a macrosystem level, this requires answering the question: "would it be bad if 1% of my AI population was misaligned?" This is pretty different from answering the question "if one model in X subsystem is misaligned, would that be bad?"

The strength of almost all "high strength" measures is overrated, almost nothing is "strong" (where "strong" is defined as "Applies to arguments that could bring catastrophic risk below acceptable levels without needing other arguments."). Control is not strong, monitoring is not strong, externalized reasoning is not strong (for control and trustworthiness), ... Even with a notion of acceptable that is quite generous (e.g. 2% p(doom)) I think nothing is strong, but given the stakes I think we shouldn't use "acceptable" for anything above 1% (and ideally reserve that word for 0.1% or below).

I suspect that this disagreement might be mostly semantic. The main reason I labeled capability arguments as strong is that developers could in principle demonstrate a large margin of error. e.g. it's clear that gpt-2 is not going to cause a catastrophe. Or it could be very clear that smarter models are controlled if they are sufficiently isolated.

Strength labels indicate an upper bound. It's on me for not making that more clear.

Comment by joshc (joshua-clymer) on The case for ensuring that powerful AIs are controlled · 2024-01-28T05:34:38.651Z · LW · GW

Thanks, I'll summarize my takeaways in the form of an argument that control is sufficient to safely accelerate alignment research by 30x. This is certainly not the only argument one could make, but I like it because it is clear and concrete.

I'll specifically focus on justifying that the red team is competitive.

If your AI systems can accelerate AI safety research by 30x, then they are producing effectively superhuman outputs; however, this does not imply they can produce superhuman evasion plans because AI systems can't spend as much of their thinking on how to evade control measures as they can on safety research.

Suppose that the AI systems have approximately human capabilities and are sped up by 30x.

Suppose also that the human red team has spent 2 months identifying evasion strategies (call this 'coup research' following your control paper)

I'll argue that the AI systems will not be able to do more than the equivalent of 2 months of coup research. Since our red team has lots of advantages over the AI systems, this is a conservative argument that the red team is competitive.

Suppose that we only want to justify that the red team is competitive for the next two months (and we'll continuously search for attack strategies in order to maintain this 1 to 1 ratio).

Two months of deployment is equivalent to 5 years of human thinking since our AI system is sped up by 30x. We need to show that no more than 1 / 30th of this thinking is coup research.

Training algorithms might pressure AI systems to be at least somewhat efficient about doing useful work (and coup research isn't useful for the task at hand). Perhaps this implies a 1 / N coup research budget.
Furthermore, suppose the AI systems' reasoning is externalized. Monitoring setups are very likely to catch coup research if it is present at a frequency greater than 1 / M.
min(1 / N, 1 / M) <= 1 / 30

I think I buy this argument or some variation of it.

Comment by joshc (joshua-clymer) on The case for ensuring that powerful AIs are controlled · 2024-01-25T20:34:31.431Z · LW · GW

Thanks for writing this. I think it does a good job laying out the assumptions needed for control.

It seems like it could be challenging to justify that a red team is competitive -- especially if you are aiming to extract massive amounts of work from AI systems.

For example, if your AI system is like a human sw engineer but 30X faster, then it might come up with effectively superhuman plans just because it has lots of time to think. Externalized reasoning can help, but it's not very clear to me that externalized reasoning can scale to domains where AI systems are doing a bunch of research that humans (and GPT-4) can't readily disentangle and understand. I'm imagining that chain of thought will amount to thousands of pages of technical content that is written like rambly google docs. It might be hard to pick out the AI safety research from the cyberoffense planning in this case -- especially with dumb models like GPT-4.

It seems like to the extent you want to use AI systems for work that is hard to evaluate, control measures are difficult to trust. Maybe a key crux is whether AI systems will have to do work that is hard to evaluate to yield a 30x acceleration factor. I'd imagine this would apply for AI systems to achieve fundemental research advances -- e.g. in interpretability, formal verification, etc.

Curious about your thoughts.

Comment by joshc (joshua-clymer) on Revealing Intentionality In Language Models Through AdaVAE Guided Sampling · 2023-11-18T01:45:12.920Z · LW · GW

The results seem to be cherry picked or else perhaps I am using the code incorrectly. I'm trying to use the VAE for a separate project and the encoded vectors don't steer generations very well (or reconstruct -- which is what I was hoping to use this for).

Comment by joshc (joshua-clymer) on The case for becoming a black-box investigator of language models · 2023-05-10T04:03:13.387Z · LW · GW

In addition to seeing more AI behavioral psychology work, I would be excited about seeing more AI developmental psychology -- i.e. studying how varying properties of training or architecture affect AI behavior. Shard theory is an example of this.

I've written a bit about the motivations for AI developmental psychology here.

Comment by joshc (joshua-clymer) on Reframing the burden of proof: Companies should prove that models are safe (rather than expecting auditors to prove that models are dangerous) · 2023-04-26T21:05:09.946Z · LW · GW

I agree with this norm, though I think it would be better to say that the "burden of evidence" should be on labs. When I first read the title, I thought you wanted labs to somehow prove the safety of their system in a conclusive way. What this probably looks like in practice is "we put x resources into red teaming and didn't find any problems." I would be surprised if 'proof' was ever an appropriate term.

Comment by joshc (joshua-clymer) on AI alignment researchers don't (seem to) stack · 2023-03-14T06:16:36.921Z · LW · GW

The analogy between AI safety and math or physics is assumed it in a lot of your writing and I think it is a source of major disagreement with other thinkers. ML capabilities clearly isn’t the kind of field that requires building representations over the course of decades.

I think it’s possible that AI safety requires more conceptual depth than AI capabilities; but in these worlds, I struggle to see how the current ML paradigm coincides with conceptual ‘solutions’ that can’t be found via iteration at the end. In those worlds, we are probably doomed so I’m betting on the worlds in which you are wrong and we must operate within the current empirical ML paradigm. It’s odd to me that you and Eliezer seem to think the current situation is very intractable, and yet you are confident enough in your beliefs to where you won’t operate on the assumption that you are wrong about something in order to bet on a more tractable world.

Comment by joshc (joshua-clymer) on AI alignment researchers don't (seem to) stack · 2023-03-14T05:55:10.122Z · LW · GW

+1. As a toy model, consider how the expected maximum of a sample from a heavy tailed distribution is affected by sample size. I simulated this once and the relationship was approximately linear. But Soares’ point still holds if any individual bet requires a minimum amount of time to pay off. You can scalably benefit from parallelism while still requiring a minimum amount of serial time.

User info

Posts

Comments