Posts

Ablations for “Frontier Models are Capable of In-context Scheming” 2024-12-17T23:58:19.222Z
Frontier Models are Capable of In-context Scheming 2024-12-05T22:11:17.320Z

Comments

Comment by Bronson Schoen (bronson-schoen) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-29T10:36:50.449Z · LW · GW

All four of those I think are basically useless in practice for purposes of progress toward aligning significantly-smarter-than-human AGI, including indirectly (e.g. via outsourcing alignment research to AI).

It’s difficult for me to understand how this could be “basically useless in practice” for:

scalable oversight (using humans, and possibly giving them a leg up with e.g secret communication channels between them, and rotating different humans when we need to simulate amnesia) - can we patch all of the problems with e.g debate? can we extract higher quality work out of real life misaligned expert humans for practical purposes (even if it's maybe a bit cost uncompetitive)?

It seems to me you’d want to understand and strongly show how and why different approaches here fail, and in any world where you have something like “outsourcing alignment research” you want some form of oversight.

Comment by Bronson Schoen (bronson-schoen) on Fabien's Shortform · 2024-12-29T10:19:34.044Z · LW · GW

The connotation is that propagandists over the years have correctly realized that presenting empirical findings is not a very effective way to convince people of things

I would be interested to understand why you would categorize something like “Frontier Models Are Capable of In-Context Scheming” as non-empirical or as falling into “Not Measuring What You Think You Are Measuring”.

Comment by Bronson Schoen (bronson-schoen) on A shortcoming of concrete demonstrations as AGI risk advocacy · 2024-12-12T00:27:05.293Z · LW · GW

I am more optimistic that we can get such empirical evidence for at least the most important parts of the AI risk case, like deceptive alignment, and here's one reason as comment on offer:

Can you elaborate on what you were pointing to in the linked example? The thread specifically I’ve seen a few people mention recently but I seem to be missing the conclusion they’re drawing from it.

Comment by Bronson Schoen (bronson-schoen) on A shortcoming of concrete demonstrations as AGI risk advocacy · 2024-12-11T23:55:48.645Z · LW · GW

crisis-mongering about risk when there is no demonstration/empirical evidence to ruin the initially perfect world pretty immediately

I think the key point of this post is precisely the question of “is there any such demonstration, short of the actual real very bad thing happening in a real setting that people who discount these as serious risks would accept as empirical evidence worth updating on?”