Posts
Comments
When AI companies have human-level AI systems, will they use them for alignment research, or will they use them (mostly) to advance capabilities instead?
It's not clear this is a crux for the automating alignment research plan to work out.
In particular, suppose an AI company currently spends 5% of its resources on alignment research and will continue spending 5% when they have human level systems. You might think this suffices for alignment to keep pace with capabilities as the alignment labor force will get more powerful as alignment gets more difficult (and more important) due to higher levels of capability.
This doesn't mean this plan will necessarily work, it depends on the relative difficulty of advancing capabilities vs alignment. I'd naively guess that the probability of success just keeps going up the more resources you use for alignment.
There are some reasons for thinking automation of labor is particularly compelling in the alignment case relative to the capabilities case:
- There might be scalable solutions to alignment which effectively indefinitely resolve the research problem while expect that capabilities looks more like continuously making better and better algorithms.
- Safety research might benefit relatively more from labor (rather than compute) when compared to capabilities. Two reasons for this:
- Safety currently seems relatively more labor bottlenecked.
- We can in principle solve large fraction of safety/alignment with fully theoretical safety research without any compute while it seems harder to do purely theoretical capabilities research.
I do think that pausing further capabilities once we have human-ish-level AIs for even just a few years while we focus on safety would massively improve the situation. This currently seems unlikely to happen.
Another way to put this is that automating alignment research is a response in the following dialogue:
Bob: We won't have enough time to solve alignment because AI takeoff will go very fast due to AIs automating AI R&D (and AI labor generally accelerating AI progress through other mechanisms).
Alice: Actually, as AIs are accelerating AI R&D, they could also be accelerating alignment work, so it's not clear that accelerating AI progress due to AI R&D acceleration makes the situation very different. As AI progress speeds up, alignment progress might speed up by a similar amount. Or it could speed up by a greater amount due to compute bottlenecks hitting capabilities harder.
My sense is that Jaime's view (and Epoch's view more generally) is more like: "making people better informed about AI in a way that is useful to them seems heuristically good (given that AI is a big deal), it doesn't seem that useful or important to have a very specific theory of change beyond this". From this perspective, saying "concerns about existential risk from AI are not among the primary motivations" is partially slightly confused as the heuristic isn't necessarily back chained from any more specific justification. Like there is no specific terminal motivation.
Like consider someone who donates to Give Directly due to "idk, seems heuristically good to empower the worst off people" and someone who generally funds global health and well being due to specifically caring about ongoing human welfare (putting aside AI for now). This heuristic is partially motived via flow through from caring about something like welfare even though it doesn't directly show up. These people seem like natural allies to me except in surprising circumstances (e.g., it turns out the worst off people use marginal money/power in a way that is net negative for human welfare).
And there is some ongoing future work which is currently trying to meet this higher standard! (Or at least something similar to it.)
Because there might be some other programs with a lot of computational resources which scan through simple universes looking for programs to run?
More like: because there is some chance that the actual laws of physics in our universe execute data as code sometimes (similar to how memory unsafe programs often do this). (And this is either epiphenomenal or it just hasn't happened yet.) While the chance is extremely small, the chance could be high enough (equivalently, universes with this property have high enough measure) and the upside could be so much higher that betting on this beats other options.
This was written in response to a conversation I had with Vivek Hebbar about some strange implications of UDASSA-like views. In particular:
- Is it better to spend your resources simulating a happy life or writing down programs which would simulate a bunch of happy lives? Naively the second feels worthless, but there seemingly are some unintuitive reasons why clever approaches for doing the later could be better.
- Alien civilizations might race to the bottom by spending resources making their civilization easier to point at (and thus higher measure in the default UDASSA perspective).
Hebbar's law: Philosophical views always have stranger implications than you expect, even when you take into account Hebbar's Law.
Some reasons for this (that I quickly wrote in response to someone asking a question about this):
- There aren't that many research direction we can do now which plausibly transfer to later much more powerful AIs while if we got really good at this it could transfer. (Up to around or a bit beyond full AI R&D automation maybe?)
- Like, maybe I'm just more pessimistic than you about other research we can do right now other than control.
- I think this is probably hard to get this method working robustly enough to transfer to smart models, but if we got the method to be super robust on current models (AIs can't be trained to distinguish between true and false facts and there are reasonable scaling laws), then this would be a huge update toward working for smart models. And we can do this research now.
- Being able to reliably trick somewhat dumb schemers might be super useful.
- Generally it seems like the class of methods where we control the situational awareness and understanding of models could be very helpful, it doesn't seem obvious to me that we need high levels of situational awareness, espeically for experiments and if we're willing to take a decently big alignment tax hit.
- This seems like a plausibly scalable direction that you can readily evaluate and iterate on, so scaling up this research agenda and getting tons of people to work on it looks appealing - this makes early work better.
I'm not confident in these exact reasons.
This is one of my favorite current research directions.
I buy your arguments for optimism about not needing to simplify/change our goals to compete. (I also think that there are other stronger reasons to expect we don't need goal simplification like just keeping humans alive and later giving back the resources which is quite simple and indirectly points to what humans want. For ultimately launching space probes, I expect the overhead of complex goals is low. There is some complexity hidden in this proposal, but it seems like it should handle this specific goal simplicity concern.)
I don't feel compelled by "the universe is very big" arguments for making cooperation look better for me personally because I put most of the weight on linear returns. A few reasons for this:
- My sense is that we probably trivially saturate altruistic positive values (I want X to happen) which aren't very scope sensitive. This doesn't require any bargaining IMO, it just happens by default due to things like the universe being extremely big (at least tegmark 3 or whatever).
- I generally find non-linear returns-ish views pretty non-compelling from a direct moral standpoint.
I think o3 maybe does worse on RE-bench than 3.7 sonnet due to often attempting reward hacking. It could also be noise, it is just a small number of tasks. (Presumably these reward hacks would have worked better in the OpenAI training setup but METR filters it out?) It doesn't attempt reward hacking as much / as aggresssively on the rest of METR's tasks so it does better there and this pulls up the overall horizon length. (I think.)
I think your main point is probably right but was not well argued here.
Fair, I though that an example would make this sufficiently obvious that it wasn't worth arguing for at length but I should have spelled it out a bit more.
I think it's also worth saying a thing very directly: just because non-behavioral evidence isn't likely to be widely legible and convincing does not mean it is not useful evidence for those trying to have correct beliefs.
FWIW, I do say this under "These techniques could be quite useful via two mechanisms:".
I think my description is consistent with "some activities on which the SAR is worse" as long as these aren't bottlenecking and it is overall dominating human researchers (as in, adding human researchers is negligable value).
But whatever, you're the author here.
Maybe "Superhuman coder has to dominate all research engineers at all pure research engineering tasks" is too strong though.
Maybe a crux here is maybe how big the speedup is?
What you describe are good reasons why companies are unlikely to want to release this information unilaterially, but from a safety perspective, we should instead consider how imposing such a policy alters the overall landscape.
From this perspective, the main question seems to me to be whether it is plausible that US AI companies would spend more on safety in worlds where other US AI companies are further behind such that having a closer race between different US companies reduces the amount spent on safety. And, how this compares to the chance of this information being helpful in other ways (e.g., making broader groups than just AI companies get involved).
It also seems quite likely to me that in practice people in the industry and investors basically know what is happening, but is harder to trigger a broader response because without more credible sources you can just dismiss it as hype.
How do you expect to deal with misuse worries? Do you just eat the risk?
The proposal is to use monitoring measures, similar to e.g. constitutional classifiers.
Also, don't we reduce misuse risk a bunch by only deploying to 10k external researchers?
(I'm skeptical of any API misuse concerns at this scale except for bio and maybe advancing capabilities at competitors, but this is a stretch given the limited number of tokens IMO.)
("Has to" is maybe a bit strong, I think I probably should have said "will probably end up needing to be better competitive with the best human experts at basically everything (other than vision) and better at more central AI R&D given the realistic capability profile". I think I generally expect full automation to hit everywhere all around the same time putting aside vision and physical tasks.)
If it is compute efficient according to even Kaplan or Chinchilla scaling laws, please demonstrate that for me.
We only have leaked numbers confirming reasonably efficient training but GPT-4 is widely believed to be a quite efficient model for the time, and notably wasn't matched by competitors for a while.
Please let me know if you suspect I've over-interpreted that validation.
Slightly? My view is more like:
- For AIs to be superhuman AI researchers, they probably need to match humans at most underlying/fundamental cognitive tasks, including reasonably sample efficient learning. (Or at least learning which is competitive with humans given the AIs structural advantages.)
- This means they can probably learn how to do arbitary things pretty quickly and easily.
I think non-ML/software-engineering expertise (that you can't quickly learn on the job) is basically never important in building more generally capable AI systems aside from maybe various things related to acquiring data from humans. (But IMO this won't ultimately be needed.)
Does an SAR have to be superhuman at creative writing, so that it can push forward creative writing capabilities in future models?
Do human ML researcherse have to be superhuman at creative writing to push forward creative writing capabilites? I don't particularly think so. Data might need to come from somewhere, but in the vision case, there are plenty of approaches which don't require AIs with superhuman vision.
In the creative writing case, it's a bit messy because the domain is intrinsically subjective. I nonetheless think you could make an AI which is superhuman at creative writing without good understanding of creative writing using just the (vast vast) quantity of data we already have on the internet.
R1 can't possibly be below V3 cost because it is inclusive? If I'm not mistaken, R1 is not trained from scratch, but I could be wrong.
Yes, I meant 1/6 additional cost which is ~negligable.
it does not meet the standard.
Importantly, it is much better than GPT-4 on the relevant downstream tasks.
Just ran the code and it looks like I'm spot on and the median goes to Mar 2031.
For context, my preregistered guess would be that AI R&D speed ups along the way to superhuman coder make it come around 1.5x faster, though between 1.25-2 all are consistent with my best guess. (So e.g., rather than ~2029.0 median on Eli's model without intermediate AI R&D speed ups we'd see around 2031.0 or so. I'd expect a bigger effect on the 10th percentile due to uncertainty.)
Sure. Epoch estimates 2e25 flop for GPT-4 and 3.4e24 for deepseek V3. So a bit less than 10x actually, but quite close. (And V3 is substantially better.) R1 is around 1/6 of deepseek V3 cost.
I said there was no compute acceleration not that there was no more compute scaling?
(Yes, sorry, edited my original comment to clarify.)
a equal and super-accelerated algorithmic or efficiency term (v_algorithmic) as shown in the code here
I don't think the "AI assisted AI R&D" speed ups along the way to superhuman coder make a huge difference to the bottom line?
In all timelines models presented there is acknowledgement that compute does not accelerate.
When you say "accelerate" do you mean "the rate of compute scaling increases"? I agree they aren't expecting this (and roughly model a fixed rate of compute progress which matches historical trends as this is just an extrapolation). However, note that the superexponentiality in the timelines model isn't based on this, and is instead based on views about the mapping from effective compute to horizon length.
If you just mean "the timelines model assumes no compute scaling", then I think this is a clear no?
The basic timelines model is based on extrapolating out the current trend of AI progress.
Where you do you see mention of no further compute scaling in this timelines model: https://ai-2027.com/research/timelines-forecast?
I agree that the takeoff model focuses on the regime without compute scaling as the takeoff occurs over a year which doesn't allow that much compute scaling (though I believe the final takeoff numbers / scenario are accounting for compute scaling).
zero-shot WikiText103 perplexity and 5-shot MMLU
These are somewhat awkward benchmarks because they don't actually measure downstream usefulness at software engineering or AI research. In particular, these tasks might not measure improvements in RL which can have huge effects on usefulness and have seen fast algorithmic progress.
Can we instead use SWE-bench or METR's task suite?
For instance, here is a proposed bet:
GPT-4 was released in March 2023 (2 years ago). So, we'd expect a model which used 10x less FLOP to perform similarly well (or better) on agentic tasks (like SWE-bench or METR's task suite).
Oh wait, there already is such a model! Deepseek-V3 / R1 is IMO clearly better than GPT-4 on these tasks (and other tasks) while using <10x GPT-4 flop and being released within 2 years. So bet resolved?
Edit: more like 6x less flop actually, so this is a bit messy and would need to lean on better performance. People don't seem to bother training compute optimal models with ~10x less than GPT-4 flop models these days...
Actually, I think Deepseek-V3 also does better than GPT-4 on MMLU, though we can't compare perplexity. So, seems ~resolved either way, at least for progress in the last 2 years and if you're fine with assuming that DeepSeek-V3 isn't rigged or using distillation.
The delta is much more extreme if instead of looking at software engineering you look at competitive programming or math.
This is both projected forward and treated with either 1 (in 45% of cases) or 2 (in all cases) super-exponential terms that make it significantly faster than an inferred 4.6x per year.
Hmm, I think you're looking at the more basic trend extrapolation for the timelines model and assuming that the authors are thinking that this trend extrapolation is purely due to algorithms?
(The authors do model this trend accelerating due to AIs accelerating algorithms, so if the rate of algorithmic progress was much lower, that would make a big difference to the bottom line.)
I do agree that "how fast is algorithmic progress right now" might be a crux and presumably the authors would think differently if they thought algorithmic progress was much faster.
This seems relatively clearly false in the case of competition programming problems. Concretely, o3 with 50 submissions beats o1 with 10k submissions. (And o1 is presumably much better than the underlying instruct model.)
I'd guess this paper doesn't have the actual optimal methods.
Another way to put this disagreement is that you can interpret all of the AI 2027 capability milestones as refering to the capability of the weakest bottlenecking capability, so:
- Superhuman coder has to dominate all research engineers at all pure research engineering tasks. This includes the most bottlenecking capability.
- SAR has to dominate all human researchers, which must include whatever task would otherwise bottleneck.
- SIAR (superintelligent AI research) has to be so good at AI research--the gap between SAR and SIAR is 2x the gap between an automated median AGI company researcher and a SAR--that it has this huge 2x gap advantage over the SAR despite the potentially bottlenecking capabilities.
So, I think perhaps what is going on is that you mostly disagree with the human-only, software-only times and are plausibly mostly on board otherwise.
I'm having trouble parsing this sentence
You said "This is valid for activities which benefit from speed and scale. But when output quality is paramount, speed and scale may not always provide much help?". But, when considering activities that aren't bottlenecked on the environment, then to achieve 10x acceleration you just need 10 more speed at the same level of capability. In order for quality to be a crux for a relative speed up, there needs to be some environmental constraint (like you can only run 1 experiment).
Is that a fair statement?
Yep, my sense is that an SAR has to[1] be better than humans at basically everything except vision.
(Given this, I currently expect that SAR comes at basically the same time as "superhuman blind remote worker", at least when putting aside niche expertise which you can't learn without a bunch of interaction with humans or the environment. I don't currently have a strong view on the difficulty of matching human visual abilites, particulary at video processing, but I wouldn't be super surprised if video processing is harder than basically everything else ultimately.)
If "producing better models" (AI R&D) requires more than just narrow "AI research" skills, then either SAR and SAIR need to be defined to cover that broader skill set (in which case, yes, I'd argue that 1.5-10 years is unreasonably short for unaccelerated SC->SAR),
It is defined to cover the broader set? It says "An AI system that can do the job of the best human AI researcher?" (Presumably this is implicitly "any of the best AI researchers which presumably need to learn misc skills as part of their jobs etc.) Notably, Superintelligent AI researcher (SIAR) happens after "superhuman remote worker" which requires being able to automate any work a remote worker could do.
I'm guessing your crux is that the time is too short?
"Has to" is maybe a bit strong, I think I probably should have said "will probably end up needing to be better competitive with the best human experts at basically everything (other than vision) and better at more central AI R&D given the realistic capability profile". I think I generally expect full automation to hit everywhere all around the same time putting aside vision and physical tasks. ↩︎
I'm worried that you're missing something important because you mostly argue against large AI R&D multipliers, but you don't spend much time directly referencing compute bottlenecks in your arguments that the forecast is too aggressive.
Consider the case of doing pure math research (which we'll assume for simplicity doesn't benefit from compute at all). If we made emulated versions of the 1000 best math researchers and then we made 1 billion copies of each of them them which all ran at 1000x speed, I expect we'd get >1000x faster progress. As far as I can tell, the words in your arguments don't particularly apply less to this situation than the AI R&D situation.
Going through the object level response for each of these arguments in the case of pure math research and the correspondence to the AI R&D:
Simplified Model of AI R&D
Math: Yes, there are many tasks in math R&D, but the 1000 best math researchers could already do them or learn to do them.
AI R&D: By the time you have SAR (superhuman AI researcher), we're assuming the AIs are better than the best human researchers(!), so heterogenous tasks don't matter if you accept the premise of SAR: whatever the humans could have done, the AIs can do better. It does apply to the speed ups at superhuman coders, but I'm not sure this will make a huge difference to the bottom line (and you seem to mostly be referencing later speed ups).
Amdahl's Law
Math: The speed up is near universal because we can do whatever the humans could do.
AI R&D: Again, the SAR is strictly better than humans, so hard-to-automate activities aren't a problem. When we're talking about ~1000x speed up, the authors are imagining AIs which are much smarter than humans at everything and which are running 100x faster than humans at immense scale. So, "hard to automate tasks" is also not relevant.
All this said, compute bottlenecks could be very important here! But the bottlenecking argument must directly reference these compute bottlenecks and there has to be no way to route around this. My sense is that much better research taste and perfect implementation could make experiments with some fixed amount of compute >100x more useful. To me, this feels like the important question: how much can labor results in routing around compute bottlenecks and utilizing compute much more effectively. The naive extrapolation out of the human range makes this look quite aggressive: the median AI company employee is probably 10x worse at using compute than the best, so an AI which as superhuman as 2x the gap between median and best would naively be 100x better at using compute than the best employee. (Is the research taste ceiling plausibly this high? I currently think extrapolating out another 100x is reasonable given that we don't see things slowing down in the human range as far as we can tell.)
Dependence on Narrow Data Sets
This is only applicable to the timeline to the superhuman coder milestone, not to takeoff speeds once we have a superhuman coder. (Or maybe you think a similar argument applies to the time between superhuman coder and SAR.)
Hofstadter's Law As Prior
Math: We're talking about speed up relative to what the human researchers would have done by default, so this just divides both sides equally and cancels out.
AI R&D: The should also just divide both sides. That said, Hofstadter's Law does apply to the human-only, software-only times between milestones. But note that these times are actually quite long! (Maybe you think they are still too short, in which case fair enough.)
Sure, but for output quality better than what humans could (ever) do to matter for the relative speed up, you have to argue about compute bottlenecks, not Amdahl's law for just the automation itself! (As in, if some humans would have done something in 10 years and it doesn't have any environmental bottleneck, then 10x faster emulated humans can do it in 1 year.)
My mental model is that, for some time to come, there will be activities where AIs simply aren't very competent at all,
Notably, SAR is defined as "Superhuman AI researcher (SAR): An AI system that can do the job of the best human AI researcher but faster, and cheaply enough to run lots of copies." So, it is strictly better than the best human researcher(s)! So, your statement might be true, but is irrelevant if we're conditioning on SAR.
It sounds like your actual objection is in the human-only, software-only time from superhuman coder to SAR (you think this would take more than 1.5-10 years).
Or perhaps your objection is that you think there will be a smaller AI R&D multiplier for superhuman coders. (But this isn't relevant once you hit full automation!)
I note that I am confused by this diagram. In particular, the legend indicates a 90th percentile forecast of ">2100" for ASI, but the diagram appears to show the probability dropping to zero around the beginning of 2032.
I think it's just that the tail is very long and flat with <1% per year. So, it looks like it goes to zero, but it stays just above.
Hmm, I think your argument is roughly right, but missing a key detail. In particular, the key aspect of the SARs (and higher levels of capability) is that they can be strictly better than humans at everything while simultaneously being 30x faster and 30x more numerous. (Or, there is 900x more parallel labor, but we can choose to run this as 30x more parallel instances each running 30x faster.)
So, even if these SARs are only slightly better than humans at these 10 activities and these activities don't benefit from parallelization at all, they can still do them 30x faster!
So, progress can actually be accelerated by up to 3000x even if the AIs are only as good as humans at these 10 activities and can't productively dump in more labor.
In practice, I expect that you can often pour more labor into whatever bottlenecks you might have. (And compensate etc as you noted.)
By the time the AIs have a 1000x AI R&D multiplier, they are running at 100x human speed! So, I don't think the argument for "you won't get 1000x uplift" can come down to amdahl's law argument for automation itself. It will have to depend on compute bottlenecks.
(My sense is that the progress multipliers in AI 2027 are too high but also that the human-only times between milestones are somewhat too long. On net, this makes me expect somewhat slower takeoff with a substantial chance on much slower takeoff.)
(Yes, I'm aware you meant imprecise probabilities. These aren't probablities though (in the same sense that a range of numbers isn't a number), e.g., you're unwilling to state a median.)
It's notable that you're just generally arguing against having probabilistic beliefs about events which are unprecedented[1], nothing is specific to this case of doing AI forecasting. You're mostly objecting to the idea of having (e.g.) medians on events like this.
Of course, the level of precedentedness is continous and from understanding forecasters have successfully done OK at predicting increasingly unprecedented events. Maybe your take is that AI is the most unprecedented event anyone has ever tried to predict. This seems maybe plausible. ↩︎
I wouldn't be surprised if 3-5% of questions were mislabeled or impossible to answer, but 25-50%? You're basically saying that HLE is worthless. I'm curious why.
Various people looked at randomly selected questions and found similar numbers.
(I don't think the dataset is worthless, I think if you filtered down to the best 25-50% of questions it would be a reasonable dataset with acceptable error rate.)
Isn't it kinda unreasonable to put 10% on superhuman coder in a year if current AIs have a 15 nanosecond time horizon? TBC, it seems fine IMO if the model just isn't very good at predicting the 10th/90th percentile, especially with extreme hyperparameters.
I also don't know how they ran this, I tried looking for model code and I couldn't find it. (Edit: found the code.)
Looks like Eli beat me to the punch!
This contradicts METR timelines, which, IMO, is the best piece of info we currently have for predicting when AGI will arrive.
Have you read the timelines supplement? One of their main methodologies involves using this exact data from METR (yielding 2027 medians). The key differences they have from the extrapolation methodology used by METR are: they use a somewhat shorter doubling time which seems closer to what we see in 2024 (4.5 months median rather than 7 months) and they put substantial probability on the trend being superexponential.
why the timelines will be much longer
I think the timelines to superhuman coder implied by METR's work are closer to 2029 than 2027, so 2 more years or 2x longer. I don't think most people will think of this as much longer, though I guess 2x longer could qualify as much longer.
Considering that frontier LLMs of today can solve at most 20% of problems on Humanity's Last Exam, both of these predictions appear overly optimistic to me. And HLE isn't even about autonomous research, it's about "closed-ended, verifiable questions". Even if some LLM scored >90% on HLE by late 2025 (I bet this won't happen), that wouldn't automatically imply that it's good at open-ended problems with no known answer. Present-day LLMs have so little agency that it's not even worth talking about.
I'm not sure that smart humans can solve 20% on Humanity's Last Exam (HLE). I also think that around 25-50% of the questions are impossible or mislabeled. So, this doesn't seem like a very effective way to rule out capabilities.
I think scores on HLE are mostly just not a good indicator of the relevant capabilities. (Given our current understanding.)
TBC, my median to superhuman coder is more like 2031.
Sure, communication will always be imperfect and there is a never ending cascade of possible clarifications. But, sometimes it seems especially good to improve communication even at some cost. I just thought this was a case where there might be particular large miscommunciations.
In particular case, I was worried there was some problematic conflationary alliance style dynamics where a bunch of people might think there is a broader consensus for some idea than there actually is.
Consider Tabooing Gradual Disempowerment.
I'm worried that when people say gradual disempowerment they often mean "some scenario in which humans are disempowered gradually over time", but many readers will interpret this as "the threat model in the paper called 'Gradual Disempowerment'". These things can differ substantially and the discussion in this paper is much more specific than encompassing all scenarios in which humans slowly are disempowered!
(You could say "disempowerment which is gradual" for clarity.)
Sure, there are in some very specific settings with long held professional norms that people agree to (e.g. doctors and lawyers). I don't think this applies in this case, though you could try to create such a norm that people agree to.
My guess would be that making RL envs for broad automation of the economy is bad[1] and making benchmarks which measure how good AIs are at automating jobs is somewhat good[2].
Regardless, IMO this seems worse for the world than other activities Matthew, Tamay, and Ege might do.
I'd guess the skills will transfer to AI R&D etc insofar as the environments are good. I'm sign uncertain about broad automation which doesn't transfer (which would be somewhat confusing/surprising) as this would come down to increased awareness earlier vs speeding up AI development due to increased investment. ↩︎
It's probably better if you don't make these benchmarks easy to iterate on and focus on determining+forecasting whether models have high levels of threat-model-relevant capability. And being able to precisely compare models with similar performance isn't directly important. ↩︎
I'm not claiming that internals-based techniques aren't useful, just that internals-based techniques probably aren't that useful for specifically producing legible evidence of misalignment. Detecting misalignment with internals-based techniques could be useful for other reasons (which I list in the post) and internals based techniques could be used for applications other than detecting misalignment (e.g. better understanding some misaligned behavior).
If internals-based techniques are useful for further investigating misalignment, that seems good. And I think I agree that if we first find legible evidence of misalignment behaviorally and internals-based methods pick this up (without known false positives), then this will make future evidence with internals-based techniques more convincing. However, I think it might not end up being that much more convincing in practice unless this happens many times with misalignment which occurs in production models.
I've updated towards a bit longer based on some recent model releases and further contemplation.
I'd now say:
- 25th percentile: Oct 2027
- 50th percentile: Jan 2031
My distribution is very uncertain, but I'd say 25% by June 2027 and 50% by Jan 2031.
(I answer a similar question, but for a slightly higher bar of capabilties and operationalized somewhat differently here. I've since updated towards slightly longer timelines. You might also be interested in the timeline in AI-2027.)
Some humans are much more charismatic than other humans based on a wide variety of sources (e.g. Sam Altman). I think these examples are pretty definitive, though I'm not sure if you'd count them as "extraordinary".
Are you sure that we see "vestigial reasoning" when:
- We run a bunch of RL while aggressively trying to reduce CoT length (e.g., with a length penalty);
- The input is in distribution with respect to the training distribution;
- The RL is purely outcome based.
I'd guess this mostly doesn't occur in this case and the examples we're seeing are either out of distribution (like the bogus reasoning case from Anthropic) or involve RL which isn't purely outcome base (like the example from openai where they train against the monitor).
Some models (like R1) weren't trained with a length penalty, so they learn to reason pretty excessively.
I'd guess we'd see some minorly steganographic reasoning, but in cases where lots of tokens really don't help with reasoning, I'd guess this mostly gets eliminated.
Sure, there might be a spectrum, (though I do think some cases are quite clear cut), but I still think the distinction is useful.
(FWIW, I didn't use them as synonyms in the post except for saying "aka sandbagging" in the title which maybe should have been "aka a strategy for sandbagging". I thought "aka sandbagging" was sufficiently accurate for a title and saved space.)
The difference I was intending is:
- The AI is intentionally given affordances by humans.
- The AI gains power in a way which isn't desired by its creators/builders (likely subversively).