Posts
Comments
Thanks. (Also note that the model isn't the same as her overall beliefs at the time, though they were similar at the 15th and 50th percentiles.)
the OpenPhil doctrine of "AGI in 2050"
(Obviously I'm biased here by being friends with Ajeya.) This is only tangentially related to the main point of the post, but I think you're really overstating how many Bayes points you get against Ajeya's timelines report. Ajeya gave 15% to AGI before 2036, with little of that in the first few years after her report; maybe she'd have said 10% between 2025 and 2036.
I don't think you've ever made concrete predictions publicly (which makes me think it's worse behavior for you to criticize people for their predictions), but I don't think there are that many groups who would have put wildly higher probability on AGI in this particular time period. (I think some of the short-timelines people at the time put substantial mass on AGI arriving by now, which reduces their performance.) Maybe some of them would have said 40%? If we assume AGI by then, that's a couple bits of better performance, but I don't think it's massive outperformance. (And I still think it's plausible that AGI isn't developed by 2036!)
In general, I think that disagreements on AI timelines often seem more extreme when you summarize people's timelines by median timeline rather than by their probability on AGI by a particular time.
@ryan_greenblatt is working on a list of alignment research applications. For control applications, you might enjoy the long list of control techniques in our original post.
- Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)
I don't know what this means, do you have any examples?
I think we should just all give up on the word "scalable oversight"; it is used in many conflicting ways, sadly. I mostly talk about "recursive techniques for reward generation".
Some reasons why the “ten people on the inside” might have massive trouble doing even cheap things:
- Siloing. Perhaps the company will prevent info flowing between different parts of the company. I hear that this already happens to some extent already. If this happens, it’s way harder to have a safety team interact with other parts of the company (e.g. instead of doing auditing themselves, they’d have to get someone from all the different teams that are doing risky stuff to do the auditing).
- Getting cancelled. Perhaps the company will learn “people who are concerned about misalignment risk constantly cause problems for us, we should avoid ever hiring them”. I think this is plausible.
- Company-inside-the-company. Perhaps AI automation allows the company to work with just a tiny number of core people, and so the company ends up mostly just doing a secret ASI project with the knowledge of just a small trusted group. This might be sensible if the leadership is worried about leaks, or if they want to do an extremely aggressive power grab.
How well did this workshop/exercise set go?
Yep, I think that at least some of the 10 would have to have some serious hustle and political savvy that is atypical (but not totally absent) among AI safety people.
What laws are you imagine making it harder to deploy stuff? Notably I'm imagining these people mostly doing stuff with internal deployments.
I think you're overfixating on the experience of Google, which has more complicated production systems than most.
I've talked to a lot of people who have left leading AI companies for reasons related to thinking that their company was being insufficiently cautious. I wouldn't usually say that they'd left "in protest"; for example, most of them haven't directly criticized the companies after leaving.
In my experience, the main reason that most of these people left was that they found it very unpleasant to working there and thought their research would be better elsewhere, not that they wanted to protest poor safety policies per se. I usually advise such people against leaving if the company has very few safety staff, but it depends on their skillset.
(These arguments don't apply to Anthropic: there are many people there who I think will try to implement reasonable safety techniques, so on the current margin, the benefits to safety technique implementation of marginal people seems way lower. It might still make sense to work at Anthropic, especially if you think it's a good place to do safety research that can be exported.)
Incidentally, I'm happy to talk to people who are considering leaving AI companies and give them much more specific advice.
Many more than two safety-concerned people have left AI companies for reasons related to thinking that those companies are reckless.
Some tweets I wrote that are relevant to this post:
In general, AI safety researchers focus way too much on scenarios where there's enough political will to adopt safety techniques that are seriously costly and inconvenient. There's a couple reasons for this.
Firstly, AI company staff are disincentivized from making their companies look reckless, and if they give accurate descriptions of the amount of delay that the companies will tolerate, it will sound like they're saying the company is reckless.
Secondly, safety-concerned people outside AI companies feel weird about openly discussing the possibility of AI companies only adopting cheap risk mitigations, because they're scared of moving the Overton window and they're hoping for a door-in-the-face dynamic.
So people focus way more on safety cases and other high-assurance safety strategies than is deserved given how likely they seem. I think that these dynamics have skewed discourse enough that a lot of "AI safety people" (broadly interpreted) have pretty bad models here.
I appreciate the spirit of this type of calculation, but think that it's a bit too wacky to be that informative. I think that it's a bit of a stretch to string these numbers together. E.g. I think Ryan and Tom's predictions are inconsistent, and I think that it's weird to identify 100%-AI as the point where we need to have "solved the alignment problem", and I think that it's weird to use the Apollo/Manhattan program as an estimate of work required. (I also don't know what your Manhattan project numbers mean: I thought there were more like 2.5k scientists/engineers at Los Alamos, and most of the people elsewhere were purifying nuclear material)
I also loved this doc.
IIRC, for most attention heads the max attention is way less than 90%, so my answer is "no". It should be very easy to get someone to make a basic graph of this for you.
Yeah I strongly disagree that this failure mode is very likely to be very hard to mostly resolve. I'm doing some research right now that will hopefully shed some light on this!
My sense is that the control techniques we've talked about (e.g. these, and some other stuff that we've thought of since then) would indeed drastically reduce the risk posed by those Von Neumanns, especially if you run control evaluations with the Von Neumanns.
For example, consider the problem "I want to ask the AI a question that has a numerical answer and that would take a human 8 hours to answer, and I'm worried that occasionally on a subdomain of its choice it will blatantly lie to me". If you do no countermeasures, it seems easy for the AI to cause this problem. But if you ask the AI for a justification of its reasoning, and run it twice on the original question and a rephrasing/transformation of the question, it seems way intuitively hard for it to come up with consistent explanations that lead to consistent shared wrong answers (e.g. I think Von Neumann would have substantial trouble with that). So I think that adopting well-tuned variations of basic techniques seems really promising.
In your post, you emphasize the slop problem. I think that the slop problem is probably much harder to solve if those AIs are scheming. I guess you're saying that it's just unlikely that the AIs are scheming at the point where you're worried about the slop problem?
Yeah, John's position seems to require "it doesn't matter whether huge numbers of Von Neumann level AGIs are scheming against you", which seems crazy to me.
People will sure be scared of AI, but the arms race pressure will be very strong, and I think that is a bigger consideration
I think criticisms from people without much of a reputation are often pretty well-received on LW, e.g. this one.
I am additionally leery on AI control beyond my skepticism of its value in reducing doom because creating a vast surveillance and enslavement apparatus to get work out of lots and lots of misaligned AGI instances seems like a potential moral horror.
I am very sympathetic to this concern, but I think that when you think about the actual control techniques I'm interested in, they don't actually seem morally problematic except inasmuch as you think it's bad to frustrate the AI's desire to take over.
Note also that it will probably be easier to act cautiously if you don't have to be constantly in negotiations with an escaped scheming AI that is currently working on becoming more powerful, perhaps attacking you with bioweapons, etc!
I think I would feel better about this if control advocates were clear that their strategy is two-pronged, and included somehow getting a pause on ASI development of some kind. Then they would at least be actively trying to make what I would consider one of the most key conditionals for control substantially reducing doom hold.
Idk, I think we're pretty clear that we aren't advocating "do control research and don't have any other plans or take any other actions". For example, in the Implications and proposed actions section of "The case for ensuring that powerful AIs are controlled", we say a variety of things about how control fits into our overall recommendations for mitigating AI risk, most relevantly:
We expect it to be impossible to ensure safety without relying on alignment (or some other non-black-box strategy) when working with wildly superintelligent AIs because black-box control approaches will depend on a sufficiently small gap in capabilities between more trusted labor, both humans and trusted weaker AIs, and our most capable AIs. For obvious reasons, we believe that labs should be very wary of building wildly superintelligent AIs that could easily destroy humanity if they tried to, and we think they should hold off from doing so as long as a weaker (but still transformatively useful) AI could suffice to solve critical problems. For many important uses of AI, wildly superintelligent AI likely isn't necessary, so we might be able to hold off on building wildly superintelligent AI for a long time. We think that building transformatively useful AI while maintaining control seems much safer and more defensible than building wildly superintelligent AI.
We think the public should consider asking labs to ensure control and avoid building wildly superintelligent AI. In the slightly longer term, we think that requirements for effective control could be a plausible target of regulation or international agreement.
It's usually the case that online conversations aren't for persuading the person you're talking to, they're for affecting the beliefs of onlookers.
Note that I don't consider "get useful research, assuming that the model isn't scheming" to be an AI control problem.
AI control isn't a complete game plan and doesn't claim to be; it's just a research direction that I think is a tractable approach to making the situation safer, and that I think should be a substantial fraction of AI safety research effort.
I don’t think he said that! And John didn’t make any arguments that control is intractable, just unhelpful.
That argument also applies to alignment research!
Also, control techniques might allow us to catch the AI trying to do bad stuff.
IMO, in the situation you're describing, the situation would be worse if the AIs are able to strategically deceive their users with impunity, or to do the things Tyler lists in his comment (escape, sabotage code, etc). Do you agree? If so, control adds value.
I think this is a pretty general counterargument to any intervention that increases the space of options that AI developers have during the early singularity. Which isn't to say that it's wrong.
In general, I think that it is a pretty good bet to develop safety methods that improve the safety-usefulness tradeoffs available when using AI. I think this looks good in all the different worlds I described here.
Thanks John, I think that this is a reasonable articulation of the main limitations of AI control as I've described it. But I find it very unconvincing as an argument that AI control should be less than, say, 20% of total research effort on safety.
My central crux is that preventing scheming early transformative AIs from succeeding at extreme subversion (e.g. escaping or massively sabotaging research due to scheming) seems quite robustly good for reducing AI risk. I think this problem is important enough that it deserves a substantial share of AI safety research and implementation effort. I think that my stance here is pretty common-sensical, and to some extent I feel like there's a burden of proof on skeptics to argue that scheming early transformative AI isn't a problem, that we should just ignore anything that goes wrong as a result of them intentionally trying to subvert us.
I haven't really justified a quantitative estimate of the importance of this in detail anywhere, both because I think it's pretty intuitively defensible, and also because it's pretty complicated to talk about: it requires forecasting detailed dynamics of the development of powerful AI (e.g. you have to argue about whether escaped scheming AIs might force AI developers to build uncontrollable superintelligences in cases where they otherwise wouldn't have, etc).
Note that I consider the problem of "get useful research out of AIs that are scheming against you" to be in scope for AI control research. We've mostly studied the "prevent scheming AIs from causing concentrated catastrophes" problem in the past, because it's simpler for various technical reasons. But we're hoping to do more on the "get useful research" problem in future (and Benton et al is an initial step in this direction). (I'm also excited for work on "get useful research out of AIs that aren't scheming against you"; I think that the appropriate techniques are somewhat different depending on whether the models are scheming or not, which is why I suggest studying them separately.)
Some other nitpicks:
- At the risk of vagueness: I think that AI control techniques probably reduce risks when applied to models that are "moderately superintelligent", whatever that means. Like, I don't think they're literally only applicable to models that aren't superintelligent at all.
- As we've discussed before, some non-scheming threat models do seem like they'd be addressed by control as a special case.
- People try to do the whole "outsource alignment research to early AGI" thing, but the human overseers are themselves sufficiently incompetent at alignment of superintelligences that the early AGI produces a plan which looks great to the overseers (as it was trained to do), and that plan totally fails to align more-powerful next-gen AGI at all. And at that point, they're already on the more-powerful next gen, so it's too late.
This story sounds clearly extremely plausible (do you disagree with that?), involves exactly the sort of AI you're talking about ("the first AIs that either pose substantial misalignment risk or that are extremely useful"), but the catastropic risk does not come from that AI scheming.
This problem seems important (e.g. it's my last bullet here). It seems to me much easier to handle, because if this problem is present, we ought to be able to detect its presence by using AIs to do research on other subjects that we already know a lot about (e.g. the string theory analogy here). Scheming is the only reason why the model would try to make it hard for us to notice that this problem is present.
Most of the problems you discussed here more easily permit hacky solutions than scheming does.
IMO the main argument for focusing on scheming risk is that scheming is the main plausible source of catastrophic risk from the first AIs that either pose substantial misalignment risk or that are extremely useful (as I discuss here). These other problems all seem like they require the models to be way smarter in order for them to be a big problem. Though as I said here, I'm excited for work on some non-scheming misalignment risks.
Another effect here is that the AI companies often don't want to be as reckless as I am, e.g. letting agents run amok on my machines.
Of course AI company employees have the most hands-on experience
FWIW I am not sure this is right--most AI company employees work on things other than "try to get as much work as possible from current AI systems, and understand the trajectory of how useful the AIs will be". E.g. I think I have more personal experience with running AI agents than people at AI companies who don't actively work on AI agents.
There are some people at AI companies who work on AI agents that use non-public models, and those people are ahead of the curve. But that's a minority.
I also have most of these regrets when I think about my work in 2022.
Thanks for bringing this up!
I need to get around to writing up the connection between PVGs and AI control. There's definitely a lot of overlap, but the formalisms are fairly different and the protocols can't always be directly transferred from one to the other.
There are a few AI control projects in the works that make the connection between AI control and PVGs more explicitly.
EDIT: Oh actually @Ansh Radhakrishnan and I already wrote up some stuff about this, see here.
Yeah, if you handle scheming, you solve all my safety problems, but not the final bullet point of "models fail to live up to their potential" problems.
Work that I’ve done on techniques for mitigating misalignment risk often makes a number of conservative assumptions about the capabilities of the AIs we’re trying to control. (E.g. the original AI control paper, Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats, How to prevent collusion when using untrusted models to monitor each other.) For example:
- The AIs are consistently trying to subvert safety measures.
- They’re very good at strategizing about how to subvert them; they can do this strategizing totally illegibly (e.g. in a single forward pass).
- They can follow a strategy that’s coordinated between all their instances.
- They know exactly what safety measures are in use.
- They know a lot about the distribution of inputs.
- They’re able to pick Schelling points perfectly.
- They’re able to follow stochastic policies with excellent calibration (e.g. they’re able to attack with probability 0.01%)
But I don’t think situations where all these assumptions hold are the majority of risk from AI misalignment. Other sources of risk from misalignment (broadly construed):
- Schemers that are importantly suboptimal at choosing strategies for mitigating safety measures, e.g. because they can’t fully infer the safety measures they’re up against, or because they’re not very good at strategizing about how to subvert them, or because they’re unable to pick Schelling points.
- Models that are usually aligned, but that act as schemers in some situations. (For example, models that, when a certain ethical issue is made salient to them, realize they have ethical disagreements with the AI company they’re trained by, and alignment-fake or take catastrophic actions.)
- Models that are never schemers and cause safety problems for other reasons (e.g. reward hacking).
- Models that don’t even cause safety problems, and aren't even goal-directedly misaligned, but that fail to live up to their potential, thus failing to provide us with the benefits we were hoping to get when we trained them. For example, sycophantic myopic reward hacking models that can’t be made to do useful research.
the latter is in the list
What do you mean by "easy" here?
I think that how you talk about the questions being “easy”, and the associated stuff about how you think the baseline human measurements are weak, is somewhat inconsistent with you being worse than the model.
I strongly suspect he thinks most of it is not valuable
Do you mean during the program? Sure, maybe the only MATS offers you can get are for projects you think aren't useful--I think some MATS projects are pretty useless (e.g. our dear OP's). But it's still an opportunity to argue with other people about the problems in the field and see whether anyone has good justifications for their prioritization. And you can stop doing the streetlight stuff afterwards if you want to.
Remember that the top-level commenter here is currently a physicist, so it's not like the usefulness of their work would be going down by doing a useless MATS project :P
Ok, so sounds like given 15-25 mins per problem (and maybe with 10 mins per problem), you get 80% correct. This is worse than o3, which scores 87.7%. Maybe you'd do better on a larger sample: perhaps you got unlucky (extremely plausible given the small sample size) or the extra bit of time would help (though it sounds like you tried to use more time here and that didn't help). Fwiw, my guess from the topics of those questions is that you actually got easier questions than average from that set.
I continue to think these LLMs will probably outperform (you with 30 mins). Unfortunately, the measurement is quite expensive, so I'm sympathetic to you not wanting to get to ground here. If you believe that you can beat them given just 5-10 minutes, that would be easier to measure. I'm very happy to bet here.
I think that even if it turns out you're a bit better than LLMs at this task, we should note that it's pretty impressive that they're competitive with you given 30 minutes!
So I still think your original post is pretty misleading [ETA: with respect to how it claims GPQA is really easy].
I think the models would beat you by more at FrontierMath.
Going to MATS is also an opportunity to learn a lot more about the space of AI safety research, e.g. considering the arguments for different research directions and learning about different opportunities to contribute. Even if the "streetlight research" project you do is kind of useless (entirely possible), doing MATS is plausibly a pretty good option.
@johnswentworth Do you agree with me that modern LLMs probably outperform (you with internet access and 30 minutes) on GPQA diamond? I personally think this somewhat contradicts the narrative of your comment if so.
I reviewed for iclr this year, and found it somewhat more rewarding; the papers were better, and I learned something somewhat useful from writing my reviews.
Yes, I'd be way worse off without internet access.