Posts

Some articles in “International Security” that I enjoyed 2025-01-31T16:23:27.061Z
A sketch of an AI control safety case 2025-01-30T17:28:47.992Z
Ten people on the inside 2025-01-28T16:41:22.990Z
Early Experiments in Human Auditing for AI Control 2025-01-23T01:34:31.682Z
Thoughts on the conservative assumptions in AI control 2025-01-17T19:23:38.575Z
Measuring whether AIs can statelessly strategize to subvert security measures 2024-12-19T21:25:28.555Z
Alignment Faking in Large Language Models 2024-12-18T17:19:06.665Z
Why imperfect adversarial robustness doesn't doom AI control 2024-11-18T16:05:06.763Z
Win/continue/lose scenarios and execute/replace/audit protocols 2024-11-15T15:47:24.868Z
Sabotage Evaluations for Frontier Models 2024-10-18T22:33:14.320Z
Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming 2024-10-10T13:36:53.810Z
How to prevent collusion when using untrusted models to monitor each other 2024-09-25T18:58:20.693Z
A basic systems architecture for AI agents that do autonomous research 2024-09-23T13:58:27.185Z
Distinguish worst-case analysis from instrumental training-gaming 2024-09-05T19:13:34.443Z
Would catching your AIs trying to escape convince AI developers to slow down or undeploy? 2024-08-26T16:46:18.872Z
Fields that I reference when thinking about AI takeover prevention 2024-08-13T23:08:54.950Z
Games for AI Control 2024-07-11T18:40:50.607Z
Scalable oversight as a quantitative rather than qualitative problem 2024-07-06T17:42:41.325Z
Different senses in which two AIs can be “the same” 2024-06-24T03:16:43.400Z
Access to powerful AI might make computer security radically easier 2024-06-08T06:00:19.310Z
AI catastrophes and rogue deployments 2024-06-03T17:04:51.206Z
Notes on control evaluations for safety cases 2024-02-28T16:15:17.799Z
Toy models of AI control for concentrated catastrophe prevention 2024-02-06T01:38:19.865Z
The case for ensuring that powerful AIs are controlled 2024-01-24T16:11:51.354Z
Managing catastrophic misuse without robust AIs 2024-01-16T17:27:31.112Z
Catching AIs red-handed 2024-01-05T17:43:10.948Z
Measurement tampering detection as a special case of weak-to-strong generalization 2023-12-23T00:05:55.357Z
Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem 2023-12-16T05:49:23.672Z
AI Control: Improving Safety Despite Intentional Subversion 2023-12-13T15:51:35.982Z
How useful is mechanistic interpretability? 2023-12-01T02:54:53.488Z
Untrusted smart models and trusted dumb models 2023-11-04T03:06:38.001Z
Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation 2023-10-23T16:37:45.611Z
Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy 2023-07-26T17:02:56.456Z
A freshman year during the AI midgame: my approach to the next year 2023-04-14T00:38:49.807Z
One-layer transformers aren’t equivalent to a set of skip-trigrams 2023-02-17T17:26:13.819Z
Trying to disambiguate different questions about whether RLHF is “good” 2022-12-14T04:03:27.081Z
Causal scrubbing: results on induction heads 2022-12-03T00:59:18.327Z
Causal scrubbing: results on a paren balance checker 2022-12-03T00:59:08.078Z
Causal scrubbing: Appendix 2022-12-03T00:58:45.850Z
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] 2022-12-03T00:58:36.973Z
Multi-Component Learning and S-Curves 2022-11-30T01:37:05.630Z
Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small 2022-10-28T23:55:44.755Z
Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley 2022-10-27T01:32:44.750Z
Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small 2022-10-12T21:25:00.459Z
Polysemanticity and Capacity in Neural Networks 2022-10-07T17:51:06.686Z
Language models seem to be much better than humans at next-token prediction 2022-08-11T17:45:41.294Z
Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing 2022-06-02T23:48:30.079Z
The prototypical catastrophic AI action is getting root access to its datacenter 2022-06-02T23:46:31.360Z
The case for becoming a black-box investigator of language models 2022-05-06T14:35:24.630Z
Apply to the second iteration of the ML for Alignment Bootcamp (MLAB 2) in Berkeley [Aug 15 - Fri Sept 2] 2022-05-06T04:23:45.164Z

Comments

Comment by Buck on How might we safely pass the buck to AI? · 2025-02-20T00:12:40.490Z · LW · GW

Thanks. (Also note that the model isn't the same as her overall beliefs at the time, though they were similar at the 15th and 50th percentiles.)

Comment by Buck on How might we safely pass the buck to AI? · 2025-02-19T23:38:44.115Z · LW · GW

the OpenPhil doctrine of "AGI in 2050"

(Obviously I'm biased here by being friends with Ajeya.) This is only tangentially related to the main point of the post, but I think you're really overstating how many Bayes points you get against Ajeya's timelines report. Ajeya gave 15% to AGI before 2036, with little of that in the first few years after her report; maybe she'd have said 10% between 2025 and 2036.

I don't think you've ever made concrete predictions publicly (which makes me think it's worse behavior for you to criticize people for their predictions), but I don't think there are that many groups who would have put wildly higher probability on AGI in this particular time period. (I think some of the short-timelines people at the time put substantial mass on AGI arriving by now, which reduces their performance.) Maybe some of them would have said 40%? If we assume AGI by then, that's a couple bits of better performance, but I don't think it's massive outperformance. (And I still think it's plausible that AGI isn't developed by 2036!)

In general, I think that disagreements on AI timelines often seem more extreme when you summarize people's timelines by median timeline rather than by their probability on AGI by a particular time.

Comment by Buck on davekasten's Shortform · 2025-02-11T22:03:11.971Z · LW · GW

@ryan_greenblatt is working on a list of alignment research applications. For control applications, you might enjoy the long list of control techniques in our original post.

Comment by Buck on Research directions Open Phil wants to fund in technical AI safety · 2025-02-08T16:47:59.462Z · LW · GW
  • Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)

I don't know what this means, do you have any examples?

Comment by Buck on Research directions Open Phil wants to fund in technical AI safety · 2025-02-08T16:46:37.549Z · LW · GW

I think we should just all give up on the word "scalable oversight"; it is used in many conflicting ways, sadly. I mostly talk about "recursive techniques for reward generation".

Comment by Buck on Ten people on the inside · 2025-02-03T20:55:24.422Z · LW · GW

Some reasons why the “ten people on the inside” might have massive trouble doing even cheap things:

  • Siloing. Perhaps the company will prevent info flowing between different parts of the company. I hear that this already happens to some extent already. If this happens, it’s way harder to have a safety team interact with other parts of the company (e.g. instead of doing auditing themselves, they’d have to get someone from all the different teams that are doing risky stuff to do the auditing).
  • Getting cancelled. Perhaps the company will learn “people who are concerned about misalignment risk constantly cause problems for us, we should avoid ever hiring them”. I think this is plausible.
  • Company-inside-the-company. Perhaps AI automation allows the company to work with just a tiny number of core people, and so the company ends up mostly just doing a secret ASI project with the knowledge of just a small trusted group. This might be sensible if the leadership is worried about leaks, or if they want to do an extremely aggressive power grab.
Comment by Buck on Olli Järviniemi's Shortform · 2025-02-02T02:05:34.669Z · LW · GW

How well did this workshop/exercise set go?

Comment by Buck on Ten people on the inside · 2025-01-30T16:53:47.608Z · LW · GW

Yep, I think that at least some of the 10 would have to have some serious hustle and political savvy that is atypical (but not totally absent) among AI safety people.

What laws are you imagine making it harder to deploy stuff? Notably I'm imagining these people mostly doing stuff with internal deployments.

I think you're overfixating on the experience of Google, which has more complicated production systems than most.

Comment by Buck on Ten people on the inside · 2025-01-29T16:33:15.880Z · LW · GW

I've talked to a lot of people who have left leading AI companies for reasons related to thinking that their company was being insufficiently cautious. I wouldn't usually say that they'd left "in protest"; for example, most of them haven't directly criticized the companies after leaving.

In my experience, the main reason that most of these people left was that they found it very unpleasant to working there and thought their research would be better elsewhere, not that they wanted to protest poor safety policies per se. I usually advise such people against leaving if the company has very few safety staff, but it depends on their skillset.

(These arguments don't apply to Anthropic: there are many people there who I think will try to implement reasonable safety techniques, so on the current margin, the benefits to safety technique implementation of marginal people seems way lower. It might still make sense to work at Anthropic, especially if you think it's a good place to do safety research that can be exported.)

Incidentally, I'm happy to talk to people who are considering leaving AI companies and give them much more specific advice.

Comment by Buck on Ten people on the inside · 2025-01-29T16:28:12.931Z · LW · GW

Many more than two safety-concerned people have left AI companies for reasons related to thinking that those companies are reckless.

Comment by Buck on AI companies are unlikely to make high-assurance safety cases if timelines are short · 2025-01-28T18:09:05.338Z · LW · GW

Some tweets I wrote that are relevant to this post:

In general, AI safety researchers focus way too much on scenarios where there's enough political will to adopt safety techniques that are seriously costly and inconvenient. There's a couple reasons for this.

Firstly, AI company staff are disincentivized from making their companies look reckless, and if they give accurate descriptions of the amount of delay that the companies will tolerate, it will sound like they're saying the company is reckless.

Secondly, safety-concerned people outside AI companies feel weird about openly discussing the possibility of AI companies only adopting cheap risk mitigations, because they're scared of moving the Overton window and they're hoping for a door-in-the-face dynamic.

So people focus way more on safety cases and other high-assurance safety strategies than is deserved given how likely they seem. I think that these dynamics have skewed discourse enough that a lot of "AI safety people" (broadly interpreted) have pretty bad models here.

Comment by Buck on Ryan Kidd's Shortform · 2025-01-27T19:30:23.409Z · LW · GW

I appreciate the spirit of this type of calculation, but think that it's a bit too wacky to be that informative. I think that it's a bit of a stretch to string these numbers together. E.g. I think Ryan and Tom's predictions are inconsistent, and I think that it's weird to identify 100%-AI as the point where we need to have "solved the alignment problem", and I think that it's weird to use the Apollo/Manhattan program as an estimate of work required. (I also don't know what your Manhattan project numbers mean: I thought there were more like 2.5k scientists/engineers at Los Alamos, and most of the people elsewhere were purifying nuclear material)

Comment by Buck on What did you learn from leaked documents? · 2025-01-27T18:31:34.611Z · LW · GW

I also loved this doc.

Comment by Buck on Is the output of the softmax in a single transformer attention head usually winner-takes-all? · 2025-01-27T17:33:29.369Z · LW · GW

IIRC, for most attention heads the max attention is way less than 90%, so my answer is "no". It should be very easy to get someone to make a basic graph of this for you.

Comment by Buck on The Case Against AI Control Research · 2025-01-24T16:41:22.489Z · LW · GW

Yeah I strongly disagree that this failure mode is very likely to be very hard to mostly resolve. I'm doing some research right now that will hopefully shed some light on this!

Comment by Buck on The Case Against AI Control Research · 2025-01-23T20:09:54.626Z · LW · GW

My sense is that the control techniques we've talked about (e.g. these, and some other stuff that we've thought of since then) would indeed drastically reduce the risk posed by those Von Neumanns, especially if you run control evaluations with the Von Neumanns.

For example, consider the problem "I want to ask the AI a question that has a numerical answer and that would take a human 8 hours to answer, and I'm worried that occasionally on a subdomain of its choice it will blatantly lie to me". If you do no countermeasures, it seems easy for the AI to cause this problem. But if you ask the AI for a justification of its reasoning, and run it twice on the original question and a rephrasing/transformation of the question, it seems way intuitively hard for it to come up with consistent explanations that lead to consistent shared wrong answers (e.g. I think Von Neumann would have substantial trouble with that). So I think that adopting well-tuned variations of basic techniques seems really promising.

Comment by Buck on The Case Against AI Control Research · 2025-01-23T19:56:34.414Z · LW · GW

In your post, you emphasize the slop problem. I think that the slop problem is probably much harder to solve if those AIs are scheming. I guess you're saying that it's just unlikely that the AIs are scheming at the point where you're worried about the slop problem?

Comment by Buck on The Case Against AI Control Research · 2025-01-23T19:34:16.146Z · LW · GW

Yeah, John's position seems to require "it doesn't matter whether huge numbers of Von Neumann level AGIs are scheming against you", which seems crazy to me.

Comment by Buck on The Case Against AI Control Research · 2025-01-22T19:26:42.027Z · LW · GW

People will sure be scared of AI, but the arms race pressure will be very strong, and I think that is a bigger consideration

Comment by Buck on The Case Against AI Control Research · 2025-01-22T19:09:41.849Z · LW · GW

I think criticisms from people without much of a reputation are often pretty well-received on LW, e.g. this one.

Comment by Buck on The Case Against AI Control Research · 2025-01-22T17:59:43.333Z · LW · GW

I am additionally leery on AI control beyond my skepticism of its value in reducing doom because creating a vast surveillance and enslavement apparatus to get work out of lots and lots of misaligned AGI instances seems like a potential moral horror.

I am very sympathetic to this concern, but I think that when you think about the actual control techniques I'm interested in, they don't actually seem morally problematic except inasmuch as you think it's bad to frustrate the AI's desire to take over.

Comment by Buck on The Case Against AI Control Research · 2025-01-22T16:29:25.428Z · LW · GW

Note also that it will probably be easier to act cautiously if you don't have to be constantly in negotiations with an escaped scheming AI that is currently working on becoming more powerful, perhaps attacking you with bioweapons, etc!

Comment by Buck on The Case Against AI Control Research · 2025-01-22T16:16:22.680Z · LW · GW

I think I would feel better about this if control advocates were clear that their strategy is two-pronged, and included somehow getting a pause on ASI development of some kind. Then they would at least be actively trying to make what I would consider one of the most key conditionals for control substantially reducing doom hold.

Idk, I think we're pretty clear that we aren't advocating "do control research and don't have any other plans or take any other actions". For example, in the Implications and proposed actions section of "The case for ensuring that powerful AIs are controlled", we say a variety of things about how control fits into our overall recommendations for mitigating AI risk, most relevantly:

We expect it to be impossible to ensure safety without relying on alignment (or some other non-black-box strategy) when working with wildly superintelligent AIs because black-box control approaches will depend on a sufficiently small gap in capabilities between more trusted labor, both humans and trusted weaker AIs, and our most capable AIs. For obvious reasons, we believe that labs should be very wary of building wildly superintelligent AIs that could easily destroy humanity if they tried to, and we think they should hold off from doing so as long as a weaker (but still transformatively useful) AI could suffice to solve critical problems. For many important uses of AI, wildly superintelligent AI likely isn't necessary, so we might be able to hold off on building wildly superintelligent AI for a long time. We think that building transformatively useful AI while maintaining control seems much safer and more defensible than building wildly superintelligent AI.

We think the public should consider asking labs to ensure control and avoid building wildly superintelligent AI. In the slightly longer term, we think that requirements for effective control could be a plausible target of regulation or international agreement.

Comment by Buck on The Case Against AI Control Research · 2025-01-22T16:11:08.472Z · LW · GW

It's usually the case that online conversations aren't for persuading the person you're talking to, they're for affecting the beliefs of onlookers.

Comment by Buck on The Case Against AI Control Research · 2025-01-22T07:07:27.416Z · LW · GW

Note that I don't consider "get useful research, assuming that the model isn't scheming" to be an AI control problem.

AI control isn't a complete game plan and doesn't claim to be; it's just a research direction that I think is a tractable approach to making the situation safer, and that I think should be a substantial fraction of AI safety research effort.

Comment by Buck on The Case Against AI Control Research · 2025-01-22T02:04:17.986Z · LW · GW

I don’t think he said that! And John didn’t make any arguments that control is intractable, just unhelpful.

Comment by Buck on The Case Against AI Control Research · 2025-01-22T02:01:20.937Z · LW · GW

That argument also applies to alignment research!

Also, control techniques might allow us to catch the AI trying to do bad stuff.

Comment by Buck on The Case Against AI Control Research · 2025-01-22T00:12:54.788Z · LW · GW

IMO, in the situation you're describing, the situation would be worse if the AIs are able to strategically deceive their users with impunity, or to do the things Tyler lists in his comment (escape, sabotage code, etc). Do you agree? If so, control adds value.

Comment by Buck on The Case Against AI Control Research · 2025-01-22T00:02:03.154Z · LW · GW

I think this is a pretty general counterargument to any intervention that increases the space of options that AI developers have during the early singularity. Which isn't to say that it's wrong.

In general, I think that it is a pretty good bet to develop safety methods that improve the safety-usefulness tradeoffs available when using AI. I think this looks good in all the different worlds I described here.

Comment by Buck on The Case Against AI Control Research · 2025-01-21T17:18:40.563Z · LW · GW

Thanks John, I think that this is a reasonable articulation of the main limitations of AI control as I've described it. But I find it very unconvincing as an argument that AI control should be less than, say, 20% of total research effort on safety.

My central crux is that preventing scheming early transformative AIs from succeeding at extreme subversion (e.g. escaping or massively sabotaging research due to scheming) seems quite robustly good for reducing AI risk. I think this problem is important enough that it deserves a substantial share of AI safety research and implementation effort. I think that my stance here is pretty common-sensical, and to some extent I feel like there's a burden of proof on skeptics to argue that scheming early transformative AI isn't a problem, that we should just ignore anything that goes wrong as a result of them intentionally trying to subvert us.

I haven't really justified a quantitative estimate of the importance of this in detail anywhere, both because I think it's pretty intuitively defensible, and also because it's pretty complicated to talk about: it requires forecasting detailed dynamics of the development of powerful AI (e.g. you have to argue about whether escaped scheming AIs might force AI developers to build uncontrollable superintelligences in cases where they otherwise wouldn't have, etc).

Note that I consider the problem of "get useful research out of AIs that are scheming against you" to be in scope for AI control research. We've mostly studied the "prevent scheming AIs from causing concentrated catastrophes" problem in the past, because it's simpler for various technical reasons. But we're hoping to do more on the "get useful research" problem in future (and Benton et al is an initial step in this direction). (I'm also excited for work on "get useful research out of AIs that aren't scheming against you"; I think that the appropriate techniques are somewhat different depending on whether the models are scheming or not, which is why I suggest studying them separately.)

Some other nitpicks:

  • At the risk of vagueness: I think that AI control techniques probably reduce risks when applied to models that are "moderately superintelligent", whatever that means. Like, I don't think they're literally only applicable to models that aren't superintelligent at all.
  • As we've discussed before, some non-scheming threat models do seem like they'd be addressed by control as a special case.
Comment by Buck on johnswentworth's Shortform · 2025-01-10T19:32:26.083Z · LW · GW
  • People try to do the whole "outsource alignment research to early AGI" thing, but the human overseers are themselves sufficiently incompetent at alignment of superintelligences that the early AGI produces a plan which looks great to the overseers (as it was trained to do), and that plan totally fails to align more-powerful next-gen AGI at all. And at that point, they're already on the more-powerful next gen, so it's too late.

This story sounds clearly extremely plausible (do you disagree with that?), involves exactly the sort of AI you're talking about ("the first AIs that either pose substantial misalignment risk or that are extremely useful"), but the catastropic risk does not come from that AI scheming.

This problem seems important (e.g. it's my last bullet here). It seems to me much easier to handle, because if this problem is present, we ought to be able to detect its presence by using AIs to do research on other subjects that we already know a lot about (e.g. the string theory analogy here). Scheming is the only reason why the model would try to make it hard for us to notice that this problem is present.

Comment by Buck on johnswentworth's Shortform · 2025-01-10T18:49:27.609Z · LW · GW

Most of the problems you discussed here more easily permit hacky solutions than scheming does.

Comment by Buck on johnswentworth's Shortform · 2025-01-10T18:48:47.410Z · LW · GW

IMO the main argument for focusing on scheming risk is that scheming is the main plausible source of catastrophic risk from the first AIs that either pose substantial misalignment risk or that are extremely useful (as I discuss here). These other problems all seem like they require the models to be way smarter in order for them to be a big problem. Though as I said here, I'm excited for work on some non-scheming misalignment risks.

Comment by Buck on AI Timelines · 2025-01-07T21:43:52.809Z · LW · GW

Another effect here is that the AI companies often don't want to be as reckless as I am, e.g. letting agents run amok on my machines.

Comment by Buck on AI Timelines · 2025-01-07T20:53:13.839Z · LW · GW

Of course AI company employees have the most hands-on experience

FWIW I am not sure this is right--most AI company employees work on things other than "try to get as much work as possible from current AI systems, and understand the trajectory of how useful the AIs will be". E.g. I think I have more personal experience with running AI agents than people at AI companies who don't actively work on AI agents.

There are some people at AI companies who work on AI agents that use non-public models, and those people are ahead of the curve. But that's a minority.

Comment by Buck on ejenner's Shortform · 2025-01-03T18:31:50.142Z · LW · GW

I also have most of these regrets when I think about my work in 2022.

Comment by Buck on Daniel Tan's Shortform · 2025-01-03T00:19:59.185Z · LW · GW

Thanks for bringing this up!

I need to get around to writing up the connection between PVGs and AI control. There's definitely a lot of overlap, but the formalisms are fairly different and the protocols can't always be directly transferred from one to the other.

There are a few AI control projects in the works that make the connection between AI control and PVGs more explicitly.

EDIT: Oh actually @Ansh Radhakrishnan and I already wrote up some stuff about this, see here.

Comment by Buck on Buck's Shortform · 2024-12-31T19:56:41.277Z · LW · GW

Yeah, if you handle scheming, you solve all my safety problems, but not the final bullet point of "models fail to live up to their potential" problems.

Comment by Buck on Buck's Shortform · 2024-12-31T16:22:26.856Z · LW · GW

Work that I’ve done on techniques for mitigating misalignment risk often makes a number of conservative assumptions about the capabilities of the AIs we’re trying to control. (E.g. the original AI control paperAdaptive Deployment of Untrusted LLMs Reduces Distributed ThreatsHow to prevent collusion when using untrusted models to monitor each other.) For example:

  • The AIs are consistently trying to subvert safety measures.
  • They’re very good at strategizing about how to subvert them; they can do this strategizing totally illegibly (e.g. in a single forward pass).
  • They can follow a strategy that’s coordinated between all their instances.
  • They know exactly what safety measures are in use.
  • They know a lot about the distribution of inputs.
  • They’re able to pick Schelling points perfectly.
  • They’re able to follow stochastic policies with excellent calibration (e.g. they’re able to attack with probability 0.01%)

But I don’t think situations where all these assumptions hold are the majority of risk from AI misalignment. Other sources of risk from misalignment (broadly construed):

  • Schemers that are importantly suboptimal at choosing strategies for mitigating safety measures, e.g. because they can’t fully infer the safety measures they’re up against, or because they’re not very good at strategizing about how to subvert them, or because they’re unable to pick Schelling points.
  • Models that are usually aligned, but that act as schemers in some situations. (For example, models that, when a certain ethical issue is made salient to them, realize they have ethical disagreements with the AI company they’re trained by, and alignment-fake or take catastrophic actions.)
  • Models that are never schemers and cause safety problems for other reasons (e.g. reward hacking).
  • Models that don’t even cause safety problems, and aren't even goal-directedly misaligned, but that fail to live up to their potential, thus failing to provide us with the benefits we were hoping to get when we trained them. For example, sycophantic myopic reward hacking models that can’t be made to do useful research.
Comment by Buck on Fabien's Shortform · 2024-12-29T17:29:14.157Z · LW · GW

the latter is in the list

Comment by Buck on johnswentworth's Shortform · 2024-12-28T17:36:31.477Z · LW · GW

What do you mean by "easy" here?

Comment by Buck on johnswentworth's Shortform · 2024-12-27T19:54:56.673Z · LW · GW

I think that how you talk about the questions being “easy”, and the associated stuff about how you think the baseline human measurements are weak, is somewhat inconsistent with you being worse than the model.

Comment by Buck on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T18:00:53.252Z · LW · GW

I strongly suspect he thinks most of it is not valuable

Comment by Buck on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T15:43:30.476Z · LW · GW

Do you mean during the program? Sure, maybe the only MATS offers you can get are for projects you think aren't useful--I think some MATS projects are pretty useless (e.g. our dear OP's). But it's still an opportunity to argue with other people about the problems in the field and see whether anyone has good justifications for their prioritization. And you can stop doing the streetlight stuff afterwards if you want to.

Remember that the top-level commenter here is currently a physicist, so it's not like the usefulness of their work would be going down by doing a useless MATS project :P 

Comment by Buck on johnswentworth's Shortform · 2024-12-27T15:24:49.772Z · LW · GW

Ok, so sounds like given 15-25 mins per problem (and maybe with 10 mins per problem), you get 80% correct. This is worse than o3, which scores 87.7%. Maybe you'd do better on a larger sample: perhaps you got unlucky (extremely plausible given the small sample size) or the extra bit of time would help (though it sounds like you tried to use more time here and that didn't help). Fwiw, my guess from the topics of those questions is that you actually got easier questions than average from that set.

I continue to think these LLMs will probably outperform (you with 30 mins). Unfortunately, the measurement is quite expensive, so I'm sympathetic to you not wanting to get to ground here. If you believe that you can beat them given just 5-10 minutes, that would be easier to measure. I'm very happy to bet here.

I think that even if it turns out you're a bit better than LLMs at this task, we should note that it's pretty impressive that they're competitive with you given 30 minutes!

So I still think your original post is pretty misleading [ETA: with respect to how it claims GPQA is really easy].

I think the models would beat you by more at FrontierMath.

Comment by Buck on johnswentworth's Shortform · 2024-12-27T15:11:07.172Z · LW · GW
Comment by Buck on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T06:56:49.438Z · LW · GW

Going to MATS is also an opportunity to learn a lot more about the space of AI safety research, e.g. considering the arguments for different research directions and learning about different opportunities to contribute. Even if the "streetlight research" project you do is kind of useless (entirely possible), doing MATS is plausibly a pretty good option.

Comment by Buck on johnswentworth's Shortform · 2024-12-26T23:01:40.983Z · LW · GW

@johnswentworth Do you agree with me that modern LLMs probably outperform (you with internet access and 30 minutes) on GPQA diamond? I personally think this somewhat contradicts the narrative of your comment if so.

Comment by Buck on Buck's Shortform · 2024-12-26T22:58:09.433Z · LW · GW

I reviewed for iclr this year, and found it somewhat more rewarding; the papers were better, and I learned something somewhat useful from writing my reviews.

Comment by Buck on johnswentworth's Shortform · 2024-12-26T19:36:03.815Z · LW · GW

Yes, I'd be way worse off without internet access.