buck

I feel like GDM safety and Constellation are similar enough to be in the same cluster: I bet within-cluster variance is bigger than between-cluster variance.

Comment by Buck on aog's Shortform · 2025-04-22T03:52:31.132Z · LW · GW

FWIW, I think that the GDM safety people are at least as similar to the Constellation/Redwood/METR cluster as the Anthropic safety people are, probably more similar. (And Anthropic as a whole has very different beliefs than the Constellation cluster, e.g. not having much credence on misalignment risk.)

Comment by Buck on Three Months In, Evaluating Three Rationalist Cases for Trump · 2025-04-19T06:35:05.088Z · LW · GW

Hammond was, right?

Comment by Buck on To be legible, evidence of misalignment probably has to be behavioral · 2025-04-16T04:34:48.301Z · LW · GW

Ryan agrees, the main thing he means by "behavioral output" is what you're saying: an actually really dangerous action.

Comment by Buck on Notes on countermeasures for exploration hacking (aka sandbagging) · 2025-04-11T14:00:50.421Z · LW · GW

I think we should probably say that exploration hacking is a strategy for sandbagging, rather than using them as synonyms.

Comment by Buck on How much progress actually happens in theoretical physics? · 2025-04-05T19:02:40.781Z · LW · GW

Isn’t the answer that the low hanging fruit of explaining unexplained observations has been picked?

Comment by Buck on Buck's Shortform · 2025-04-05T15:43:54.898Z · LW · GW

I appeared on the 80,000 Hours podcast. I discussed a bunch of points on misalignment risk and AI control that I don't think I've heard discussed publicly before.

Transcript + links + summary here; it's also available as a podcast in many places.

Comment by Buck on Consider showering · 2025-04-02T19:59:19.619Z · LW · GW

I love that I can guess the infohazard from the comment

Comment by Buck on Thomas Kwa's Shortform · 2025-04-02T00:39:37.829Z · LW · GW

A few months ago, I accidentally used France as an example of a small country that it wouldn't be that catastrophic for AIs to take over, while giving a talk in France 😬

Comment by Buck on Third-wave AI safety needs sociopolitical thinking · 2025-04-01T00:05:29.235Z · LW · GW

No problem, my comment was pretty unclear and I can see from the other comments why you'd be on edge!

Comment by Buck on ank's Shortform · 2025-03-31T20:46:26.090Z · LW · GW

It seems extremely difficult to make a blacklist of models in a way that isn't trivially breakable. (E.g. what's supposed to happen when someone adds a tiny amount of noise to the weights of a blacklisted model, or rotates them along a gauge invariance?)

Comment by Buck on Third-wave AI safety needs sociopolitical thinking · 2025-03-31T19:49:15.973Z · LW · GW

I agree that this isn't what I'd call "direct written evidence"; I was just (somewhat jokingly) making the point that the linked articles are Bayesian evidence that Musk tries to censor, and that the articles are pieces of text.

Comment by Buck on Third-wave AI safety needs sociopolitical thinking · 2025-03-31T17:24:32.927Z · LW · GW

It is definitely evidence that was literally written

Comment by Buck on Why do many people who care about AI Safety not clearly endorse PauseAI? · 2025-03-31T16:32:57.821Z · LW · GW

I disagree that you have to believe those four things in order to believe what I said. I believe some of those and find others too ambiguously phrased to evaluate.

Re your model: I think your model is basically just: if we race, we go from 70% chance that US "wins" to a 75% chance the US wins, and we go from a 50% chance of "solving alignment" to a 25% chance? Idk how to apply that here: isn't your squiggle model talking about whether racing is good, rather than whether unilaterally pausing is good? Maybe you're using "race" to mean "not pause" and "not race" to mean "pause"; if so, that's super confusing terminology. If we unilaterally paused indefinitely, surely we'd have less than 70% chance of winning.

In general, I think you're modeling this extremely superficially in your comments on the topic. I wish you'd try modeling this with more granularity than "is alignment hard" or whatever. I think that if you try to actually make such a model, you'll likely end up with a much better sense of where other people are coming from. If you're trying to do this, I recommend reading posts where people explain strategies for passing safely through the singularity, e.g. like this.

Comment by Buck on Mo Putera's Shortform · 2025-03-31T16:15:18.901Z · LW · GW

I have this experience with @ryan_greenblatt -- he's got an incredible ability to keep really large and complicated argument trees in his head, so he feels much less need to come up with slightly-lossy abstractions and categorizations than e.g. I do. This is part of why his work often feels like huge, mostly unstructured lists. (The lists are more unstructured before his pre-release commenters beg him to structure them more.) (His code often also looks confusing to me, for similar reasons.)

Comment by Buck on Why do many people who care about AI Safety not clearly endorse PauseAI? · 2025-03-31T14:21:54.674Z · LW · GW

Some quick takes:

"Pause AI" could refer to many different possible policies.
I think that if humanity avoided building superintelligent AI, we'd massively reduce the risk of AI takeover and other catastrophic outcomes.
I suspect that at some point in the future, AI companies will face a choice between proceeding more slowly with AI development than they're incentivized to, and proceeding more quickly while imposing huge risks. In particular, I suspect it's going to be very dangerous to develop ASI.
I don't think that it would be clearly good to pause AI development now. This is mostly because I don't think that the models being developed literally right now pose existential risk.
Maybe it would be better to pause AI development right now because this will improve the situation later (e.g. maybe we should pause until frontier labs implement good enough security that we can be sure their Slack won't be hacked again, leaking algorithmic secrets). But this is unclear and I don't think it immediately follows from "we could stop AI takeover risk by pausing AI development before the AIs are able to take over".
Many of the plausible "pause now" actions seem to overall increase risk. For example, I think it would be bad for relatively responsible AI developers to unilaterally pause, and I think it would probably be bad for the US to unilaterally force all US AI developers to pause if they didn't simultaneously somehow slow down non-US development.
- (They could slow down non-US development with actions like export controls.)
Even in the cases where I support something like pausing, it's not clear that I want to spend effort on the margin actively supporting it; maybe there are other things I could push on instead that have better ROI.
I'm not super enthusiastic about PauseAI the organization; they sometimes seem to not be very well-informed, they sometimes argue for conclusions that I think are wrong, and I find Holly pretty unpleasant to interact with, because she seems uninformed and prone to IMO unfair accusations that I'm conspiring with AI companies. My guess is that there could be an organization with similar goals to PauseAI that I felt much more excited for.

Comment by Buck on Good Research Takes are Not Sufficient for Good Strategic Takes · 2025-03-31T13:07:43.717Z · LW · GW

A few points:

Knowing a research field well makes it easier to assess how much other people know about it. For example, if you know ML, you sometimes notice that someone clearly doesn't know what they're talking about (or conversely, you become impressed by the fact that they clearly do know what they're talking about). This is helpful when deciding who to defer to.
If you are a prominent researcher, you get more access to confidential/sensitive information and the time of prestigious people. This is true regardless of whether your strategic takes are good, and generally improves your strategic takes.
- One downside is that people try harder to convince you of stuff. I think that being a more prominent researcher is probably overall net positive despite this effect.
IMO, one way of getting a sense of whether someone's strategic takes are good is to ask them whether they try hard to have strategic takes. A lot of people will openly tell you that they don't focus on that, which makes it easier for you to avoid deferring to random half-baked strategic takes that they say without expecting anyone to take them too seriously.

Comment by Buck on We Have No Plan for Preventing Loss of Control in Open Models · 2025-03-31T12:35:56.076Z · LW · GW

A few takes:

I believe that there is also an argument to be made that the AI safety community is currently very under-indexed on research into future scenarios where assumptions about the AI operator taking baseline safety precautions related to preventing loss of control do not hold.

I think you're mixing up two things: the extent to which we consider the possibility that AI operators will be very incautious, and the extent to which our technical research focuses on that possibility.

My research mostly focuses on techniques that an AI developer could use to reduce the misalignment risk posed by deploying and developing AI, given some constraints on how much value they need to get from the AI. Given this, I basically definitionally have to be imagining the AI developer trying to mitigate misalignment risk: why else would they use the techniques I study?

But that focus isn't to say that I'm totally sure all AI developers will in fact use good safety techniques.

Another disagreement is that I think that we're better off if some AI developers (preferably more powerful ones) have controlled or aligned their models, even if there are some misaligned AIs being developed without safeguards. This is because the controlled/aligned models can be used to defend against attacks from the misaligned AIs, and to compete with the misaligned AIs (on both acquiring resources and doing capabilities and safety research).

Comment by Buck on The case for ensuring that powerful AIs are controlled · 2025-03-14T00:24:44.773Z · LW · GW

I am not sure I agree with this change at this point. How do you feel now?

Comment by Buck on Buck's Shortform · 2025-03-13T00:21:17.553Z · LW · GW

We're planning to release some talks; I also hope we can publish various other content from this!

I'm sad that we didn't have space for everyone!

Comment by Buck on Mis-Understandings's Shortform · 2025-03-10T04:10:54.282Z · LW · GW

I wrote thoughts here: https://redwoodresearch.substack.com/p/fields-that-i-reference-when-thinking?selection=fada128e-c663-45da-b21d-5473613c1f5c

Comment by Buck on Jerdle's Shortform · 2025-03-05T22:30:32.421Z · LW · GW

Fans of math exercises related to toy models of taxation might enjoy this old post of mine.

Comment by Buck on Buck's Shortform · 2025-03-03T19:54:46.221Z · LW · GW

Alignment Forum readers might be interested in this:

Announcing ControlConf: The world’s first conference dedicated to AI control - techniques to mitigate security risks from AI systems even if they’re trying to subvert those controls. March 27-28, 2025 in London. 🧵ControlConf will bring together:
Researchers from frontier labs & government
AI researchers curious about control mechanisms
InfoSec professionals
Policy researchers
Our goals: build bridges across disciplines, share knowledge, and coordinate research priorities.
The conference will feature presentations, technical demos, panel discussions, 1:1 networking, and structured breakout sessions to advance the field.
Interested in joining us in London on March 27-28? Fill out the expression of interest form by March 7th: https://forms.fillout.com/t/sFuSyiRyJBus

Comment by Buck on How might we safely pass the buck to AI? · 2025-02-25T17:19:02.566Z · LW · GW

Paul is not Ajeya, and also Eliezer only gets one bit from this win, which I think is insufficient grounds for behaving like such an asshole.

Comment by Buck on How might we safely pass the buck to AI? · 2025-02-20T00:12:40.490Z · LW · GW

Thanks. (Also note that the model isn't the same as her overall beliefs at the time, though they were similar at the 15th and 50th percentiles.)

Comment by Buck on How might we safely pass the buck to AI? · 2025-02-19T23:38:44.115Z · LW · GW

the OpenPhil doctrine of "AGI in 2050"

(Obviously I'm biased here by being friends with Ajeya.) This is only tangentially related to the main point of the post, but I think you're really overstating how many Bayes points you get against Ajeya's timelines report. Ajeya gave 15% to AGI before 2036, with little of that in the first few years after her report; maybe she'd have said 10% between 2025 and 2036.

I don't think you've ever made concrete predictions publicly (which makes me think it's worse behavior for you to criticize people for their predictions), but I don't think there are that many groups who would have put wildly higher probability on AGI in this particular time period. (I think some of the short-timelines people at the time put substantial mass on AGI arriving by now, which reduces their performance.) Maybe some of them would have said 40%? If we assume AGI by then, that's a couple bits of better performance, but I don't think it's massive outperformance. (And I still think it's plausible that AGI isn't developed by 2036!)

In general, I think that disagreements on AI timelines often seem more extreme when you summarize people's timelines by median timeline rather than by their probability on AGI by a particular time.

Comment by Buck on davekasten's Shortform · 2025-02-11T22:03:11.971Z · LW · GW

@ryan_greenblatt is working on a list of alignment research applications. For control applications, you might enjoy the long list of control techniques in our original post.

Comment by Buck on Research directions Open Phil wants to fund in technical AI safety · 2025-02-08T16:47:59.462Z · LW · GW

Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)

I don't know what this means, do you have any examples?

Comment by Buck on Research directions Open Phil wants to fund in technical AI safety · 2025-02-08T16:46:37.549Z · LW · GW

I think we should just all give up on the word "scalable oversight"; it is used in many conflicting ways, sadly. I mostly talk about "recursive techniques for reward generation".

Comment by Buck on Ten people on the inside · 2025-02-03T20:55:24.422Z · LW · GW

Some reasons why the “ten people on the inside” might have massive trouble doing even cheap things:

Siloing. Perhaps the company will prevent info flowing between different parts of the company. I hear that this already happens to some extent already. If this happens, it’s way harder to have a safety team interact with other parts of the company (e.g. instead of doing auditing themselves, they’d have to get someone from all the different teams that are doing risky stuff to do the auditing).
Getting cancelled. Perhaps the company will learn “people who are concerned about misalignment risk constantly cause problems for us, we should avoid ever hiring them”. I think this is plausible.
Company-inside-the-company. Perhaps AI automation allows the company to work with just a tiny number of core people, and so the company ends up mostly just doing a secret ASI project with the knowledge of just a small trusted group. This might be sensible if the leadership is worried about leaks, or if they want to do an extremely aggressive power grab.

Comment by Buck on Olli Järviniemi's Shortform · 2025-02-02T02:05:34.669Z · LW · GW

How well did this workshop/exercise set go?

Comment by Buck on Ten people on the inside · 2025-01-30T16:53:47.608Z · LW · GW

Yep, I think that at least some of the 10 would have to have some serious hustle and political savvy that is atypical (but not totally absent) among AI safety people.

What laws are you imagine making it harder to deploy stuff? Notably I'm imagining these people mostly doing stuff with internal deployments.

I think you're overfixating on the experience of Google, which has more complicated production systems than most.

Comment by Buck on Ten people on the inside · 2025-01-29T16:33:15.880Z · LW · GW

I've talked to a lot of people who have left leading AI companies for reasons related to thinking that their company was being insufficiently cautious. I wouldn't usually say that they'd left "in protest"; for example, most of them haven't directly criticized the companies after leaving.

In my experience, the main reason that most of these people left was that they found it very unpleasant to working there and thought their research would be better elsewhere, not that they wanted to protest poor safety policies per se. I usually advise such people against leaving if the company has very few safety staff, but it depends on their skillset.

(These arguments don't apply to Anthropic: there are many people there who I think will try to implement reasonable safety techniques, so on the current margin, the benefits to safety technique implementation of marginal people seems way lower. It might still make sense to work at Anthropic, especially if you think it's a good place to do safety research that can be exported.)

Incidentally, I'm happy to talk to people who are considering leaving AI companies and give them much more specific advice.

Comment by Buck on Ten people on the inside · 2025-01-29T16:28:12.931Z · LW · GW

Many more than two safety-concerned people have left AI companies for reasons related to thinking that those companies are reckless.

Comment by Buck on AI companies are unlikely to make high-assurance safety cases if timelines are short · 2025-01-28T18:09:05.338Z · LW · GW

Some tweets I wrote that are relevant to this post:

In general, AI safety researchers focus way too much on scenarios where there's enough political will to adopt safety techniques that are seriously costly and inconvenient. There's a couple reasons for this.
Firstly, AI company staff are disincentivized from making their companies look reckless, and if they give accurate descriptions of the amount of delay that the companies will tolerate, it will sound like they're saying the company is reckless.
Secondly, safety-concerned people outside AI companies feel weird about openly discussing the possibility of AI companies only adopting cheap risk mitigations, because they're scared of moving the Overton window and they're hoping for a door-in-the-face dynamic.
So people focus way more on safety cases and other high-assurance safety strategies than is deserved given how likely they seem. I think that these dynamics have skewed discourse enough that a lot of "AI safety people" (broadly interpreted) have pretty bad models here.

Comment by Buck on Ryan Kidd's Shortform · 2025-01-27T19:30:23.409Z · LW · GW

I appreciate the spirit of this type of calculation, but think that it's a bit too wacky to be that informative. I think that it's a bit of a stretch to string these numbers together. E.g. I think Ryan and Tom's predictions are inconsistent, and I think that it's weird to identify 100%-AI as the point where we need to have "solved the alignment problem", and I think that it's weird to use the Apollo/Manhattan program as an estimate of work required. (I also don't know what your Manhattan project numbers mean: I thought there were more like 2.5k scientists/engineers at Los Alamos, and most of the people elsewhere were purifying nuclear material)

Comment by Buck on What did you learn from leaked documents? · 2025-01-27T18:31:34.611Z · LW · GW

I also loved this doc.

Comment by Buck on Is the output of the softmax in a single transformer attention head usually winner-takes-all? · 2025-01-27T17:33:29.369Z · LW · GW

IIRC, for most attention heads the max attention is way less than 90%, so my answer is "no". It should be very easy to get someone to make a basic graph of this for you.

Comment by Buck on The Case Against AI Control Research · 2025-01-24T16:41:22.489Z · LW · GW

Yeah I strongly disagree that this failure mode is very likely to be very hard to mostly resolve. I'm doing some research right now that will hopefully shed some light on this!

Comment by Buck on The Case Against AI Control Research · 2025-01-23T20:09:54.626Z · LW · GW

My sense is that the control techniques we've talked about (e.g. these, and some other stuff that we've thought of since then) would indeed drastically reduce the risk posed by those Von Neumanns, especially if you run control evaluations with the Von Neumanns.

For example, consider the problem "I want to ask the AI a question that has a numerical answer and that would take a human 8 hours to answer, and I'm worried that occasionally on a subdomain of its choice it will blatantly lie to me". If you do no countermeasures, it seems easy for the AI to cause this problem. But if you ask the AI for a justification of its reasoning, and run it twice on the original question and a rephrasing/transformation of the question, it seems way intuitively hard for it to come up with consistent explanations that lead to consistent shared wrong answers (e.g. I think Von Neumann would have substantial trouble with that). So I think that adopting well-tuned variations of basic techniques seems really promising.

Comment by Buck on The Case Against AI Control Research · 2025-01-23T19:56:34.414Z · LW · GW

In your post, you emphasize the slop problem. I think that the slop problem is probably much harder to solve if those AIs are scheming. I guess you're saying that it's just unlikely that the AIs are scheming at the point where you're worried about the slop problem?

Comment by Buck on The Case Against AI Control Research · 2025-01-23T19:34:16.146Z · LW · GW

Yeah, John's position seems to require "it doesn't matter whether huge numbers of Von Neumann level AGIs are scheming against you", which seems crazy to me.

Comment by Buck on The Case Against AI Control Research · 2025-01-22T19:26:42.027Z · LW · GW

People will sure be scared of AI, but the arms race pressure will be very strong, and I think that is a bigger consideration

Comment by Buck on The Case Against AI Control Research · 2025-01-22T19:09:41.849Z · LW · GW

I think criticisms from people without much of a reputation are often pretty well-received on LW, e.g. this one.

Comment by Buck on The Case Against AI Control Research · 2025-01-22T17:59:43.333Z · LW · GW

I am additionally leery on AI control beyond my skepticism of its value in reducing doom because creating a vast surveillance and enslavement apparatus to get work out of lots and lots of misaligned AGI instances seems like a potential moral horror.

I am very sympathetic to this concern, but I think that when you think about the actual control techniques I'm interested in, they don't actually seem morally problematic except inasmuch as you think it's bad to frustrate the AI's desire to take over.

Comment by Buck on The Case Against AI Control Research · 2025-01-22T16:29:25.428Z · LW · GW

Note also that it will probably be easier to act cautiously if you don't have to be constantly in negotiations with an escaped scheming AI that is currently working on becoming more powerful, perhaps attacking you with bioweapons, etc!

Comment by Buck on The Case Against AI Control Research · 2025-01-22T16:16:22.680Z · LW · GW

I think I would feel better about this if control advocates were clear that their strategy is two-pronged, and included somehow getting a pause on ASI development of some kind. Then they would at least be actively trying to make what I would consider one of the most key conditionals for control substantially reducing doom hold.

Idk, I think we're pretty clear that we aren't advocating "do control research and don't have any other plans or take any other actions". For example, in the Implications and proposed actions section of "The case for ensuring that powerful AIs are controlled", we say a variety of things about how control fits into our overall recommendations for mitigating AI risk, most relevantly:

We expect it to be impossible to ensure safety without relying on alignment (or some other non-black-box strategy) when working with wildly superintelligent AIs because black-box control approaches will depend on a sufficiently small gap in capabilities between more trusted labor, both humans and trusted weaker AIs, and our most capable AIs. For obvious reasons, we believe that labs should be very wary of building wildly superintelligent AIs that could easily destroy humanity if they tried to, and we think they should hold off from doing so as long as a weaker (but still transformatively useful) AI could suffice to solve critical problems. For many important uses of AI, wildly superintelligent AI likely isn't necessary, so we might be able to hold off on building wildly superintelligent AI for a long time. We think that building transformatively useful AI while maintaining control seems much safer and more defensible than building wildly superintelligent AI.
We think the public should consider asking labs to ensure control and avoid building wildly superintelligent AI. In the slightly longer term, we think that requirements for effective control could be a plausible target of regulation or international agreement.

Comment by Buck on The Case Against AI Control Research · 2025-01-22T16:11:08.472Z · LW · GW

It's usually the case that online conversations aren't for persuading the person you're talking to, they're for affecting the beliefs of onlookers.

Comment by Buck on The Case Against AI Control Research · 2025-01-22T07:07:27.416Z · LW · GW

Note that I don't consider "get useful research, assuming that the model isn't scheming" to be an AI control problem.

AI control isn't a complete game plan and doesn't claim to be; it's just a research direction that I think is a tractable approach to making the situation safer, and that I think should be a substantial fraction of AI safety research effort.

Comment by Buck on The Case Against AI Control Research · 2025-01-22T02:04:17.986Z · LW · GW

I don’t think he said that! And John didn’t make any arguments that control is intractable, just unhelpful.

User info

Posts

Comments