Posts

Anthropic's updated Responsible Scaling Policy 2024-10-15T16:46:48.727Z
Anthropic: Reflections on our Responsible Scaling Policy 2024-05-20T04:14:44.435Z
Simple probes can catch sleeper agents 2024-04-23T21:10:47.784Z
Third-party testing as a key ingredient of AI policy 2024-03-25T22:40:43.744Z
Dario Amodei’s prepared remarks from the UK AI Safety Summit, on Anthropic’s Responsible Scaling Policy 2023-11-01T18:10:31.110Z
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning 2023-10-05T21:01:39.767Z
Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust 2023-09-19T15:09:27.235Z
Anthropic's Core Views on AI Safety 2023-03-09T16:55:15.311Z
Concrete Reasons for Hope about AI 2023-01-14T01:22:18.723Z
In Defence of Spock 2021-04-21T21:34:04.206Z
Zac Hatfield Dodds's Shortform 2021-03-09T02:39:33.481Z

Comments

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Don't go bankrupt, don't go rogue · 2025-02-10T04:46:55.212Z · LW · GW

How did the greens get here?

Largely via opposition to nuclear weapons, and some cost-benefit analysis which assumes nuclear proponents are too optimistic about both costs and risks of nuclear power (further reading).  Personally I think this was pretty reasonable in the 70s and 80s.  At this point I'd personally prefer to keep existing nuclear running and build solar panels instead of new reactors, though if SMRs worked in a sane regulatory regime that'd be nice too.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Mikhail Samin's Shortform · 2025-02-10T04:13:57.256Z · LW · GW

To quote from Anthropic's letter to Govenor Newsom,

As you may be aware, several weeks ago Anthropic submitted a Support if Amended letter regarding SB 1047, in which we suggested a series of amendments to the bill. ... In our assessment the new SB 1047 is substantially improved, to the point where we believe its benefits likely outweigh its costs.

...

We see the primary benefits of the bill as follows:

  • Developing SSPs and being honest with the public about them. The bill mandates the adoption of safety and security protocols (SSPs), flexible policies for managing catastrophic risk that are similar to frameworks adopted by several of the most advanced developers of AI systems, including Anthropic, Google, and OpenAI. However, some companies have still not adopted these policies, and others have been vague about them. Furthermore, nothing prevents companies from making misleading statements about their SSPs or about the results of the tests they have conducted as part of their SSPs. It is a major improvement, with very little downside, that SB 1047 requires companies to adopt some SSP (whose details are up to them) and to be honest with the public about their SSP-related practices and findings.

...

We believe it is critical to have some framework for managing frontier AI systems that roughly meets [requirements discussed in the letter]. As AI systems become more powerful, it's crucial for us to ensure we have appropriate regulations in place to ensure their safety.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Mikhail Samin's Shortform · 2025-02-10T04:09:24.366Z · LW · GW

Here I am on record supporting SB-1047, along with many of my colleagues. I will continue to support specific proposed regulations if I think they would help, and oppose them if I think they would be harmful; asking "when" independent of "what" doesn't make much sense to me and doesn't seem to follow from anything I've said.

My claim is not "this is a bad time", but rather "given the current state of the art, I tend to support framework/liability/etc regulations, and tend to oppose more-specific/exact-evals/etc regulations". Obviously if the state of the art advanced enough that I thought the latter would be better for overall safety, I'd support them, and I'm glad that people are working on that.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Mikhail Samin's Shortform · 2025-02-03T06:32:45.905Z · LW · GW

There's a big difference between regulation which says roughly "you must have something like an RSP", and regulation which says "you must follow these specific RSP-like requirements", and I think Mikhail is talking about the latter.

I personally think the former is a good idea, and thus supported SB-1047 along with many other lab employees. It's also pretty clear to me that locking in circa-2023 thinking about RSPs would have been a serious mistake, and so I (along with many others) am generally against very specific regulations because we expect they would on net increase catastrophic risk.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on In response to critiques of Guaranteed Safe AI · 2025-02-02T01:55:47.077Z · LW · GW

Improving the sorry state of software security would be great, and with AI we might even see enough change to the economics of software development and maintenance that it happens, but it's not really an AI safety agenda.

(added for clarity: of course it can be part of a safety agenda, but see point #1 above)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on In response to critiques of Guaranteed Safe AI · 2025-02-01T11:16:43.990Z · LW · GW

I'm sorry that I don't have time to write up a detailed response to (critique of?) the response to critiques; hopefully this brief note is still useful.

  1. I remain frustrated by GSAI advocacy. It's suited for well-understood closed domains, excluding e.g. natural language, when discussing feasibility; but 'we need rigorous guarantees for current or near-future AI' when arguing for importance. It's an extension to or complement of current practice; and current practice is irresponsible and inadequate. Often this is coming from different advocates, but that doesn't make it less frustrating for me.

  2. Claiming that non-vacuous sound (over)approximations are feasible, or that we'll be able to specify and verify non-trivial safety properties, is risible. Planning for runtime monitoring and anomaly detection is IMO an excellent idea, but would be entirely pointless if you believed that we had a guarantee!

  3. It's vaporware. I would love to see a demonstration project and perhaps lose my bet, but I don't find papers or posts full of details compelling, however long we could argue over them. Nullius in verba!

I like the idea of using formal tools to complement and extend current practice - I was at the workshop where Towards GSAI was drafted, and offered co-authorship - but as much I admire the people involved, I just don't believe the core claims of the GSAI agenda as it stands.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Ten people on the inside · 2025-01-30T19:27:17.474Z · LW · GW

I don't think Miles' or Richard's stated reasons for resigning included safety policies, for example.

But my broader point is that "fewer safety people should quit leading labs to protest poor safety policies" is basically a non-sequitor from "people have quit leading labs because they think they'll be more effective elsewhere", whether because they want to do something different or independent, or because they no longer trust the lab to behave responsibly.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Ten people on the inside · 2025-01-30T19:09:38.532Z · LW · GW

I agree with Rohin that there are approximately zero useful things that don't make anyone's workflow harder. The default state is "only just working means working, so I've moved on to the next thing" and if you want to change something there'd better be a benefit to balance the risk of breaking it.

Also 3% of compute is so much compute; probably more than the "20% to date over four years" that OpenAI promised and then yanked from superalignment. Take your preferred estimate of lab compute spending, multiply by 3%, and ask yourself whether a rushed unreasonable lab would grant that much money to people working on a topic it didn't care for, at the expense of those it did.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Ten people on the inside · 2025-01-29T10:31:17.073Z · LW · GW

My impression is that few (one or two?) of the safety people who have quit a leading lab did so to protest poor safety policies, and of those few none saw staying as a viable option.

Relatedly, I think Buck far overestimates the influence and resources of safety-concerned staff in a 'rushed unreasonable developer'.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Starting an Egan High School · 2025-01-27T06:54:09.150Z · LW · GW

This seems like a very long list of complicated and in many cases new and untested changes to the way schools usually work... which is not in itself bad, but does make the plan very risky. How many students do you imagine attend this school? Have you spoken to people who have founded a similar-sized school?

The good news is that outcomes for exciting new opt-in educational things tend to be pretty good; the bad news is that this is usually for reasons other than "the new thing works" - e.g. the families are engaged and care about education, the teachers are passionate, the school is responsive to changing conditions, etc. If your goal is large-scale educational reform I would not hold out much hope; if you'd be happy running a small niche school with flourishing students (eg) for however long it lasts, that seems achievable with hard work.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on POC || GTFO culture as partial antidote to alignment wordcelism · 2025-01-11T13:25:16.966Z · LW · GW

"POC || GTFO culture" need not be literal, and generally cannot be when speculating about future technologies. I wouldn't even want a proof-of-concept misaligned superintelligence!

Nonetheless, I think the field has been improved by an increasing emphasis on empiricism and demonstrations over the last two years, in technical research, in governance research, and in advocacy. I'd still like to see more carefully caveating of claims for which we have arguments but not evidence, and it's useful to have a short handle for that idea - "POC || admit you're unsure", perhaps?

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Change my mind: Veganism entails trade-offs, and health is one of the axes · 2025-01-10T15:42:54.005Z · LW · GW

I think Elizabeth is correct here, and also that vegan advocates would be considerably more effective with higher epistemic standards:

I think veganism comes with trade-offs, health is one of the axes, and that the health issues are often but not always solvable. This is orthogonal to the moral issue of animal suffering. If I’m right, animal EAs need to change their messaging around vegan diets, and start self-policing misinformation. If I’m wrong, I need to write some retractions and/or shut the hell up.

The post unfortunately suffers for its length, detailed explanations, and rebuttal of many motivated misreadings - many of which can be found in the comments, so it's unclear whether this helped. It's also well-researched and cited, well organized, offers cruxes and anticipates objections - vegan advocates are fortunate to have such high-quality criticism.

This could have been a shorter post, which was about rather than engaged in epistemics and advocacy around veganism, with less charitable assumptions. I'd have shared that shorter post more often, but I don't think it would be better.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible · 2025-01-10T14:59:05.845Z · LW · GW

I remain both skeptical some core claims in this post, and convinced of its importance. GeneSmith is one of few people with such a big-picture, fresh, wildly ambitious angle on beneficial biotechnology, and I'd love to see more of this genre.

One one hand on the object level, I basically don't buy the argument that in-vivo editing could lead to substantial cognitive enhancement in adults. Brain development is incredibly important for adult cognition, and in the maybe 1%--20% residual you're going well off-distribution for any predictors trained on unedited individuals. I too would prefer bets that pay off before my median AI timelines, but biology does not always allow us to have nice things.

On the other, gene therapy does indeed work in adults for some (much simpler) issues, and there might be valuable interventions which are narrower but still valuable. Plus, of course, there's the nineteen-ish year pathway to adults, building on current practice. There's no shortage of practical difficulties, but the strong or general objections I've seen seem ill-founded, and that makes me more optimistic about eventual feasibility of something drawing on this tech tree.

I've been paying closer attention to the space thanks to Gene's posts, to the point of making some related investments, and look forward to watching how these ideas fare on contact with biological and engineering reality over the next few years.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures · 2025-01-10T13:13:37.207Z · LW · GW

I think this is the most important statement on AI risk to date. Where ChatGPT brought "AI could be very capable" into the overton window, the CAIS Statement brought in AI x-risk. When I give talks to NGOs, or business leaders, or government officials, I almost always include a slide with selected signatories and the full text:

Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.

I believe it's true, that it was important to say, and that it's had an ongoing, large, and positive impact. Thank you again to the organizers and to my many, many co-signatories.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on What Have Been Your Most Valuable Casual Conversations At Conferences? · 2025-01-10T12:49:12.965Z · LW · GW

I've been to quite a few Python conferences; typically I find the unstructured time in hallways, over dinner, and in "sprints" both fun and valuable. I've made great friends and recruited new colleagues, conceived and created new libraries, built professional relationships, hashed out how to fix years-old infelicities in various well-known things, etc.

Conversations at afterparties led me to write concrete reasons for hope about AI, and at another event met a friend working on important-to-me biotechnology (I later invested in their startup). I've also occasionally taken something useful away from AI safety conversations, or in one memorable late-night at LessOnline hopefully conveyed something important about my work.

There are many more examples, but it also feels telling that I can't give you examples of conference talks that amazed me in person (there are some great ones recorded but your odds are low, and most I'd prefer to read a good written verion instead), and structured events I've enjoyed are things like "the Python language summit" or "conference dinners which are mostly socializing" - so arguably the bar is low

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2025-01-10T04:41:00.568Z · LW · GW

And I've received an email from Mieux Donner confirming Lucie's leg has been executed for 1,000€. Thanks to everyone involved!

If if anyone else is interested in a similar donation swap, from either side, I'd be excited to introduce people or maybe even do this trick again :D

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Zac Hatfield Dodds's Shortform · 2025-01-06T00:19:06.679Z · LW · GW

For what it's worth I think this accurately conveys "Zac endorses the Lightcone fundraiser and has non-trivially donated", and dropping the word "unusually" would leave the sentence unobjectionable; alternatively maybe you could have dropped me from the list instead.

I just posted this because I didn't want people to assume that I'd donated >10% of my income when I hadn't :-)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Zac Hatfield Dodds's Shortform · 2025-01-05T06:05:57.920Z · LW · GW

I (and many others) recently received an email about the Lightcone fundraiser, which included:

Many people with (IMO) strong track records of thinking about existential risk have also made unusually large personal donations, including ..., Zac Hatfield-Dodds, ...

and while I'm honored to be part of this list, there's only a narrow sense in which I've made an unusually large personal donation: the $1,000 I donated to Lightcone is unusually large from my pay-what-I-want budget, and I'm fortunate that I can afford that, but it's also much less than my typical annual donation to GiveWell. I think it's plausible that Lightcone has great EV for impartial altruistic funding, but don't count it towards my efffective-giving goal - see here and here.

(I've also been happy to support Lightcone by attending and recommending events at Lighthaven, including an upcoming metastrategy intensive, and arranging a donation swap, but don't think of these as donations)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2024-12-23T12:26:13.031Z · LW · GW

If they're not, let me know by December 27th and I'll be happy to do the swap after all!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2024-12-23T09:51:45.398Z · LW · GW

I reached out to Lucie and we agreed to swap donations: she'd give 1000€ to AMF, and I'd give an additional[1] $810 to Lightcone (which I would otherwise send to GiveWell). This would split the difference in our tax deductions, and lead to more total funding for each of the organizations we want to support :-)

We ended up happily cancelling this plan because donations to Lightcone will be deductible in France after all, but I'm glad that we worked through all the details and would have done it. update: because we're doing it after all!


  1. I think it's plausible that Lightcone has great EV for impartial altruistic funding, but due to concerns about community dynamics / real-or-perceived conflicts of interest / etc, I don't give to AI-related or socially-close causes out of my 10+% to effective charity budget. But I've found both LessWrong and Lighthaven personally valuable, and therefore gave $1,000 to Lightcone on the same basis that I pay-what-you-want for arts or events that I like. I also encouraged Ray to set up rationality.training in time for end-of-2024 professional development spending, and I'm looking forward to just directly paying for a valuable service! ↩︎

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Stupid Question: Why am I getting consistently downvoted? · 2024-12-10T08:07:08.136Z · LW · GW

We've been in touch, and agreed that MatHatter will make the donation by end of February. I'll post a final update in this thread when I get the confirmation from GiveWell.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on A good way to build many air filters on the cheap · 2024-12-08T08:40:30.102Z · LW · GW

I'd run the numbers for higher-throughput, lower-filtration filters - see eg cleanairkits writeup - but this looks great!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Stupid Question: Why am I getting consistently downvoted? · 2024-12-03T22:31:02.621Z · LW · GW

Hey @MadHatter - Eliezer confirms that I've won our bet.

I ask that you donate my winnings to GiveWell's All Grants fund, here, via credit card or ACH (preferred due to lower fees).  Please check the box for "I would like to dedicate this donation to someone" and include zac@zhd.dev as the notification email address so that I can confirm here that you've done so.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on A case for donating to AI risk reduction (including if you work in AI) · 2024-12-03T02:34:53.422Z · LW · GW

IMO "major donors won't fund this kind of thing" is a pretty compelling reason to look into it, since great opportunities which are illegible or structurally-hard-to-fund definitely exist (as do illegible-or-etc terrible options; do your diligence). On the other hand I'm pretty nervous about the community dynamics that emerge when you're granting money and also socially engaged with and working in the field. Caveat donor!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Is the mind a program? · 2024-11-28T21:58:06.596Z · LW · GW

I think your argument also has to establish that the cost of simulating any that happen to matter is also quite high.

My intuition is that capturing enough secondary mechanisms, in sufficient-but-abstracted detail that the simulated brain is behaviorally normal (e.g. a sim of me not-more-different than a very sleep-deprived me), is likely to be both feasible by your definition and sufficient for consciousness.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on adam_scholl's Shortform · 2024-11-09T16:52:58.502Z · LW · GW

Why do you focus on this particular guy?

Because I saw a few posts discussing his trades, vs none for anyone else's, which in turn is presumably because he moved the market by ten percentage points or so. I'm not arguing that this "should" make him so salient, but given that he was salient I stand by my sense of failure.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on overengineered air filter shelving · 2024-11-09T05:22:04.927Z · LW · GW

https://www.cleanairkits.com/products/luggables is basically one side of a Corsi-Rosenthal box, takes up very little floor space if placed by a wall, and is quiet, affordable, and effective.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on AI #89: Trump Card · 2024-11-08T00:34:33.053Z · LW · GW

SQLite is ludicrously well tested; similar bugs in other databases just don't get found and fixed.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on adam_scholl's Shortform · 2024-11-07T03:42:18.659Z · LW · GW

I don't remember anyone proposing "maybe this trader has an edge", even though incentivising such people to trade is the mechanism by which prediction markets work. Certainly I didn't, and in retrospect it feels like a failure not to have had 'the multi-million dollar trader might be smart money' as a hypothesis at all.

Comment by zac-hatfield-dodds on [deleted post] 2024-10-29T07:36:15.863Z

(4) is infeasible, because voting systems are designed so that nobody can identify which voter cast which vote - including that voter. This property is called "coercion resistance", which should immediately suggest why it is important!

I further object that any scheme to "win" an election by invalidating votes (or preventing them, etc) is straightforwardly unethical and a betrayal of the principles of democracy. Don't give the impression that this is acceptable behavior, or even funny to joke about.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Daniel Kokotajlo's Shortform · 2024-10-22T04:17:59.845Z · LW · GW

let's not kill the messenger, lest we run out of messengers.

Unfortunately we're a fair way into this process, not because of downvotes[1] but rather because the comments are often dominated by uncharitable interpretations that I can't productively engage with.[2]. I've had researchers and policy people tell me that reading the discussion convinced them that engaging when their work was discussed on LessWrong wasn't worth the trouble.

I'm still here, sad that I can't recommend it to many others, and wondering whether I'll regret this comment too.


  1. I also feel there's a double standard, but don't think it matters much. Span-level reacts would make it a lot easier to tell what people disagree with though. ↩︎

  2. Confidentiality makes any public writing far more effortful than you might expect. Comments which assume ill-faith are deeply unpleasant to engage with, and very rarely have any actionable takeaways. I've written and deleted a lot of other stuff here, and can't find an object-level description that I think is worth posting, but there are plenty of further reasons. ↩︎

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Daniel Kokotajlo's Shortform · 2024-10-22T03:57:13.824Z · LW · GW

I'd find the agree/disagree dimension much more useful if we split out "x people agree, y disagree" - as the EA Forum does - rather than showing the sum of weighted votes (and total number on hover).

I'd also encourage people to use the other reactions more heavily, including on substrings of a comment, but there's value in the anonymous dis/agree counts too.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on A Narrow Path: a plan to deal with AI extinction risk · 2024-10-21T07:20:48.167Z · LW · GW

(2) ✅ ... The first is from Chief of Staff at Anthropic.

The byline of that piece is "Avital Balwit lives in San Francisco and works as Chief of Staff to the CEO at Anthropic. This piece was written entirely in her personal capacity and does not reflect the views of Anthropic."

I do not think this is an appropriate citation for the claim. In any case, They publicly state that it is not a matter of “if” such artificial superintelligence might exist, but “when” simply seems to be untrue; both cited sources are peppered with phrases like 'possibility', 'I expect', 'could arrive', and so on.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Daniel Kokotajlo's Shortform · 2024-10-16T18:43:35.703Z · LW · GW

If grading I'd give full credit for (2) on the basis of "documents like these" referring to Anthopic's constitution + system prompt and OpenAI's model spec, and more generous partials for the others. I have no desire to litigate details here though, so I'll leave it at that.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Daniel Kokotajlo's Shortform · 2024-10-16T18:33:03.613Z · LW · GW

Proceeding with training or limited deployments of a "potentially existentially catastrophic" system would clearly violate our RSP, at minimum the commitment to define and publish ASL-4-appropriate safeguards and conduct evaluations confirming that they are not yet necessary. This footnote is referring to models which pose much lower levels of risk.

And it seems unremarkable to me for a US company to 'single out' a relevant US government entity as the recipient of a voluntary non-public disclosure of a non-existential risk.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Monthly Roundup #23: October 2024 · 2024-10-16T16:13:50.514Z · LW · GW

More importantly, the average price per plate is not just a function of costs, it's a function of the value that people receive.

No, willingness to pay is (ideally) a function of value, but under reasonable competition the price should approach the cost of providing the meal. "It's weird" that a city with many restaurants and consumers, easily available information, low transaction costs, lowish barriers to entry, minimal externalities or returns to atypical scale, and good factor mobility (at least for labor, capital, and materials) should still have substantially elevated prices. My best guess is barriers to entry aren't that low, but mostly that profit-seekers prefer industries with fewer of there conditions!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Anthropic rewrote its RSP · 2024-10-16T01:16:43.025Z · LW · GW

A more ambitious procedural approach would involve strong third-party auditing.

I'm not aware of any third party who could currently perform such an audit - e.g. METR disclaims that here.  We committed to soliciting external expert feedback on capabilities and safeguards reports (RSP §7), and fund new third-party evaluators to grow the ecosystem.  Right now though, third-party audit feels to me like a fabricated option rather than lack of ambition.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Daniel Kokotajlo's Shortform · 2024-10-16T00:42:57.004Z · LW · GW

Thanks Daniel (and Dean) - I'm always glad to hear about people exploring common ground, and the specific proposals sound good to me too.

I think Anthropic already does most of these, as of our RSP update this morning! While I personally support regulation to make such actions universal and binding, I'm glad that we have voluntary commitments in the meantime:

  1. Disclosure of in-development capabilities - in section 7.2 (Transparency and External Input) of our updated RSP, we commit to public disclosures for deployed models, and to notify a relevant U.S. Government entity if any model requires stronger protections than the ASL-2 Standard. I think this is a reasonable balance for a unilateral commitment.
  2. Disclosure of training goal / model spec - as you note, Anthropic publishes both the constitution we train with and our system prompts. I'd be interested in also exploring model-spec-style aspirational documents too.
  3. Public discussion of safety cases and potential risks - there's some discussion in our Core Views essay and RSP; our capability reports and plans for safeguards and future evaluations are published here starting today (with some redactions for e.g. misuse risks).
  4. Whistleblower protections - RSP section 7.1.5 lays out our noncompliance reporting policy, and 7.1.6 a commitment not to use non-disparagement agreements which could impede or discourage publicly raising safety concerns.
Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on A Rational Company - Seeking Advisors · 2024-09-22T08:56:34.764Z · LW · GW

This sounds to me like the classic rationalist failure mode of doing stuff which is unusually popular among rationalists, rather than studying what experts or top performers are doing and then adopting the techniques, conceptual models, and ways of working that actually lead to good results.

Or in other words, the primary thing when thinking about how to optimize a business is not being rationalist; it is to succeed in business (according to your chosen definition).


Happily there's considerable scholarship on business, and CommonCog has done a fantastic job organizing and explaining the good parts. I highly recommend reading and discussing and reflecting on the whole site - it's a better education in business than any MBA program I know of.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on AIS terminology proposal: standardize terms for probability ranges · 2024-09-11T09:35:04.561Z · LW · GW

I further suggest that if using these defined terms, instead of including a table of definitions somewhere you include the actual probability range or point estimate in parentheses after the term. This avoids any need to explain the conventions, and makes it clear at the point of use that the author had a precise quantitative definition in mind.

For example: it's likely (75%) that flipping a pair of fair coins will get less than two heads, and extremely unlikely (0-5%) that most readers of AI safety papers are familiar with the quantitative convention proposed above - although they may (>20%) be familiar with the general concept. Note that the inline convention allows for other descriptions if they make the sentence more natural!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Zach Stein-Perlman's Shortform · 2024-09-11T09:22:30.162Z · LW · GW

For what it's worth, I endorse Anthopic's confidentiality policies, and am confident that everyone involved in setting them sees the increased difficulty of public communication as a cost rather than a benefit. Unfortunately, the unilateralist's curse and entangled truths mean that confidential-by-default is the only viable policy.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Benito's Shortform Feed · 2024-09-11T03:56:58.126Z · LW · GW

This feels pretty nitpick-y, but whether or not I'd be interested in taking a bet will depend on the odds - in many cases I might take either side, given a reasonably wide spread. Maybe append at p >= 0.5 to the descriptions to clarify?

The shorthand trading syntax "$size @ $sell_percent / $buy_percent" is especially nice because it expresses the spread you'd accept to take either side of the bet, e.g. "25 @ 85/15 on rain tomorrow" to offer a bet of $25 dollars, selling if you think probability of rain is >85%, buying if you think it's <15%. Seems hard to build this into a reaction though!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-09-04T18:22:03.621Z · LW · GW

Locked in! Whichever way this goes, I expect to feel pretty good about both the process and the outcome :-)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-09-04T09:56:02.687Z · LW · GW

Nice! I look forward to seeing how this resolves.

Ah, by 'size' I meant the stakes, not the number of locks - did you want to bet the maximum $1k against my $10k, or some smaller proportional amount?

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-04T08:49:33.747Z · LW · GW

Article IV of the Certificate of Incorporation lists the number of shares of each class of stock, and as that's organized by funding round I expect that you could get a fair way by cross-referencing against public reporting.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-04T08:33:10.202Z · LW · GW

Someone suggested that I point out that this is misleading. The board is not independent: it's two executives, one investor, and one other guy.

As of November this year, the board will consist of the CEO, one investor representative, and three members appointed by the LTBT. I think it's reasonable to describe that as independent, even if the CEO alone would not be, and to be thinking about the from-November state in this document.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Democracy beyond majoritarianism · 2024-09-04T07:50:38.231Z · LW · GW

If you're interested in this area, I suggest looking at existing scholarship such as Manon Revel's.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-08-22T03:47:33.603Z · LW · GW

I think we're agreed then, if you want to confirm the size? Then we wait for 2027!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on davekasten's Shortform · 2024-08-22T03:23:31.874Z · LW · GW

See e.g. Table 1 of https://nickbostrom.com/information-hazards.pdf

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-08-22T02:56:22.199Z · LW · GW

That works for me - thanks very much for helping out!