Posts

Anthropic's updated Responsible Scaling Policy 2024-10-15T16:46:48.727Z
Anthropic: Reflections on our Responsible Scaling Policy 2024-05-20T04:14:44.435Z
Simple probes can catch sleeper agents 2024-04-23T21:10:47.784Z
Third-party testing as a key ingredient of AI policy 2024-03-25T22:40:43.744Z
Dario Amodei’s prepared remarks from the UK AI Safety Summit, on Anthropic’s Responsible Scaling Policy 2023-11-01T18:10:31.110Z
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning 2023-10-05T21:01:39.767Z
Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust 2023-09-19T15:09:27.235Z
Anthropic's Core Views on AI Safety 2023-03-09T16:55:15.311Z
Concrete Reasons for Hope about AI 2023-01-14T01:22:18.723Z
In Defence of Spock 2021-04-21T21:34:04.206Z
Zac Hatfield Dodds's Shortform 2021-03-09T02:39:33.481Z

Comments

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Stupid Question: Why am I getting consistently downvoted? · 2024-12-10T08:07:08.136Z · LW · GW

We've been in touch, and agreed that MatHatter will make the donation by end of February. I'll post a final update in this thread when I get the confirmation from GiveWell.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on A good way to build many air filters on the cheap · 2024-12-08T08:40:30.102Z · LW · GW

I'd run the numbers for higher-throughput, lower-filtration filters - see eg cleanairkits writeup - but this looks great!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Stupid Question: Why am I getting consistently downvoted? · 2024-12-03T22:31:02.621Z · LW · GW

Hey @MadHatter - Eliezer confirms that I've won our bet.

I ask that you donate my winnings to GiveWell's All Grants fund, here, via credit card or ACH (preferred due to lower fees).  Please check the box for "I would like to dedicate this donation to someone" and include zac@zhd.dev as the notification email address so that I can confirm here that you've done so.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on A case for donating to AI risk reduction (including if you work in AI) · 2024-12-03T02:34:53.422Z · LW · GW

IMO "major donors won't fund this kind of thing" is a pretty compelling reason to look into it, since great opportunities which are illegible or structurally-hard-to-fund definitely exist (as do illegible-or-etc terrible options; do your diligence). On the other hand I'm pretty nervous about the community dynamics that emerge when you're granting money and also socially engaged with and working in the field. Caveat donor!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Is the mind a program? · 2024-11-28T21:58:06.596Z · LW · GW

I think your argument also has to establish that the cost of simulating any that happen to matter is also quite high.

My intuition is that capturing enough secondary mechanisms, in sufficient-but-abstracted detail that the simulated brain is behaviorally normal (e.g. a sim of me not-more-different than a very sleep-deprived me), is likely to be both feasible by your definition and sufficient for consciousness.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on adam_scholl's Shortform · 2024-11-09T16:52:58.502Z · LW · GW

Why do you focus on this particular guy?

Because I saw a few posts discussing his trades, vs none for anyone else's, which in turn is presumably because he moved the market by ten percentage points or so. I'm not arguing that this "should" make him so salient, but given that he was salient I stand by my sense of failure.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on overengineered air filter shelving · 2024-11-09T05:22:04.927Z · LW · GW

https://www.cleanairkits.com/products/luggables is basically one side of a Corsi-Rosenthal box, takes up very little floor space if placed by a wall, and is quiet, affordable, and effective.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on AI #89: Trump Card · 2024-11-08T00:34:33.053Z · LW · GW

SQLite is ludicrously well tested; similar bugs in other databases just don't get found and fixed.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on adam_scholl's Shortform · 2024-11-07T03:42:18.659Z · LW · GW

I don't remember anyone proposing "maybe this trader has an edge", even though incentivising such people to trade is the mechanism by which prediction markets work. Certainly I didn't, and in retrospect it feels like a failure not to have had 'the multi-million dollar trader might be smart money' as a hypothesis at all.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on agg's Shortform · 2024-10-29T07:36:15.863Z · LW · GW

(4) is infeasible, because voting systems are designed so that nobody can identify which voter cast which vote - including that voter. This property is called "coercion resistance", which should immediately suggest why it is important!

I further object that any scheme to "win" an election by invalidating votes (or preventing them, etc) is straightforwardly unethical and a betrayal of the principles of democracy. Don't give the impression that this is acceptable behavior, or even funny to joke about.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Daniel Kokotajlo's Shortform · 2024-10-22T04:17:59.845Z · LW · GW

let's not kill the messenger, lest we run out of messengers.

Unfortunately we're a fair way into this process, not because of downvotes[1] but rather because the comments are often dominated by uncharitable interpretations that I can't productively engage with.[2]. I've had researchers and policy people tell me that reading the discussion convinced them that engaging when their work was discussed on LessWrong wasn't worth the trouble.

I'm still here, sad that I can't recommend it to many others, and wondering whether I'll regret this comment too.


  1. I also feel there's a double standard, but don't think it matters much. Span-level reacts would make it a lot easier to tell what people disagree with though. ↩︎

  2. Confidentiality makes any public writing far more effortful than you might expect. Comments which assume ill-faith are deeply unpleasant to engage with, and very rarely have any actionable takeaways. I've written and deleted a lot of other stuff here, and can't find an object-level description that I think is worth posting, but there are plenty of further reasons. ↩︎

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Daniel Kokotajlo's Shortform · 2024-10-22T03:57:13.824Z · LW · GW

I'd find the agree/disagree dimension much more useful if we split out "x people agree, y disagree" - as the EA Forum does - rather than showing the sum of weighted votes (and total number on hover).

I'd also encourage people to use the other reactions more heavily, including on substrings of a comment, but there's value in the anonymous dis/agree counts too.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on A Narrow Path: a plan to deal with AI extinction risk · 2024-10-21T07:20:48.167Z · LW · GW

(2) ✅ ... The first is from Chief of Staff at Anthropic.

The byline of that piece is "Avital Balwit lives in San Francisco and works as Chief of Staff to the CEO at Anthropic. This piece was written entirely in her personal capacity and does not reflect the views of Anthropic."

I do not think this is an appropriate citation for the claim. In any case, They publicly state that it is not a matter of “if” such artificial superintelligence might exist, but “when” simply seems to be untrue; both cited sources are peppered with phrases like 'possibility', 'I expect', 'could arrive', and so on.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Daniel Kokotajlo's Shortform · 2024-10-16T18:43:35.703Z · LW · GW

If grading I'd give full credit for (2) on the basis of "documents like these" referring to Anthopic's constitution + system prompt and OpenAI's model spec, and more generous partials for the others. I have no desire to litigate details here though, so I'll leave it at that.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Daniel Kokotajlo's Shortform · 2024-10-16T18:33:03.613Z · LW · GW

Proceeding with training or limited deployments of a "potentially existentially catastrophic" system would clearly violate our RSP, at minimum the commitment to define and publish ASL-4-appropriate safeguards and conduct evaluations confirming that they are not yet necessary. This footnote is referring to models which pose much lower levels of risk.

And it seems unremarkable to me for a US company to 'single out' a relevant US government entity as the recipient of a voluntary non-public disclosure of a non-existential risk.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Monthly Roundup #23: October 2024 · 2024-10-16T16:13:50.514Z · LW · GW

More importantly, the average price per plate is not just a function of costs, it's a function of the value that people receive.

No, willingness to pay is (ideally) a function of value, but under reasonable competition the price should approach the cost of providing the meal. "It's weird" that a city with many restaurants and consumers, easily available information, low transaction costs, lowish barriers to entry, minimal externalities or returns to atypical scale, and good factor mobility (at least for labor, capital, and materials) should still have substantially elevated prices. My best guess is barriers to entry aren't that low, but mostly that profit-seekers prefer industries with fewer of there conditions!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Anthropic rewrote its RSP · 2024-10-16T01:16:43.025Z · LW · GW

A more ambitious procedural approach would involve strong third-party auditing.

I'm not aware of any third party who could currently perform such an audit - e.g. METR disclaims that here.  We committed to soliciting external expert feedback on capabilities and safeguards reports (RSP §7), and fund new third-party evaluators to grow the ecosystem.  Right now though, third-party audit feels to me like a fabricated option rather than lack of ambition.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Daniel Kokotajlo's Shortform · 2024-10-16T00:42:57.004Z · LW · GW

Thanks Daniel (and Dean) - I'm always glad to hear about people exploring common ground, and the specific proposals sound good to me too.

I think Anthropic already does most of these, as of our RSP update this morning! While I personally support regulation to make such actions universal and binding, I'm glad that we have voluntary commitments in the meantime:

  1. Disclosure of in-development capabilities - in section 7.2 (Transparency and External Input) of our updated RSP, we commit to public disclosures for deployed models, and to notify a relevant U.S. Government entity if any model requires stronger protections than the ASL-2 Standard. I think this is a reasonable balance for a unilateral commitment.
  2. Disclosure of training goal / model spec - as you note, Anthropic publishes both the constitution we train with and our system prompts. I'd be interested in also exploring model-spec-style aspirational documents too.
  3. Public discussion of safety cases and potential risks - there's some discussion in our Core Views essay and RSP; our capability reports and plans for safeguards and future evaluations are published here starting today (with some redactions for e.g. misuse risks).
  4. Whistleblower protections - RSP section 7.1.5 lays out our noncompliance reporting policy, and 7.1.6 a commitment not to use non-disparagement agreements which could impede or discourage publicly raising safety concerns.
Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on A Rational Company - Seeking Advisors · 2024-09-22T08:56:34.764Z · LW · GW

This sounds to me like the classic rationalist failure mode of doing stuff which is unusually popular among rationalists, rather than studying what experts or top performers are doing and then adopting the techniques, conceptual models, and ways of working that actually lead to good results.

Or in other words, the primary thing when thinking about how to optimize a business is not being rationalist; it is to succeed in business (according to your chosen definition).


Happily there's considerable scholarship on business, and CommonCog has done a fantastic job organizing and explaining the good parts. I highly recommend reading and discussing and reflecting on the whole site - it's a better education in business than any MBA program I know of.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on AIS terminology proposal: standardize terms for probability ranges · 2024-09-11T09:35:04.561Z · LW · GW

I further suggest that if using these defined terms, instead of including a table of definitions somewhere you include the actual probability range or point estimate in parentheses after the term. This avoids any need to explain the conventions, and makes it clear at the point of use that the author had a precise quantitative definition in mind.

For example: it's likely (75%) that flipping a pair of fair coins will get less than two heads, and extremely unlikely (0-5%) that most readers of AI safety papers are familiar with the quantitative convention proposed above - although they may (>20%) be familiar with the general concept. Note that the inline convention allows for other descriptions if they make the sentence more natural!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Zach Stein-Perlman's Shortform · 2024-09-11T09:22:30.162Z · LW · GW

For what it's worth, I endorse Anthopic's confidentiality policies, and am confident that everyone involved in setting them sees the increased difficulty of public communication as a cost rather than a benefit. Unfortunately, the unilateralist's curse and entangled truths mean that confidential-by-default is the only viable policy.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Benito's Shortform Feed · 2024-09-11T03:56:58.126Z · LW · GW

This feels pretty nitpick-y, but whether or not I'd be interested in taking a bet will depend on the odds - in many cases I might take either side, given a reasonably wide spread. Maybe append at p >= 0.5 to the descriptions to clarify?

The shorthand trading syntax "$size @ $sell_percent / $buy_percent" is especially nice because it expresses the spread you'd accept to take either side of the bet, e.g. "25 @ 85/15 on rain tomorrow" to offer a bet of $25 dollars, selling if you think probability of rain is >85%, buying if you think it's <15%. Seems hard to build this into a reaction though!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-09-04T18:22:03.621Z · LW · GW

Locked in! Whichever way this goes, I expect to feel pretty good about both the process and the outcome :-)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-09-04T09:56:02.687Z · LW · GW

Nice! I look forward to seeing how this resolves.

Ah, by 'size' I meant the stakes, not the number of locks - did you want to bet the maximum $1k against my $10k, or some smaller proportional amount?

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-04T08:49:33.747Z · LW · GW

Article IV of the Certificate of Incorporation lists the number of shares of each class of stock, and as that's organized by funding round I expect that you could get a fair way by cross-referencing against public reporting.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-04T08:33:10.202Z · LW · GW

Someone suggested that I point out that this is misleading. The board is not independent: it's two executives, one investor, and one other guy.

As of November this year, the board will consist of the CEO, one investor representative, and three members appointed by the LTBT. I think it's reasonable to describe that as independent, even if the CEO alone would not be, and to be thinking about the from-November state in this document.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Democracy beyond majoritarianism · 2024-09-04T07:50:38.231Z · LW · GW

If you're interested in this area, I suggest looking at existing scholarship such as Manon Revel's.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-08-22T03:47:33.603Z · LW · GW

I think we're agreed then, if you want to confirm the size? Then we wait for 2027!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on davekasten's Shortform · 2024-08-22T03:23:31.874Z · LW · GW

See e.g. Table 1 of https://nickbostrom.com/information-hazards.pdf

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-08-22T02:56:22.199Z · LW · GW

That works for me - thanks very much for helping out!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Tapatakt's Shortform · 2024-08-21T15:51:45.417Z · LW · GW

I don't recall any interpretability experiments with TinyStories offhand, but I'd be surprised if there aren't any.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-08-21T00:36:38.745Z · LW · GW

I don't think that a thing you can only manufacture once is a practically usable lock; having multiple is also practically useful to facilitate picking attempts and in case of damage - imagine that a few hours into an open pick-this-lock challenge, someone bent a part such that the key no longer opens the lock. I'd suggest resolving neutral in this case as we only saw an partial attempt.

Other conditions:

  • I think it's important that the design could have at least a thousand distinct keys which are non-pickable. It's fine if the theoretical keyspace is larger so long as the verified-secure keyspace is large enough to be useful, and distinct keys/locks need not be manufactured so long as they're clearly possible.
  • I expect the design to be available in advance to people attempting to pick the lock, just as the design principles and detailed schematics of current mechanical locks are widely known - security through obscurity would not demonstrate that the design is better, only that as-yet-secret designs are harder to exploit.

I nominate @raemon as our arbiter, if both he and you are willing, and the majority vote or nominee of the Lightcone team if Raemon is unavailable for some reason (and @habryka approves that).

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Zach Stein-Perlman's Shortform · 2024-08-20T22:03:19.260Z · LW · GW

I think both Zach and I care about labs doing good things on safety, communicating that clearly, and helping people understand both what labs are doing and the range of views on what they should be doing.  I shared Zach's doc with some colleagues, but won’t try for a point-by-point response.  Two high-level responses:

First, at a meta level, you say:

  1. [Probably Anthropic (like all labs) should encourage staff members to talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic, as long as they (1) clarify that they're not speaking for Anthropic and (2) don't share secrets.]

I do feel welcome to talk about my views on this basis, and often do so with friends and family, at public events, and sometimes even in writing on the internet (hi!).  However, it takes way more effort than you might think to avoid inaccurate or misleading statements while also maintaining confidentiality.  Public writing tends to be higher-stakes due to the much larger audience and durability, so I routinely run comments past several colleagues before posting, and often redraft in response (including these comments and this very point!).

My guess is that most don’t do this much in public or on the internet, because it’s absolutely exhausting, and if you say something misremembered or misinterpreted you’re treated as a liar, it’ll be taken out of context either way, and you probably can’t make corrections.  I keep doing it anyway because I occasionally find useful perspectives or insights this way, and think it’s important to share mine.  That said, there’s a loud minority which makes the AI-safety-adjacent community by far the most hostile and least charitable environment I spend any time in, and I fully understand why many of my colleagues might not want to.

Imagine, if you will, trying to hold a long conversation about AI risk - but you can’t reveal any information about, or learned from, or even just informative about LessWrong.  Every claim needs an independent public source, as do any jargon or concepts that would give an informed listener information about the site, etc.; you have to find different analogies and check that citations are public and for all that you get pretty regular hostility anyway because of… well, there are plenty of misunderstandings and caricatures to go around.

I run intro-to-AGI-safety courses for Anthropic employees (based on AGI-SF), and we draw a clear distinction between public and confidential resources specifically so that people can go talk to family and friends and anyone else they wish about the public information we cover.

Second, and more concretely, many of these asks are unimplementable for various reasons, and often gesture in a direction without giving reasons to think that there’s a better tradeoff available than we’re already making.  Some quick examples:

  • Both AI Control and safety cases are research areas less than a year old; we’re investigating them and e.g. hiring safety-case specialists, but best-practices we could implement don’t exist yet.  Similarly, there simply aren’t any auditors or audit standards for AI safety yet (see e.g. METR’s statement); we’re working to make this possible but the thing you’re asking for just doesn’t exist yet.  Some implementation questions that “let auditors audit our models” glosses over:

    • If you have dozens of organizations asking to be auditors, and none of them are officially auditors yet, what criteria do you use to decide who you collaborate with?

    • What kind of pre deployment model access would you provide?  If it’s helpful-only or other nonpublic access, do they meet our security bar to avoid leaking privileged API keys?  (We’ve already seen unauthorized sharing or compromise lead to serious abuse.)

    • How do you decide who gets to say what about the testing?  What if they have very different priorities than you and think that a different level of risk or a different kind of harm is unacceptable?

  • I strongly support Anthropic’s nondisclosure of information about pretraining.  I have never seen a compelling argument that publishing this kind of information is, on net, beneficial for safety.

  • There are many cases where I’d be happy if Anthropic shared more about what we’re doing and what we’re thinking about.  Some of the things you’re asking about I think we’ve already said, e.g. for [7] LTBT changes would require an RSP update, and for [17] our RSP requires us to “enforce an acceptable use policy [against …] using the model to generate content that could cause severe risks to the continued existence of humankind”.

So, saying “do more X” just isn’t that useful; we’ve generally thought about it and concluded that that the current amount of X is our best available tradeoff at the moment.  For many more of the other asks above, I just disagree with implicit or explicit claims about the facts in question.  Even for the communication issues where I’d celebrate us sharing more—and for some I expect we will—doing so is yet another demand on heavily loaded people and teams, and it can take longer than we’d like to find the time.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-08-17T18:43:42.717Z · LW · GW

I agree with you that this feels like a 'compact crux' for many parts of the agenda. I'd like to take your bet, let me reflect if there's any additional operationalizations or conditioning.

quick proposals:

  • I win at the end of 2026, if there has not been a formally-verified design for a mechanical lock, OR the design does not verify it cannot be mechanically picked, OR less than three consistent physical instances have been manufactured. (e.g. a total of three including prototypes or other designs doesn't count)
  • You win if at the end of 2027, there have been credible and failed expert attempts to pick such a lock (e.g. an open challenge at Defcon). I win if there is a successful attempt.
  • Bet resolves neutral, and we each donate half our stakes to a mutally-agreed charity, if it's unclear whether production actually happened, or there were no credible attempts to pick a verified lock.
  • Any disputes resolved by the best judgement of an agreed-in-advance arbiter; I'd be happy with the LessWrong team if you and they also agree.
Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Raemon's Shortform · 2024-08-17T06:17:32.534Z · LW · GW

Hmm, usually when I go looking it's because I remember reading a particular post, but there's always some chance of getting tab-sniped into reading a just a few more pages...

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Raemon's Shortform · 2024-08-17T00:27:26.180Z · LW · GW

tags: used them semi-regularly to find related posts when I want to refer to previous discussions of a topic. They work well for that, and I've occasionally added tags when the post I was looking for wasn't tagged yet.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Humanity isn't remotely longtermist, so arguments for AGI x-risk should focus on the near term · 2024-08-13T01:47:23.014Z · LW · GW

I think it's worth distinguishing between what I'll call 'intrinsic preference discounting', and 'uncertain-value discounting'. In the former case, you inherently care less about what happens in the (far?) future; in the latter case you are impartial but rationally discount future value based on your uncertainty about whether it'll actually happen - perhaps there'll be a supernova or something before anyone actually enjoys the utils! Economists often observe the latter, or some mixture, and attribute it to the former.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-08-11T19:45:47.363Z · LW · GW

I am a big fan of verification in principle, but don't think it's feasible in such broad contexts in practice. Most of these proposals are far too vague for me to have an opinion on, but I'd be happy to make bets with anyone interested that

  • (3) won't happen (with nontrivial bounds) due to the difficulty of modeling nondeterministic GPUs, varying floating-point formats, etc.
  • (8) won't be attempted, or will fail at some combination of design, manufacture, or just-being-pickable.  This is a great proposal and a beautifully compact crux for the overall approach.
  • with some mutually agreed operationalization against the success of other listed ideas with a physical component

for up to 10k of my dollars at up to 10:1 odds (i.e. min your $1k), placed by end of 2024, to resolve by end of 2026 or as agreed. (I'm somewhat dubious about very long-term bets for implementation reasons)  I'm also happy to discuss conditioning on a registered attempt.


For what it's worth, I was offered and chose to decline co-authorship of Towards Guaranteed Safe AI. Despite my respect for the authors and strong support for this research agenda (as part of a diverse portfolio of safety research), I disagree with many of the central claims of the paper, and see it as uncritically enthusiastic about formal verification and overly dismissive of other approaches.

I do not believe that any of the proposed approaches actually guarantee safety. Safety is a system property,[1] which includes the surrounding social and technical context; and I see serious conceptual and in-practice-fatal challenges in defining both a world-model and a safety spec. (exercise: what could these look like for a biology lab? Try to block unsafe outcomes without blocking Kevin Esvelt's research!)

I do think it's important to reach appropriately high safety assurances before developing or deploying future AI systems which would be capable of causing a catastrophe. However, I believe that the path there is to extend and complement current techniques, including empirical and experimental approaches alongside formal verification - whatever actually works in practice.


  1. My standard recommendation on this is ch. 1-3 of Engineering a Safer World; I expect her recent book is excellent but haven't yet read enough to recommend it. ↩︎
Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-08-10T00:51:16.490Z · LW · GW

Mark Zuckerberg has concluded that it is better to open source powerful AI models for the white hats than to try to keep them secret from the black hats.

Of course, LLama models aren't actually open source - here's my summary, here's LeCun testifying under oath (which he regularly contradicts on twitter).  I think this substantially undercuts Zuckerberg's argument and don't believe that it's a sincere position, just tactically useful right now.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on It's time for a self-reproducing machine · 2024-08-08T03:54:25.449Z · LW · GW

I'm just guessing from affect, yep, though I still think that "large project and many years of effort" typically describes considerably smaller challenges than my expectation for producing a complete autofac.

On the steel-vs-vitamins question, I'm thinking about "effort" - loose proxy measurements would be the sale value or production headcount rather than kilograms of output. Precisely because steel is easier to transform, it's much less valuable to do so and thus I expect the billions-of-autofacs to be far less economically valuable than a quick estimate might show. Unless of course they start edging in on vitamin production, but then that's the hard rest-of-the-industrial-economy problem...

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on It's time for a self-reproducing machine · 2024-08-08T03:43:58.297Z · LW · GW

To be clear, I think the autofac concept - with external "vitamins" for electronics etc - is in fact technically feasible right now and if teleoperated has been for decades. It's not economically competitive, but that's a totally different target.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on It's time for a self-reproducing machine · 2024-08-07T22:20:45.413Z · LW · GW

I think you're substantially underestimating the difficulty here, and the proportion of effort which goes into the "starter pack" (aka vitamins) relative to steelworking.

If you're interested in taking this further, I'd suggest:

  • getting involved in the RepRap project
  • initially focusing on just one of producing parts, assembling parts, or controlling machinery. If your system works when teleoperated, or can assemble but not produce a copy of itself, etc., that's already a substantial breakthrough.
  • reading up on NASA's studies on self-replicating machinery, e.g. Freitas 1981 or this paper from 1982 or later work like Chirikjian 2004.
Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on bilalchughtai's Shortform · 2024-07-30T01:40:45.557Z · LW · GW

Well, it seems like a no-brainer to store money you intend to spend after age 60 in such an account; for other purposes it does seem less universally useful. I'd also check the treatment of capital gains, and whether it's included in various assets tests; both can be situationally useful and included in some analogues elsewhere.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team · 2024-07-18T22:22:06.963Z · LW · GW

Might be worth putting a short notice at the top of each post saying that, with a link to this post or whatever other resource you'd now recommend? (inspired by the 'Attention - this is a historical document' on e.g. this PEP)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on jacquesthibs's Shortform · 2024-07-07T21:24:42.809Z · LW · GW

I don't think any of these amount to a claim that "to reach ASI, we simply need to develop rules for all the domains we care about". Yes, AlphaGo Zero reached superhuman levels on the narrow task of playing Go, and that's a nice demonstration that synthetic data could be useful, but it's not about ASI and there's no claim that this would be either necessary or sufficient.

(not going to speculate on object-level details though)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Habryka's Shortform Feed · 2024-07-06T00:32:52.038Z · LW · GW

I think that personal incentives is an unhelpful way to try and think about or predict board behavior (for Anthropic and in general), but you can find the current members of our board listed here.

Is there an actual way to criticize Dario and/or Daniela in a way that will realistically be given a fair hearing by someone who, if appropriate, could take some kind of action?

For whom to criticize him/her/them about what? What kind of action are you imagining? For anything I can imagine actually coming up, I'd be personally comfortable raising it directly with either or both of them in person or in writing, and believe they'd give it a fair hearing as well as appropriate follow-up. There are also standard company mechanisms that many people might be more comfortable using (talk to your manager or someone responsible for that area; ask a maybe-anonymous question in various fora; etc). Ultimately executives are accountable to the board, which will be majority appointed by the long-term benefit trust from late this year.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Habryka's Shortform Feed · 2024-07-05T20:04:54.516Z · LW · GW

Makes sense - if I felt I had to use an anonymous mechanism, I can see how contacting Daniela about Dario might be uncomfortable. (Although to be clear I actually think that'd be fine, and I'd also have to think that Sam McCandlish as responsible scaling officer wouldn't handle it)

If I was doing this today I guess I'd email another board member; and I'll suggest that we add that as an escalation option.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Habryka's Shortform Feed · 2024-07-04T20:16:53.376Z · LW · GW

OK, let's imagine I had a concern about RSP noncompliance, and felt that I needed to use this mechanism.

(in reality I'd just post in whichever slack channel seemed most appropriate; this happens occasionally for "just wanted to check..." style concerns and I'm very confident we'd welcome graver reports too. Usually that'd be a public channel; for some compartmentalized stuff it might be a private channel and I'd DM the team lead if I didn't have access. I think we have good norms and culture around explicitly raising safety concerns and taking them seriously.)

As I understand it, I'd:

  • Remember that we have such a mechanism and bet that there's a shortcut link. Fail to remember the shortlink name (reports? violations?) and search the list of "rsp-" links; ah, it's rsp-noncompliance. (just did this, and added a few aliases)
  • That lands me on the policy PDF, which explains in two pages the intended scope of the policy, who's covered, the proceedure, etc. and contains a link to the third-party anonymous reporting platform. That link is publicly accessible, so I could e.g. make a report from a non-work device or even after leaving the company.
  • I write a report on that platform describing my concerns[1], optionally uploading documents etc. and get a random password so I can log in later to give updates, send and receive messages, etc.
  • The report by default goes to our Responsible Scaling Officer, currently Sam McCandlish. If I'm concerned about the RSO or don't trust them to handle it, I can instead escalate to the Board of Directors (current DRI Daniella Amodei)
  • Investigation and resolution obviously depends on the details of the noncompliance concern.

There are other (pretty standard) escalation pathways for concerns about things that aren't RSP noncompliance. There's not much we can do about the "only one person could have made this report" problem beyond the included strong commitments to non-retaliation, but if anyone has suggestions I'd love to hear them.


  1. I clicked through just now to the point of cursor-in-textbox, but not submitting a nuisance report. ↩︎

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Habryka's Shortform Feed · 2024-07-01T05:55:12.586Z · LW · GW

I am a current Anthropic employee, and I am not under any such agreement, nor has any such agreement ever been offered to me.

If asked to sign a self-concealing NDA or non-disparagement agreement, I would refuse.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Nathan Young's Shortform · 2024-06-22T16:35:32.219Z · LW · GW

He talked to Gladstone AI founders a few weeks ago; AGI risks were mentioned but not in much depth.