Posts

Anthropic: Reflections on our Responsible Scaling Policy 2024-05-20T04:14:44.435Z
Simple probes can catch sleeper agents 2024-04-23T21:10:47.784Z
Third-party testing as a key ingredient of AI policy 2024-03-25T22:40:43.744Z
Dario Amodei’s prepared remarks from the UK AI Safety Summit, on Anthropic’s Responsible Scaling Policy 2023-11-01T18:10:31.110Z
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning 2023-10-05T21:01:39.767Z
Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust 2023-09-19T15:09:27.235Z
Anthropic's Core Views on AI Safety 2023-03-09T16:55:15.311Z
Concrete Reasons for Hope about AI 2023-01-14T01:22:18.723Z
In Defence of Spock 2021-04-21T21:34:04.206Z
Zac Hatfield Dodds's Shortform 2021-03-09T02:39:33.481Z

Comments

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on A Rational Company - Seeking Advisors · 2024-09-22T08:56:34.764Z · LW · GW

This sounds to me like the classic rationalist failure mode of doing stuff which is unusually popular among rationalists, rather than studying what experts or top performers are doing and then adopting the techniques, conceptual models, and ways of working that actually lead to good results.

Or in other words, the primary thing when thinking about how to optimize a business is not being rationalist; it is to succeed in business (according to your chosen definition).


Happily there's considerable scholarship on business, and CommonCog has done a fantastic job organizing and explaining the good parts. I highly recommend reading and discussing and reflecting on the whole site - it's a better education in business than any MBA program I know of.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on AIS terminology proposal: standardize terms for probability ranges · 2024-09-11T09:35:04.561Z · LW · GW

I further suggest that if using these defined terms, instead of including a table of definitions somewhere you include the actual probability range or point estimate in parentheses after the term. This avoids any need to explain the conventions, and makes it clear at the point of use that the author had a precise quantitative definition in mine.

For example: it's likely (75%) that flipping a pair of fair coins will get less than two heads, and extremely unlikely (0-5%) that most readers of AI safety papers are familiar with the quantitative convention proposed above - although they may (>20%) be familiar with the general concept. Note that the inline convention allows for other descriptions if they make the sentence more natural!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Zach Stein-Perlman's Shortform · 2024-09-11T09:22:30.162Z · LW · GW

For what it's worth, I endorse Anthopic's confidentiality policies, and am confident that everyone involved in setting them sees the increased difficulty of public communication as a cost rather than a benefit. Unfortunately, the unilateralist's curse and entangled truths mean that confidential-by-default is the only viable policy.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Benito's Shortform Feed · 2024-09-11T03:56:58.126Z · LW · GW

This feels pretty nitpick-y, but whether or not I'd be interested in taking a bet will depend on the odds - in many cases I might take either side, given a reasonably wide spread. Maybe append at p >= 0.5 to the descriptions to clarify?

The shorthand trading syntax "$size @ $sell_percent / $buy_percent" is especially nice because it expresses the spread you'd accept to take either side of the bet, e.g. "25 @ 85/15 on rain tomorrow" to offer a bet of $25 dollars, selling if you think probability of rain is >85%, buying if you think it's <15%. Seems hard to build this into a reaction though!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-09-04T18:22:03.621Z · LW · GW

Locked in! Whichever way this goes, I expect to feel pretty good about both the process and the outcome :-)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-09-04T09:56:02.687Z · LW · GW

Nice! I look forward to seeing how this resolves.

Ah, by 'size' I meant the stakes, not the number of locks - did you want to bet the maximum $1k against my $10k, or some smaller proportional amount?

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-04T08:49:33.747Z · LW · GW

Article IV of the Certificate of Incorporation lists the number of shares of each class of stock, and as that's organized by funding round I expect that you could get a fair way by cross-referencing against public reporting.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-04T08:33:10.202Z · LW · GW

Someone suggested that I point out that this is misleading. The board is not independent: it's two executives, one investor, and one other guy.

As of November this year, the board will consist of the CEO, one investor representative, and three members appointed by the LTBT. I think it's reasonable to describe that as independent, even if the CEO alone would not be, and to be thinking about the from-November state in this document.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Democracy beyond majoritarianism · 2024-09-04T07:50:38.231Z · LW · GW

If you're interested in this area, I suggest looking at existing scholarship such as Manon Revel's.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-08-22T03:47:33.603Z · LW · GW

I think we're agreed then, if you want to confirm the size? Then we wait for 2027!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on davekasten's Shortform · 2024-08-22T03:23:31.874Z · LW · GW

See e.g. Table 1 of https://nickbostrom.com/information-hazards.pdf

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-08-22T02:56:22.199Z · LW · GW

That works for me - thanks very much for helping out!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Tapatakt's Shortform · 2024-08-21T15:51:45.417Z · LW · GW

I don't recall any interpretability experiments with TinyStories offhand, but I'd be surprised if there aren't any.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-08-21T00:36:38.745Z · LW · GW

I don't think that a thing you can only manufacture once is a practically usable lock; having multiple is also practically useful to facilitate picking attempts and in case of damage - imagine that a few hours into an open pick-this-lock challenge, someone bent a part such that the key no longer opens the lock. I'd suggest resolving neutral in this case as we only saw an partial attempt.

Other conditions:

  • I think it's important that the design could have at least a thousand distinct keys which are non-pickable. It's fine if the theoretical keyspace is larger so long as the verified-secure keyspace is large enough to be useful, and distinct keys/locks need not be manufactured so long as they're clearly possible.
  • I expect the design to be available in advance to people attempting to pick the lock, just as the design principles and detailed schematics of current mechanical locks are widely known - security through obscurity would not demonstrate that the design is better, only that as-yet-secret designs are harder to exploit.

I nominate @raemon as our arbiter, if both he and you are willing, and the majority vote or nominee of the Lightcone team if Raemon is unavailable for some reason (and @habryka approves that).

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Zach Stein-Perlman's Shortform · 2024-08-20T22:03:19.260Z · LW · GW

I think both Zach and I care about labs doing good things on safety, communicating that clearly, and helping people understand both what labs are doing and the range of views on what they should be doing.  I shared Zach's doc with some colleagues, but won’t try for a point-by-point response.  Two high-level responses:

First, at a meta level, you say:

  1. [Probably Anthropic (like all labs) should encourage staff members to talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic, as long as they (1) clarify that they're not speaking for Anthropic and (2) don't share secrets.]

I do feel welcome to talk about my views on this basis, and often do so with friends and family, at public events, and sometimes even in writing on the internet (hi!).  However, it takes way more effort than you might think to avoid inaccurate or misleading statements while also maintaining confidentiality.  Public writing tends to be higher-stakes due to the much larger audience and durability, so I routinely run comments past several colleagues before posting, and often redraft in response (including these comments and this very point!).

My guess is that most don’t do this much in public or on the internet, because it’s absolutely exhausting, and if you say something misremembered or misinterpreted you’re treated as a liar, it’ll be taken out of context either way, and you probably can’t make corrections.  I keep doing it anyway because I occasionally find useful perspectives or insights this way, and think it’s important to share mine.  That said, there’s a loud minority which makes the AI-safety-adjacent community by far the most hostile and least charitable environment I spend any time in, and I fully understand why many of my colleagues might not want to.

Imagine, if you will, trying to hold a long conversation about AI risk - but you can’t reveal any information about, or learned from, or even just informative about LessWrong.  Every claim needs an independent public source, as do any jargon or concepts that would give an informed listener information about the site, etc.; you have to find different analogies and check that citations are public and for all that you get pretty regular hostility anyway because of… well, there are plenty of misunderstandings and caricatures to go around.

I run intro-to-AGI-safety courses for Anthropic employees (based on AGI-SF), and we draw a clear distinction between public and confidential resources specifically so that people can go talk to family and friends and anyone else they wish about the public information we cover.

Second, and more concretely, many of these asks are unimplementable for various reasons, and often gesture in a direction without giving reasons to think that there’s a better tradeoff available than we’re already making.  Some quick examples:

  • Both AI Control and safety cases are research areas less than a year old; we’re investigating them and e.g. hiring safety-case specialists, but best-practices we could implement don’t exist yet.  Similarly, there simply aren’t any auditors or audit standards for AI safety yet (see e.g. METR’s statement); we’re working to make this possible but the thing you’re asking for just doesn’t exist yet.  Some implementation questions that “let auditors audit our models” glosses over:

    • If you have dozens of organizations asking to be auditors, and none of them are officially auditors yet, what criteria do you use to decide who you collaborate with?

    • What kind of pre deployment model access would you provide?  If it’s helpful-only or other nonpublic access, do they meet our security bar to avoid leaking privileged API keys?  (We’ve already seen unauthorized sharing or compromise lead to serious abuse.)

    • How do you decide who gets to say what about the testing?  What if they have very different priorities than you and think that a different level of risk or a different kind of harm is unacceptable?

  • I strongly support Anthropic’s nondisclosure of information about pretraining.  I have never seen a compelling argument that publishing this kind of information is, on net, beneficial for safety.

  • There are many cases where I’d be happy if Anthropic shared more about what we’re doing and what we’re thinking about.  Some of the things you’re asking about I think we’ve already said, e.g. for [7] LTBT changes would require an RSP update, and for [17] our RSP requires us to “enforce an acceptable use policy [against …] using the model to generate content that could cause severe risks to the continued existence of humankind”.

So, saying “do more X” just isn’t that useful; we’ve generally thought about it and concluded that that the current amount of X is our best available tradeoff at the moment.  For many more of the other asks above, I just disagree with implicit or explicit claims about the facts in question.  Even for the communication issues where I’d celebrate us sharing more—and for some I expect we will—doing so is yet another demand on heavily loaded people and teams, and it can take longer than we’d like to find the time.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-08-17T18:43:42.717Z · LW · GW

I agree with you that this feels like a 'compact crux' for many parts of the agenda. I'd like to take your bet, let me reflect if there's any additional operationalizations or conditioning.

quick proposals:

  • I win at the end of 2026, if there has not been a formally-verified design for a mechanical lock, OR the design does not verify it cannot be mechanically picked, OR less than three consistent physical instances have been manufactured. (e.g. a total of three including prototypes or other designs doesn't count)
  • You win if at the end of 2027, there have been credible and failed expert attempts to pick such a lock (e.g. an open challenge at Defcon). I win if there is a successful attempt.
  • Bet resolves neutral, and we each donate half our stakes to a mutally-agreed charity, if it's unclear whether production actually happened, or there were no credible attempts to pick a verified lock.
  • Any disputes resolved by the best judgement of an agreed-in-advance arbiter; I'd be happy with the LessWrong team if you and they also agree.
Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Raemon's Shortform · 2024-08-17T06:17:32.534Z · LW · GW

Hmm, usually when I go looking it's because I remember reading a particular post, but there's always some chance of getting tab-sniped into reading a just a few more pages...

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Raemon's Shortform · 2024-08-17T00:27:26.180Z · LW · GW

tags: used them semi-regularly to find related posts when I want to refer to previous discussions of a topic. They work well for that, and I've occasionally added tags when the post I was looking for wasn't tagged yet.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Humanity isn't remotely longtermist, so arguments for AGI x-risk should focus on the near term · 2024-08-13T01:47:23.014Z · LW · GW

I think it's worth distinguishing between what I'll call 'intrinsic preference discounting', and 'uncertain-value discounting'. In the former case, you inherently care less about what happens in the (far?) future; in the latter case you are impartial but rationally discount future value based on your uncertainty about whether it'll actually happen - perhaps there'll be a supernova or something before anyone actually enjoys the utils! Economists often observe the latter, or some mixture, and attribute it to the former.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-08-11T19:45:47.363Z · LW · GW

I am a big fan of verification in principle, but don't think it's feasible in such broad contexts in practice. Most of these proposals are far too vague for me to have an opinion on, but I'd be happy to make bets with anyone interested that

  • (3) won't happen (with nontrivial bounds) due to the difficulty of modeling nondeterministic GPUs, varying floating-point formats, etc.
  • (8) won't be attempted, or will fail at some combination of design, manufacture, or just-being-pickable.  This is a great proposal and a beautifully compact crux for the overall approach.
  • with some mutually agreed operationalization against the success of other listed ideas with a physical component

for up to 10k of my dollars at up to 10:1 odds (i.e. min your $1k), placed by end of 2024, to resolve by end of 2026 or as agreed. (I'm somewhat dubious about very long-term bets for implementation reasons)  I'm also happy to discuss conditioning on a registered attempt.


For what it's worth, I was offered and chose to decline co-authorship of Towards Guaranteed Safe AI. Despite my respect for the authors and strong support for this research agenda (as part of a diverse portfolio of safety research), I disagree with many of the central claims of the paper, and see it as uncritically enthusiastic about formal verification and overly dismissive of other approaches.

I do not believe that any of the proposed approaches actually guarantee safety. Safety is a system property,[1] which includes the surrounding social and technical context; and I see serious conceptual and in-practice-fatal challenges in defining both a world-model and a safety spec. (exercise: what could these look like for a biology lab? Try to block unsafe outcomes without blocking Kevin Esvelt's research!)

I do think it's important to reach appropriately high safety assurances before developing or deploying future AI systems which would be capable of causing a catastrophe. However, I believe that the path there is to extend and complement current techniques, including empirical and experimental approaches alongside formal verification - whatever actually works in practice.


  1. My standard recommendation on this is ch. 1-3 of Engineering a Safer World; I expect her recent book is excellent but haven't yet read enough to recommend it. ↩︎
Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Provably Safe AI: Worldview and Projects · 2024-08-10T00:51:16.490Z · LW · GW

Mark Zuckerberg has concluded that it is better to open source powerful AI models for the white hats than to try to keep them secret from the black hats.

Of course, LLama models aren't actually open source - here's my summary, here's LeCun testifying under oath (which he regularly contradicts on twitter).  I think this substantially undercuts Zuckerberg's argument and don't believe that it's a sincere position, just tactically useful right now.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on It's time for a self-reproducing machine · 2024-08-08T03:54:25.449Z · LW · GW

I'm just guessing from affect, yep, though I still think that "large project and many years of effort" typically describes considerably smaller challenges than my expectation for producing a complete autofac.

On the steel-vs-vitamins question, I'm thinking about "effort" - loose proxy measurements would be the sale value or production headcount rather than kilograms of output. Precisely because steel is easier to transform, it's much less valuable to do so and thus I expect the billions-of-autofacs to be far less economically valuable than a quick estimate might show. Unless of course they start edging in on vitamin production, but then that's the hard rest-of-the-industrial-economy problem...

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on It's time for a self-reproducing machine · 2024-08-08T03:43:58.297Z · LW · GW

To be clear, I think the autofac concept - with external "vitamins" for electronics etc - is in fact technically feasible right now and if teleoperated has been for decades. It's not economically competitive, but that's a totally different target.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on It's time for a self-reproducing machine · 2024-08-07T22:20:45.413Z · LW · GW

I think you're substantially underestimating the difficulty here, and the proportion of effort which goes into the "starter pack" (aka vitamins) relative to steelworking.

If you're interested in taking this further, I'd suggest:

  • getting involved in the RepRap project
  • initially focusing on just one of producing parts, assembling parts, or controlling machinery. If your system works when teleoperated, or can assemble but not produce a copy of itself, etc., that's already a substantial breakthrough.
  • reading up on NASA's studies on self-replicating machinery, e.g. Freitas 1981 or this paper from 1982 or later work like Chirikjian 2004.
Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on bilalchughtai's Shortform · 2024-07-30T01:40:45.557Z · LW · GW

Well, it seems like a no-brainer to store money you intend to spend after age 60 in such an account; for other purposes it does seem less universally useful. I'd also check the treatment of capital gains, and whether it's included in various assets tests; both can be situationally useful and included in some analogues elsewhere.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team · 2024-07-18T22:22:06.963Z · LW · GW

Might be worth putting a short notice at the top of each post saying that, with a link to this post or whatever other resource you'd now recommend? (inspired by the 'Attention - this is a historical document' on e.g. this PEP)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on jacquesthibs's Shortform · 2024-07-07T21:24:42.809Z · LW · GW

I don't think any of these amount to a claim that "to reach ASI, we simply need to develop rules for all the domains we care about". Yes, AlphaGo Zero reached superhuman levels on the narrow task of playing Go, and that's a nice demonstration that synthetic data could be useful, but it's not about ASI and there's no claim that this would be either necessary or sufficient.

(not going to speculate on object-level details though)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Habryka's Shortform Feed · 2024-07-06T00:32:52.038Z · LW · GW

I think that personal incentives is an unhelpful way to try and think about or predict board behavior (for Anthropic and in general), but you can find the current members of our board listed here.

Is there an actual way to criticize Dario and/or Daniela in a way that will realistically be given a fair hearing by someone who, if appropriate, could take some kind of action?

For whom to criticize him/her/them about what? What kind of action are you imagining? For anything I can imagine actually coming up, I'd be personally comfortable raising it directly with either or both of them in person or in writing, and believe they'd give it a fair hearing as well as appropriate follow-up. There are also standard company mechanisms that many people might be more comfortable using (talk to your manager or someone responsible for that area; ask a maybe-anonymous question in various fora; etc). Ultimately executives are accountable to the board, which will be majority appointed by the long-term benefit trust from late this year.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Habryka's Shortform Feed · 2024-07-05T20:04:54.516Z · LW · GW

Makes sense - if I felt I had to use an anonymous mechanism, I can see how contacting Daniela about Dario might be uncomfortable. (Although to be clear I actually think that'd be fine, and I'd also have to think that Sam McCandlish as responsible scaling officer wouldn't handle it)

If I was doing this today I guess I'd email another board member; and I'll suggest that we add that as an escalation option.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Habryka's Shortform Feed · 2024-07-04T20:16:53.376Z · LW · GW

OK, let's imagine I had a concern about RSP noncompliance, and felt that I needed to use this mechanism.

(in reality I'd just post in whichever slack channel seemed most appropriate; this happens occasionally for "just wanted to check..." style concerns and I'm very confident we'd welcome graver reports too. Usually that'd be a public channel; for some compartmentalized stuff it might be a private channel and I'd DM the team lead if I didn't have access. I think we have good norms and culture around explicitly raising safety concerns and taking them seriously.)

As I understand it, I'd:

  • Remember that we have such a mechanism and bet that there's a shortcut link. Fail to remember the shortlink name (reports? violations?) and search the list of "rsp-" links; ah, it's rsp-noncompliance. (just did this, and added a few aliases)
  • That lands me on the policy PDF, which explains in two pages the intended scope of the policy, who's covered, the proceedure, etc. and contains a link to the third-party anonymous reporting platform. That link is publicly accessible, so I could e.g. make a report from a non-work device or even after leaving the company.
  • I write a report on that platform describing my concerns[1], optionally uploading documents etc. and get a random password so I can log in later to give updates, send and receive messages, etc.
  • The report by default goes to our Responsible Scaling Officer, currently Sam McCandlish. If I'm concerned about the RSO or don't trust them to handle it, I can instead escalate to the Board of Directors (current DRI Daniella Amodei)
  • Investigation and resolution obviously depends on the details of the noncompliance concern.

There are other (pretty standard) escalation pathways for concerns about things that aren't RSP noncompliance. There's not much we can do about the "only one person could have made this report" problem beyond the included strong commitments to non-retaliation, but if anyone has suggestions I'd love to hear them.


  1. I clicked through just now to the point of cursor-in-textbox, but not submitting a nuisance report. ↩︎

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Habryka's Shortform Feed · 2024-07-01T05:55:12.586Z · LW · GW

I am a current Anthropic employee, and I am not under any such agreement, nor has any such agreement ever been offered to me.

If asked to sign a self-concealing NDA or non-disparagement agreement, I would refuse.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Nathan Young's Shortform · 2024-06-22T16:35:32.219Z · LW · GW

He talked to Gladstone AI founders a few weeks ago; AGI risks were mentioned but not in much depth.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Niche product design · 2024-06-20T16:22:40.471Z · LW · GW

see also: the tyranny of the marginal user

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Maybe Anthropic's Long-Term Benefit Trust is powerless · 2024-06-12T16:52:05.533Z · LW · GW

Incorporating as a Public Benefit Corporation already frees directors' hands; Delaware Title 8, §365 requires them to "balance the pecuniary interests of the stockholders, the best interests of those materially affected by the corporation’s conduct, and the specific public benefit(s) identified in its certificate of incorporation".

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on There Should Be More Alignment-Driven Startups · 2024-05-31T09:31:53.877Z · LW · GW

Without wishing to discourage these efforts, I disagree on a few points here:

Still, the biggest opportunities are often the ones with the lowest probability of success, and startups are the best structures to capitalize on them.

If I'm looking for the best expected value around, that's still monotonic in the probability of success! There are good reasons to think that most organizations are risk-averse (relative to the neutrality of linear $=utils) and startups can be a good way to get around this.

Nonetheless, I remain concerned about regressional Goodhart; and that many founders naively take on the risk appetite of funders who manage a portfolio, without the corresponding diversification (if all your eggs are in one basket, watch that basket very closely). See also Inadequate Equilibria and maybe Fooled by Randomness.

Meanwhile, strongly agreed that AI safety driven startups should be B corps, especially if they're raising money.

Technical quibble; "B Corp" is a voluntary private certification; PBC is a corporate form which imposes legal obligations on directors. I think many of the B Corp criteria are praiseworthy, but this is neither necessary nor sufficient as an alternative to PBC status - and getting certified is probably a poor use of time and attention for a startup when the founders' time and attention are at such a premium.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on There Should Be More Alignment-Driven Startups · 2024-05-31T06:05:22.681Z · LW · GW

My personal opinion is that starting a company can be great, but I've also seen several fail due to the gaps between their personal goals, a work-it-out-later business plan, and the duties that you/your board owes to your investors.

IMO any purpose-driven company should be founded as a Public Benefit Corporation, to make it clear in advance and in law that you'll also consider the purpose and the interests of people materially affected by the company alongside investor returns. (cf § 365. Duties of directors)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Anthropic: Reflections on our Responsible Scaling Policy · 2024-05-21T22:17:16.721Z · LW · GW

Enforcement of mitigations when it's someone else who removes them won't be seen as relevant, since in this religion a contributor is fundamentally not responsible for how the things they release will be used by others.

This may be true of people who talk a lot about open source, but among actual maintainers the attitude is pretty different. If some user causes harm with an overall positive tool, that's on the user; but if the contributor has built something consistently or overall harmful that is indeed on them. Maintainers tend to avoid working on projects which are mostly useful for surveillance, weapons, etc. for pretty much this reason.

Source: my personal experience as a a maintainer and PSF Fellow, and the multiple Python core developers I just checked with at the PyCon sprints.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Anthropic: Reflections on our Responsible Scaling Policy · 2024-05-21T21:59:04.670Z · LW · GW

Thanks for these clarifications. I didn't realize that the 30% was for the new yellow-line evals rather than the current ones.

That's how I was thinking about the predictions that I was making; others might have been thinking about the current evals where those were more stable.

I'm having trouble parsing this sentence. What you mean by "doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals"? Doesn't pausing include focusing on mitigations and evals?

Of course, but pausing also means we'd have to shuffle people around, interrupt other projects, and deal with a lot of other disruption (the costs of pausing). Ideally, we'd continue updating our yellow-line evals to stay ahead of model capabilities until mitigations are ready.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Anthropic: Reflections on our Responsible Scaling Policy · 2024-05-21T16:56:48.466Z · LW · GW

The yellow-line evals are already a buffer ('sufficent to rule out red-lines') which are themselves a buffer (6x effective compute) before actually-dangerous situations. Since triggering a yellow-line eval requires pausing until we have either safety and security mitigations or design a better yellow-line eval with a higher ceiling, doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals. I therefore think it's reasonable to keep going basically regardless of the probability of triggering in the next round of evals. I also expect that if we did develop some neat new elicitation technique we thought would trigger yellow-line evals, we'd re-run them ahead of schedule.

I also think people might be reading much more confidence into the 30% than is warranted; my contribution to this process included substantial uncertainty about what yellow-lines we'd develop for the next round, and enough calibration training to avoid very low probabilities.

Finally, the point of these estimates is that they can guide research and development prioritization - high estimates suggest that it's worth investing in more difficult yellow-line evals, and/or that elicitation research seems promising. Tying a pause to that estimate is redundant with the definition of a yellow-line, and would risk some pretty nasty epistemic distortions.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Anthropic: Reflections on our Responsible Scaling Policy · 2024-05-21T16:35:32.814Z · LW · GW

What about whistleblowing or anonymous reporting to governments? If an Anthropic employee was so concerned about RSP implementation (or more broadly about models that had the potential to cause major national or global security threats), where would they go in the status quo?

That really seems more like a question for governments than for Anthropic! For example, the SEC or IRS whistleblower programs operate regardless of what companies puport to "allow", and I think it'd be cool if the AISI had something similar.

If I was currently concerned about RSP implementation per se (I'm not), it's not clear why the government would get involved in a matter of voluntary commitments by a private organization. If there was some concern touching on the White House committments, Bletchley declaration, Seoul declaration, etc., then I'd look up the appropriate monitoring body; if in doubt the Commerce whistleblower office or AISI seem like reasonable starting points.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Anthropic: Reflections on our Responsible Scaling Policy · 2024-05-20T06:05:23.622Z · LW · GW

"red line" vs "yellow line"

Passing a red-line eval indicates that the model requires ASL-n mitigations. Yellow-line evals are designed to be easier to implement and/or run, while maintaining the property that if you fail them you would also fail the red-line evals. If a model passes the yellow-line evals, we have to pause training and deployment until we put a higher standard of security and safety measures in place, or design and run new tests which demonstrate that the model is below the red line. For example, leaving out the "register a typo'd domain" step from an ARA eval, because there are only so many good typos for our domain.

assurance mechanisms

Our White House committments mean that we're already reporting safety evals to the US Government, for example. I think the natural reading of "validated" is some combination of those, though obviously it's very hard to validate that whatever you're doing is 'sufficient' security against serious cyberattacks or safety interventions on future AI systems. We do our best.

I'm glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I'm still hoping to see more details. (And I'm generally confused about why Anthropic doesn't share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.)

What details are you imagining would be helpful for you? Sharing the PDF of the formal policy document doesn't mean much compared to whether it's actually implemented and upheld and treated as a live option that we expect staff to consider (fwiw: it is, and I don't have a non-disparage agreement). On the other hand, sharing internal docs eats a bunch of time in reviewing it before release, chance that someone seizes on a misinterpretation and leaps to conclusions, and other costs.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Anthropic: Reflections on our Responsible Scaling Policy · 2024-05-20T05:41:06.319Z · LW · GW

I believe that meeting our ASL-2 deployment commitments - e.g. enforcing our acceptable use policy, and data-filtering plus harmlessness evals for any fine-tuned models - with widely available model weights is presently beyond the state of the art. If a project or organization makes RSP-like commitments, evaluations and mitigates risks, and can uphold that while releasing model weights... I think that would be pretty cool.

(also note that e.g. LLama is not open source - I think you're talking about releasing weights; the license doesn't affect safety but as an open-source maintainer the distinction matters to me)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on AISN #35: Lobbying on AI Regulation Plus, New Models from OpenAI and Google, and Legal Regimes for Training on Copyrighted Data · 2024-05-16T18:34:45.327Z · LW · GW

While some companies, such as OpenAI and Anthropic, have publicly advocated for AI regulation, Time reports that in closed-door meetings, these same companies "tend to advocate for very permissive or voluntary regulations."

I think that dropping the intermediate text which describes 'more established big tech companies' such as Microsoft substantially changes the meaning of this quote - "these same companies" is not "OpenAI and Anthropic". Full context:

Executives from the newer companies that have developed the most advanced AI models, such as OpenAI CEO Sam Altman and Anthropic CEO Dario Amodei, have called for regulation when testifying at hearings and attending Insight Forums. Executives from the more established big technology companies have made similar statements. For example, Microsoft vice chair and president Brad Smith has called for a federal licensing regime and a new agency to regulate powerful AI platforms. Both the newer AI firms and the more established tech giants signed White House-organized voluntary commitments aimed at mitigating the risks posed by AI systems. But in closed door meetings with Congressional offices, the same companies are often less supportive of certain regulatory approaches

AI lab watch makes it easy to get some background information by comparing committments made by OpenAI, Anthropic, Microsoft, and some other established big tech companies.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on "Open Source AI" is a lie, but it doesn't have to be · 2024-05-12T20:05:22.159Z · LW · GW

Meta’s Llama3 model is also *not *open source, despite the Chief AI Scientist at the company, Yann LeCun, frequently proclaiming that it is.

This is particularly annoying because he knows better: the latter two of those three tweets are from January 2024, and here's video of his testimony under oath in September 2023: "the Llama system was not made open-source".

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning · 2024-04-25T15:35:05.260Z · LW · GW

It's a sparse autoencoder because part of the loss function is an L1 penalty encouraging sparsity in the hidden layer. Otherwise, it would indeed learn a simple identity map!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Scenario planning for AI x-risk · 2024-04-23T23:52:34.982Z · LW · GW

Tom Davidson's work on a compute-centric framework for takeoff speed is excellent, IMO.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on What is the best way to talk about probabilities you expect to change with evidence/experiments? · 2024-04-20T05:17:19.307Z · LW · GW

you CAN predict that there will be evidence with equal probability of each direction.

More precisely the expected value of upwards and downwards updates should be the same; it's nonetheless possible to be very confident that you'll update in a particular direction - offset by a much larger and proportionately less likely update in the other.

For example, I have some chance of winning. lottery this year, not much lower than if I actually bought a ticket. I'm very confident that each day I'll give somewhat lower odds (as there's less time remaining), but being credibly informed that I've won would radically change the odds such that the expectation balances out.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Tamsin Leake's Shortform · 2024-03-17T07:31:03.914Z · LW · GW

I agree that there's no substitute for thinking about this for yourself, but I think that morally or socially counting "spending thousands of dollars on yourself, an AI researcher" as a donation would be an apalling norm. There are already far too many unmanaged conflicts of interest and trust-me-it's-good funding arrangements in this space for me, and I think it leads to poor epistemic norms as well as social and organizational dysfunction. I think it's very easy for donating to people or organizations in your social circle to have substantial negative expected value.

I'm glad that funding for AI safety projects exists, but the >10% of my income I donate will continue going to GiveWell.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on 'Empiricism!' as Anti-Epistemology · 2024-03-14T07:11:51.146Z · LW · GW

Trivially true to the extent that you are about equally likely to observe a thing throughout that timespan; and the Lindy Effect is at least regularly talked of.

But there are classes of observations for which this is systematically wrong: for example, most people who see a ship part-way through a voyage will do so while it's either departing or arriving in port. Investment schemes are just such a class, because markets are usually up to the task of consuming alpha and tend to be better when the idea is widely known - even Buffett's returns have oscillated around the index over the last few years!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Is anyone working on formally verified AI toolchains? · 2024-03-13T01:01:52.839Z · LW · GW

Safety properties aren't the kind of properties you can prove; they're statements about the world, not about mathematical objects. I very strongly encourage anyone reading this comment to go read Leveson's Engineering a Safer World (free pdf from author) through to the end of chapter three - it's the best introduction to systems safety that I know of and a standard reference for anyone working with life-critical systems. how.complexsystems.fail is the short-and-quotable catechism.


I'm not really sure what you mean by "AI toolchain", nor what threat model would have a race-condition present an existential risk. More generally, formal verification is a research topic - there's some neat demonstration systems and they're used in certain niches with relatively small amounts of code and compute, simple hardware, and where high development times are acceptable. None of those are true of AI systems, or even libraries such as Pytorch.

For flavor, some of the most exciting developments in formal methods: I expect the Lean FRO to improve usability, and 'autoformalization' tricks like Proofster (pdf) might also help - but it's still niche, and "proven correct" software can still have bugs from under-specified components, incorrect axioms, or outright hardware issues (e.g. Spectre, Rowhammer, cosmic rays, etc.). The seL4 microkernel is great, but you still have to supply an operating system and application layer, and then ensure the composition is still safe. To test an entire application stack, I'd instead turn to Antithesis, which is amazing so long as you can run everything in an x86 hypervisor (with no GPUs).

(as always, opinions my own)