Third-party testing as a key ingredient of AI policy 2024-03-25T22:40:43.744Z
Dario Amodei’s prepared remarks from the UK AI Safety Summit, on Anthropic’s Responsible Scaling Policy 2023-11-01T18:10:31.110Z
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning 2023-10-05T21:01:39.767Z
Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust 2023-09-19T15:09:27.235Z
Anthropic's Core Views on AI Safety 2023-03-09T16:55:15.311Z
Concrete Reasons for Hope about AI 2023-01-14T01:22:18.723Z
In Defence of Spock 2021-04-21T21:34:04.206Z
Zac Hatfield Dodds's Shortform 2021-03-09T02:39:33.481Z


Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Tamsin Leake's Shortform · 2024-03-17T07:31:03.914Z · LW · GW

I agree that there's no substitute for thinking about this for yourself, but I think that morally or socially counting "spending thousands of dollars on yourself, an AI researcher" as a donation would be an apalling norm. There are already far too many unmanaged conflicts of interest and trust-me-it's-good funding arrangements in this space for me, and I think it leads to poor epistemic norms as well as social and organizational dysfunction. I think it's very easy for donating to people or organizations in your social circle to have substantial negative expected value.

I'm glad that funding for AI safety projects exists, but the >10% of my income I donate will continue going to GiveWell.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on 'Empiricism!' as Anti-Epistemology · 2024-03-14T07:11:51.146Z · LW · GW

Trivially true to the extent that you are about equally likely to observe a thing throughout that timespan; and the Lindy Effect is at least regularly talked of.

But there are classes of observations for which this is systematically wrong: for example, most people who see a ship part-way through a voyage will do so while it's either departing or arriving in port. Investment schemes are just such a class, because markets are usually up to the task of consuming alpha and tend to be better when the idea is widely known - even Buffett's returns have oscillated around the index over the last few years!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Is anyone working on formally verified AI toolchains? · 2024-03-13T01:01:52.839Z · LW · GW

Safety properties aren't the kind of properties you can prove; they're statements about the world, not about mathematical objects. I very strongly encourage anyone reading this comment to go read Leveson's Engineering a Safer World (free pdf from author) through to the end of chapter three - it's the best introduction to systems safety that I know of and a standard reference for anyone working with life-critical systems. is the short-and-quotable catechism.

I'm not really sure what you mean by "AI toolchain", nor what threat model would have a race-condition present an existential risk. More generally, formal verification is a research topic - there's some neat demonstration systems and they're used in certain niches with relatively small amounts of code and compute, simple hardware, and where high development times are acceptable. None of those are true of AI systems, or even libraries such as Pytorch.

For flavor, some of the most exciting developments in formal methods: I expect the Lean FRO to improve usability, and 'autoformalization' tricks like Proofster (pdf) might also help - but it's still niche, and "proven correct" software can still have bugs from under-specified components, incorrect axioms, or outright hardware issues (e.g. Spectre, Rowhammer, cosmic rays, etc.). The seL4 microkernel is great, but you still have to supply an operating system and application layer, and then ensure the composition is still safe. To test an entire application stack, I'd instead turn to Antithesis, which is amazing so long as you can run everything in an x86 hypervisor (with no GPUs).

(as always, opinions my own)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on philh's Shortform · 2024-03-09T19:22:41.022Z · LW · GW

I think he's actually quite confused here - I imagine saying

Hang on - you say that (a) we can think, and (b) we are the instantiations of any number of computer programs. Wouldn't instantiating one of those programs be a sufficient condition of understanding? Surely if two things are isomorphic even in their implementation, either both can think, or neither.

(the Turing test suggests 'indistinguishable in input/output behaviour', which I think is much too weak)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Decaeneus's Shortform · 2024-03-03T03:47:10.841Z · LW · GW

See e.g.

Fuzzing is a generally pretty healthy subfield, but even there most peer-reviewed papers in top venues are still are completely useless! Importantly, "a 'working' github repo" is really not enough to ensure that your results are reproducible, let alone ensure external validity.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Cryonics p(success) estimates are only weakly associated with interest in pursuing cryonics in the LW 2023 Survey · 2024-03-01T01:26:49.366Z · LW · GW

people’s subjective probability of successful restoration to life in the future, conditional on there not being a global catastrophe destroying civilization before then. This is also known as p(success).

This definition seems relevantly modified by the conditional!

You also seem to be assuming that "probability of revival" could be a monocausal explanation for cryonics interest, but I find that implausible ex ante. Monocausality approximately doesn't exist, and "is being revived good in expectation / good with what probability" are also common concerns. (CF)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Decaeneus's Shortform · 2024-02-21T21:34:08.269Z · LW · GW

Very little, because most CS experiments are not in fact replicable (and that's usually only one of several serious methodological problems).

CS does seem somewhat ahead of other fields I've worked in, but I'd attribute that to the mostly-separate open source community rather than academia per se.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on D0TheMath's Shortform · 2024-02-11T08:25:40.115Z · LW · GW

My impression is that the effects of genes which vary between individuals are essentially independent, and small effects are almost always locally linear. With the amount of measurement noise and number of variables, I just don't think we could pick out nonlinearities or interaction effects of any plausible strength if we tried!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Constituency-sized AI congress? · 2024-02-11T07:11:25.846Z · LW · GW

I think there's a lot of interesting potential in such ideas - but that this isn't ambitious enough! Democracy isn't just about compromising on the issues on the table; the best forms involve learning more and perhaps changing our minds... as well as, yes, trying to find creative win-win outcomes that everyone can at least accept.

I think that trying to improve democracy with better voting systems is fairly similar to trying to improve the economy with better price and capital-allocation sytems. In both cases, there have been enormous advances since the mid-1800s; in both there's a realistic prospect of modern computers enabling wildly better-than-historical systems; and in both cases it focuses effort on a technical subproblem which not sufficient and maybe not even necessary. (and also there's the spectre of communism in Europe haunting both)

A few bodies of thought and work on this that I like:

But as usual, the hard and valuable part is the doing!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Scenario planning for AI x-risk · 2024-02-10T10:24:19.643Z · LW · GW

Most discussions of AI x-risk consider a subset of this [misuse / structural / accidental / agentic] taxonomy. ... Anthropic’s Responsible Scaling Policy is designed with only “misuse” and “autonomy and replication” in mind.

No, we've[1] been thinking about all four of these aspects!

  • Misuse is obvious - our RSP defines risk levels, evals, and corresponding safeguards and mitigations before continued training or deployment.
  • Structural risks are obviously not something we can solve unilaterally, but nor are we neglecting them. The baseline risk comparisons in our RSP are specifically excluding other provider's models, so that e.g. we don't raise the bar on allowable cyberoffense capabilities even if a competitor has already released a more-capable model. (UDT approved strategy!) Between making strong unilateral safety committments, advancing industry-best-practice, and supporting public policy through e.g. testimony and submissions to government enquiries, I'm fairly confident that our net contribution to structural risks is robustly positive.
  • Accident and agentic risks are IMO on a continuous spectrum - you could think of the underlying factor as "how robustly-goal-pursuing is this system?", with accidents being cases where it was shifted off-goal-distribution and agentic failures coming from a treacherous turn by a schemer. We do technical safety research to address various points on this spectrum, e.g. Constitutional AI or investigating faithfulness of chain-of-thought to improve robustness of prosaic alignment, and our recent Sleeper Agents paper on more agentic risks. Accidents are more linked to specific deployments though, and corresponding less emphasized in our RSP - though if you can think of a good way to evaluate accident risks before deployment, let me know!

  1. as usual, these are my opinions only, I'm not speaking for my employer. Further hedging omitted for clarity. ↩︎

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Scenario planning for AI x-risk · 2024-02-10T10:21:54.822Z · LW · GW

By “sustainability,” I mean that a theory of victory should ideally not reduce AI x-risk per year to a constant, low level, but instead continue to reduce AI x-risk over time. In the former case, “expected time to failure”[31] would remain constant, and total risk over a long enough time period would inevitably reach unacceptable levels. (For example, a 1% chance of an existential catastrophe per year implies an approximately 63% chance over 100 years.)

Obviously yes, a 1% pa chance of existential catastrophe is utterly unacceptable! I'm not convinced that "continues to reduce over time" is the right framing though; if we achieved a low enough constant rate for a MTBF of many millions of years I'd expect other projects to have higher long-term EV given the very-probably-finite future resources available anyway. I also expect that the challenge is almost entirely in getting to an acceptably low rate, not in the further downward trend, so it's really a moot point.

(I'm looking forward to retiring from this kind of thing if or when I feel that AI risk and perhaps synthetic biorisk are under control, and going back to low-stakes software engineering r&d... though not making any active plans)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Scenario planning for AI x-risk · 2024-02-10T10:20:45.537Z · LW · GW

However, the data they use to construct inferential relationships are expert forecasts. Therefore, while their four scenarios might accurately describe clusters of expert forecasts, they should only be taken as predictively valuable to the extent that one takes expert forecasts to be predictively valuable.

No, it's plausible that this kind of scenario or cluster is more predictively accurate than taking expert forecasts directly. In practice, this happens when experts disagree on (latent) state variables, but roughly agree on dynamics - for example there might be widespread disagreement on AGI timelines, but agreement that

  • if scaling laws and compute trends hold and no new paradigm is needed, AGI timelines of five to ten years are plausible
  • if the LLM paradigm will not scale to AGI, we should have a wide probability distribution over timelines, say from 2040 -- 2100

and then assigning relative probability to the scenarios can be a later exercise. Put another way, forming scenarios or clusters is more like formulating an internally-coherent hypothesis than updating on evidence.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Scenario planning for AI x-risk · 2024-02-10T10:20:24.809Z · LW · GW

“The most pressing practical question for future work is: why were superforecasters so unmoved by experts’ much higher estimates of AI extinction risk, and why were experts so unmoved by the superforecasters’ lower estimates? The most puzzling scientific question is: why did rational forecasters, incentivized by the XPT to persuade each other, not converge after months of debate and the exchange of millions of words and thousands of forecasts?”

This post by Peter McClusky, a participating superforecaster, renders the question essentially non-puzzling to me. Doing better would be fairly simple, although attracting and incentivising the relevant experts would be fairly expensive.

  • The questions were in many cases somewhat off from the endpoints we care about, or framed in ways that I believe would distort straightforward attempts to draw conclusions
  • The incentive structure of predicting the apocalypse is necessarily screwy, and using a Keynsian beauty prediction contest doesn't really fix it
  • Most of the experts and superforecasters just don't know much about AI, and thought that (as of 2022) the recent progress was basically just hype. Hopefully it's now clear that this was just wrong?

Some selected quotes:

I didn’t notice anyone with substantial expertise in machine learning. Experts were apparently chosen based on having some sort of respectable publication related to AI, nuclear, climate, or biological catastrophic risks. ... they’re likely to be more accurate than random guesses. But maybe not by a large margin.

Many superforecasters suspected that recent progress in AI was the same kind of hype that led to prior disappointments with AI. I didn't find a way to get them to look closely enough to understand why I disagreed. My main success in that area was with someone who thought there was a big mystery about how an AI could understand causality. I pointed him to Pearl, which led him to imagine that problem might be solvable.

I didn't see much evidence that either group knew much about the subject that I didn't already know. So maybe most of the updates during the tournament were instances of the blind leading the blind. None of this seems to be as strong evidence as the changes, since the tournament, in opinions of leading AI researchers, such as Hinton and Bengio.

I think the core problem is actually that it's really hard to get good public predictions of AI progress, in any more detail than "extrapolate compute spending, hardware price/performance, scaling laws, and then guess at what downstream-task performance that implies (and whether we'll need a new paradigm for AGI [tbc: no!])". To be clear, I think that's a stronger baseline than the forecasting tournament achieved!

But downstream task performance is hard to predict, and there's a fair bit of uncertainty in the other parameters too. Details are somewhere between "trade secrets" and "serious infohazards", and the people who are best at predicting AI progress are mostly - for that reason! - work at frontier labs with AI-xrisk-mitigation efforts. I think it's likely that inferring frontier lab [members]'s beliefs from their actions and statements would give you better estimates than another such tournament.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Scenario planning for AI x-risk · 2024-02-10T10:20:08.364Z · LW · GW

I'm a big fan of scenario modelling in general, and loved this post reviewing its application to AI xrisk. Thanks for writing it!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on How has internalising a post-AGI world affected your current choices? · 2024-02-05T08:38:29.197Z · LW · GW

In 2021 I dropped everything else I was doing and moved across the Pacific Ocean to join Anthropic, which I guess puts me in group three. However, I think you should also take seriously the possibility that AGI won't change everything soon - whether because of technical limitations, policy decisions to avoid building more-capable (/dangerous) AI systems, or something which we haven't seen coming at all. Even if you're only wrong about the timing, you could have a remarkably bad time.

So my view is that almost nothing on your list has enough of an upside to be worth the nontrivial chance of very large downsides - though by all means spend more time with family and friends! I believe there was a period in the Cold War where many reasearchers at RAND declined to save for retirement, but saving for retirement is not actually that expensive. Save for retirement, don't smoke, don't procrastinate about cryonics, and live a life you think is worth living.

And, you know, I think there are concrete reasons for hope and that careful, focussed effort can improve our odds. If AI is going to be a major focus of your life, make that productive instead of nihilistic.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on The economy is mostly newbs (strat predictions) · 2024-02-03T09:00:58.097Z · LW · GW

Yep, I'm happy to take the won't-go-down-like-that side of the bet. See you in ten years!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on The economy is mostly newbs (strat predictions) · 2024-02-03T07:38:10.164Z · LW · GW

How about "inflation-adjusted market cap of 50% of the Fortune 500 as at Jan 1st 2024 is down by 80% or more as of Jan 1st 2034".

It's a lot easier to measure but I think captures the spirit? I'd be down for an even-odds bet of your chosen size; if I win as a donation of GiveWell / paid to you or your charity of choice if I lose.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on The economy is mostly newbs (strat predictions) · 2024-02-02T23:38:36.544Z · LW · GW

The average investor will notice almost all their investments go to zero except for a few corps

I'd like to bet against this, if you want to formalize it enough to have someone judge it in ten years.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on AI #48: The Talk of Davos · 2024-01-26T07:41:51.066Z · LW · GW

"X is open source" has a specific meaning for software, and Llama models are not open source according to this important legal definition.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on AI #48: The Talk of Davos · 2024-01-26T07:20:27.656Z · LW · GW

Typo: the link to the Nature news article on Sleeper Agents is broken.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Most People Don't Realize We Have No Idea How Our AIs Work · 2023-12-27T00:32:06.201Z · LW · GW

See Urschleim in Silicon: Return-Oriented Program Evolution with ROPER (2018).

Incidentally this is among my favorite theses, with a beautiful elucidation of 'weird machines' in chapter two. Recommended reading if you're at all interested in computers or computation.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Constellations are Younger than Continents · 2023-12-26T23:24:55.404Z · LW · GW

Agreed, "ancient aeons" would be my preferred edit.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Constellations are Younger than Continents · 2023-12-26T01:21:55.653Z · LW · GW

In that spirit, my complaints about Time Wrote The Rocks:

We gaze upon creation where erosion makes it known,
And count the countless aeons in the banding of the stone.

Geological time isn't countless; the earth is only around 4.54 ± 0.05 billion years old and the universe 13.7 ± 0.2 billion years old. Seems worth getting the age of creation right! In the technical sense, there are exactly four eons: the Hadean, Archean, Proterozoic and Phanerozoic. It's awkward to find a singable replacement though, since "era" and "age" also have technical definitions.

The best replacements I've thought of are to either go with "ancient aeons" and trust everyone to understand that we're not using the technical definition, or "And see the length of history in" if a nod to chronostratigrapy seems worth the cost in lyricism.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on AI #43: Functional Discoveries · 2023-12-24T09:38:52.937Z · LW · GW

(Does this always reflect agents’ “real” reasoning? Need more work on this!)

Conveniently, we already published results on this, and the answer is no!  

Per Measuring Faithfulness in Chain-of-Thought Reasoning and Question Decomposition Improves the Faithfulness of Model-Generated Reasoning, chain of thought reasoning is often "unfaithful" - the model reaches a conclusion for reasons other than those given in the chain of thought, and indeed sometimes contrary to it.  Question decomposition - splitting across multiple independent contexts - helps but does not fully eliminate the problem.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Employee Incentives Make AGI Lab Pauses More Costly · 2023-12-23T10:57:44.625Z · LW · GW

While we would obviously prefer to meet all our ASL-3 commitments before we ever train an ASL-3 model, we have of course also thought about and discussed what we'd do in the other case. I expect that it would be stressful time, but for reasons upstream of the conditional pause!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Finding Sparse Linear Connections between Features in LLMs · 2023-12-09T07:22:19.171Z · LW · GW

Quick plotting tip: when lines (or dots, or anything else) are overlapping, passing alpha=0.6 gives you a bit of transparency and makes it much easier to see what's going on. I think this would make most of your line plots a bit more informative, although I've found it most useful to avoid saturating scatterplots.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Disappointing Table Refinishing · 2023-12-04T06:09:18.451Z · LW · GW

I really, really like the finish from osmo oil (polyx for tables or topoil for eg cutting board level food safety). Requires sanding back the first time, but subsequent coats or touchups are apply-and-buff only; you might want a buffer but I've done it by hand too.

Aesthetically, it's very close to a raw timber finish, but very water resistant (and alcohol resistant!), durable, and maintains the tactile experience of raw wood pretty well too - not sticky or slick or coated. Slightly more work than polyurethane but IMO worth it.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Black Box Biology · 2023-12-01T07:01:54.966Z · LW · GW

You don’t need to understand the mechanism of action. You don’t need an animal model of disease. You just need a reasonable expectation that changing a genetic variant will have a positive impact on the thing you care about.  And guess what?  We already have all that information.  We’ve been conducting genome-wide association studies for over a decade.

This is critically wrong, because association is uninformative about interventions.

Genome-wide association studies are enormously informative about correlations in a particular environment.  In a complex setting like genomics, and where there are systematic correlations between environmental conditions and population genetics to confound your analysis, this doesn't give you much reason to think that interventions will have the desired effect!

I'd recommend reading up on causal statistics; Pearl's Book of Why is an accessible introduction and the wikipedia page is also reasonably good.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Stupid Question: Why am I getting consistently downvoted? · 2023-11-30T22:54:40.480Z · LW · GW

Done! Setting a calendar reminder; see you in a year.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Stupid Question: Why am I getting consistently downvoted? · 2023-11-30T22:50:50.756Z · LW · GW

Yep, arithmetic matches. However if 10K is the limit you can reasonably afford, I'd be more comfortable betting my $1 against your $2000.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Stupid Question: Why am I getting consistently downvoted? · 2023-11-30T22:31:34.763Z · LW · GW

I have around 2K karma and will take that bet at those odds, for up to 1000 dollars on my side.

Resolution criteria are to ask EY about his views on this sequence as of December 1st 2024, literally "which of Zac or MadHatter won this bet", and resolves no payment if he declined to respond or does not explicitly rule for any other reason.

I'm happy to pay my loss by eg Venmo, and would request winnings as a receipt for your donation to GiveWell's all-grants fund.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Shallow review of live agendas in alignment & safety · 2023-11-29T18:49:48.175Z · LW · GW

Zac Hatfield-Dobbs

Almost but not quite my name!  If you got this from somewhere else, let me know and I'll go ping them too?

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Zac Hatfield Dodds's Shortform · 2023-11-28T23:09:22.843Z · LW · GW

I'd love a "post a hash" feature, where I could make a private post with hashes of a post's title and body. Then when the post is publishable, it could include a verified-by-LW "written at" timestamp as well as the time it was actually published and some time-of-publication post-matter.

Idea prompted by re-reading a private doc I wrote earlier this year, and thinking that it'd be nice to have trustable timestamps if or when it's feasible to publish such docs. Presumably others are in a similar position; it's a neat feature for e.g. predictions (although you'd want some way to avoid the "hash both sides, reveal correct" problem).

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on R&D is a Huge Externality, So Why Do Markets Do So Much of it? · 2023-11-17T15:12:27.427Z · LW · GW

One reason that economically rational firms might produce "too much" R&D for their shareholders: competition! In many industries, if you stop innovating your company will fall behind quickly, and your customers will go elsewhere.

Happily this Red Queen's Race has large positive externalities, and so coordinating to reduce such investments is generally illegal ('restraint of trade').

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Snapshot of narratives and frames against regulating AI · 2023-11-01T22:37:45.672Z · LW · GW

Agreed: this kind of psudeo-openness has all of the downsides of releasing a dual-use capability, and we miss so many benefits from commercial use and innovation.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Snapshot of narratives and frames against regulating AI · 2023-11-01T18:16:34.132Z · LW · GW

One possible hypothesis here is Meta just loves open source and wants everyone to flourish. ... A more complex hypothesis is Meta doesn't actually love open source that much but has a sensible, self-interested strategy

It's worth noting here that Meta is very careful never to describe Llama as open source, because they know perfectly well that it isn't.  For example, here's video of Yan LeCun testifying under oath: "so first of all Llama system was not made open source ... we released it in a way that did not authorize commercial use, we kind of vetted the people who could download the model it was reserved to researchers and academics"

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on We're Not Ready: thoughts on "pausing" and responsible scaling policies · 2023-10-28T03:39:32.219Z · LW · GW

You can find the current closest thing various companies have at

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on AI #34: Chipping Away at Chip Exports · 2023-10-19T20:50:40.594Z · LW · GW

Adam Jermyn says Anthropic’s RSP includes fine-tuning-included evals every three months or 4x compute increase, including during training.

You don't need to take anyone's word for this when checking the primary source is so easy: the RSP is public, and the relevant protocol is on page 12:

In more detail, our evaluation protocol is as follows: ... Timing: During model training and fine-tuning, Anthropic will conduct an evaluation of its models for next-ASL capabilities both (1) after every 4x jump in effective compute, including if this occurs mid-training, and (2) every 3 months to monitor fine-tuning/tooling/etc improvements.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Taxonomy of AI-risk counterarguments · 2023-10-16T20:55:10.517Z · LW · GW
  • Constitutional AI: AI can be trained by feedback from other AI based on a "constitution" of rules and principles.
  • (The number of proposed alignment solutions is very large, so the only ones listed here are the two pursued by OpenAI and Anthropic, respectively. ...)

I think describing Constitutional AI as "the solution pursued by Anthropic" is substantially false.  Our 'core views' post describes a portfolio approach to safety research, across optimistic, intermediate, and pessimistic scenarios.

If we're in an optimistic scenario where catastrophic risk from advanced AI is very unlikely, then Constitutional AI or direct successors might be sufficient - but personally I think of such techniques as baselines and building blocks for further research rather than solutions.  If we're not so lucky, then future research and agendas like mechanistic interpretability will be vital.  This alignment forum comment goes into some more detail about our thinking at the time.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on More or Fewer Fights over Principles and Values? · 2023-10-15T22:38:45.943Z · LW · GW

I certainly think there are ways to comport yourself as a professional which have very little in common with Scott's conception of a blankface, although pretending to professionalism is a classic blankface strategy.

A blankface is anyone who enjoys wielding the power entrusted in them to make others miserable by acting like a cog in a broken machine, rather than like a human being with courage, judgment, and responsibility for their actions. A blankface meets every appeal to facts, logic, and plain compassion with the same repetition of rules and regulations and the same blank stare—a blank stare that, more often than not, conceals a contemptuous smile.

A professional may be caught in a broken machine or bureaucracy; your responsibility (and the call for courage and judgement) then is to choose voice and eventually exit rather than loyal complicity.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on What I've been reading, October 2023: The stirrup in Europe, 19th-century art deco, and more · 2023-10-11T22:47:13.854Z · LW · GW

The study of genetics is the study of the causes of genetic variation in the population. Yet genetics has contributed little to our understanding of speciation and nothing to our understanding of extinction (Lewontin, 1974, p. 12).

My understanding is that genetics has in fact contributed enormously to our understanding of speciation and also extinction - and here's a book-length treatment I found from 1977, which post-dates Mokyr's cite but not his book. Fortunately I think the point about macroinventions doesn't actually rest on this analogy.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning · 2023-10-06T19:59:34.066Z · LW · GW

The obvious targets are of course Anthropic's own frontier models, Claude Instant and Claude 2.

Problem setup: what makes a good decomposition? discusses what success might look like and enable - but note that decomposing models into components is just the beginning of the work of mechanistic interpretability! Even with perfect decomposition we'd have plenty left to do, unraveling circuits and building a larger-scale understanding of models.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on AI #30: Dalle-3 and GPT-3.5-Instruct-Turbo · 2023-09-26T07:27:47.585Z · LW · GW

Yann LeCun made a five minute statement to the Senate Intelligence Committee (which I am considering sufficient newsworthiness it needs to be covered), defending the worst thing you can do, open sourcing AI models.

It sure does sound like that, but later he testifies that:

first of all the Llama system was not made open source ... we released it in a way that did not authorize commercial use, we kind of vetted the people who could download the model; it was reserved to researchers and academics

If you read his prepared opening statement carefully, he never actually claims that Llama is open source; just speaks at length about the virtues of openness and open-source. Easy to end up confused though!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Taking features out of superposition with sparse autoencoders more quickly with informed initialization · 2023-09-23T18:08:32.163Z · LW · GW

I'd love to see a polished version of this work posted to the Arxiv.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust · 2023-09-20T18:46:27.326Z · LW · GW

We've deliberately set conservative thresholds, such that I don't expect the first models which pass the ASL-3 evals to pose serious risks without improved fine-tuning or agent-scaffolding, and we've committed to re-evaluate to check on that every three months. From the policy:

Ensuring that we never train a model that passes an ASL evaluation threshold is a difficult task. Models are trained in discrete sizes, they require effort to evaluate mid-training, and serious, meaningful evaluations may be very time consuming, since they will likely require fine-tuning.

This means there is a risk of overshooting an ASL threshold when we intended to stop short of it. We mitigate this risk by creating a buffer: we have intentionally designed our ASL evaluations to trigger at slightly lower capability levels than those we are concerned about, while ensuring we evaluate at defined, regular intervals (specifically every 4x increase in effective compute, as defined below) in order to limit the amount of overshoot that is possible. We have aimed to set the size of our safety buffer to 6x (larger than our 4x evaluation interval) so model training can continue safely while evaluations take place. Correct execution of this scheme will result in us training models that just barely pass the test for ASL-N, are still slightly below our actual threshold of concern (due to our buffer), and then pausing training and deployment of that model unless the corresponding safety measures are ready.

I also think that many risks which could emerge in apparently-ASL-2 models will be reasonably mitigable by some mixture of re-finetuning, classifiers to reject harmful requests and/or responses, and other techniques. I've personally spent more time thinking about the autonomous replication than the biorisk evals though, and this might vary by domain.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust · 2023-09-20T14:32:12.424Z · LW · GW

Try really hard to avoid this situation for many reasons, among them that de-deployment would suck.. I think it's unlikely, but:

(2d) If it becomes apparent that the capabilities of a deployed model have been under-elicited and the model can, in fact, pass the evaluations, then we will halt further deployment to new customers and assess existing deployment cases for any serious risks which would constitute a safety emergency. Given the safety buffer, de-deployment should not be necessary in the majority of deployment cases. If we identify a safety emergency, we will work rapidly to implement the minimum additional safeguards needed to allow responsible continued service to existing customers.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust · 2023-09-19T20:33:08.703Z · LW · GW

One year is actually the typical term length for board-style positions, but because members can be re-elected their tenure is often much longer. In this specific case of course it's now up to the trustees!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Invading Australia (Endless Formerlies Most Beautiful, or What I Learned On My Holiday) · 2023-09-11T00:43:29.592Z · LW · GW

You may enjoy reading The Future Eaters (Flannery 2004), as an ecological history of Australia and the region - covering the period before first human settlement, the arrival of Indigenous peoples, and later European colonization.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on The Human Phase Hypothesis (Why We Might Be Alone) · 2023-08-25T22:13:37.215Z · LW · GW

What is the human-phase hypothesis? Your whole post doesn't actually say!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on The Human Phase Hypothesis (Why We Might Be Alone) · 2023-08-25T05:03:31.911Z · LW · GW

I think this post would be substantially improved by adding a one-to-four-sentence abstract at the beginning.

Substantively, I think your proposal is already part of the standard Drake Equation and dicing the sequence of steps a little differently doesn't affect the result. There's also some recent research which many people on LessWrong think explains away the paradox: Dissolving the Fermi Paradox makes the case that we're simply very rare, and Grabby Aliens says that we're instead early. Between them, we have fairly precise quantitative bounds, and I'd suggest familiarizing yourself with the papers and follow-up research.