Posts

Anthropic: Reflections on our Responsible Scaling Policy 2024-05-20T04:14:44.435Z
Simple probes can catch sleeper agents 2024-04-23T21:10:47.784Z
Third-party testing as a key ingredient of AI policy 2024-03-25T22:40:43.744Z
Dario Amodei’s prepared remarks from the UK AI Safety Summit, on Anthropic’s Responsible Scaling Policy 2023-11-01T18:10:31.110Z
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning 2023-10-05T21:01:39.767Z
Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust 2023-09-19T15:09:27.235Z
Anthropic's Core Views on AI Safety 2023-03-09T16:55:15.311Z
Concrete Reasons for Hope about AI 2023-01-14T01:22:18.723Z
In Defence of Spock 2021-04-21T21:34:04.206Z
Zac Hatfield Dodds's Shortform 2021-03-09T02:39:33.481Z

Comments

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team · 2024-07-18T22:22:06.963Z · LW · GW

Might be worth putting a short notice at the top of each post saying that, with a link to this post or whatever other resource you'd now recommend? (inspired by the 'Attention - this is a historical document' on e.g. this PEP)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on jacquesthibs's Shortform · 2024-07-07T21:24:42.809Z · LW · GW

I don't think any of these amount to a claim that "to reach ASI, we simply need to develop rules for all the domains we care about". Yes, AlphaGo Zero reached superhuman levels on the narrow task of playing Go, and that's a nice demonstration that synthetic data could be useful, but it's not about ASI and there's no claim that this would be either necessary or sufficient.

(not going to speculate on object-level details though)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Habryka's Shortform Feed · 2024-07-06T00:32:52.038Z · LW · GW

I think that personal incentives is an unhelpful way to try and think about or predict board behavior (for Anthropic and in general), but you can find the current members of our board listed here.

Is there an actual way to criticize Dario and/or Daniela in a way that will realistically be given a fair hearing by someone who, if appropriate, could take some kind of action?

For whom to criticize him/her/them about what? What kind of action are you imagining? For anything I can imagine actually coming up, I'd be personally comfortable raising it directly with either or both of them in person or in writing, and believe they'd give it a fair hearing as well as appropriate follow-up. There are also standard company mechanisms that many people might be more comfortable using (talk to your manager or someone responsible for that area; ask a maybe-anonymous question in various fora; etc). Ultimately executives are accountable to the board, which will be majority appointed by the long-term benefit trust from late this year.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Habryka's Shortform Feed · 2024-07-05T20:04:54.516Z · LW · GW

Makes sense - if I felt I had to use an anonymous mechanism, I can see how contacting Daniela about Dario might be uncomfortable. (Although to be clear I actually think that'd be fine, and I'd also have to think that Sam McCandlish as responsible scaling officer wouldn't handle it)

If I was doing this today I guess I'd email another board member; and I'll suggest that we add that as an escalation option.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Habryka's Shortform Feed · 2024-07-04T20:16:53.376Z · LW · GW

OK, let's imagine I had a concern about RSP noncompliance, and felt that I needed to use this mechanism.

(in reality I'd just post in whichever slack channel seemed most appropriate; this happens occasionally for "just wanted to check..." style concerns and I'm very confident we'd welcome graver reports too. Usually that'd be a public channel; for some compartmentalized stuff it might be a private channel and I'd DM the team lead if I didn't have access. I think we have good norms and culture around explicitly raising safety concerns and taking them seriously.)

As I understand it, I'd:

  • Remember that we have such a mechanism and bet that there's a shortcut link. Fail to remember the shortlink name (reports? violations?) and search the list of "rsp-" links; ah, it's rsp-noncompliance. (just did this, and added a few aliases)
  • That lands me on the policy PDF, which explains in two pages the intended scope of the policy, who's covered, the proceedure, etc. and contains a link to the third-party anonymous reporting platform. That link is publicly accessible, so I could e.g. make a report from a non-work device or even after leaving the company.
  • I write a report on that platform describing my concerns[1], optionally uploading documents etc. and get a random password so I can log in later to give updates, send and receive messages, etc.
  • The report by default goes to our Responsible Scaling Officer, currently Sam McCandlish. If I'm concerned about the RSO or don't trust them to handle it, I can instead escalate to the Board of Directors (current DRI Daniella Amodei)
  • Investigation and resolution obviously depends on the details of the noncompliance concern.

There are other (pretty standard) escalation pathways for concerns about things that aren't RSP noncompliance. There's not much we can do about the "only one person could have made this report" problem beyond the included strong commitments to non-retaliation, but if anyone has suggestions I'd love to hear them.


  1. I clicked through just now to the point of cursor-in-textbox, but not submitting a nuisance report. ↩︎

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Habryka's Shortform Feed · 2024-07-01T05:55:12.586Z · LW · GW

I am a current Anthropic employee, and I am not under any such agreement, nor has any such agreement ever been offered to me.

If asked to sign a self-concealing NDA or non-disparagement agreement, I would refuse.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Nathan Young's Shortform · 2024-06-22T16:35:32.219Z · LW · GW

He talked to Gladstone AI founders a few weeks ago; AGI risks were mentioned but not in much depth.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Niche product design · 2024-06-20T16:22:40.471Z · LW · GW

see also: the tyranny of the marginal user

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Maybe Anthropic's Long-Term Benefit Trust is powerless · 2024-06-12T16:52:05.533Z · LW · GW

Incorporating as a Public Benefit Corporation already frees directors' hands; Delaware Title 8, §365 requires them to "balance the pecuniary interests of the stockholders, the best interests of those materially affected by the corporation’s conduct, and the specific public benefit(s) identified in its certificate of incorporation".

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on There Should Be More Alignment-Driven Startups · 2024-05-31T09:31:53.877Z · LW · GW

Without wishing to discourage these efforts, I disagree on a few points here:

Still, the biggest opportunities are often the ones with the lowest probability of success, and startups are the best structures to capitalize on them.

If I'm looking for the best expected value around, that's still monotonic in the probability of success! There are good reasons to think that most organizations are risk-averse (relative to the neutrality of linear $=utils) and startups can be a good way to get around this.

Nonetheless, I remain concerned about regressional Goodhart; and that many founders naively take on the risk appetite of funders who manage a portfolio, without the corresponding diversification (if all your eggs are in one basket, watch that basket very closely). See also Inadequate Equilibria and maybe Fooled by Randomness.

Meanwhile, strongly agreed that AI safety driven startups should be B corps, especially if they're raising money.

Technical quibble; "B Corp" is a voluntary private certification; PBC is a corporate form which imposes legal obligations on directors. I think many of the B Corp criteria are praiseworthy, but this is neither necessary nor sufficient as an alternative to PBC status - and getting certified is probably a poor use of time and attention for a startup when the founders' time and attention are at such a premium.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on There Should Be More Alignment-Driven Startups · 2024-05-31T06:05:22.681Z · LW · GW

My personal opinion is that starting a company can be great, but I've also seen several fail due to the gaps between their personal goals, a work-it-out-later business plan, and the duties that you/your board owes to your investors.

IMO any purpose-driven company should be founded as a Public Benefit Corporation, to make it clear in advance and in law that you'll also consider the purpose and the interests of people materially affected by the company alongside investor returns. (cf § 365. Duties of directors)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Anthropic: Reflections on our Responsible Scaling Policy · 2024-05-21T22:17:16.721Z · LW · GW

Enforcement of mitigations when it's someone else who removes them won't be seen as relevant, since in this religion a contributor is fundamentally not responsible for how the things they release will be used by others.

This may be true of people who talk a lot about open source, but among actual maintainers the attitude is pretty different. If some user causes harm with an overall positive tool, that's on the user; but if the contributor has built something consistently or overall harmful that is indeed on them. Maintainers tend to avoid working on projects which are mostly useful for surveillance, weapons, etc. for pretty much this reason.

Source: my personal experience as a a maintainer and PSF Fellow, and the multiple Python core developers I just checked with at the PyCon sprints.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Anthropic: Reflections on our Responsible Scaling Policy · 2024-05-21T21:59:04.670Z · LW · GW

Thanks for these clarifications. I didn't realize that the 30% was for the new yellow-line evals rather than the current ones.

That's how I was thinking about the predictions that I was making; others might have been thinking about the current evals where those were more stable.

I'm having trouble parsing this sentence. What you mean by "doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals"? Doesn't pausing include focusing on mitigations and evals?

Of course, but pausing also means we'd have to shuffle people around, interrupt other projects, and deal with a lot of other disruption (the costs of pausing). Ideally, we'd continue updating our yellow-line evals to stay ahead of model capabilities until mitigations are ready.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Anthropic: Reflections on our Responsible Scaling Policy · 2024-05-21T16:56:48.466Z · LW · GW

The yellow-line evals are already a buffer ('sufficent to rule out red-lines') which are themselves a buffer (6x effective compute) before actually-dangerous situations. Since triggering a yellow-line eval requires pausing until we have either safety and security mitigations or design a better yellow-line eval with a higher ceiling, doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals. I therefore think it's reasonable to keep going basically regardless of the probability of triggering in the next round of evals. I also expect that if we did develop some neat new elicitation technique we thought would trigger yellow-line evals, we'd re-run them ahead of schedule.

I also think people might be reading much more confidence into the 30% than is warranted; my contribution to this process included substantial uncertainty about what yellow-lines we'd develop for the next round, and enough calibration training to avoid very low probabilities.

Finally, the point of these estimates is that they can guide research and development prioritization - high estimates suggest that it's worth investing in more difficult yellow-line evals, and/or that elicitation research seems promising. Tying a pause to that estimate is redundant with the definition of a yellow-line, and would risk some pretty nasty epistemic distortions.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Anthropic: Reflections on our Responsible Scaling Policy · 2024-05-21T16:35:32.814Z · LW · GW

What about whistleblowing or anonymous reporting to governments? If an Anthropic employee was so concerned about RSP implementation (or more broadly about models that had the potential to cause major national or global security threats), where would they go in the status quo?

That really seems more like a question for governments than for Anthropic! For example, the SEC or IRS whistleblower programs operate regardless of what companies puport to "allow", and I think it'd be cool if the AISI had something similar.

If I was currently concerned about RSP implementation per se (I'm not), it's not clear why the government would get involved in a matter of voluntary commitments by a private organization. If there was some concern touching on the White House committments, Bletchley declaration, Seoul declaration, etc., then I'd look up the appropriate monitoring body; if in doubt the Commerce whistleblower office or AISI seem like reasonable starting points.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Anthropic: Reflections on our Responsible Scaling Policy · 2024-05-20T06:05:23.622Z · LW · GW

"red line" vs "yellow line"

Passing a red-line eval indicates that the model requires ASL-n mitigations. Yellow-line evals are designed to be easier to implement and/or run, while maintaining the property that if you fail them you would also fail the red-line evals. If a model passes the yellow-line evals, we have to pause training and deployment until we put a higher standard of security and safety measures in place, or design and run new tests which demonstrate that the model is below the red line. For example, leaving out the "register a typo'd domain" step from an ARA eval, because there are only so many good typos for our domain.

assurance mechanisms

Our White House committments mean that we're already reporting safety evals to the US Government, for example. I think the natural reading of "validated" is some combination of those, though obviously it's very hard to validate that whatever you're doing is 'sufficient' security against serious cyberattacks or safety interventions on future AI systems. We do our best.

I'm glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I'm still hoping to see more details. (And I'm generally confused about why Anthropic doesn't share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.)

What details are you imagining would be helpful for you? Sharing the PDF of the formal policy document doesn't mean much compared to whether it's actually implemented and upheld and treated as a live option that we expect staff to consider (fwiw: it is, and I don't have a non-disparage agreement). On the other hand, sharing internal docs eats a bunch of time in reviewing it before release, chance that someone seizes on a misinterpretation and leaps to conclusions, and other costs.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Anthropic: Reflections on our Responsible Scaling Policy · 2024-05-20T05:41:06.319Z · LW · GW

I believe that meeting our ASL-2 deployment commitments - e.g. enforcing our acceptable use policy, and data-filtering plus harmlessness evals for any fine-tuned models - with widely available model weights is presently beyond the state of the art. If a project or organization makes RSP-like commitments, evaluations and mitigates risks, and can uphold that while releasing model weights... I think that would be pretty cool.

(also note that e.g. LLama is not open source - I think you're talking about releasing weights; the license doesn't affect safety but as an open-source maintainer the distinction matters to me)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on AISN #35: Lobbying on AI Regulation Plus, New Models from OpenAI and Google, and Legal Regimes for Training on Copyrighted Data · 2024-05-16T18:34:45.327Z · LW · GW

While some companies, such as OpenAI and Anthropic, have publicly advocated for AI regulation, Time reports that in closed-door meetings, these same companies "tend to advocate for very permissive or voluntary regulations."

I think that dropping the intermediate text which describes 'more established big tech companies' such as Microsoft substantially changes the meaning of this quote - "these same companies" is not "OpenAI and Anthropic". Full context:

Executives from the newer companies that have developed the most advanced AI models, such as OpenAI CEO Sam Altman and Anthropic CEO Dario Amodei, have called for regulation when testifying at hearings and attending Insight Forums. Executives from the more established big technology companies have made similar statements. For example, Microsoft vice chair and president Brad Smith has called for a federal licensing regime and a new agency to regulate powerful AI platforms. Both the newer AI firms and the more established tech giants signed White House-organized voluntary commitments aimed at mitigating the risks posed by AI systems. But in closed door meetings with Congressional offices, the same companies are often less supportive of certain regulatory approaches

AI lab watch makes it easy to get some background information by comparing committments made by OpenAI, Anthropic, Microsoft, and some other established big tech companies.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on "Open Source AI" is a lie, but it doesn't have to be · 2024-05-12T20:05:22.159Z · LW · GW

Meta’s Llama3 model is also *not *open source, despite the Chief AI Scientist at the company, Yann LeCun, frequently proclaiming that it is.

This is particularly annoying because he knows better: the latter two of those three tweets are from January 2024, and here's video of his testimony under oath in September 2023: "the Llama system was not made open-source".

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning · 2024-04-25T15:35:05.260Z · LW · GW

It's a sparse autoencoder because part of the loss function is an L1 penalty encouraging sparsity in the hidden layer. Otherwise, it would indeed learn a simple identity map!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Scenario planning for AI x-risk · 2024-04-23T23:52:34.982Z · LW · GW

Tom Davidson's work on a compute-centric framework for takeoff speed is excellent, IMO.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on What is the best way to talk about probabilities you expect to change with evidence/experiments? · 2024-04-20T05:17:19.307Z · LW · GW

you CAN predict that there will be evidence with equal probability of each direction.

More precisely the expected value of upwards and downwards updates should be the same; it's nonetheless possible to be very confident that you'll update in a particular direction - offset by a much larger and proportionately less likely update in the other.

For example, I have some chance of winning. lottery this year, not much lower than if I actually bought a ticket. I'm very confident that each day I'll give somewhat lower odds (as there's less time remaining), but being credibly informed that I've won would radically change the odds such that the expectation balances out.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Tamsin Leake's Shortform · 2024-03-17T07:31:03.914Z · LW · GW

I agree that there's no substitute for thinking about this for yourself, but I think that morally or socially counting "spending thousands of dollars on yourself, an AI researcher" as a donation would be an apalling norm. There are already far too many unmanaged conflicts of interest and trust-me-it's-good funding arrangements in this space for me, and I think it leads to poor epistemic norms as well as social and organizational dysfunction. I think it's very easy for donating to people or organizations in your social circle to have substantial negative expected value.

I'm glad that funding for AI safety projects exists, but the >10% of my income I donate will continue going to GiveWell.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on 'Empiricism!' as Anti-Epistemology · 2024-03-14T07:11:51.146Z · LW · GW

Trivially true to the extent that you are about equally likely to observe a thing throughout that timespan; and the Lindy Effect is at least regularly talked of.

But there are classes of observations for which this is systematically wrong: for example, most people who see a ship part-way through a voyage will do so while it's either departing or arriving in port. Investment schemes are just such a class, because markets are usually up to the task of consuming alpha and tend to be better when the idea is widely known - even Buffett's returns have oscillated around the index over the last few years!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Is anyone working on formally verified AI toolchains? · 2024-03-13T01:01:52.839Z · LW · GW

Safety properties aren't the kind of properties you can prove; they're statements about the world, not about mathematical objects. I very strongly encourage anyone reading this comment to go read Leveson's Engineering a Safer World (free pdf from author) through to the end of chapter three - it's the best introduction to systems safety that I know of and a standard reference for anyone working with life-critical systems. how.complexsystems.fail is the short-and-quotable catechism.


I'm not really sure what you mean by "AI toolchain", nor what threat model would have a race-condition present an existential risk. More generally, formal verification is a research topic - there's some neat demonstration systems and they're used in certain niches with relatively small amounts of code and compute, simple hardware, and where high development times are acceptable. None of those are true of AI systems, or even libraries such as Pytorch.

For flavor, some of the most exciting developments in formal methods: I expect the Lean FRO to improve usability, and 'autoformalization' tricks like Proofster (pdf) might also help - but it's still niche, and "proven correct" software can still have bugs from under-specified components, incorrect axioms, or outright hardware issues (e.g. Spectre, Rowhammer, cosmic rays, etc.). The seL4 microkernel is great, but you still have to supply an operating system and application layer, and then ensure the composition is still safe. To test an entire application stack, I'd instead turn to Antithesis, which is amazing so long as you can run everything in an x86 hypervisor (with no GPUs).

(as always, opinions my own)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on philh's Shortform · 2024-03-09T19:22:41.022Z · LW · GW

I think he's actually quite confused here - I imagine saying

Hang on - you say that (a) we can think, and (b) we are the instantiations of any number of computer programs. Wouldn't instantiating one of those programs be a sufficient condition of understanding? Surely if two things are isomorphic even in their implementation, either both can think, or neither.

(the Turing test suggests 'indistinguishable in input/output behaviour', which I think is much too weak)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Decaeneus's Shortform · 2024-03-03T03:47:10.841Z · LW · GW

See e.g. https://mschloegel.me/paper/schloegel2024sokfuzzevals.pdf

Fuzzing is a generally pretty healthy subfield, but even there most peer-reviewed papers in top venues are still are completely useless! Importantly, "a 'working' github repo" is really not enough to ensure that your results are reproducible, let alone ensure external validity.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Cryonics p(success) estimates are only weakly associated with interest in pursuing cryonics in the LW 2023 Survey · 2024-03-01T01:26:49.366Z · LW · GW

people’s subjective probability of successful restoration to life in the future, conditional on there not being a global catastrophe destroying civilization before then. This is also known as p(success).

This definition seems relevantly modified by the conditional!

You also seem to be assuming that "probability of revival" could be a monocausal explanation for cryonics interest, but I find that implausible ex ante. Monocausality approximately doesn't exist, and "is being revived good in expectation / good with what probability" are also common concerns. (CF)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Decaeneus's Shortform · 2024-02-21T21:34:08.269Z · LW · GW

Very little, because most CS experiments are not in fact replicable (and that's usually only one of several serious methodological problems).

CS does seem somewhat ahead of other fields I've worked in, but I'd attribute that to the mostly-separate open source community rather than academia per se.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on D0TheMath's Shortform · 2024-02-11T08:25:40.115Z · LW · GW

My impression is that the effects of genes which vary between individuals are essentially independent, and small effects are almost always locally linear. With the amount of measurement noise and number of variables, I just don't think we could pick out nonlinearities or interaction effects of any plausible strength if we tried!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Constituency-sized AI congress? · 2024-02-11T07:11:25.846Z · LW · GW

I think there's a lot of interesting potential in such ideas - but that this isn't ambitious enough! Democracy isn't just about compromising on the issues on the table; the best forms involve learning more and perhaps changing our minds... as well as, yes, trying to find creative win-win outcomes that everyone can at least accept.

I think that trying to improve democracy with better voting systems is fairly similar to trying to improve the economy with better price and capital-allocation sytems. In both cases, there have been enormous advances since the mid-1800s; in both there's a realistic prospect of modern computers enabling wildly better-than-historical systems; and in both cases it focuses effort on a technical subproblem which not sufficient and maybe not even necessary. (and also there's the spectre of communism in Europe haunting both)

A few bodies of thought and work on this that I like:

But as usual, the hard and valuable part is the doing!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Scenario planning for AI x-risk · 2024-02-10T10:24:19.643Z · LW · GW

Most discussions of AI x-risk consider a subset of this [misuse / structural / accidental / agentic] taxonomy. ... Anthropic’s Responsible Scaling Policy is designed with only “misuse” and “autonomy and replication” in mind.

No, we've[1] been thinking about all four of these aspects!

  • Misuse is obvious - our RSP defines risk levels, evals, and corresponding safeguards and mitigations before continued training or deployment.
  • Structural risks are obviously not something we can solve unilaterally, but nor are we neglecting them. The baseline risk comparisons in our RSP are specifically excluding other provider's models, so that e.g. we don't raise the bar on allowable cyberoffense capabilities even if a competitor has already released a more-capable model. (UDT approved strategy!) Between making strong unilateral safety committments, advancing industry-best-practice, and supporting public policy through e.g. testimony and submissions to government enquiries, I'm fairly confident that our net contribution to structural risks is robustly positive.
  • Accident and agentic risks are IMO on a continuous spectrum - you could think of the underlying factor as "how robustly-goal-pursuing is this system?", with accidents being cases where it was shifted off-goal-distribution and agentic failures coming from a treacherous turn by a schemer. We do technical safety research to address various points on this spectrum, e.g. Constitutional AI or investigating faithfulness of chain-of-thought to improve robustness of prosaic alignment, and our recent Sleeper Agents paper on more agentic risks. Accidents are more linked to specific deployments though, and corresponding less emphasized in our RSP - though if you can think of a good way to evaluate accident risks before deployment, let me know!

  1. as usual, these are my opinions only, I'm not speaking for my employer. Further hedging omitted for clarity. ↩︎

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Scenario planning for AI x-risk · 2024-02-10T10:21:54.822Z · LW · GW

By “sustainability,” I mean that a theory of victory should ideally not reduce AI x-risk per year to a constant, low level, but instead continue to reduce AI x-risk over time. In the former case, “expected time to failure”[31] would remain constant, and total risk over a long enough time period would inevitably reach unacceptable levels. (For example, a 1% chance of an existential catastrophe per year implies an approximately 63% chance over 100 years.)

Obviously yes, a 1% pa chance of existential catastrophe is utterly unacceptable! I'm not convinced that "continues to reduce over time" is the right framing though; if we achieved a low enough constant rate for a MTBF of many millions of years I'd expect other projects to have higher long-term EV given the very-probably-finite future resources available anyway. I also expect that the challenge is almost entirely in getting to an acceptably low rate, not in the further downward trend, so it's really a moot point.

(I'm looking forward to retiring from this kind of thing if or when I feel that AI risk and perhaps synthetic biorisk are under control, and going back to low-stakes software engineering r&d... though not making any active plans)

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Scenario planning for AI x-risk · 2024-02-10T10:20:45.537Z · LW · GW

However, the data they use to construct inferential relationships are expert forecasts. Therefore, while their four scenarios might accurately describe clusters of expert forecasts, they should only be taken as predictively valuable to the extent that one takes expert forecasts to be predictively valuable.

No, it's plausible that this kind of scenario or cluster is more predictively accurate than taking expert forecasts directly. In practice, this happens when experts disagree on (latent) state variables, but roughly agree on dynamics - for example there might be widespread disagreement on AGI timelines, but agreement that

  • if scaling laws and compute trends hold and no new paradigm is needed, AGI timelines of five to ten years are plausible
  • if the LLM paradigm will not scale to AGI, we should have a wide probability distribution over timelines, say from 2040 -- 2100

and then assigning relative probability to the scenarios can be a later exercise. Put another way, forming scenarios or clusters is more like formulating an internally-coherent hypothesis than updating on evidence.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Scenario planning for AI x-risk · 2024-02-10T10:20:24.809Z · LW · GW

“The most pressing practical question for future work is: why were superforecasters so unmoved by experts’ much higher estimates of AI extinction risk, and why were experts so unmoved by the superforecasters’ lower estimates? The most puzzling scientific question is: why did rational forecasters, incentivized by the XPT to persuade each other, not converge after months of debate and the exchange of millions of words and thousands of forecasts?”

This post by Peter McClusky, a participating superforecaster, renders the question essentially non-puzzling to me. Doing better would be fairly simple, although attracting and incentivising the relevant experts would be fairly expensive.

  • The questions were in many cases somewhat off from the endpoints we care about, or framed in ways that I believe would distort straightforward attempts to draw conclusions
  • The incentive structure of predicting the apocalypse is necessarily screwy, and using a Keynsian beauty prediction contest doesn't really fix it
  • Most of the experts and superforecasters just don't know much about AI, and thought that (as of 2022) the recent progress was basically just hype. Hopefully it's now clear that this was just wrong?

Some selected quotes:

I didn’t notice anyone with substantial expertise in machine learning. Experts were apparently chosen based on having some sort of respectable publication related to AI, nuclear, climate, or biological catastrophic risks. ... they’re likely to be more accurate than random guesses. But maybe not by a large margin.

Many superforecasters suspected that recent progress in AI was the same kind of hype that led to prior disappointments with AI. I didn't find a way to get them to look closely enough to understand why I disagreed. My main success in that area was with someone who thought there was a big mystery about how an AI could understand causality. I pointed him to Pearl, which led him to imagine that problem might be solvable.

I didn't see much evidence that either group knew much about the subject that I didn't already know. So maybe most of the updates during the tournament were instances of the blind leading the blind. None of this seems to be as strong evidence as the changes, since the tournament, in opinions of leading AI researchers, such as Hinton and Bengio.

I think the core problem is actually that it's really hard to get good public predictions of AI progress, in any more detail than "extrapolate compute spending, hardware price/performance, scaling laws, and then guess at what downstream-task performance that implies (and whether we'll need a new paradigm for AGI [tbc: no!])". To be clear, I think that's a stronger baseline than the forecasting tournament achieved!

But downstream task performance is hard to predict, and there's a fair bit of uncertainty in the other parameters too. Details are somewhere between "trade secrets" and "serious infohazards", and the people who are best at predicting AI progress are mostly - for that reason! - work at frontier labs with AI-xrisk-mitigation efforts. I think it's likely that inferring frontier lab [members]'s beliefs from their actions and statements would give you better estimates than another such tournament.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Scenario planning for AI x-risk · 2024-02-10T10:20:08.364Z · LW · GW

I'm a big fan of scenario modelling in general, and loved this post reviewing its application to AI xrisk. Thanks for writing it!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on How has internalising a post-AGI world affected your current choices? · 2024-02-05T08:38:29.197Z · LW · GW

In 2021 I dropped everything else I was doing and moved across the Pacific Ocean to join Anthropic, which I guess puts me in group three. However, I think you should also take seriously the possibility that AGI won't change everything soon - whether because of technical limitations, policy decisions to avoid building more-capable (/dangerous) AI systems, or something which we haven't seen coming at all. Even if you're only wrong about the timing, you could have a remarkably bad time.

So my view is that almost nothing on your list has enough of an upside to be worth the nontrivial chance of very large downsides - though by all means spend more time with family and friends! I believe there was a period in the Cold War where many reasearchers at RAND declined to save for retirement, but saving for retirement is not actually that expensive. Save for retirement, don't smoke, don't procrastinate about cryonics, and live a life you think is worth living.

And, you know, I think there are concrete reasons for hope and that careful, focussed effort can improve our odds. If AI is going to be a major focus of your life, make that productive instead of nihilistic.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on The economy is mostly newbs (strat predictions) · 2024-02-03T09:00:58.097Z · LW · GW

Yep, I'm happy to take the won't-go-down-like-that side of the bet. See you in ten years!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on The economy is mostly newbs (strat predictions) · 2024-02-03T07:38:10.164Z · LW · GW

How about "inflation-adjusted market cap of 50% of the Fortune 500 as at Jan 1st 2024 is down by 80% or more as of Jan 1st 2034".

It's a lot easier to measure but I think captures the spirit? I'd be down for an even-odds bet of your chosen size; if I win as a donation of GiveWell / paid to you or your charity of choice if I lose.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on The economy is mostly newbs (strat predictions) · 2024-02-02T23:38:36.544Z · LW · GW

The average investor will notice almost all their investments go to zero except for a few corps

I'd like to bet against this, if you want to formalize it enough to have someone judge it in ten years.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on AI #48: The Talk of Davos · 2024-01-26T07:41:51.066Z · LW · GW

"X is open source" has a specific meaning for software, and Llama models are not open source according to this important legal definition.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on AI #48: The Talk of Davos · 2024-01-26T07:20:27.656Z · LW · GW

Typo: the link to the Nature news article on Sleeper Agents is broken.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Most People Don't Realize We Have No Idea How Our AIs Work · 2023-12-27T00:32:06.201Z · LW · GW

See Urschleim in Silicon: Return-Oriented Program Evolution with ROPER (2018).

Incidentally this is among my favorite theses, with a beautiful elucidation of 'weird machines' in chapter two. Recommended reading if you're at all interested in computers or computation.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Constellations are Younger than Continents · 2023-12-26T23:24:55.404Z · LW · GW

Agreed, "ancient aeons" would be my preferred edit.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Constellations are Younger than Continents · 2023-12-26T01:21:55.653Z · LW · GW

In that spirit, my complaints about Time Wrote The Rocks:

We gaze upon creation where erosion makes it known,
And count the countless aeons in the banding of the stone.

Geological time isn't countless; the earth is only around 4.54 ± 0.05 billion years old and the universe 13.7 ± 0.2 billion years old. Seems worth getting the age of creation right! In the technical sense, there are exactly four eons: the Hadean, Archean, Proterozoic and Phanerozoic. It's awkward to find a singable replacement though, since "era" and "age" also have technical definitions.

The best replacements I've thought of are to either go with "ancient aeons" and trust everyone to understand that we're not using the technical definition, or "And see the length of history in" if a nod to chronostratigrapy seems worth the cost in lyricism.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on AI #43: Functional Discoveries · 2023-12-24T09:38:52.937Z · LW · GW

(Does this always reflect agents’ “real” reasoning? Need more work on this!)

Conveniently, we already published results on this, and the answer is no!  

Per Measuring Faithfulness in Chain-of-Thought Reasoning and Question Decomposition Improves the Faithfulness of Model-Generated Reasoning, chain of thought reasoning is often "unfaithful" - the model reaches a conclusion for reasons other than those given in the chain of thought, and indeed sometimes contrary to it.  Question decomposition - splitting across multiple independent contexts - helps but does not fully eliminate the problem.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Employee Incentives Make AGI Lab Pauses More Costly · 2023-12-23T10:57:44.625Z · LW · GW

While we would obviously prefer to meet all our ASL-3 commitments before we ever train an ASL-3 model, we have of course also thought about and discussed what we'd do in the other case. I expect that it would be stressful time, but for reasons upstream of the conditional pause!

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Finding Sparse Linear Connections between Features in LLMs · 2023-12-09T07:22:19.171Z · LW · GW

Quick plotting tip: when lines (or dots, or anything else) are overlapping, passing alpha=0.6 gives you a bit of transparency and makes it much easier to see what's going on. I think this would make most of your line plots a bit more informative, although I've found it most useful to avoid saturating scatterplots.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Disappointing Table Refinishing · 2023-12-04T06:09:18.451Z · LW · GW

I really, really like the finish from osmo oil (polyx for tables or topoil for eg cutting board level food safety). Requires sanding back the first time, but subsequent coats or touchups are apply-and-buff only; you might want a buffer but I've done it by hand too.

Aesthetically, it's very close to a raw timber finish, but very water resistant (and alcohol resistant!), durable, and maintains the tactile experience of raw wood pretty well too - not sticky or slick or coated. Slightly more work than polyurethane but IMO worth it.

Comment by Zac Hatfield-Dodds (zac-hatfield-dodds) on Black Box Biology · 2023-12-01T07:01:54.966Z · LW · GW

You don’t need to understand the mechanism of action. You don’t need an animal model of disease. You just need a reasonable expectation that changing a genetic variant will have a positive impact on the thing you care about.  And guess what?  We already have all that information.  We’ve been conducting genome-wide association studies for over a decade.

This is critically wrong, because association is uninformative about interventions.

Genome-wide association studies are enormously informative about correlations in a particular environment.  In a complex setting like genomics, and where there are systematic correlations between environmental conditions and population genetics to confound your analysis, this doesn't give you much reason to think that interventions will have the desired effect!

I'd recommend reading up on causal statistics; Pearl's Book of Why is an accessible introduction and the wikipedia page is also reasonably good.