AI safety content you could create

adam-jones

AI safety content you could create

post by Adam Jones (domdomegg) · 2025-01-06T15:35:56.167Z · LW · GW · 0 comments

This is a link post for https://adamjones.me/blog/ai-safety-content/

  Communication of catastrophic AI safety problems outside alignment
  Case studies for analogous problems
  Plans for AI safety
  Defining things
None
No comments

This is a (slightly chaotic and scrappy) list of gaps in AI safety literature that I think would be useful/interesting to exist. I’ve broken it down into sections:

AI safety problems beyond alignment: Better explaining non-misalignment catastrophic AI risks.
Case studies of analogous problems: Historical lessons from nuclear weapons, surviving the resource curse, etc.
Plans: End-to-end plans for achieving AI safety.
Definitions: Accessible explanations of often-used AI safety terminology.

If you think there are articles that exist covering the topics described, please verify the articles you are thinking about do meet the criteria, and then tell me.

Communication of catastrophic AI safety problems outside alignment

I’ve previously written about how alignment is not all you need. And before me, others had written great things on parts of these problems [LW · GW]. Friends have written up articles on parts of the economic transition, and specifically the intelligence curse [LW · GW].

Few people appear to be working on these problems, despite them seeming extremely important and neglected - and plausibly tractable? This might be because:

there is little understanding of these problems in the community;
the problems don’t match the existing community’s skills and experiences;
few people have started, so there aren’t obvious tractable in-roads to these problems; and
there aren’t organisations / structures for people to fit in to work on these problems

Tackling the first two issues might be done by spreading these messages more clearly. Corresponding semi-defined audiences:

The existing community. Funders or decision makers at AI policy orgs might be particularly useful.
People who would be useful to add to the community. These might be experts who could help by working on these problems, or at least beginning to think about them (e.g. economists, politics and international relations scholars, military/field strategists). I suspect there are many things where people from these fields will see obvious things we are missing.

There is downside risk here. We want to be particularly careful not heating up race dynamics further, particularly in messaging to the general public or people likely to make race-y decisions. For this reason I’m more excited about spreading messages about the coordination problem and economic transition problem, than the power distribution problem (see my problem naming for more context).

Case studies for analogous problems

Related to the plans above, I think we could probably get a lot of insight into building better plans by looking at other case studies through time.

Unfortunately, a lot of existing resources on case studies are:

painful to read: they’re too long and academically-written
paywalled: this is a real problem for distribution and sharing with policymakers
hard to separate the noise from the truth (low epistemic legibility [LW · GW], and generally just low-quality content)
don’t include analysis on ‘how might we apply these lessons again for advanced AI’.

I think people should be flexible as to what they look into here (provided you expect it has relevance to the key problems in AI safety). Some questions I brainstormed were:

Why did nuclear actors not immediately use nuclear weapons to dominate their adversaries? And what does this mean for the power distribution problem?
How did Norway avoid the resource curse? And what does this mean for the power distribution and economic transition problem?
- A lot of existing resources only describe surface level actions, but not the ‘hard’ (IMO) part of the problem. For example, explaining that Norway taxed oil, invested those funds and promoted education - but how did they align the incentives to begin with? (Or resources point at having strong democratic institutions - but how can we speedrun building these before AI?)
- Highlighting where there are disanalogies might also be helpful: e.g. with oil people could still earn the country meaningfully more money while working (because they were already educated), but with AI that might not be the case. Also some claim that the oil took much longer to extract than other countries with oil, so they needed to build lots of infrastructure first and therefore had a more gradual transition.
We seemed to manage to agree to not do human cloning. How did this happen? And are there lessons to learn for the coordination problem?
- Human cloning doesn’t have as obvious rewards, so this might be a bad analogy. But perhaps there are other promising technologies we managed to control?
We seemed to sort out chlorofluorocarbons (CFCs) quite quickly once we realised it was a problem. How did we do this so effectively? Why does climate change seem different? And what can we learn here for AI?
Were the bourgeoisie/aristocracy who didn’t need to work happy? Or what traits did people who could find happiness have? And what does this mean for the purpose problem?

Plans for AI safety

For the last few weeks, I’ve been working on trying to find plans for AI safety. They should cover the whole problem, including the major hurdles after intent alignment. Unfortunately, this has not gone well - my rough conclusion is that there aren’t any very clear and well publicised plans (or even very plausible stories) for making this go well. (More context on some of this work can be found in BlueDot Impact’s AI safety strategist job posting).

In short: what series of actions might get us to a state of existential security, and ideally at a more granular level than ‘pause’ or ‘regulate companies’.

Things that are kind of going in this direction:

However, many of them stop (or are very vague) after preventing misalignment, or don’t describe how we will achieve the intended outcomes (e.g. bringing about a pause successfully). Additionally, while there has been criticism of some of the above plans, there is relatively little consensus building on these plans, or further development to improve the plans from the community.

Building a better plan, or improving on one of these plans (not just criticising where it fails) would be really valuable.

Defining things

I run BlueDot Impact’s AI safety courses. This involves finding resources that explain what AI safety people are talking about.

There are useful concepts that people in AI safety take for granted but there are ~no easy-to-find resources explaining them well. It’d be great to fix that.

I imagine these articles primarily as definitions. You might even want to create them as a LessWrong tag [? · GW], AISafety.info article, or Arbital page (although I’m not certain Arbital is still maintained?). They could be a good match for a Wikipedia page, although I don’t know if they’re big enough for this.

I think these are slightly less important to write than the other ideas, but might be a good low-stakes entrypoint.

What is the ‘critical period’ in AI?
- My brief definition: The time after getting extremely capable (e.g. broadly human-level or similar) where decisions are extremely high-stakes for the future. It might be where an actor attempts to take a pivotal act. Most people assume this period will be short - possibly only a few months, weeks or even days. We’re likely to exit the critical period either by:
  - getting to a state of existential security (“a state in which existential risk from AI is negligible either indefinitely or for long enough that humanity can carefully plan its future”)
  - being permanently locked-in to a negative future / heading unavoidably to catastrophe
- Related existing resource: The presentation “Catastrophic risks from unsafe AI” by Ben Garfinkel shows a way to describe this graphically, and the critical period is basically the section where risks >> 0.
- NB: People use this in different ways, so you might also want to check what different communities mean by this.
What is AI scaffolding?
- My brief definition: ‘Scaffolding’ is a term often used for iterated amplification processes that incorporate tool use, or encourage the model to act more agentically. For example, Devin (or Cline) is an AI system that can complete software engineering tasks. It achieves this by breaking down the problem, using tools like a code editor and web browser, and potentially prompting copies of itself.

0 comments

Comments sorted by top scores.

AI safety content you could create

Contents

Communication of catastrophic AI safety problems outside alignment

Case studies for analogous problems

Plans for AI safety

Defining things

0 comments