AI labs can boost external safety research

post by Zach Stein-Perlman · 2024-07-31T19:30:16.207Z · LW · GW · 1 comments

1 comment

Frontier AI labs can boost external safety researchers by

Sharing better access to powerful models (early access, fine-tuning, helpful-only,^[1] filters/moderation-off, logprobs, activations)^[2]
Releasing research artifacts besides models
Publishing (transparent, reproducible) safety research
Giving API credits
Mentoring

Here's what the labs have done (besides just publishing safety research^[3]).

Anthropic:

Google DeepMind:

Publishing their model evals for dangerous capabilities and sharing resources for reproducing some of them
Releasing Gemma SAEs
Releasing Gemma weights
(External mentoring, in particular via MATS)
[No fine-tuning or deep access to frontier models]

OpenAI:^[4]

OpenAI Evals
Superalignment Fast Grants
Maybe giving better API access to some OP grantees
Fine-tuning GPT-3.5 (and "GPT-4 fine-tuning is in experimental access"; OpenAI shared GPT-4 fine-tuning access with academic researchers including Jacob Steinhardt and Daniel Kang in 2023)
- Update: GPT-4o fine-tuning
Early access: shared GPT-4 with a few safety researchers including Rachel Freedman [LW(p) · GW(p)] before release
API gives top 5 logprobs

Meta AI:

Microsoft:

xAI:

1 comments

Comments sorted by top scores.

Yeah this seems like a good point. Not a lot to argue with, but yeah underrated.