AI labs can boost external safety research
post by Zach Stein-Perlman · 2024-07-31T19:30:16.207Z · LW · GW · 1 commentsContents
1 comment
Frontier AI labs can boost external safety researchers by
- Sharing better access to powerful models (early access, fine-tuning, helpful-only,[1] filters/moderation-off, logprobs, activations)[2]
- Releasing research artifacts besides models
- Publishing (transparent, reproducible) safety research
- Giving API credits
- Mentoring
Here's what the labs have done (besides just publishing safety research[3]).
Anthropic:
- Releasing resources including RLHF and red-teaming datasets, an interpretability notebook, and model organisms prompts and transcripts
- Supporting creation of safety-relevant evals and tools for evals
- Giving free API access to some OP grantees and giving some researchers $1K (or sometimes more) in API credits
- (Giving deep model access to Ryan Greenblatt [LW(p) · GW(p)])
- (External mentoring, in particular via MATS)
- [No fine-tuning or deep access, except for Ryan]
Google DeepMind:
- Publishing their model evals for dangerous capabilities and sharing resources for reproducing some of them
- Releasing Gemma SAEs
- Releasing Gemma weights
- (External mentoring, in particular via MATS)
- [No fine-tuning or deep access to frontier models]
OpenAI:
- OpenAI Evals
- Superalignment Fast Grants
- Maybe giving better API access to some OP grantees
- Fine-tuning GPT-3.5 (and "GPT-4 fine-tuning is in experimental access"; OpenAI shared GPT-4 fine-tuning access with academic researchers including Jacob Steinhardt and Daniel Kang in 2023)
- Update: GPT-4o fine-tuning
- Early access: shared GPT-4 with a few safety researchers including Rachel Freedman [LW(p) · GW(p)] before release
- API gives top 5 logprobs
Meta AI:
- Releasing Llama weights
Microsoft:
- [Nothing]
xAI:
- [Nothing]
Related papers:
- Structured access for third-party research on frontier AI models (Bucknall and Trager 2023)
- Black-Box Access is Insufficient for Rigorous AI Audits (Casper et al. 2024)
- (The paper is about audits, like for risk assessment and oversight; this post is about research)
- A Safe Harbor for AI Evaluation and Red Teaming (Longpre et al. 2024)
- Structured Access (Shevlane 2022)
- ^
"Helpful-only" refers to the version of the model RLHFed/RLAIFed/finetuned/whatever for helpfulness but not harmlessness.
- ^
Releasing model weights will likely be dangerous once models are more powerful, but all past releases seem fine, but e.g. Meta's poor risk assessment and lack of a plan to make release decisions conditional on risk assessment is concerning.
- ^
And an unspecified amount of funding Frontier Model Forum grants.
1 comments
Comments sorted by top scores.
comment by Nathan Young · 2024-10-09T12:40:51.113Z · LW(p) · GW(p)
Yeah this seems like a good point. Not a lot to argue with, but yeah underrated.