IAPS: Mapping Technical Safety Research at AI Companies

zach-stein-perlman

IAPS: Mapping Technical Safety Research at AI Companies

post by Zach Stein-Perlman · 2024-10-24T20:30:41.159Z · LW · GW · 13 comments

This is a link post for https://www.iaps.ai/research/mapping-technical-safety-research-at-ai-companies

13 comments

13 comments

Comments sorted by top scores.

comment by Arthur Conmy (arthur-conmy) · 2024-10-26T20:54:16.399Z · LW(p) · GW(p)

I assume all the data is fairly noisy, since scanning for the domain I know in https://raw.githubusercontent.com/Oscar-Delaney/safe_AI_papers/refs/heads/main/Automated%20categorization/final_output.csv, it misses ~half of the GDM Mech Interp output from the specified window and also mislabels https://arxiv.org/abs/2208.08345 and https://arxiv.org/abs/2407.13692 as Mech Interp (though two labels are applied to these papers and I didn't dig to see which was used)

Replies from: Oscar Delaney, Zach Stein-Perlman

↑ comment by Oscar (Oscar Delaney) · 2024-10-28T10:47:16.640Z · LW(p) · GW(p)

Thanks for engaging with our work Arthur! Perhaps I should have signposted this more clearly in the Github as well as the report, but the categories assigned by GPT-4o were not final, we reviewed its categories and made changes where necessary. The final categories we gave are available here. The discovering agents paper we put as 'safety by design' and the prover-verifier games paper we labelled 'enhancing human feedback'. (Though for some papers of course the best categorization may not be clear, if e.g. it touches on multiple safety research areas.)

If you have the links handy I would be interested in which GDM mech interp papers we missed, and I can look into where our methodologies went wrong.

Replies from: arthur-conmy

↑ comment by Arthur Conmy (arthur-conmy) · 2024-11-02T11:32:23.163Z · LW(p) · GW(p)

Here are the other GDM mech interp papers missed:
We have some blog posts of comparable standard to the Anthropic circuit updates listed:
- https://www.alignmentforum.org/posts/C5KAZQib3bzzpeyrg/full-post-progress-update-1-from-the-gdm-mech-interp-team [AF · GW]
- https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall [AF · GW]
You use a very wide scope for the "enhancing human feedback" (basically any post-training paper mentioning 'align'-ing anything). So I will use a wide scope for what counts as mech interp and also include:
- https://arxiv.org/abs/2401.06102
- https://arxiv.org/abs/2304.14767
- There are a few other papers from the PAIR group as well as Mor Geva and also Been Kim, but mostly with Google Research affiliations so it seems fine to not include these as IIRC you weren't counting pre-GDM merger Google Research/Brain work

Replies from: Oscar Delaney

↑ comment by Oscar (Oscar Delaney) · 2024-11-05T11:39:20.243Z · LW(p) · GW(p)

Thanks for that list of papers/posts. For most of the papers you linked, they’re not included because they did not feature in either of our search strategies: (1) titles containing specific keywords that we searched for on arXiv; (2) the paper is linked on the company’s website. I agree this is a limitation of our methodology. We won't add these papers in now as that would be somewhat ad hoc, and inconsistent between the companies.

Re the blog posts from Anthropic and what counts as a paper, I agree this is a tricky demarcation problem. We included the 'Circuit Updates' because it was linked to as a 'paper' on the Anthropic website. Even if GDM has a higher bar for what counts as a 'paper' than Anthropic, I think we don't really want to be adjudicating this, so I feel comfortable just deferring to each company about what counts as a paper for them.

Replies from: nc

↑ comment by cdt (nc) · 2024-11-05T11:49:47.429Z · LW(p) · GW(p)

I would have found it helpful in your report for there to be a ROSES-type diagram or other flowchart showing the steps in your paper collation. This would bring it closer in line with other scoping reviews and would have made it easier to understand your methodology.

↑ comment by Zach Stein-Perlman · 2024-10-26T21:31:32.594Z · LW(p) · GW(p)

@Zoe Williams [LW · GW]

comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-10-24T20:54:43.852Z · LW(p) · GW(p)

My main takeaway would be that this seems like quite strong evidence towards the view expressed in https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/, that most safety research doesn't come from the top labs.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-10-24T21:07:46.141Z · LW(p) · GW(p)

How does this paper suggest "that most safety research doesn't come from the top labs"?

Replies from: bogdan-ionut-cirstea

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-10-24T21:28:33.995Z · LW(p) · GW(p)

Indirectly, because 90 papers seems like a tiny number, vs. what got published on arxiv during that same time interval. (Depending on how one counts) I wouldn't be surprised if there were > 90 papers from outside the labs even looking only at the unlearning category.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-10-24T23:54:22.508Z · LW(p) · GW(p)

Counting the number of papers isn't going to be a good strategy.

I do think total research outside of labs looks competitive with research from labs and probably research done outside of labs has produced more ~~differential~~ safety progress in total.

I also think open weight models are probably good so far in terms of making AI more likely to go well (putting aside norms and precedents of releases), though I don't think this is necessarily implied from "more research happens outside of labs".

Replies from: Lanrian

↑ comment by Lukas Finnveden (Lanrian) · 2024-10-25T17:14:30.511Z · LW(p) · GW(p)

probably research done outside of labs has produced more differential safety progress in total

To be clear — this statement is consistent with companies producing way more safety research than non-companies, if companies also produce even way more capabilities progress than non-companies? (Which I would've thought is the case, though I'm not well-informed. Not sure if "total research outside of labs look competitive with research from labs" is meant to deny this possibility, or if you're only talking about safety research there.)

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-10-25T18:12:19.514Z · LW(p) · GW(p)

I'm just talking about research intended to be safety/safety-adjacent. As in, of this research, what has the quality weighted differential safety progress been.

Probably the word "differential" was just a mistake.

comment by sanyer (santeri-koivula) · 2024-12-02T11:55:16.747Z · LW(p) · GW(p)

Really interesting work! I have two questions:

1. In the model organisms of misalignment -section it is stated that AI companies might be nervous about researching model organisms because it could increase the likelihood of new regulation, since it would provide more evidence on concerning properties in AI system. Doesn't this depend on what kind of model organisms the company expects to be able to develop? If it's difficult to find model organisms, we would have evidence that alignment is easier and thus there would be less need for regulation.

2. Why didn't you listed AI control work as one of the areas that may be slow to progress without efforts from outside labs? According to your incentives analysis it doesn't seem like AI companies have many incentives to pursue this kind of work, and there were zero papers on AI control.

IAPS: Mapping Technical Safety Research at AI Companies

Contents

13 comments