Questions for labs

post by Zach Stein-Perlman · 2024-04-30T22:15:55.362Z · LW · GW · 11 comments


  OpenAI-Microsoft relationship
  [At least three different labs]

Associated with AI Lab Watch [LW · GW], I sent questions to some labs a week ago (except I failed to reach Microsoft). I didn't really get any replies (one person replied in their personal capacity; this was very limited and they didn't answer any questions). Here are most of those questions, with slight edits since I shared them with the labs + questions I asked multiple labs condensed into the last two sections.

Lots of my questions are normal I didn't find public info on this safety practice and I think you should explain questions. Some are more like it's pretty uncool that I can't find the answer to this — like: breaking commitments, breaking not-quite-commitments and not explaining, having ambiguity around commitments, and taking credit for stuff[1] when it's very unclear that you should get credit are pretty uncool.


Internal governance stuff (I'm personally particularly interested in these questions—I think Anthropic has tried to set up great internal governance systems and maybe it has succeeded but it needs to share more information for that to be clear from the outside):

Details of when the RSP triggers evals:

Deployment commitments: does Anthropic consider itself bound by any commitments about deployment it has made in the past besides those in its RSP (and in particular not meaningfully advancing the frontier [LW(p) · GW(p)])? (Why hasn't it clarified this after the confusion around Anthropic's commitments after the Claude 3 launch?)

You shared Claude 2 with METR. Did you let METR or any other external parties do model evals for dangerous capabilities on Claude 3 before deployment? Do you plan for such things in the future? (To be clear, I don't just mean using external red-teamers, I mean using experts in eliciting capabilities and doing evals who have substantial independence.)

I think it would be really cool if y'all said more about lots of safety stuff, to help other labs do better. Like, safety evals, and red-teaming, and the automated systems mentioned here, and all the other safety stuff you do.


Preparedness Framework:

Internal governance:

What are your plans for the profit cap for investors?

You shared GPT-4 with METR. Are you planning to let METR or any other external parties do model evals for dangerous capabilities before future deployments? (To be clear, I don't just mean using external red-teamers, I mean using experts in eliciting capabilities and doing evals who have substantial independence.)


DeepMind and Google have various councils and teams related to safety (see e.g. here): the Responsible AI Council, the Responsible Development and Innovation team, the Responsibility and Safety Council, the Advanced Technology Review Council, etc. From the outside, it's difficult to tell whether they actually improve safety. Can you point me to details on what they do, what exactly they're responsible for, what their powers are, etc.?

My impression is that when push comes to shove, Google can do whatever it wants with DeepMind; DeepMind and its leadership have no hard power. Is this correct? If it is mistaken, can you clarify the relationship between DeepMind and Google?

I hope DeepMind will soon make RSP-y commitments — like, dangerous capability evals before deployment + at least one risk threshold and how it would affect deployment decisions + a plan for making safety arguments after models have dangerous capabilities.

I wish DeepMind (or Google) would articulate a safety plan in the company's voice, clearly supported by leadership, rather than leaving DeepMind safety folks to do so in their personal voices [? · GW].

I'm kind of confused about the extent to which DeepMind and Google have a mandate for preventing extreme risks and sharing the benefits of powerful AI. Would such a mandate be expressed in places besides DeepMind's About page, Google's AI Principles, and Google's Responsible AI Practices? Do any internal oversight bodies have explicit mandates along these lines?

What's the status of DeepMind's old Operating Principles?

Terms of service:


Microsoft says:

When it comes to frontier model deployment, Microsoft and OpenAI have together defined capability thresholds that act as a trigger to review models in advance of their first release or downstream deployment. The scope of a review, through our joint Microsoft-OpenAI Deployment Safety Board (DSB), includes model capability discovery. We established the joint DSB’s processes in 2021, anticipating a need for a comprehensive pre-release review process focused on AI safety and alignment, well ahead of regulation or external commitments mandating the same.

We have exercised this review process with respect to several frontier models, including GPT-4. Using Microsoft’s Responsible AI Standard and OpenAI’s experience building and deploying advanced AI systems, our teams prepare detailed artefacts for the joint DSB review. Artefacts record the process by which our organizations have mapped, measured, and managed risks, including through the use of adversarial testing and third-party evaluations as appropriate. We continue to learn from, and refine, the joint DSB process, and we expect it to evolve over time.

How do you measure capabilities? What are the capability thresholds? How does the review work? Can you share the artifacts? What are other details I'd want to know?


Is there a risk or capability threshold beyond which Meta AI would stop releasing model weights? What could lead Meta AI to stop releasing model weights?

Does Meta AI measure the risk of autonomous replication, per METR and the white house commitments? Does it plan to?

OpenAI-Microsoft relationship

[At least three different labs]

[new] Did you commit to share models with UK AISI before deployment? (I suspect Rishi Sunak or Politico hallucinated this.)

Do you let external parties do model evals for dangerous capabilities before you deploy models? (To be clear, I don't just mean using external red-teamers, I mean using experts in eliciting capabilities and doing evals who have substantial independence.)

Do you do anything to avoid helping others create powerful models (via model inversion or just imitation learning)? Do you enforce any related provisions in your terms of service?

Do you have a process for staff to follow if they have concerns about safety? (Where have you written about it, or can you share details?) What about concerns/suggestions about risk assessment policies or their implementation?

Do you ever keep research private for safety reasons? How do you decide what research to publish; how do safety considerations determine what you publish?

Do AI oversight bodies within the lab have a mandate for safety and preventing extreme risks? Please share details on how they work.


Do you require nontrivial KYC for some types of model access? If so, please explain or point me to details.

What do you do to improve adversarial robustness, e.g. to prevent jailbreaking? (During training and especially at inference-time.)

Can you promise that you don't use non-disparagement agreements (nor otherwise discourage current or past staff or board members from talking candidly about their impressions of and experiences with the company)?

I'm currently uncertain about how system prompts can prevent misuse. Maybe they can't and I won't write about this. But in case they can: How do you think about system prompts and preventing misuse?

(Note: if a public source is kinda related to a question but doesn't directly answer it, probably I'm already aware of it and mention it on AI Lab Watch.)

  1. ^

    I'm thinking of internal governance stuff and risk assessment / RSP-y stuff.

  2. ^

    At least among my friends, Anthropic gets way more credit for LTBT than it deserves based on public information. For all we know publicly, Google and Amazon can abrogate the LTBT at will! Failing to share the details on the LTBT but claiming credit for the LTBT being great is not cool.

  3. ^


    Engaging external experts has the advantage of leveraging specialized domain expertise and increasing the likelihood of an unbiased audit. Initially, we expected this collaboration [with METR] to be straightforward, but it ended up requiring significant science and engineering support on our end. Providing full-time assistance diverted resources from internal evaluation efforts.

    Anthropic cofounder Jack Clark:

    Pre-deployment testing is a nice idea but very difficult to implement.


Comments sorted by top scores.

comment by Akash (akash-wasil) · 2024-05-02T16:36:52.877Z · LW(p) · GW(p)

@Zach Stein-Perlman [LW · GW], great work on this. I would be interested in you brainstorming some questions that have to do with the lab's stances toward (government) AI policy interventions.

After a quick 5 min brainstorm, here are some examples of things that seem relevant:

  • I remember hearing that OpenAI lobbied against the EU AI Act– what's up with that?
  • I heard a rumor that Congresspeople and their teams reached out to Sam/OpenAI after his testimony. They allegedly asked for OpenAI's help to craft legislation around licensing, and then OpenAI refused. Is that true? 
  • Sam said we might need an IAEA for AI at some point– what did he mean by this? At what point would he see that as valuable?
  • In general, what do labs think the US government should be doing? What proposals would they actively support or even help bring about? (Flagging ofc that there are concerns about actual and perceived regulatory capture, but there are also major advantages to having industry players support & contribute to meaningful regulation).
  • Senator Cory Booker recently asked Jack Clark something along the lines of "what is your top policy priority right now//what would you do if you were a Senator." Jack responded with something along the lines of "I would make sure the government can deploy AI successfully. We need a testing regime to better understand risks, but the main risk is that we don't use AI enough, and we need to make sure we stay at the cutting edge." What's up with that?
  • Why haven't Dario and Jack made public statements about specific government interventions? Do they believe that there are some circumstances under which a moratorium would need to be implemented, labs would need to be nationalized (or internationalized), or something else would need to occur to curb race dynamics? (This could be asked to any of the lab CEOs/policy team leads– I don't mean to be picking on Anthropic, though I think Sam/OpenAI have had more public statements here, and I think the other labs are scoring more poorly across the board//don't fully buy into the risks in the first place.)
  • Big tech is spending a lot of money on AI lobbying. How much is each lab spending (this is something you can estimate with publicly available data), and what are they actually lobbying for/against?

I imagine there's a lot more in this general category of "labs and how they are interacting with governments and how they are contributing to broader AI policy efforts", and I'd be excited to see AI Lab Watch (or just you) dive into this more.

Replies from: Zach Stein-Perlman
comment by Zach Stein-Perlman · 2024-05-02T18:58:23.395Z · LW(p) · GW(p)

Thanks. Briefly:

I'm not sure what the theory of change for listing such questions is.

In the context of policy advocacy, think it's sometimes fine/good for labs to say somewhat different things publicly vs privately. Like, if I was in charge of a lab and believed (1) the EU AI Act will almost certainly pass and (2) it has some major bugs that make my life harder without safety benefits, I might publicly say "I support (the goals of) the EU AI Act" and privately put some effort into removing those bugs, which is technically lobbying to weaken the Act.

(^I'm not claiming that particular labs did ~this rather than actually lobby against the Act. I just think it's messy and regulation isn't a one-dimensional thing that you're for or against.)

Edit: this comment was misleading and partially replied to a strawman. I agree it would be good for the labs and their leaders to publicly say some things about recommended regulation (beyond what they already do) and their lobbying. I'm nervous about trying to litigate rumors for reasons I haven't explained.

Edit 2: based on,, and background information, I believe that OpenAI, Microsoft, Google, and Meta privately lobbied to make the EU AI Act worse—especially by lobbying against rules for foundation models—and that this is inconsistent with OpenAI's and Altman's public statements.

Replies from: akash-wasil
comment by Akash (akash-wasil) · 2024-05-02T20:15:27.322Z · LW(p) · GW(p)

Right now, I think one of the most credible ways for a lab to show its committment to safety is through its engagement with governments.

I didn’t mean to imply that a lab should automatically be considered “bad” if its public advocacy and its private advocacy differ.

However, when assessing how “responsible” various actors are, I think investigating questions relating to their public comms, engagement with government, policy proposals, lobbying efforts, etc would be valuable.

If Lab A had slightly better internal governance but lab B had better effects on “government governance”, I would say that lab B is more “responsible” on net.

comment by Zach Stein-Perlman · 2024-05-08T23:00:29.898Z · LW(p) · GW(p)

Yay @Zac Hatfield-Dodds [LW · GW] of Anthropic for feedback and corrections including clarifying a couple of Anthropic's policies. Two pieces of not-previously-public information:

  • I was disappointed that Anthropic's Responsible Scaling Policy only mentions evaluation "During model training and fine-tuning." Zac told me "this was a simple drafting error - our every-three months evaluation commitment is intended to continue during deployment. This has been clarified for the next version, and we've been acting accordingly all along." Yay.
  • I said labs should have a "process for staff to escalate concerns about safety" and "have a process for staff and external stakeholders to share concerns about risk assessment policies or their implementation with the board and some other staff, including anonymously." I noted that Anthropic's RSP includes a commitment to "Implement a non-compliance reporting policy." Zac told me "Beyond standard internal communications channels, our recently formalized non-compliance reporting policy meets these criteria [including independence], and will be described in the forthcoming RSP v1.1." Yay.

I think it's cool that Zac replied (but most of my questions for Anthropic remain).

I have not yet received substantive corrections/clarifications from any other labs.

(I have made some updates to based on Zac's feedback—and revised Anthropic's score from 45 to 48—but have not resolved all of it.)

comment by ryan_greenblatt · 2024-05-01T04:07:41.209Z · LW(p) · GW(p)

I get it if you're worried about leaks but I don't get how it could be a hard engineering problem — just share API access early, with fine-tuning

Fine-tuning access can be extremely expensive if implemented naively and it's plausible that cheap (LoRA) fine-tuning isn't even implemented for new models internally for a while at AI labs. If you make the third party groups pay for it than I suppose this isn't a problem, but the costs could be vast.

I agree that inference access is cheap.

Replies from: Fabien
comment by Fabien Roger (Fabien) · 2024-05-01T10:07:08.381Z · LW(p) · GW(p)

What do you expect to be expensive? The engineer hours to build the fine-tuning infra? Or the actual compute for fine-tuning?

Given the amount of internal fine-tuning experiments going on for safety stuff, I'd be surprised if the infra was a bottleneck, though maybe there is a large overhead in making these find-tuned models available through an API.

I'd be even more surprised if the cost of compute was significant compared to the rest of the activity the lab is doing (I think fine-tuning on a few thousand sequences is often enough for capabilities' evaluations, you rarely need massive training runs).

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-05-01T16:22:03.369Z · LW(p) · GW(p)

Compute for doing inference on the weights if you don't have LoRA finetuning set up properly.

My implicit claim is that there maybe isn't that much fine-tuning stuff internally.

Replies from: Zach Stein-Perlman, Fabien
comment by Zach Stein-Perlman · 2024-05-21T06:58:10.824Z · LW(p) · GW(p)

Maybe setting up custom fine-tuning is hard and labs often only set it up during deployment...

(Separately, it would be nice if OpenAI and Anthropic let some safety researchers do fine-tuning now.)

comment by Fabien Roger (Fabien) · 2024-05-02T14:49:48.293Z · LW(p) · GW(p)

Isn't that only ~10x more expensive than running the forward-passes (even if you don't do LoRA)? Or is it much more because of communications bottlenecks + the infra being taken by the next pretraining run (without the possibility to swap the model in and out).

comment by faul_sname · 2024-05-01T17:31:02.505Z · LW(p) · GW(p)

What does Anthropic owe to its investors/stockholders? (any fiduciary duty? any other promises or obligations?) I think balancing their interests with pursuit of the mission; anything more concrete?

This sounds like one of those questions that's a terrible idea to answer in writing without extensive consultation with your legal department. How long was the time period between when you asked this question and when you made this post?

Replies from: Zach Stein-Perlman
comment by Zach Stein-Perlman · 2024-05-01T17:59:50.013Z · LW(p) · GW(p)

This post is not trying to shame labs for failing to answer before; I didn't try hard to get them to answer. (The period was one week but I wasn't expecting answers to my email / wouldn't expect to receive a reply even if I waited longer.)

(Separately, I kinda hope the answers to basic questions like this are already written down somewhere...)