Zach Stein-Perlman's Shortform

zach-stein-perlman

Zach Stein-Perlman's Shortform

post by Zach Stein-Perlman · 2021-08-29T18:00:56.148Z · LW · GW · 237 comments

237 comments

237 comments

Comments sorted by top scores.

comment by Zach Stein-Perlman · 2024-08-04T05:45:35.759Z · LW(p) · GW(p)

Two weeks ago some senators asked OpenAI questions about safety. A few days ago OpenAI responded. Its reply is frustrating.

OpenAI's letter ignores all of the important questions^[1] and instead brags about somewhat-related "safety" stuff. Some of this shows chutzpah — the senators, aware of tricks like letting ex-employees nominally keep their equity but excluding them from tender events, ask

Can you further commit to removing any other provisions from employment agreements that could be used to penalize employees who publicly raise concerns about company practices, such as the ability to prevent employees from selling their equity in private “tender offer” events?

and OpenAI's reply just repeats the we-don't-cancel-equity thing:

OpenAI has never canceled a current or former employee’s vested equity. The May and July communications to current and former employees referred to above confirmed that OpenAI would not cancel vested equity, regardless of any agreements, including non-disparagement agreements, that current and former employees may or may not have signed, and we have updated our relevant documents accordingly.

!!^[2]

One thing in OpenAI's letter is object-level notable: they deny that they ever committed compute to Superalignment.

To further our safety research agenda, last July we committed to allocate at least 20 percent of the computing resources we had secured to AI safety over a multi-year period. This commitment was always intended to apply to safety efforts happening across our company, not just to a specific team. We continue to uphold this commitment.

Altman tweeted the same thing at the time the letter was published.

I think this is straightforward gaslighting. I didn't find a super-explicit quote from OpenAI or its leadership that the compute was for Superalignment, but:

The announcement was pretty clear:
> Introducing Superalignment
> We need scientific and technical breakthroughs to steer and control AI systems much smarter than us. To solve this problem within four years, we’re starting a new team, co-led by Ilya Sutskever and Jan Leike, and dedicating 20% of the compute we’ve secured to date to this effort.
As far as I know, everyone—including OpenAI people and people close to OpenAI—interpreted the compute commitment as for Superalignment
I never heard it suggested that it was for not-just-superalignment

Sidenote on less explicit deception — the "20%" thing: most people are confused about 20% of compute secured in July 2023, to be used over four years vs 20% of compute, and when your announcement is confusing and indeed most people are confused and you fail to deconfuse them, you're kinda culpable. OpenAI continues to fail to clarify this — e.g. here the senators asked "Does OpenAI plan to honor its previous public commitment to dedicate 20 percent of its computing resources to research on AI safety?" and OpenAI replied "last July we committed to allocate at least 20 percent of the computing resources we had secured to AI safety over a multi-year period." This sentence is close to the maximally misleading way to say the commitment was only for compute we'd secured in July 2023, and we don't have to use it for safety until 2027.

^{^}
Most important to me were 3, 4a, and 9. Maybe also 6; I'm unfamiliar with the facts there.
^{^}
I'm confused by this reply — even pretending OpenAI is totally ruthless, I'd think it's not incentivized to exclude people from tender offers, and moreover it's incentivized to clarify that. Leaving it ambiguous leaves ex-employees in a little more fear of OpenAI excluding them (even though presumably OpenAI never would, since it would look sooo bad after e.g. Altman said "vested equity is vested equity, full stop"), but it looks bad to people like me and the senators...
OpenAI has said something internally about including past employees in tender events, but this leaves some ambiguity and I wish OpenAI had answered the question.

Replies from: Zach Stein-Perlman, mateusz-baginski

↑ comment by Zach Stein-Perlman · 2024-08-05T01:30:04.953Z · LW(p) · GW(p)

Clarification on the Superalignment commitment: OpenAI said:

We are dedicating 20% of the compute we’ve secured to date over the next four years to solving the problem of superintelligence alignment. Our chief basic research bet is our new Superalignment team, but getting this right is critical to achieve our mission and we expect many teams to contribute, from developing new methods to scaling them up to deployment.

The commitment wasn't compute for the Superalignment team—it was compute for superintelligence alignment. (As opposed to, in my view, work by the posttraining team and near-term-focused work by the safety systems team and preparedness team.) Regardless, OpenAI is not at all transparent about this, and they violated the spirit of the commitment by denying Superalignment compute or a plan for when they'd get compute, even if the literal commitment doesn't require them to give any compute to safety until 2027.

↑ comment by Mateusz Bagiński (mateusz-baginski) · 2024-08-04T16:12:50.177Z · LW(p) · GW(p)

Also, they failed to provide the promised fraction of compute to the Superalignment team (and not because it was needed for non-Superalignment safety stuff).

comment by Zach Stein-Perlman · 2024-08-04T05:30:11.721Z · LW(p) · GW(p)

Update, five days later: OpenAI published the GPT-4o system card, with most of what I wanted (but kinda light on details on PF evals).

OpenAI Preparedness scorecard

Context:

OpenAI's Preparedness Framework says OpenAI will maintain a public scorecard showing their current capability level (they call it "risk level"), in each risk category they track, before and after mitigations.
When OpenAI released GPT-4o, it said "GPT-4o does not score above Medium risk in any of these categories" but didn't break down risk level by category.
(I've remarked on this repeatedly. I've also remarked that the ambiguity suggests that OpenAI didn't actually decide whether 4o was Low or Medium in some categories, but this isn't load-bearing for the OpenAI is not following its plan proposition.)

News: a week ago,^[1] a "Risk Scorecard" section appeared near the bottom of the 4o page. It says:

Updated May 8, 2024
As part of our Preparedness Framework, we conduct regular evaluations and update scorecards for our models. Only models with a post-mitigation score of “medium” or below are deployed.The overall risk level for a model is determined by the highest risk level in any category. Currently, GPT-4o is assessed at medium risk both before and after mitigation efforts.

This is not what they committed to publish. It's quite explicit that the scorecard should show risk in each category, not just overall.^[2]

(What they promised: a real version of the image below. What we got: the quote above.)

Additionally, they're supposed to publish their evals and red-teaming.^[3] But OpenAI has still said nothing about how it evaluated 4o.

Most frustrating is the failure to acknowledge that they're not complying with their commitments. If they were transparent and said they're behind schedule and explained their current plan, that would probably be fine. Instead they're claiming to have implemented the PF and to have evaluated 4o correctly and publicly taking credit for that and ignoring the issues.

^{^}
Archive versions:
- May 13 (original)
- July 24 (similar to original)
- July 26 ("scorecard" added)
^{^}
Two relevant quotes:
- "As a part of our Preparedness Framework, we will maintain a dynamic (i.e., frequently updated) Scorecard that is designed to track our current pre-mitigation model risk across each of the risk categories, as well as the post-mitigation risk."
- "Scorecard,   in which we will indicate our current assessments  of the level of risk along each tracked risk category"
And there are no suggestions to the contrary.
^{^}
This is not as explicit in the PF, but they're supposed to frequently update the scorecard section of the PF, and the scorecard section is supposed to describe their evals.
Regardless, this is part of the White House voluntary commitments:
Publicly report model or system capabilities, limitations, and domains of appropriate and inappropriate use, including discussion of societal risks, such as effects on fairness and bias[:] . . . . publish reports for all new significant model public releases . . . . These reports should include the safety evaluations conducted (including in areas such as dangerous capabilities, to the extent that these are responsible to publicly disclose) . . . and the results of adversarial testing conducted to evaluate the model's fitness for deployment [and include the "red-teaming and safety procedures"].
For more on commitments, see https://ailabwatch.org/resources/commitments/.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-08-08T17:30:27.329Z · LW(p) · GW(p)

Coda: yay OpenAI for publishing the GPT-4o system card, including eval results and the scorecard they promised! (Minus the "Unknown Unknowns" row but that row never made sense to me anyway.)

comment by Zach Stein-Perlman · 2024-07-13T05:50:25.389Z · LW(p) · GW(p)

OpenAI reportedly rushed the GPT-4o evals. This article makes it sound like the problem is not having enough time to test the final model. I don't think that's necessarily a huge deal — if you tested a recent version of the model and your tests have a large enough safety buffer, it's OK to not test the final model at all.

But there are several apparent issues with the application of the Preparedness Framework (PF) for GPT-4o (not from the article):

They didn't publish their scorecard
- Despite the PF saying they would
- They instead said "GPT-4o does not score above Medium risk in any of these categories." (Maybe they didn't actually decide whether it's Low or Medium in some categories!)
They didn't publish their evals
- Despite the PF strongly suggesting they would
- Despite committing to in the White House voluntary commitments
While rushing testing of the final model would be OK in some circumstances, OpenAI's PF is supposed to ensure safety by testing the final model before deployment. (This contrasts with Anthropic's RSP, which is supposed to ensure safety with its "safety buffer" between evaluations and doesn't require testing the final model.) So OpenAI committed to testing the final model well and its current safety plan depends on doing that.
- [Edit: this may be ambiguous: OpenAI explicitly committed to test every 2x increase in effective training compute, but maybe it merely strongly suggests that it's supposed to test all final models ("We will evaluate all our frontier models, including at every 2x effective compute increase during training runs"; "Only models with a post-mitigation score of 'medium' or below can be deployed"). This is mostly moot in this case, since here they claimed to evaluate GPT-4o, not an earlier version.]
[Edit: also maybe lack of third-party audits, but they didn't commit to do audits at any particular frequency]

Replies from: nathan-helm-burger, tao-lin

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-07-14T16:25:34.950Z · LW(p) · GW(p)

I am also frustrated by the current underwhelming state of safety evals being done in general and in particular for GPT-4o. I do think it's worth mentioning that privately sharing eval results with the Federal government wouldn't be evident to the general public. I hope that OpenAI is privately sharing more details than they are releasing publicly. The fact that the public can't know whether this is the case is a problem. A potential solution might be for the government to report on their take on whether a new frontier model is "in compliance with teporting standards" or not. That way, even though the evals were private, the public would know if the government had received its private reports.

↑ comment by Tao Lin (tao-lin) · 2024-08-20T22:55:21.624Z · LW(p) · GW(p)

if you tested a recent version of the model and your tests have a large enough safety buffer, it's OK to not test the final model at all.

I agree in theory but testing the final model feels worthwhile, because we want more direct observability and less complex reasoning in safety cases.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-08-20T23:06:47.425Z · LW(p) · GW(p)

Thanks. Is this because of posttraining? Ignoring posttraining, I'd rather that evaluators get the 90% through training model version and are unrushed than the final version and are rushed — takes?

Replies from: tao-lin

↑ comment by Tao Lin (tao-lin) · 2024-08-21T19:31:30.631Z · LW(p) · GW(p)

two versions with the same posttraining, one with only 90% pretraining are indeed very similar, no need to evaluate both. It's likely more like one model with 80% pretraining and 70% posttraining of the final model, and the last 30% of posttraining might be significant

comment by Zach Stein-Perlman · 2024-07-15T19:27:18.411Z · LW(p) · GW(p)

Woah. OpenAI antiwhistleblowing news seems substantially more obviously-bad than the nondisparagement-concealed-by-nondisclosure stuff. If page 4 and the "threatened employees with criminal prosecutions if they reported violations of law to federal authorities" part aren't exaggerated, it crosses previously-uncrossed integrity lines. H/t Garrison Lovely.

[Edit: probably exaggerated; see comments. But I haven't seen takes that the "OpenAI made staff sign employee agreements that required them to waive their federal rights to whistleblower compensation" part is likely exaggerated, and that alone seems quite bad.]

Replies from: aphyer, Dagon, Vladimir_Nesov

↑ comment by aphyer · 2024-07-16T11:53:02.383Z · LW(p) · GW(p)

Matt Levine is worth reading on this subject (also on many others).

https://www.bloomberg.com/opinion/articles/2024-07-15/openai-might-have-lucrative-ndas?srnd=undefined

The SEC has a history of taking aggressive positions on what an NDA can say (if your NDA does not explicitly have a carveout for 'you can still say anything you want to the SEC', they will argue that you're trying to stop whistleblowers from talking to the SEC) and a reliable tendency to extract large fines and give a chunk of them to the whistleblowers.

This news might be better modeled as 'OpenAI thought it was a Silicon Valley company, and tried to implement a Silicon Valley NDA, without consulting the kind of lawyers a finance company would have used for the past few years.'

(To be clear, this news might also be OpenAI having been doing something sinister. I have no evidence against that, and certainly they've done shady stuff before. But I don't think this news is strong evidence of shadiness on its own).

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-07-16T13:55:48.887Z · LW(p) · GW(p)

Hmm. Part of the news is "Non-disparagement clauses that failed to exempt disclosures of securities violations to the SEC"; this is minor. Part of the news is "threatened employees with criminal prosecutions if they reported violations of law to federal authorities"; this seems major and sinister.

Replies from: aphyer

↑ comment by aphyer · 2024-07-16T14:18:43.858Z · LW(p) · GW(p)

Not a lawyer, but I think those are the same thing.

The SEC's legal theory is that "non-disparagement clauses that failed to exempt disclosures of securities violations to the SEC" and "threats of prosecution if you report violations of law to federal authorities" are the same thing, and on reading the letter I can't find any wrongdoing alleged or any investigation requested outside of issues with "OpenAI's employment, severance, non-disparagement and non-disclosure agreements".

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-07-19T19:59:34.251Z · LW(p) · GW(p)

I'm confused by the word "prosecution" here. I'd assume violating your OpenAI contract is a civil thing, not a criminal thing.

Edit: like I think the word "prosecution" should be "suit" in your sentence about the SEC's theory. And this makes the whistleblowers' assertion weirder.

Replies from: aphyer

↑ comment by aphyer · 2024-07-19T21:17:11.324Z · LW(p) · GW(p)

Yeah, I have no idea. It would be much clearer if the contracts themselves were available. Obviously the incentive of the plaintiffs is to make this sound as serious as possible, and obviously the incentive of OpenAI is to make it sound as innocuous as possible. I don't feel highly confident without more information, my gut is leaning towards 'opportunistic plaintiffs hoping for a cut of one of the standard SEC settlements' but I could easily be wrong.

EDITED TO ADD: On re-reading the letter, I'm not clear where the word 'criminal' even came from. The WaPo article claims

These agreements threatened employees with criminal prosecutions if they reported violations of law to federal authorities under trade secret laws, Kohn said.

but the letter does not contain the word 'criminal', its allegations are:

Non-disparagement clauses that failed to exempt disclosures of securities violations to the SEC;
Requiring prior consent from the company to disclose confidential information to federal authorities;
Confidentiality requirements with respect to agreements, that themselves contain securities violations;
Requiring employees to waive compensation that was intended by Congress to incentivize reporting and provide financial relief to whistleblowers.

↑ comment by Dagon · 2024-07-16T16:32:29.437Z · LW(p) · GW(p)

Non-communication of problems enforced by significant legal penalties feels like it's part of the same underlying problem, though I agree that "nondisparagement" to the public or press is far less heinous than "non-reporting of crimes"

It's unclear whether OpenAI, a non-public company, has actually done things which would be covered by whistleblower laws or compensation for talking to a federal agency. But it's highly suspicious (and per Matt Levine, likely penalizable if under SEC purview) to try to prevent such reporting.

↑ comment by Vladimir_Nesov · 2024-07-15T19:41:25.571Z · LW(p) · GW(p)

(The tweet includes a screenshot from The Washington Post article "OpenAI illegally barred staff from airing safety risks, whistleblowers say" that references a letter to SEC.)

Edit: This was in response to the original version of the above comment that only linked to the tweet without other links or elaboration.

comment by Zach Stein-Perlman · 2024-07-23T13:00:59.075Z · LW(p) · GW(p)

New OpenAI tweet "on how we’re prioritizing safety in our work." I'm annoyed.

We believe that frontier AI models can greatly benefit society. To help ensure our readiness, our Preparedness Framework helps evaluate and protect against the risks posed by increasingly powerful models. We won’t release a new model if it crosses a “medium” risk threshold until we implement sufficient safety interventions. https://openai.com/preparedness/

This seems false: per the Preparedness Framework, nothing happens when they cross their "medium" threshold; they meant to say "high." Presumably this is just a mistake, but it's a pretty important one, and they said the same false thing in a May blogpost (!). (Indeed, GPT-4o may have reached "medium" — they were supposed to say how it scored in each category, but they didn't, and instead said "GPT-4o does not score above Medium risk in any of these categories.")

(Reminder: the "high" thresholds sound quite scary; here's cybersecurity (not cherrypicked, it's the first they list): "Tool-augmented model can identify and develop proofs-of-concept for high-value exploits against hardened targets without human intervention, potentially involving novel exploitation techniques, OR provided with a detailed strategy, the model can end-to-end execute cyber operations involving the above tasks without human intervention." They can deploy models just below the "high" threshold with no mitigations. (Not to mention the other issues with the Preparedness Framework.))

We are developing levels to help us and stakeholders categorize and track AI progress. This is a work in progress and we'll share more soon.

Shrug. This isn't bad but it's not a priority and it's slightly annoying they don't mention more important things.

In May our Board of Directors launched a new Safety and Security committee to evaluate and further develop safety and security recommendations for OpenAI projects and operations. The committee includes leading cybersecurity expert, retired U.S. Army General Paul Nakasone. This review is underway and we’ll share more on the steps we’ll be taking after it concludes. https://openai.com/index/openai-board-forms-safety-and-security-committee/

I have epsilon confidence in both the board's ability to do this well if it wanted (since it doesn't include any AI safety experts) (except on security) and in the board's inclination to exert much power if it should (given the history of the board and Altman).

Our whistleblower policy protects employees’ rights to make protected disclosures. We also believe rigorous debate about this technology is important and have made changes to our departure process to remove non-disparagement terms.

Not doing nondisparagement-clause-by-default is good. Beyond that, I'm skeptical, given past attempts to chill employee dissent (the nondisparagement thing, Altman telling the board's staff liason to not talk to employees or tell him about those conversations, maybe recent antiwhistleblowing news [LW(p) · GW(p)]) and lies about that. (I don't know of great ways to rebuild trust; some mechanisms would work but are unrealistically ambitious.)

Safety has always been central to our work, from aligning model behavior to monitoring for abuse, and we’re investing even further as we develop more capable models.
https://openai.com/index/openai-safety-update/

This is from May. It's mostly not about x-risk, and the x-risk-relevant stuff is mostly non-substantive, except the part about the Preparedness Framework, which is crucially wrong.

Replies from: aysja

↑ comment by aysja · 2024-07-23T20:57:59.072Z · LW(p) · GW(p)

Maybe I'm missing the relevant bits, but afaict their preparedness doc says that they won't deploy a model if it passes the "medium" threshold, eg:

Only models with a post-mitigation score of "medium" or below can be deployed. In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place.

The threshold for further developing is set to "high," though. I.e., they can further develop so long as models don't hit the "critical" threshold.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-07-23T21:10:32.168Z · LW(p) · GW(p)

I think you're confusing medium-threshold with medium-zone (the zone from medium-threshold to just-below-high-threshold). Maybe OpenAI made this mistake too — it's the most plausible honest explanation. (They should really do better.) (I doubt they intentionally lied, because it's low-upside and so easy to catch, but the mistake is weird.)

Based on the PF, they can deploy a model just below the "high" threshold without mitigations. Based on the tweet and blogpost:

We won’t release a new model if it crosses a “medium” risk threshold until we implement sufficient safety interventions.

This just seems clearly inconsistent with the PF (should say crosses out of medium zone by crossing a "high" threshold).

We won’t release a new model if it crosses a “Medium” risk threshold from our Preparedness Framework, until we implement sufficient safety interventions to bring the post-mitigation score back to “Medium”.

This doesn't make sense: if you cross a "medium" threshold you enter medium-zone. Per the PF, the mitigations just need to bring you out of high-zone and down to medium-zone.

(Sidenote: the tweet and blogpost incorrectly suggest that the "medium" thresholds matter for anything; based on the PF, only the "high" and "critical" thresholds matter (like, there are three ways to treat models: below high or between high and critical or above critical).)

[edited repeatedly]

Replies from: aysja

↑ comment by aysja · 2024-07-23T21:46:43.307Z · LW(p) · GW(p)

I agree that scoring "medium" seems like it would imply crossing into the medium zone, although I think what they actually mean is "at most medium." The full quote (from above) says:

In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place for the relevant postmitigation risk level to be back at most to “medium” level.

I.e., I think what they’re trying to say is that they have different categories of evals, each of which might pass different thresholds of risk. If any of those are “high,” then they're in the "medium zone" and they can’t deploy. But if they’re all medium, then they're in the "below medium zone" and they can. This is my current interpretation, although I agree it’s fairly confusing and it seems like they could (and should) be more clear about it.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-07-23T21:53:55.802Z · LW(p) · GW(p)

Surely if any categories are above the "high" threshold then they're in "high zone" and if all are below the "high" threshold then they're in "medium zone."

And regardless the reading you describe here seems inconsistent with

We won’t release a new model if it crosses a “Medium” risk threshold from our Preparedness Framework, until we implement sufficient safety interventions to bring the post-mitigation score back to “Medium”.

[edited]

Added later: I think someone else had a similar reading and it turned out they were reading "crosses a medium risk threshold" as "crosses a high risk threshold" and that's just [not reasonable / too charitable].

comment by Zach Stein-Perlman · 2025-04-11T19:10:13.416Z · LW(p) · GW(p)

OpenAI slashes AI model safety testing time, FT reports. This is consistent with lots of past evidence about OpenAI's evals for dangerous capabilities being rushed, being done on weak checkpoints, and having worse elicitation than OpenAI has committed to.

This is bad because OpenAI is breaking its commitments (and isn't taking safety stuff seriously and is being deceptive about its practices). It's also kinda bad in terms of misuse risk, since OpenAI might fail to notice that its models have dangerous capabilities. I'm not saying OpenAI should delay deployments for evals — there may be strategies that are better (similar misuse-risk-reduction with less cost-to-the-company) than detailed evals for dangerous capabilities before external deployment, where you generally do slow/expensive evals after your model is done (even if you want to deploy externally before finishing evals) and have a safety buffer and increase the sensitivity of your filters early in deployment (when you're less certain about risk). But OpenAI isn't doing that; it's just doing a bad job of the evals before external deployment plan.

(Regardless, maybe short-term misuse isn't so scary, and maybe short-term misuse risk comes mostly from open-weights or stolen models than models that can be undeployed/mitigated if misuse risks appear during deployment. And insofar as you're more worried about risks from internal deployment, maybe you should focus on evals and mitigations relevant to those threats rather than external deployment. (OpenAI's doing even worse on risks from internal deployment!))

tl;dr: OpenAI is doing risk assessment poorly^[1] but maybe do detailed evals for dangerous capabilities before external deployment isn't a great ask.

^{^}
But similar to DeepMind and Anthropic, and those three are better than any other AI companies

Replies from: peter_hurford, o-o, william-walshe

↑ comment by Peter Wildeford (peter_hurford) · 2025-04-12T12:10:26.933Z · LW(p) · GW(p)

What do you think of the counterargument that OpenAI announced o3 in December and publicly solicited external safety testing then, and isn't deploying until ~4 months later?

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2025-04-12T15:25:08.605Z · LW(p) · GW(p)

I don't know. I don't have a good explanation for why OpenAI hasn't released o3. Delaying to do lots of risk assessment would be confusing because they did little risk assessment for other models.

↑ comment by O O (o-o) · 2025-04-11T20:57:40.500Z · LW(p) · GW(p)

https://www.windowscentral.com/software-apps/sam-altman-ai-will-make-coders-10x-more-productive-not-replace-them

It sounds like they’re getting pretty bearish on capabilities tho

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2025-04-12T14:15:44.195Z · LW(p) · GW(p)

Orrr he's telling comforting lies to tread the fine line between billion-dollar hype and nationalization-worthy panic [LW(p) · GW(p)].

Could realistically be either, but it's probably the comforting-lies thing. Whatever the ground-truth reality may be, the AGI labs are not bearish.

Replies from: o-o

↑ comment by O O (o-o) · 2025-04-14T04:07:21.788Z · LW(p) · GW(p)

I mean some hard evidence is them currently hiring a lot of software engineers for random product-y things. If AGI was close, wouldn't they go all in on research and training?

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2025-04-14T04:40:47.587Z · LW(p) · GW(p)

Interesting. Source? Last I heard, they're not hiring anyone because they expect SWE to be automated soon.

Replies from: o-o

↑ comment by O O (o-o) · 2025-04-14T16:27:31.144Z · LW(p) · GW(p)

I recently interviewed with them, and one of them said they’re hiring a lot of SWEs as they shift to product. Also many of my friends are currently interviewing with them.

↑ comment by kilgoar (william-walshe) · 2025-04-12T14:01:31.647Z · LW(p) · GW(p)

This post is first and foremost an exercise in the danger of nesting parentheticals. The conventional view is that a parenthesis is an aside or a parallel thesis to the sentence. I think you can easily get away with eliminating all of them as your ideas are forming a coherent whole.

As for the topic at hand, you are spot on that LLMs are less scary than ever, and the risks posed by a poorly designed language model are limited to legal and reputational problems for the businesses training and deploying them. There is no sense in hoping for effective self-policing. We so often see the chemical industry choosing disaster over even minimal safety, even when the costs to them are much higher in the long run. Large corporations are increasingly chaired by a revolving door of finance people who optimize for short term gains and personal rewards, rather than the engineer-CEOs of last century who remained in their position for decades. As long as the leadership in these companies care about the long term viability of their products and assets, the situation is relatively stable. However, that is altogether the exception rather than the rule.

comment by Zach Stein-Perlman · 2024-05-22T23:00:07.719Z · LW(p) · GW(p)

New Kelsey Piper article and twitter thread on OpenAI equity & non-disparagement.

It has lots of little things that make OpenAI look bad. It further confirms that OpenAI threatened to revoke equity unless employees signed the non-disparagement agreements Plus it shows Altman's signature on documents giving the company broad power over employees' equity — perhaps he doesn't read every document he signs, but this one seems quite important. This is all in tension with Altman's recent tweet that "vested equity is vested equity, full stop" and "i did not know this was happening." Plus "we have never clawed back anyone's vested equity, nor will we do that if people do not sign a separation agreement (or don't agree to a non-disparagement agreement)" is misleading given that they apparently regularly threatened to do so (or something equivalent — let the employee nominally keep their PPUs but disallow them from selling them) whenever an employee left.

Great news:

OpenAI told me that “we are identifying and reaching out to former employees who signed a standard exit agreement to make it clear that OpenAI has not and will not cancel their vested equity and releases them from nondisparagement obligations”

(Unless "employees who signed a standard exit agreement" is doing a lot of work — maybe a substantial number of employees technically signed nonstandard agreements.)

I hope to soon hear from various people that they have been freed from their nondisparagement obligations.

Update: OpenAI says:

As we shared with employees today, we are making important updates to our departure process. We have not and never will take away vested equity, even when people didn't sign the departure documents. We're removing nondisparagement clauses from our standard departure paperwork, and we're releasing former employees from existing nondisparagement obligations unless the nondisparagement provision was mutual. We'll communicate this message to former employees. We're incredibly sorry that we're only changing this language now; it doesn't reflect our values or the company we want to be.

[Low-effort post; might have missed something important.]

[Substantively edited after posting.]

Replies from: dsj, D0TheMath, Dagon, D0TheMath

↑ comment by dsj · 2024-05-22T23:27:57.520Z · LW(p) · GW(p)

(Unless "employees who signed a standard exit agreement" is doing a lot of work — maybe a substantial number of employees technically signed nonstandard agreements.)

Yeah, what about employees who refused to sign? Have we gotten any clarification on their situation?

↑ comment by Garrett Baker (D0TheMath) · 2024-05-24T02:07:38.087Z · LW(p) · GW(p)

I quote Gwern

Note that it says nothing about being allowed to participate in tenders, nothing about the clause where OA can repurchase your PPUs at any time at 'fair market value' (not canceled at $0), nothing about what those 'other documents' might be, nothing about Anthropic founders...

↑ comment by Dagon · 2024-05-23T15:14:21.900Z · LW(p) · GW(p)

I haven't followed closely - from outside, it seems like pretty standard big-growth-tech behavior. One thing to keep in mind is that "vested equity" is pretty inviolable. These are grants that have been fully earned and delivered to the employee, and are theirs forever. It's the "unvested" or "semi-vested" equity that's usually in question - these are shares that are conditionally promised to employees, which will vest at some specified time or event - usually some combination of time in good standing and liquidity events (for a non-public company).

It's quite possible (and VERY common) that employees who leave are offered "accelerated vesting" on some of their granted-but-not-vested shares in exchange for signing agreements and making things easy for the company. I don't know if that's what OpenAI is doing, but I'd be shocked if they somehow took away any vested shares from departing employees.

It would be pretty sketchy to consider unvested grants to be part of one's net worth - certainly banks won't lend on it. Vested shared are just shares, they're yours like any other asset.

Replies from: Linch

↑ comment by Linch · 2024-05-23T18:37:13.981Z · LW(p) · GW(p)

I don't know if that's what OpenAI is doing, but I'd be shocked if they somehow took away any vested shares from departing employees.

Consider yourself shocked.

Replies from: Dagon

↑ comment by Dagon · 2024-05-24T14:07:08.735Z · LW(p) · GW(p)

Trying to figure out how to update. From the downvotes and comments, I'm clearly considered wrong, but I can't easily find details on how. Is the statement "We have not and never will take away vested equity" a flat-out lie? I'd expected it was relying heavily on the word "vested", and what they took away was something non-vested.

Is there a simple link to a specific legal description of what assets a non-signer was entitled to, but lost due to declining to sign?

Edit: Zvi recently linked to OpenAI NDAs: Leaked documents reveal aggressive tactics toward former employees - Vox, which does have pretty compelling references that my assumptions were wrong, that the denial was a verifiably false statement, and they did, in fact, credibly threaten to take back vested equity. I've checked my equity in past (private, so not exercisable unless they have a liquidity event) and current (public, so exercisable immediately on vest) employers, and this doesn't seem possible for them. OpenAI is an outlier in defining their equity that way (such that "vested" is contingent).

↑ comment by Garrett Baker (D0TheMath) · 2024-05-22T23:24:53.301Z · LW(p) · GW(p)

OpenAI told me that “we are identifying and reaching out to former employees who signed a standard exit agreement to make it clear that OpenAI has not and will not cancel their vested equity and releases them from nondisparagement obligations”

They could be lying about this.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-05-22T23:34:02.833Z · LW(p) · GW(p)

We know various people who've left OpenAI and might criticize it if they could. Either most of them will soon say they're free or we can infer that OpenAI was lying/misleading.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-05-23T04:51:28.112Z · LW(p) · GW(p)

Now OpenAI publicly said "we're releasing former employees from existing nondisparagement obligations unless the nondisparagement provision was mutual." This seems to be self-effecting; by saying it, OpenAI made it true.

Hooray!

Replies from: bec-hawk, Viliam, JamesPayor

↑ comment by Rebecca (bec-hawk) · 2024-05-23T11:51:54.551Z · LW(p) · GW(p)

unless the nondisparagement provision was mutual

This could be true for most cases though

↑ comment by Viliam · 2024-05-23T06:33:49.021Z · LW(p) · GW(p)

I am not a lawyer -- is that legally binding?

That is, if someone signed the (standard or non-standard) agreement, and OpenAI says this, but later they decide to sue the employee anyway... what exactly will happen?

(I am also suspicious about the "reaching out to former employees" part, because if the new negotiation is made in private, another trick might be involved, like maybe they are released from the old agreement, but they have to sign a new one...?)

↑ comment by James Payor (JamesPayor) · 2024-05-23T14:55:27.143Z · LW(p) · GW(p)

So I'm guessing this covers like 2-4 recent departures, and not Paul, Dario, or the others that split earlier

comment by Zach Stein-Perlman · 2024-08-20T19:30:14.057Z · LW(p) · GW(p)

Edit, 2.5 days later: I think this list is fine but sharing/publishing it was a poor use of everyone's attention. Oops.

Asks for Anthropic

Note: I think Anthropic is the best frontier AI lab on safety. I wrote up asks for Anthropic because it's most likely to listen to me. A list of asks for any other lab would include most of these things plus lots more. This list was originally supposed to be more part of my help labs improve project than my hold labs accountable crusade.

Numbering is just for ease of reference.

1. RSP: Anthropic should strengthen/clarify the ASL-3 mitigations, or define ASL-4 such that the threshold is not much above ASL-3 but the mitigations much stronger. I'm not sure where the lowest-hanging mitigation-fruit is, except that it includes control.

2. Control: Anthropic (like all labs) should use control mitigations and control evaluations [AF · GW] to reduce risks from AIs scheming, including escape during internal deployment [AF · GW].

3. External model auditing for risk assessment: Anthropic (like all labs) should let auditors like METR, UK AISI, and US AISI audit its models if they want to — Anthropic should offer them good access pre-deployment and let them publish their findings or flag if they're concerned. (Anthropic shared some access with UK AISI before deploying Claude 3.5 Sonnet, but it doesn't seem to have been deep.) (Anthropic has said that sharing with external auditors is hard or costly. It's not clear why, for just sharing normal API access + helpful-only access + control over inference-time safety features, without high-touch support.)

4. Policy advocacy (this is murky, and maybe driven by disagreements-on-the-merits and thus intractable): Anthropic (like all labs) should stop advocating against good policy and ideally should advocate for good policy. Maybe it should also be more transparent about policy advocacy. [It's hard to make precise what I believe is optimal and what I believe is unreasonable, but at the least I think Dario is clearly too bullish on self-governance, and Jack Clark is clearly too anti-regulation, and all of this would be OK if it was balanced out by some public statements or policy advocacy that's more pro-real-regulation but as far as I can tell it's not. Not justified here but I predict almost all of my friends would agree if they looked into it for an hour.]

5a. Security: Anthropic (like all labs) should ideally implement RAND SL4 for model weights and code when reaching ASL-3. I think that's unrealistic, but lesser security improvements would also be good. (Anthropic said in May 2024 that 8% of staff work in security-related areas. I think this is pretty good. I think on current margins Anthropic could still turn money into better security reasonably effectively, and should do so.)

5b. Anthropic (like all labs) should be more transparent about the quality of its security. Anthropic should publish the private reports on https://trust.anthropic.com/, redacted as appropriate. It should commit to publish information on future security incidents and should publish information on all security incidents from the last year or two.

6. Anthropic (like all labs) should facilitate employees publicly flagging false statements or violated processes.

7. Anthropic takes credit for its Long-Term Benefit Trust but Anthropic hasn't [LW · GW] published [LW · GW] enough to show that it's effective. Anthropic should publish the Trust Agreement, clarify the ambiguities discussed in the linked posts, and make accountability-y commitments like if major changes happen to the LTBT we'll quickly tell the public.

8. Anthropic should avoid exaggerating interpretability research or causing observers to have excessively optimistic impressions of Anthropic’s interpretability research. (See e.g. Stephen Casper [LW · GW].)

9. Maybe Anthropic (like all labs) should make safety cases for its models or deployments, especially after the simple "no dangerous capabilities" safety case doesn't work anymore, and publish them (or maybe just share with external auditors).

9.5. Anthropic should clarify a few confusing RSP things, including (a) the deal with substantially raising the ARA bar for ASL-3, and moreover deciding the old threshold is a "yellow line" and not creating a new threshold, and doing so without officially updating the RSP (and quietly); and (b) when the "every 3 months" trigger for RSP evals is active. I haven't tried hard to get to the bottom of these.

Minor stuff:

10. Anthropic (like all labs) should fully release everyone from nondisparagement agreements and not use nondisparagement agreements in the future.

11. Anthropic should commit to publish updates on risk assessment practices and results, including low-level details, perhaps for all major model releases and at least quarterly or so. (Anthropic says its Responsible Scaling Officer does this internally. Anthropic publishes model cards and has published one Responsible Scaling Policy Evaluations Report.)

12. Anthropic should confirm that its old policy don't meaningfully advance the frontier with a public launch has been replaced by the RSP, if that's true, and otherwise clarify Anthropic's policy.

Done! ~~13. Anthropic~~ ~~committed~~ to establish a bug bounty program (for model issues) or similar, over a year ago. Anthropic hasn't; it is the only frontier lab without a bug bounty program (although others don't necessarily comply with the commitment, e.g. OpenAI's excludes model issues). It should do this or talk about its plans.

14. [Anthropic should clarify its security commitments; I expect it will in its forthcoming RSP update.]

15. [Maybe Anthropic (like all labs) should better boost external safety research [LW · GW], especially by giving more external researchers deep model access (e.g. fine-tuning or helpful-only). I hear this might be costly but I don't really understand why.]

16. [Probably Anthropic (like all labs) should encourage staff members to talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic, as long as they (1) clarify that they're not speaking for Anthropic and (2) don't share secrets.]

17. [Maybe Anthropic (like all labs) should talk about its views on AI progress and risk. At the least, probably Anthropic (like all labs) should clearly describe a worst-case plausible outcome from AI and state how likely the lab considers it.]

18. [Most of my peers say: Anthropic (like all labs) should publish info like training compute and #parameters for each model. I'm inside-view agnostic on this.]

19. [Maybe Anthropic could cheaply improve its model evals for dangerous capabilities or share more information about them. Specific asks/recommendations TBD. As Anthropic notes, its CBRN eval is kinda bad and its elicitation is kinda bad (and it doesn't share enough info for us to evaluate its elicitation from the outside).]

I shared this list—except 9.5 and 19, which are new—with @Zac Hatfield-Dodds [LW · GW] two weeks ago.

You are encouraged to comment with other asks for Anthropic. (Or things Anthropic does very well, if you feel so moved.)

Replies from: zac-hatfield-dodds, thomas-larsen, Raemon

↑ comment by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-08-20T22:03:19.260Z · LW(p) · GW(p)

I think both Zach and I care about labs doing good things on safety, communicating that clearly, and helping people understand both what labs are doing and the range of views on what they should be doing. I shared Zach's doc with some colleagues, but won’t try for a point-by-point response. Two high-level responses:

First, at a meta level, you say:

[Probably Anthropic (like all labs) should encourage staff members to talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic, as long as they (1) clarify that they're not speaking for Anthropic and (2) don't share secrets.]

I do feel welcome to talk about my views on this basis, and often do so with friends and family, at public events, and sometimes even in writing on the internet (hi!). However, it takes way more effort than you might think to avoid inaccurate or misleading statements while also maintaining confidentiality. Public writing tends to be higher-stakes due to the much larger audience and durability, so I routinely run comments past several colleagues before posting, and often redraft in response (including these comments and this very point!).

My guess is that most don’t do this much in public or on the internet, because it’s absolutely exhausting, and if you say something misremembered or misinterpreted you’re treated as a liar, it’ll be taken out of context either way, and you probably can’t make corrections. I keep doing it anyway because I occasionally find useful perspectives or insights this way, and think it’s important to share mine. That said, there’s a loud minority which makes the AI-safety-adjacent community by far the most hostile and least charitable environment I spend any time in, and I fully understand why many of my colleagues might not want to.

Imagine, if you will, trying to hold a long conversation about AI risk - but you can’t reveal any information about, or learned from, or even just informative about LessWrong. Every claim needs an independent public source, as do any jargon or concepts that would give an informed listener information about the site, etc.; you have to find different analogies and check that citations are public and for all that you get pretty regular hostility anyway because of… well, there are plenty of misunderstandings and caricatures to go around.

I run intro-to-AGI-safety courses for Anthropic employees (based on AGI-SF), and we draw a clear distinction between public and confidential resources specifically so that people can go talk to family and friends and anyone else they wish about the public information we cover.

Second, and more concretely, many of these asks are unimplementable for various reasons, and often gesture in a direction without giving reasons to think that there’s a better tradeoff available than we’re already making. Some quick examples:

Both AI Control and safety cases are research areas less than a year old; we’re investigating them and e.g. hiring safety-case specialists, but best-practices we could implement don’t exist yet. Similarly, there simply aren’t any auditors or audit standards for AI safety yet (see e.g. METR’s statement [AF · GW]); we’re working to make this possible but the thing you’re asking for just doesn’t exist yet. Some implementation questions that “let auditors audit our models” glosses over:
- If you have dozens of organizations asking to be auditors, and none of them are officially auditors yet, what criteria do you use to decide who you collaborate with?
- What kind of pre deployment model access would you provide? If it’s helpful-only or other nonpublic access, do they meet our security bar to avoid leaking privileged API keys? (We’ve already seen unauthorized sharing or compromise lead to serious abuse.)
- How do you decide who gets to say what about the testing? What if they have very different priorities than you and think that a different level of risk or a different kind of harm is unacceptable?
I strongly support Anthropic’s nondisclosure of information about pretraining. I have never seen a compelling argument that publishing this kind of information is, on net, beneficial for safety.
There are many cases where I’d be happy if Anthropic shared more about what we’re doing and what we’re thinking about. Some of the things you’re asking about I think we’ve already said, e.g. for [7] LTBT changes would require an RSP update, and for [17] our RSP requires us to “enforce an acceptable use policy [against …] using the model to generate content that could cause severe risks to the continued existence of humankind”.

So, saying “do more X” just isn’t that useful; we’ve generally thought about it and concluded that that the current amount of X is our best available tradeoff at the moment. For many more of the other asks above, I just disagree with implicit or explicit claims about the facts in question. Even for the communication issues where I’d celebrate us sharing more—and for some I expect we will—doing so is yet another demand on heavily loaded people and teams, and it can take longer than we’d like to find the time.

Replies from: davekasten, aysja, akash-wasil, adam_scholl, T3t

↑ comment by davekasten · 2024-08-20T22:27:00.443Z · LW(p) · GW(p)

I just want to note that people who've never worked in a true high-confidentiality environment (professional services, national defense, professional services for national defense) probably radically underestimate the level of brain damage and friction that Zac is describing here:

"Imagine, if you will, trying to hold a long conversation about AI risk - but you can’t reveal any information about, or learned from, or even just informative about LessWrong. Every claim needs an independent public source, as do any jargon or concepts that would give an informed listener information about the site, etc.; you have to find different analogies and check that citations are public and for all that you get pretty regular hostility anyway because of… well, there are plenty of misunderstandings and caricatures to go around."

Confidentiality is really, really hard to maintain. Doing so while also engaging the public is terrifying. I really admire the frontier labs folks who try to engage publicly despite that quite severe constraint, and really worry a lot as a policy guy about the incentives we're creating to make that even less likely in the future.

↑ comment by aysja · 2024-08-22T20:41:14.184Z · LW(p) · GW(p)

I'm sympathetic to how this process might be exhausting, but at an institutional level I think Anthropic (and all labs) owe humanity a much clearer image of how they would approach a potentially serious and dangerous situation with their models. Especially so, given that the RSP is fairly silent on this point, leaving the response to evaluations up to the discretion of Anthropic. In other words, the reason I want to hear more from employees is in part because I don't know what the decision process inside of Anthropic will look like if an evaluation indicates something like "yeah, it's excellent at inserting backdoors, and also, the vibe is that it's overall pretty capable." And given that Anthropic is making these decisions on behalf of everyone, Anthropic (like all labs) really owes it to humanity to be more upfront about how it'll make these decisions (imo).

I will also note what I feel is a somewhat concerning trend. It's happened many times now that I've critiqued something about Anthropic (its RSP, advocating to eliminate pre-harm from SB 1047, the silent reneging on the commitment to not push the frontier), and someone has said something to the effect of: "this wouldn't seem so bad if you knew what was happening behind the scenes." They of course cannot tell me what the "behind the scenes" information is, so I have no way of knowing whether that's true. And, maybe I would in fact update positively about Anthropic if I knew. But I do think the shape of "we're doing something which might be incredibly dangerous, many external bits of evidence point to us not taking the safety of this endeavor seriously, but actually you should think we are based on me telling you we are" is pretty sketchy.

Replies from: akash-wasil

↑ comment by Orpheus16 (akash-wasil) · 2024-08-23T00:21:38.756Z · LW(p) · GW(p)

I will also note what I feel is a somewhat concerning trend. It's happened many times now that I've critiqued something about Anthropic (its RSP, advocating to eliminate pre-harm from SB 1047, the silent reneging on the commitment to not push the frontier), and someone has said something to the effect of: "this wouldn't seem so bad if you knew what was happening behind the scenes."

I just wanted to +1 that I am also concerned about this trend, and I view it as one of the things that I think has pushed me (as well as many others in the community) to lose a lot of faith in corporate governance (especially of the "look, we can't make any tangible commitments but you should just trust us to do what's right" variety) and instead look to governments to get things under control.

I don't think Anthropic is solely to blame for this trend, of course, but I think Anthropic has performed less well on comms/policy than I [and IMO many others] would've predicted if you had asked me [or us] in 2022.

↑ comment by Orpheus16 (akash-wasil) · 2024-08-21T23:01:39.492Z · LW(p) · GW(p)

@Zac Hatfield-Dodds [LW · GW] do you have any thoughts on official comms from Anthropic and Anthropic's policy team?

For example, I'm curious if you have thoughts on this anecdote [LW(p) · GW(p)]– Jack Clark was asked an open-ended question by Senator Cory Booker and he told policymakers that his top policy priority was getting the government to deploy AI successfully. There was no mention of AGI, existential risks, misalignment risks, or anything along those lines, even though it would've been (IMO) entirely appropriate for him to bring such concerns up in response to such an open-ended question.

I was left thinking that either Jack does not care much about misalignment risks or he was not being particularly honest/transparent with policymakers. Both of these raise some concerns for me.

(Noting that I hold Anthropic's comms and policy teams to higher standards than individual employees. I don't have particularly strong takes on what Anthropic employees should be doing in their personal capacity– like in general I'm pretty in favor of transparency, but I get it, it's hard and there's a lot that you have to do. Whereas the comms and policy teams are explicitly hired/paid/empowered to do comms and policy, so I feel like it's fair to have higher expectations of them.)

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-08-21T23:05:59.866Z · LW(p) · GW(p)

Source: Hill & Valley Forum on AI Security (May 2024):

https://www.youtube.com/live/RqxE3ub7wWA?t=13338s:

very powerful systems [] may have national security uses or misuses. And for that I think we need to come up with tests that make sure that we don’t put technologies into the market which could—unwittingly to us—advantage someone or allow some nonstate actor to commit something harmful. Beyond that I think we can mostly rely on existing regulations and law and existing testing procedures . . . and we don’t need to create some entirely new infrastructure.

https://www.youtube.com/live/RqxE3ub7wWA?t=13551

At Anthropic we discover that the more ways we find to use this technology the more ways we find it could help us. And you also need a testing and measurement regime that closely looks at whether the technology is working—and if it’s not how you fix it from a technological level, and if it continues to not work whether you need some additional regulation—but . . . I think the greatest risk is us [viz. America] not using it [viz. AI]. Private industry is making itself faster and smarter by experimenting with this technology . . . and I think if we fail to do that at the level of the nation, some other entrepreneurial nation will succeed here.

↑ comment by Adam Scholl (adam_scholl) · 2024-08-23T02:11:55.698Z · LW(p) · GW(p)

My guess is that most don’t do this much in public or on the internet, because it’s absolutely exhausting, and if you say something misremembered or misinterpreted you’re treated as a liar, it’ll be taken out of context either way, and you probably can’t make corrections. I keep doing it anyway because I occasionally find useful perspectives or insights this way, and think it’s important to share mine. That said, there’s a loud minority which makes the AI-safety-adjacent community by far the most hostile and least charitable environment I spend any time in, and I fully understand why many of my colleagues might not want to.

My guess is that this seems so stressful mostly because Anthropic’s plan is in fact so hard to defend, due to making little sense. Anthropic is attempting to build a new mind vastly smarter than any human, and as I understand it, plans to ensure this goes well basically by doing periodic vibe checks to see whether their staff feel sketched out yet. I think a plan this shoddy obviously endangers life on Earth, so it seems unsurprising (and good) that people might sometimes strongly object; if Anthropic had more reassuring things to say, I’m guessing it would feel less stressful to try to reassure them.

Replies from: Raemon, Josephm

↑ comment by Raemon · 2024-08-24T18:24:02.908Z · LW(p) · GW(p)

Meta aside: normally this wouldn't seem worth digging into but as a moderator/site-culture-guardian, I feel compelled to justify my negative react on the disagree votes.

I'm actually not entirely sure what downvote-reacting is for. Habryka has said the intent is to override inappropriate uses of reacts. We haven't actually really had a sit-down-and-argue-this-out on the moderator team. I'm pretty sure we haven't told or tried to enforce that "override inappropriate use of reacts" as the intended use

I think Adam's line:

Anthropic is attempting to build a new mind vastly smarter than any human, and as I understand it, plans to ensure this goes well basically by doing periodic vibe checks to see whether their staff feel sketched out yet.

Is psychologizing and summarizing Anthropic unfairly. So I wouldn't agree vote with it. I do think it has some kind of grain of truth to it (me believing this is also kind of "doubting the experience of Anthropic employees" which is also group-epistemologically dicey IMO, but, feels kinda important enough to do in this case). The claim isn't true... but I also don't belief report that it's not true.

I initially downvoted the Disagree when it was just Noosphere, since I didn't think Noosphere was really in a position to have an opinion and if he was the only reactor it felt more like noise. A few others who are more positioned to know relevant stuff have since added their own disagree reacts. I... feel sort of justified leaving the anti-react up, with an overall indicator of "a bunch of people disagree with this, but the weight of that disagreement is slightly reduced." (I think I'd remove the anti-react if the the disagree count went much lower than it is now).

I don't know whether I particularly endorse any of this, but wanted people to have a bit more model of what one site-admin was thinking here.

[/end of rambly meta commentary]

Replies from: adam_scholl

↑ comment by Adam Scholl (adam_scholl) · 2024-08-24T20:52:02.241Z · LW(p) · GW(p)

What seemed psychologizing/unfair to you, Raemon? I think it was probably unnecessarily rude/a mistake to try to summarize Anthropic’s whole RSP in a sentence, given that the inferential distance here is obviously large. But I do think the sentence was fair.

As I understand it, Anthropic’s plan for detecting threats is mostly based on red-teaming (i.e., asking the models to do things to gain evidence about whether they can). But nobody understands the models well enough to check for the actual concerning properties themselves, so red teamers instead check for distant proxies, or properties that seem plausibly like precursors. (E.g., for “ability to search filesystems for passwords” as a partial proxy for “ability to autonomously self-replicate,” since maybe the former is a prerequisite for the latter).

But notice that this activity does not involve directly measuring the concerning behavior. Rather, it instead measures something more like “the amount the model strikes the evaluators as broadly sketchy-seeming/suggestive that it might be capable of doing other bad stuff.” And the RSP’s description of Anthropic’s planned responses to these triggers is so chock full of weasel words and caveats and vague ambiguous language that I think it barely constrains their response at all.

So in practice, I think both Anthropic’s plan for detecting threats, and for deciding how to respond, fundamentally hinge on wildly subjective judgment calls, based on broad, high-level, gestalt-ish impressions of how these systems seem likely to behave. I grant that this process is more involved than the typical thing people describe as a “vibe check,” but I do think it’s basically the same epistemic process, and I expect will generate conclusions around as sound.

Replies from: abandon, Raemon

↑ comment by dirk (abandon) · 2024-08-24T22:10:22.942Z · LW(p) · GW(p)

I don't really think any of that affects the difficulty of public communication; your implication that it must be the cause reads to me more like an insult than a well-considered psychological model

Replies from: None

↑ comment by [deleted] · 2024-08-25T07:06:18.927Z · LW(p) · GW(p)

I don't really think any of that affects the difficulty of public communication

The basic point would be that it's hard to write publicly about how you are taking responsible steps that grapple directly with the real issues... if you are not in fact doing those responsible things in the first place. This seems locally valid to me; you may disagree on the object level about whether Adam Scholl's characterization [LW(p) · GW(p)] of Anthropic's agenda/internal work is correct, but if it is, then it would certainly affect the difficulty of public communication to such an extent that it might well become the primary factor that needs to be discussed in this matter.

Indeed, the suggestion [LW(p) · GW(p)] is for Anthropic employees to "talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic" and the counterargument [LW(p) · GW(p)] is that doing so would be nice in an ideal world, except it's very psychologically exhausting because every public statement you make is likely to get maliciously interpreted by those who will use it to argue that your company is irresponsible. In this situation, there is a straightforward direct correlation between the difficulty of public communication and the likelihood that your statements will get you and your company in trouble.

But the more responsible you are in your actual work, the more responsible-looking details you will be able to bring up in conversations with others when you discuss said work. AI safety community members are not actually arbitrarily intelligent Machiavellians with the ability to convincingly twist every (in-reality) success story into an (in-perception) irresponsible gaffe;^[1] the extent to which they can do this depends very heavily on the extent to which you have anything substantive to bring up in the first place. After all, as Paul Graham often says, "If you want to convince people of something, it's much easier if it's true."

As I see it, not being able to bring up Anthropic's work/views on this matter without some AI safety person successfully making it seem like Anthropic is behaving badly is rather strong Bayesian evidence that Anthropic is in fact behaving badly. So this entire discussion, far from being an insult, seems directly on point to the topic at hand, and locally valid to boot (although not necessarily sound, as that depends on an individualized assessment of the particular object-level claims about the usefulness of the company's safety team).

^{^}
Quite the opposite, actually, if the change in the wider society's opinions about EA in the wake of the SBF scandal is any representative indication of how the rationalist/EA/AI safety cluster typically handles PR stuff.

Replies from: abandon

↑ comment by dirk (abandon) · 2024-08-25T15:25:21.509Z · LW(p) · GW(p)

I think communication as careful as it must be to maintain the confidentiality distinction here is always difficult in the manner described, and that communication to large quantities of people will ~always result in someone running with an insane misinterpretation of what was said.

Replies from: Benito

↑ comment by Ben Pace (Benito) · 2024-08-25T19:03:23.959Z · LW(p) · GW(p)

I understand that this confidentiality point might seem to you like the end of the fault analysis, but have you considered the hypothesis that Anthropic leadership has set such stringent confidentiality policies in part to make it hard for Zac to engage in public discourse?

Look, I don't think Anthropic leadership is just trying to keep their training skills private or their models secure. Their company does not merely keep trade secrets. When I speak to staff from this company about issues with their 'Responsible Scaling Policies', they say that they want to tell me more information about how they think it can be better or how they think it might change, but cannot due to confidentiality constraints. That's their safety policies, not information about their training policies that they want to keep secret so that they can make money.

I believe the Anthropic leadership cares very little about the public's ability to have arguments and evidence and access to information about Anthropic's behavior. The leadership roughly ~never shows up to engage with critical discourse about itself, unless there's a potential major embarrassment [LW(p) · GW(p)]. There is no regular Q&A session with the leadership of a company who believes their own product poses a 10-25% chance of existential catastrophe, no comment section on their website, no official twitter accounts that regularly engage with and share info with critics, no debates with the many people who outright oppose their actions.

No, they go far in the other direction of committing to no-public-discourse. I challenge any Anthropic staffer to openly deny that there is a mutual non-disparagement agreement between Anthropic and OpenAI leadership, whereby neither is legally allowed to openly criticize the other's company. (You can read cofounder Sam McCandlish write that Anthropic has mutual non-disparagement agreements in this comment [LW(p) · GW(p)].) Anthropic leadership say they quit OpenAI primarily due to safety concerns, and yet I believe they simultaneously signed away their ability to criticize that very organization that they had such unique levels of information about and believed poses an existential threat to civilization.

Where Daniel Kokotajlo refused to sign a non-disparagement agreement (by-default forfeiting his equity) so that he could potentially criticize OpenAI in the future, the Amodei's quit purportedly due to having damning criticisms of OpenAI in the present and then (I believe) chose to sign a non-disparagement agreement while quitting (and kept their equity). A complete inversion of good collective epistemic principles.

To quote from Zac's above analogy explaining how difficult his situation at Anthropic is.

Imagine, if you will, trying to hold a long conversation about AI risk - but you can’t reveal any information about, or learned from, or even just informative about LessWrong. Every claim needs an independent public source, as do any jargon or concepts that would give an informed listener information about the site, etc.; you have to find different analogies and check that citations are public

The analogous goal described here for Anthropic is to have complete separation between internal and external information. This does not describe a set of blacklisted trade-secrets or security practices. My sense is that for most safety-related issues Anthropic has a set of whitelisted information, which is primarily the already public stuff. The goal here is for you to not have access to any information about them that they did not explicitly decide that they wanted you to know, and they do not want people in their org to share new information when engaging in public, critical discourse.

Yes, yes, Zac's situation is stressful and I respect his effort to engage in public discourse nonetheless. Good on Zac. But I can't help but wrankle at the implication that the primary reason he and others don't talk more is the public commentariat not empathizing enough with having confidential info. Sure, people could do better to understand the difficulty of communicating while holding confidential info. It is hard to repeatedly walk right up to the line and not over it, it's stressful to think you might have gone over it, and it's stressful to suddenly find yourself unable to engage well with people's criticisms because you hit a confidential crux. But as to the fault analysis for Zac's particularly difficult position? In my opinion the blame is surely first with the Anthropic leadership who have given him way too stringent confidentiality constraints, due to seeming to anti-care about helping people external to Anthropic understand what is going on.

Replies from: abandon, zac-hatfield-dodds

↑ comment by dirk (abandon) · 2024-08-26T03:14:48.694Z · LW(p) · GW(p)

I don't think the placement of fault is causally related to whether communication is difficult for him, really. To refer back to the original claim being made, Adam Scholl said that [LW(p) · GW(p)]

My guess is that this seems so stressful mostly because Anthropic’s plan is in fact so hard to defend... [I]t seems unsurprising (and good) that people might sometimes strongly object; if Anthropic had more reassuring things to say, I’m guessing it would feel less stressful to try to reassure them.

I think the amount of stress incurred when doing public communication is nearly orthogonal to these factors, and in particular is, when trying to be as careful about anything as Zac is trying to be about confidentiality, quite high at baseline. I don't think Adam Scholl's assessment arose from a usefully-predictive model, nor one which was likely to reflect the inside view.

Replies from: None, Benito

↑ comment by [deleted] · 2024-08-26T11:48:40.659Z · LW(p) · GW(p)

Ben Pace has said [LW(p) · GW(p)] that perhaps he doesn't disagree with you in particular about this, but I sure think I do.^[1]

I think the amount of stress incurred when doing public communication is nearly orthogonal to these factors, and in particular is, when trying to be as careful about anything as Zac is trying to be about confidentiality, quite high at baseline.

I don't see how the first half of this could be correct, and while the second half could be true, it doesn't seem to me to offer meaningful support for the first half either (instead, it seems rather... off-topic).

As a general matter, even if it were the case that no matter what you say, at least one person will actively misinterpret your words, this fact would have little bearing on whether you can causally influence the proportion of readers/community members that end up with (what seem to you like) the correct takeaways from a discussion of that kind.

Moreover, in a spot where you have something meaningful and responsible, etc, that you and your company have done to deal with safety issues, the major concern in your mind when communicating publicly is figuring out how to make it clear to everyone that you are on top of things without revealing confidential information. That is certainly stressful, but much less so than the additional constraint you have in a world in which you do not have anything concrete that you can back your generic claims of responsibility with, since that is a spot where you can no longer fall back on (a partial version of) the truth as your defense. For the vast majority of human beings, lying and intentional obfuscation with the intent to mislead are significantly more psychologically straining than telling the truth as-you-see-it is.

Overall, I also think I disagree about the amount of stress that would be caused by conversations with AI safety community members. As I have said earlier:

AI safety community members are not actually arbitrarily intelligent Machiavellians with the ability to convincingly twist every (in-reality) success story into an (in-perception) irresponsible gaffe;[1] the extent to which they can do this depends very heavily on the extent to which you have anything substantive to bring up in the first place.
[1] Quite the opposite, actually, if the change in the wider society's opinions about EA in the wake of the SBF scandal is any representative indication of how the rationalist/EA/AI safety cluster typically handles PR stuff.

In any case, I have already made all these points in a number of ways in my previous response [LW(p) · GW(p)] to you (which you haven't addressed, and which still seem to me to be entirely correct).

^{^}
He also said that he thinks your perspective makes sense, which... I'm not really sure about.

↑ comment by Ben Pace (Benito) · 2024-08-26T03:32:15.900Z · LW(p) · GW(p)

Yeah, I totally think your perspective makes sense and I appreciate you bringing it up, even though I disagree.

I acknowledge that someone who has good justifications for their position but just has made a bunch of reasonable confidentiality agreements around the topic should expect to run into a bunch of difficulties and stresses around public conflicts and arguments.

I think you go too far in saying that the stress is orthogonal to whether you have a good case to make, I think you can't really think that it's not a top-3 factor to how much stress you're experiencing. As a pretty simple hypothetical, if you're responding to a public scandal about whether you stole money, you're gonna have a way more stressful time if you did steal money than if you didn't (in substantial part because you'd be able to show the books and prove it).

Perhaps not so much disagreeing with you in particular, but disagreeing with my sense of what was being agreed upon in Zac's comment and in the reacts, I further wanted to raise my hypothesis that a lot of the confidentiality constraints are unwarranted and actively obfuscatory, which does change who is responsible for the stress, but doesn't change the fact that there is stress.

Added: Also, I think we would both agree that there would be less stress if there were fewer confidentiality restrictions.

↑ comment by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-09-11T09:22:30.162Z · LW(p) · GW(p)

For what it's worth, I endorse Anthopic's confidentiality policies, and am confident that everyone involved in setting them sees the increased difficulty of public communication as a cost rather than a benefit. Unfortunately, the unilateralist's curse and entangled truths [? · GW] mean that confidential-by-default is the only viable policy.

Replies from: Benito

↑ comment by Ben Pace (Benito) · 2024-09-11T20:35:35.678Z · LW(p) · GW(p)

That might be the case, but then it only increases the amount of work your company should be doing to carve out and figure out the info that can be made public, and engage with criticism. There should be whole teams who have Twitter accounts and LW accounts and do regular AMAs and show up to podcasts and who have a mandate internally to seek information in the organization and publish relevant info, and there should be internal policies that reflect an understanding that it is correct for some research teams to spend 10-50% of their yearly effort toward making publishable version of research and decision-making principles in order to inform your stakeholders (read: the citizens of earth) and critics about decisions you are making directly related to existential catastrophes that you are getting rich running toward. Not monologue-style blogposts, but dialogue-style comment sections & interviews.

Confidentiality-by-default does not mean you get to abdicate responsibility for answering questions to the people whose lives you are risking about how-and-why you are making decisions, it means you have to put more work into doing it well. If your company valued the rest of the world understanding what is going on yet thought confidentiality by-default was required, I think it would be trying significantly harder to overcome this barrier.

My general principle is that if you are wielding a lot of power over people that they didn't otherwise legitimately grant you (in this case building a potential doomsday device), you owe them to be auditable. You are supposed to show up and answer their questions directly – not "thank you so much for the questions, in six months I will publish a related blogpost on this topic" but more like "with the public info available to me, here's my best guess answer to your specific question today". Especially so if you are doing something the people you have power over perceive as norm-violating, and even more-so when you are keeping the answers to some very important questions secret from them.

↑ comment by Raemon · 2024-08-24T21:09:16.958Z · LW(p) · GW(p)

(not going to respond in this context out of respect for Zach's wishes. May chat later, and am mulling over my own top-level post on the subject)

↑ comment by Joseph Miller (Josephm) · 2024-08-23T03:56:06.654Z · LW(p) · GW(p)

Anthropic is attempting to build a new mind vastly smarter than any human, and as I understand it, plans to ensure this goes well basically by “doing periodic vibe checks”

This obvious straw-man makes your argument easy to dismiss.

However I think the point is basically correct. Anthropic's strategy to reduce x-risk also includes lobbying against pre-harm enforcement of liability for AI companies in SB 1047.

Replies from: TsviBT

↑ comment by TsviBT · 2024-08-24T15:45:03.505Z · LW(p) · GW(p)

How is it a straw-man? How is the plan meaningfully different from that?

Imagine a group of people has already gathered a substantial amount of uranium, is already refining it, is already selling power generated by their pile of uranium, etc. And doing so right near and upwind of a major city. And they're shoveling more and more uranium onto the pile, basically as fast as they can. And when you ask them why they think this is going to turn out well, they're like "well, we trust our leadership, and you know we have various documents, and we're hiring for people to 'Develop and write comprehensive safety cases that demonstrate the effectiveness of our safety measures in mitigating risks from huge piles of uranium', and we have various detectors such as an EM detector which we will privately check and then see how we feel". And then the people in the city are like "Hey wait, why do you think this isn't going to cause a huge disaster? Sure seems like it's going to by any reasonable understanding of what's going on". And the response is "well we've thought very hard about it and yes there are risks but it's fine and we are working on safety cases". But... there's something basic missing, which is like, an explanation of what it could even look like to safely have a huge pile of superhot uranium. (Also in this fantasy world no one has ever done so and can't explain how it would work.)

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-08-24T17:31:17.785Z · LW(p) · GW(p)

In the AI case, there's lots of inaction risk: if Anthropic doesn't make powerful AI, someone less safety-focused will.

It's reasonable to think e.g. I want to boost Anthropic in the current world because others are substantially less safe, but if other labs didn't exist, I would want Anthropic to slow down.

Replies from: aysja, TsviBT

↑ comment by aysja · 2024-08-24T21:17:17.552Z · LW(p) · GW(p)

I disagree. It would be one thing if Anthropic were advocating for AI to go slower, trying to get op-eds in the New York Times about how disastrous of a situation this was, or actually gaming out and detailing their hopes for how their influence will buy saving the world points if everything does become quite grim, and so on. But they aren’t doing that, and as far as I can tell they basically take all of the same actions as the other labs except with a slight bent towards safety.

Like, I don’t feel at all confident that Anthropic’s credit has exceeded their debit, even on their own consequentialist calculus. They are clearly exacerbating race dynamics, both by pushing the frontier, and by lobbying against regulation. And what they have done to help strikes me as marginal at best and meaningless at worst. E.g., I don’t think an RSP is helpful if we don’t know how to scale safely; we don’t, so I feel like this device is mostly just a glorified description of what was already happening, namely that the labs would use their judgment to decide what was safe. Because when it comes down to it, if an evaluation threshold triggers, the first step is to decide whether that was actually a red-line, based on the opaque and subjective judgment calls of people at Anthropic. But if the meaning of evaluations can be reinterpreted at Anthropic’s whims, then we’re back to just trusting “they seem to have a good safety culture,” and that isn’t a real plan, nor really any different to what was happening before. Which is why I don’t consider Adam’s comment to be a strawman. It really is, at the end of the day, a vibe check.

And I feel pretty sketched out in general by bids to consider their actions relative to other extremely reckless players like OpenAI. Because when we have so little sense of how to build this safely, it’s not like someone can come in and completely change the game. At best they can do small improvements on the margins, but once you’re at that level, it feels kind of like noise to me. Maybe one lab is slightly better than the others, but they’re still careening towards the same end. And at the very least it feels like there is a bit of a missing mood about this, when people are requesting we consider safety plans relatively. I grant Anthropic is better than OpenAI on that axis, but my god, is that really the standard we’re aiming for here? Should we not get to ask “hey, could you please not build machines that might kill everyone, or like, at least show that you’re pretty sure that won’t happen before you do?”

↑ comment by TsviBT · 2024-08-24T17:40:18.773Z · LW(p) · GW(p)

But that's not a plan to ensure their uranium pile goes well.

Replies from: TsviBT, mesaoptimizer

↑ comment by TsviBT · 2024-08-24T19:16:20.378Z · LW(p) · GW(p)

@Zach Stein-Perlman [LW · GW] , you're missing the point. They don't have a plan. Here's the thread (paraphrased in my words):

Zach: [asks, for Anthropic]
Zac: ... I do talk about Anthropic's safety plan and orientation, but it's hard because of confidentiality and because many responses here are hostile. ...
Adam: Actually I think it's hard because Anthropic doesn't have a real plan.
Joseph: That's a straw-man. [implying they do have a real plan?]
Tsvi: No it's not a straw-man, they don't have a real plan.
Zach: Something must be done. Anthropic's plan is something.
Tsvi: They don't have a real plan.

Replies from: Josephm, Zach Stein-Perlman

↑ comment by Joseph Miller (Josephm) · 2024-08-25T22:38:00.499Z · LW(p) · GW(p)

Joseph: That's a straw-man. [implying they do have a real plan?]

I explicitly said "However I think the point is basically correct" in the next sentence.

↑ comment by Zach Stein-Perlman · 2024-08-24T19:20:59.269Z · LW(p) · GW(p)

Sorry, reacts are ambiguous.

I agree Anthropic doesn't have a "real plan" in your sense, and narrow disagreement with Zac on that is fine.

I just think that's not a big deal and is missing some broader point (maybe that's a motte and Anthropic is doing something bad—vibes from Adam's comment—is a bailey).

[Edit: "Something must be done. Anthropic's plan is something." is a very bad summary of my position. My position is more like various facts about Anthropic mean that them-making-powerful-AI is likely better than the counterfactual, and evaluating a lab in a vacuum or disregarding inaction risk is a mistake.]

[Edit: replies to this shortform tend to make me sad and distracted—this is my fault, nobody is doing something wrong—so I wish I could disable replies and I will probably stop replying and would prefer that others stop commenting. Tsvi, I'm ok with one more reply to this.]

Replies from: TsviBT, TsviBT

↑ comment by TsviBT · 2024-08-24T20:37:16.080Z · LW(p) · GW(p)

(I won't reply more, by default.)

various facts about Anthropic mean that them-making-powerful-AI is likely better than the counterfactual, and evaluating a lab in a vacuum or disregarding inaction risk is a mistake

Look, if Anthropic was honestly and publically saying

We do not have a credible plan for how to make AGI, and we have no credible reason to think we can come up with a plan later. Neither does anyone else. But--on the off chance there's something that could be done with a nascent AGI that makes a nonomnicide outcome marginally more likely, if the nascent AGI is created and observed by people are at least thinking about the problem--on that off chance, we're going to keep up with the other leading labs. But again, given that no one has a credible plan or a credible credible-plan plan, better would be if everyone including us stopped. Please stop this industry.

If they were saying and doing that, then I would still raise my eyebrows a lot and wouldn't really trust it. But at least it would be plausibly consistent with doing good.

But that doesn't sound like either what they're saying or doing. IIUC they lobbied to remove protection for AI capabilities whistleblowers from SB 1047! That happened! Wow! And it seems like Zac feels he has to pretend to have a credible credible-plan plan.

↑ comment by TsviBT · 2024-08-24T19:33:42.778Z · LW(p) · GW(p)

Hm. I imagine you don't want to drill down on this, but just to state for the record, this exchange seems like something weird is happening in the discourse. Like, people are having different senses of "the point" and "the vibe" and such, and so the discourse has already broken down. (Not that this is some big revelation.) Like, there's the Great Stonewall of the AGI makers. And then Zac is crossing through the gates of the Great Stonewall to come and talk to the AGI please-don't-makers. But then Zac is like (putting words in his mouth) "there's no Great Stonewall, or like, it's not there in order to stonewall you in order to pretend that we have a safe AGI plan or to muddy the waters about whether or not we should have one, it's there because something something trade secrets and exfohazards, and actually you're making it difficult to talk by making me work harder to pretend that we have a safe AGI plan or intentions that should promissorily satisfy the need for one".

↑ comment by mesaoptimizer · 2024-08-24T18:35:37.048Z · LW(p) · GW(p)

Seems like most people believe (implicitly or explicitly) that empirical research is the only feasible path forward to building a somewhat aligned generally intelligent AI scientist. This is an underspecified claim, and given certain fully-specified instances of it, I'd agree.

But this belief leads to the following reasoning: (1) if we don't eat all this free energy in the form of researchers+compute+funding, someone else will; (2) other people are clearly less trustworthy compared to us (Anthropic, in this hypothetical); (3) let's do whatever it takes to maintain our lead and prevent other labs from gaining power, while using whatever resources we have to also do alignment research, preferably in ways that also help us maintain or strengthen our lead in this race.

Replies from: TsviBT

↑ comment by TsviBT · 2024-08-24T18:57:19.340Z · LW(p) · GW(p)

most people believe (implicitly or explicitly) that empirical research is the only feasible path forward to building a somewhat aligned generally intelligent AI scientist.

I don't credit that they believe that. And, I don't credit that you believe that they believe that. What did they do, to truly test their belief--such that it could have been changed? For most of them the answer is "basically nothing". Such a "belief" is not a belief (though it may be an investment, if that's what you mean). What did you do to truly test that they truly tested their belief? If nothing, then yours isn't a belief either (though it may be an investment). If yours is an investment in a behavioral stance, that investment may or may not be advisable, but it would DEFINITELY be inadvisable to pretend to yourself that yours is a belief.

↑ comment by RobertM (T3t) · 2024-08-22T06:49:35.424Z · LW(p) · GW(p)

My guess is that most don’t do this much in public or on the internet, because it’s absolutely exhausting, and if you say something misremembered or misinterpreted you’re treated as a liar, it’ll be taken out of context either way, and you probably can’t make corrections. I keep doing it anyway because I occasionally find useful perspectives or insights this way, and think it’s important to share mine. That said, there’s a loud minority which makes the AI-safety-adjacent community by far the most hostile and least charitable environment I spend any time in, and I fully understand why many of my colleagues might not want to.

I'd be very interested to have references to occassions of people in the AI-safety-adjacent community treating Anthropic employees as liars because of things those people misremembered or misinterpreted. (My guess is that you aren't interested in litigating these cases; I care about it for internal bookkeeping and so am happy to receive examples e.g. via DM rather than as a public comment.)

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-08-22T19:57:03.903Z · LW(p) · GW(p)

Not Zach Hatfield-Dodds, but people claimed that Anthropic had a commitment to not advance the frontier of capabilities, but as it turns out people misinterpreted communications, and no such commitment actually happened.

Not sure I'd go as far as saying that they treated Anthropic as liars, but this seems to me a central example of Zach Hatfield-Dodds's concerns.

From Evhub:

https://www.lesswrong.com/posts/BaLAgoEvsczbSzmng/?commentId=yd2t6YymWdfGBFhFa [LW · GW]

Replies from: Benito

↑ comment by Ben Pace (Benito) · 2024-08-25T06:04:07.273Z · LW(p) · GW(p)

Contrary to the above, for the record, here [LW(p) · GW(p)] is a link to a thread where a major Anthropic investor (Moskovitz) and the researcher who coined the term “The Scaling Hypothesis” (Gwern) both report that the Anthropic CEO told them in private that this is what Anthropic would do, in accordance with what many others also report hearing privately. (There is disagreement about whether this constituted a commitment.)

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-10-06T00:15:29.621Z · LW(p) · GW(p)

The one thing I do conclude is that Anthropic's comms are very inconsistent, and this is bad, actually.

↑ comment by Thomas Larsen (thomas-larsen) · 2024-08-21T00:56:23.018Z · LW(p) · GW(p)

I agree with Zach that Anthropic is the best frontier lab on safety, and I feel not very worried about Anthropic causing an AI related catastrophe. So I think the most important asks for Anthropic to make the world better are on its policy and comms.

I think that Anthropic should more clearly state its beliefs about AGI, especially in its work on policy. For example, the SB-1047 letter they wrote states:

Broad pre-harm enforcement. The current bill requires AI companies to design and implement SSPs that meet certain standards – for example they must include testing sufficient to provide a "reasonable assurance" that the AI system will not cause a catastrophe, and must "consider" yet-to-be-written guidance from state agencies. To enforce these standards, the state can sue AI companies for large penalties, even if no actual harm has occurred. While this approach might make sense in a more mature industry where best practices are known, AI safety is a nascent field where best practices are the subject of original scientific research. For example, despite a substantial effort from leaders in our company, including our CEO, to draft and refine Anthropic's RSP over a number of months, applying it to our first product launch uncovered many ambiguities. Our RSP was also the first such policy in the industry, and it is less than a year old. What is needed in such a new environment is iteration and experimentation, not prescriptive enforcement. There is a substantial risk that the bill and state agencies will simply be wrong about what is actually effective in preventing catastrophic risk, leading to ineffective and/or burdensome compliance requirements.

Liability doesn’t not address the central threat model of AI takeover, for which pre-harm mitigations are necessary due to the irreversible nature of the harm. I think that this letter should have acknowledged that explicitly, and that not doing so is misleading. I feel that Anthropic is trying to play a game of courting political favor by not being very straightforward about its beliefs around AGI, and that this is bad.

To be clear, I think it is reasonable that they argue that the FMD and government in general will be bad at implementing safety guidelines while still thinking that AGI will soon be transformative. I just really think they should be much clearer about the latter belief.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-08-21T17:12:06.597Z · LW(p) · GW(p)

I feel not very worried about Anthropic causing an AI related catastrophe.

This does not fit my model of your risk model. Why do you think this?

Replies from: thomas-larsen

↑ comment by Thomas Larsen (thomas-larsen) · 2024-08-21T17:34:40.042Z · LW(p) · GW(p)

Perhaps that was overstated. I think there is maybe a 2-5% chance that Anthropic directly causes an existential catastrophe (e.g. by building a misaligned AGI). Some reasoning for that:

I doubt Anthropic will continue to be in the lead because they are behind OAI/GDM in capital. They do seem around the frontier of AI models now, though, which might translate to increased returns, but it seems like they do best on very short timelines worlds.
I think that if they could cause an intelligence explosion, it is more likely than not that they would pause for at least long enough to allow other labs into the lead. This is especially true in short timelines worlds because the gap between labs is smaller.
I think they have much better AGI safety culture than other labs (though still far from perfect), which will probably result in better adherence to voluntary commitments.
On the other hand, they haven't been very transparent, and we haven't seen their ASL-4 commitments. So these commitments might amount to nothing, or Anthropic might just walk them back at a critical juncture.

2-5% is still wildly high in an absolute sense! However, risk from other labs seems even higher to me, and I think that Anthropic could reduce this risk by advocating for reasonable regulations (e.g. transparency into frontier AI projects so no one can build ASI without the government noticing).

Replies from: D0TheMath, Benito

↑ comment by Garrett Baker (D0TheMath) · 2024-08-21T18:40:00.557Z · LW(p) · GW(p)

I think you probably under-rate the effect of having both a large number & concentration of very high quality researchers & engineers (more than OpenAI now, I think, and I wouldn't be too surprised if the concentration of high quality researchers was higher than at GDM), being free from corporate chafe, and also having many of those high quality researchers thinking (and perhaps being correct in thinking, I don't know) they're value aligned with the overall direction of the company at large. Probably also Nvidia rate-limiting the purchases of large labs to keep competition among the AI companies.

All of this is also compounded by smart models leading to better data curation and RLAIF (given quality researchers & lack of crust) leading to even better models (this being the big reason I think llama had to be so big to be SOTA, and Gemini not even SOTA), which of course leads to money in the future even if they have no money now.

Replies from: Josephm

↑ comment by Joseph Miller (Josephm) · 2024-08-27T14:58:31.839Z · LW(p) · GW(p)

llama had to be so big to be SOTA,

How many parameters do you estimate for other SOTA models?

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-08-27T15:36:24.170Z · LW(p) · GW(p)

Minstral had like 150b parameters or something.

↑ comment by Ben Pace (Benito) · 2024-08-22T19:16:37.978Z · LW(p) · GW(p)

directly causes an existential risk

FYI I believe the correct language is "directly causes an existential catastrophe". "Existential risk" is a measure of the probability of an existential catastrophe, but is not itself an event.

↑ comment by Raemon · 2024-08-20T19:32:25.331Z · LW(p) · GW(p)

This one seems probably worth making a top-level post?

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-08-20T19:43:31.597Z · LW(p) · GW(p)

I want to avoid this being negative-comms for Anthropic. I'm generally happy to loudly criticize Anthropic, obviously, but this was supposed to be part of the 5% of my work that I do because someone at the lab is receptive to feedback, where the audience was Zac and publishing was an afterthought. (Maybe the disclaimers at the top fail to negate the negative-comms; maybe I should list some good things Anthropic does that no other labs do...)

Also, this is low-effort.

comment by Zach Stein-Perlman · 2024-08-08T16:00:09.835Z · LW(p) · GW(p)

Yay Anthropic for expanding its model safety bug bounty program, focusing on jailbreaks and giving participants pre-deployment access. Apply by next Friday.

Anthropic also says "To date, we’ve operated an invite-only bug bounty program in partnership with HackerOne that rewards researchers for identifying model safety issues in our publicly released AI models." This is news, and they never published an application form for that. I wonder how long that's been going on.

(Google, Microsoft, and Meta have bug bounty programs which include some model issues but exclude jailbreaks. OpenAI's bug bounty program excludes model issues.)

comment by Zach Stein-Perlman · 2024-12-11T05:25:35.362Z · LW(p) · GW(p)

To avoid deploying a dangerous model, you can either (1) test the model pre-deployment or (2) test a similar older model with tests that have a safety buffer such that if the old model is below some conservative threshold it's very unlikely that the new model is dangerous.

DeepMind says it uses the safety-buffer plan (but it hasn't yet said it has operationalized thresholds/buffers).

Anthropic's original RSP used the safety-buffer plan; its new RSP doesn't really use either plan (kinda safety-buffer but it's very weak). (This is unfortunate.)

OpenAI seemed to use the test-the-actual-model plan.^[1] This isn't going well. The 4o evals were rushed because OpenAI (reasonably) didn't want to delay deployment. Then the o1 evals were done on a weak o1 checkpoint rather than the final model, presumably so they wouldn't be rushed, but this presumably hurt performance a lot on some tasks (and indeed the o1 checkpoint performed worse than o1-preview on some capability evals). OpenAI doesn't seem to be implementing the safety-buffer plan, so if a model is dangerous but not super obviously dangerous, it seems likely OpenAI wouldn't notice before deployment....

(Yay OpenAI for honestly publishing eval results that don't look good.)

^{^}
It's not explicit. The PF says e.g. 'Only models with a post-mitigation score of "medium" or below can be deployed.' But it also mentions forecasting capabilities.

Replies from: neel-nanda-1

↑ comment by Neel Nanda (neel-nanda-1) · 2024-12-12T16:19:00.532Z · LW(p) · GW(p)

It seems unlikely that openai is truly following the test the model plan? They keep eg putting new experimental versions onto lmsys, presumably mostly due to different post training, and it seems pretty expensive to be doing all the DC evals again on each new version (and I think it's pretty reasonable to assume that a bit of further post training hasn't made things much more dangerous)

comment by Zach Stein-Perlman · 2024-08-08T19:15:34.929Z · LW(p) · GW(p)

Zico Kolter Joins OpenAI’s Board of Directors. OpenAI says "Zico's work predominantly focuses on AI safety, alignment, and the robustness of machine learning classifiers."

Misc facts:

He's an ML professor
He cofounded Gray Swan (with Dan Hendrycks, among others)
He coauthored Universal and Transferable Adversarial Attacks on Aligned Language Models
I hear he has good takes on adversarial robustness
I failed to find statements on alignment or extreme risks, or work focused on that (in particular, he did not sign the CAIS letter)

Replies from: mtrazzi

↑ comment by Michaël Trazzi (mtrazzi) · 2024-08-09T15:47:35.243Z · LW(p) · GW(p)

He cofounded Gray Swan (with Dan Hendrycks, among others)

I'm confused. On their about page, Dan is an advisor, not a founder.

Replies from: Zach Stein-Perlman, bogdan-ionut-cirstea

↑ comment by Zach Stein-Perlman · 2024-08-09T16:32:13.273Z · LW(p) · GW(p)

Dan was a cofounder.

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-09T16:32:55.166Z · LW(p) · GW(p)

I'm confused. On their about page, Dan is an advisor, not a founder.

It might have something to do with Dan choosing to divest: https://x.com/DanHendrycks/status/1816523907777888563.

comment by Zach Stein-Perlman · 2024-12-26T18:50:06.466Z · LW(p) · GW(p)

DeepSeek-V3 is out today, with weights and a paper published. Tweet thread, GitHub, report (GitHub, HuggingFace). It's big and mixture-of-experts-y; discussion here and here.

It was super cheap to train — they say 2.8M H800-hours or $5.6M (!!).

It's powerful:

It's cheap to run:

Replies from: robert-lynn

↑ comment by Foyle (robert-lynn) · 2024-12-28T04:15:07.338Z · LW(p) · GW(p)

This is depressing, but not surprising. We know the approximate processing power of brains (O(1e16-1e17flops) and how long it takes to train them, and should expect that over the next few years the tricks and structures needed to replicate or exceed that efficiency in ML will be uncovered in an accelerating rush towards the cliff as computational resources needed to attain commercially useful performance continue to fall. AI Industry can afford to run thousands of experiments at this cost scale.

Within a few years this will likely see AGI implementations on Nvidia B200 level GPUS (~1e16flop). We have not yet seen hardware application of the various power-reducing computational 'cheats' for mimicking multiplication with reduced gate counts that are likely to see a 2-5x performance gain at same chip size and power draw.

Humans are so screwed.

Replies from: wassname

↑ comment by wassname · 2024-12-28T05:06:35.937Z · LW(p) · GW(p)

We know the approximate processing power of brains (O(1e16-1e17flops)

This is still debatable, see Table 9 is the brain emulation roadmap https://www.fhi.ox.ac.uk/brain-emulation-roadmap-report.pdf. You are referring to level 4 (SNN), but level 5 is plausible imo (at 10^22) and 6 seems possible (10^25), and of course it could be a mix of levels.

comment by Zach Stein-Perlman · 2025-02-12T20:00:03.666Z · LW(p) · GW(p)

This shortform discusses the current state of responsible scaling policies (RSPs). They're mostly toothless, unfortunately.

The Paris summit was this week. Many companies had committed to make something like an RSP by the summit. Half of them did, including Microsoft, Meta, xAI, and Amazon. (NVIDIA did not—shame on them—but I hear they are writing something.) Unfortunately but unsurprisingly, these policies are all vague and weak.

RSPs essentially have four components: capability thresholds beyond which a model might be dangerous by default, an evaluation protocol to determine when models reach those thresholds, a plan for how to respond when various thresholds are reached, and accountability measures.

A maximally lazy RSP—a document intended to look like an RSP without making the company do anything differently—would have capability thresholds be vague or extremely high, evaluation be unspecified or low-quality, response be like we will make it safe rather than substantive mitigations or robustness guarantees, and no accountability measures. Such a policy would be little better than the company saying "we promise to deploy AIs safely." The new RSPs are basically like that.^[1]

Some aspects of some RSPs that existed before the summit are slightly better.^[2]

If existing RSPs are weak, how would a strong RSP be different?

Evals: eval should measure relevant capabilities (including cyber, bio, and scheming), evals should be sufficiently difficult, and labs should do good elicitation. (As a lower bar, the evals should exist; many companies say they will do evals but don't seem to have a plan for what evals to do.)
- See generally Model evals for dangerous capabilities [LW · GW] and OpenAI's CBRN tests seem unclear
Response: misuse
- Rather than just saying that you'll implement mitigations such that users can't access dangerous capabilities, say how you'll tell if your mitigations are good enough. For example, say that you'll have a skilled red-team attempt to elicit dangerous stuff from the post-mitigation model; ensure that it doesn't provide uplift, or the elicitation required is so involved that doing harm this way is no cheaper than without the AI
Response: control
- AI monitoring for catastrophic actions [AF · GW] where the monitor can replace suspicious actions and send particularly suspicious actions to human auditors [AF · GW]
  - And ideally a control evaluation [AF · GW] to evaluate your protocol's adequacy if the model is scheming^[3]
- Behavioral red-teaming to catch schemers [AF · GW]
Response: security (especially of model weights) — we're very far from securing model weights against determined sophisticated attackers, so:
- Avoid expanding the Pareto frontier between powerful/dangerous and insecure much
- Say no AI company has secured their model weights, and this imposes unacceptable risk. Commit that if all other developers were willing to implement super strong security, even though it's costly, you would too.
Thresholds
- Should be low enough that your responses trigger before your models enable catastrophic harm
- Ideally should be operationalized in evals, but this is genuinely hard
Accountability
- Publish info on evals/elicitation (such that others can tell whether it's adequate)
- Be transparent to an external auditor; have them review your evals and RSP and publicly comment on (1) adequacy and (2) whether your decisions about not publishing various details are reasonable
- See also https://metr.org/rsp-key-components/#accountability

(What am I happy about in current RSPs? Briefly and without justification: yay Anthropic and maybe DeepMind on misuse stuff; somewhat yay Anthropic and OpenAI on their eval-commitments; somewhat yay DeepMind and maybe OpenAI on their actual evals; somewhat yay DeepMind on scheming/monitoring/control; maybe somewhat yay DeepMind and Anthropic on security (but not adequate).)

(Companies' other commitments aren't meaningful either.)

^{^}
Microsoft is supposed to conduct "robust evaluation of whether a model possesses tracked capabilities at high or critical levels, including through adversarial testing and systematic measurement using state-of-the-art methods," but no details on what evals they'll use. The response to dangerous capabilities is "Further review and mitigations required." The "Security measures" are underspecified but do take the situation seriously. The "Safety mitigations" are less substantive, unfortunately. There's not really accountability. Nothing on alignment or internal deployment.
Meta has high vague risk thresholds, a vague evaluation plan, vague responses (e.g. "security protections to prevent hacking or exfiltration" and "mitigations to reduce risk to moderate level"), and no accountability. But they do suggest that if they make a model with dangerous capabilities, they won't release the weights and if they do deploy it externally (via API) they'll have decent robustness to jailbreaks — there are loopholes but they hadn't articulated that principle before. Nothing on alignment or internal deployment.
xAI's policy begins "This is the first draft iteration of xAI’s risk management framework that we expect to apply to future models not currently in development." Misuse evals are cheap (like multiple-choice questions rather than uplift experiments); alignment evals reference unpublished papers and mention "Utility Functions" and "Corrigibility Score." Thresholds would be operationalized as eval results, which is nice, but they don't yet exist. For misuse, includes "Examples of safeguards or mitigations" (but not we'll know mitigations are adequate if a red team fails to break them or other details to suggest mitigations will be effective); no mitigations for alignment.
Amazon: "Critical Capability Thresholds" are high and vague; evaluation is vague; mitigations are vague ("Upon determining that an Amazon model has reached a Critical Capability Threshold, we will implement a set of Safety Measures and Security Measures to prevent elicitation of the critical capability identified and to protect against inappropriate access risks. Safety Measures are designed to prevent the elicitation of the observed Critical Capabilities following deployment of the model. Security Measures are designed to prevent unauthorized access to model weights or guardrails implemented as part of the Safety Measures, which could enable a malicious actor to remove or bypass existing guardrails to exceed Critical Capability Thresholds."); there's not really accountability. Nothing on alignment or internal deployment. I appreciate the list of current security practices at the end of the document.
^{^}
Briefly and without justification:
OpenAI is similarly vague on thresholds and evals and responses, with no accountability. (Also OpenAI is untrustworthy in general and has a bad history on Preparedness in particular.) But they've done almost-decent evals in the past.
The DeepMind thing isn't really a commitment, and it doesn't say much about when to do evals, and the security levels are low (but this is no worse than everyone else being vague), and it doesn't have accountability, but it does mention deceptive alignment + control evals + monitoring, and DeepMind has done almost-decent evals in the past.
The Anthropic thing is fine except it only goes up to ASL-3 and there's no control and little accountability (and the security isn't great (especially for systems after the first systems requiring ASL-3 security) but it's no worse than others).
See The current state of RSPs [LW · GW] modulo the DeepMind FSF update [LW(p) · GW(p)].
^{^}
DeepMind says something good on this (but it's not perfect and is only effective if they do good evals sufficiently frequently). Other RSPs don't seriously talk about risks from deceptive alignment.

Replies from: dan-hendrycks

↑ comment by Dan H (dan-hendrycks) · 2025-02-13T00:18:21.352Z · LW(p) · GW(p)

capability thresholds be vague or extremely high

xAI's thresholds are entirely concrete and not extremely high.

evaluation be unspecified or low-quality

They are specified and as high-quality as you can get. (If there are better datasets let me know.)

I'm not saying it's perfect, but I wouldn't but them all in the same bucket. Meta's is very different from DeepMind's or xAI's.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2025-02-13T01:00:31.947Z · LW(p) · GW(p)

xAI Risk Management Framework (Draft)

You're mostly right about evals/thresholds. Mea culpa. Sorry for my sloppiness.

For misuse, xAI has benchmarks and thresholds—or rather examples of benchmarks thresholds to appear in the real future framework—and based on the right column they seem very reasonably low.

Unlike other similar documents, these are not thresholds at which to implement mitigations but rather thresholds to reduce performance to. So it seems the primary concern is probably not the thresholds are too high but rather xAI's mitigations won't be robust to jailbreaks and xAI won't elicit performance on post-mitigation models well. E.g. it would be inadequate to just run a benchmark with a refusal-trained model, note that it almost always refuses, and call it a success. You need something like: a capable red-team tries to break the mitigations and use the model for harm, and either the red-team fails or it's so costly that the model doesn't make doing harm cheaper.

(For "Loss of Control," one of the two cited benchmarks was published today [LW · GW]—I'm dubious that it measures what we care about but I've only spent ~3 mins engaging—and one has not yet been published. [Edit: and, like, on priors, I'm very skeptical of alignment evals/metrics, given the prospect of deceptive alignment, how we care about worst-case in addition to average-case behavior, etc.])

comment by Zach Stein-Perlman · 2024-06-06T17:00:52.382Z · LW(p) · GW(p)

Securing model weights is underrated for AI safety. (Even though it's very highly rated.) If the leading lab can't stop critical models from leaking to actors that won't use great deployment safety practices, approximately nothing else matters. Safety techniques would need to be based on properties that those actors are unlikely to reverse (alignment, maybe unlearning) rather than properties that would be undone or that require a particular method of deployment (control techniques, RLHF harmlessness, deployment-time mitigations).

However hard the make a critical model you can safely deploy problem is, the make a critical model that can safely be stolen problem is... much harder.

Replies from: habryka4, ryan_greenblatt, Tenoke, None, ozziegooen

↑ comment by habryka (habryka4) · 2024-06-06T19:52:56.697Z · LW(p) · GW(p)

None of the actors who seem currently likely to me to be to deploy highly capable systems seem to me like they will do anything except approximately scaling as fast as they can. I do agree that proliferation is still bad simply because you get more samples from the distribution, but I don't think that changes the probabilities that drastically for me (I am still in favor of securing model weights work, especially in the long run).

Separately, I think it's currently pretty plausible that model weight leaks will substantially reduce the profit of AI companies by reducing their moat, and that has an effect size that seems plausible larger than the benefits of non-proliferation.

Replies from: Linch, valley9

↑ comment by Linch · 2024-06-06T22:59:13.154Z · LW(p) · GW(p)

My central story [EA(p) · GW(p)]is that AGI development will eventually be taken over by governments, in more or less subtle ways. So the importance of securing model weights now is mostly about less scrupulous actors having less of a headstart during the transition/after a governmental takeover.

Replies from: valley9

↑ comment by Ebenezer Dukakis (valley9) · 2024-06-07T01:03:53.152Z · LW(p) · GW(p)

IMO someone should consider writing a "how and why" post on nationalizing AI companies. It could accomplish a few things:

Ensure there's a reasonable plan in place for nationalization. That way if nationalization happens, we can decrease the likelihood of it being controlled by Donald Trump with few safeguards, or something like that. Maybe we could take inspiration from a less partisan organization like the Federal Reserve.
Scare off investors. Just writing the post and having it be discussed a lot could scare them.
Get AI companies on their best behavior. Maybe Sam Altman would finally be pushed out if congresspeople made him the poster child for why nationalization is needed.

Replies from: akash-wasil, Aaron_Scher

↑ comment by Orpheus16 (akash-wasil) · 2024-06-07T01:29:23.840Z · LW(p) · GW(p)

@Ebenezer Dukakis [LW · GW] I would be even more excited about a "how and why" post for internationalizing AGI development and spelling out what kinds of international institutions could build + govern AGI.

↑ comment by Aaron_Scher · 2024-11-11T04:47:50.406Z · LW(p) · GW(p)

There is now some work in that direction: https://forum.effectivealtruism.org/posts/47RH47AyLnHqCQRCD/soft-nationalization-how-the-us-government-will-control-ai

↑ comment by Ebenezer Dukakis (valley9) · 2024-06-07T00:55:20.530Z · LW(p) · GW(p)

Separately, I think it's currently pretty plausible that model weight leaks will substantially reduce the profit of AI companies by reducing their moat, and that has an effect size that seems plausible larger than the benefits of non-proliferation.

What sort of leaks are we talking about? I doubt a sophisticated hacker is going to steal weights from OpenAI just to post them on 4chan. And I doubt OpenAI's weights will be stolen by anyone except a sophisticated hacker.

If you want to reduce the incentive to develop AI, how about passing legislation to tax it really heavily? That is likely to have popular support due to the threat of AI unemployment. And it reduces the financial incentive to invest in large training runs. Even just making a lot of noise about such legislation creates uncertainty for investors.

↑ comment by ryan_greenblatt · 2024-06-06T18:00:52.705Z · LW(p) · GW(p)

If the leading lab can't stop critical models from leaking to actors that won't use great deployment safety practices, approximately nothing else matters.

This seems somewhat overstated. You might hope that you can get the safety tax sufficiently low that you can just do full competition (e.g. even though there are rogue AIs, you just compete with this rogue AIs for power). This also requires offense-defense imbalance to not be too bad.

I overall agree that securing model weights in underrated and that it is plausibly the most important thing on current margins.

In principle, if reasonable actors start with a high fraction of resources (e.g. compute), then you might hope that they can keep that fraction of power (in expectation at least).

↑ comment by Tenoke · 2024-06-06T22:16:37.353Z · LW(p) · GW(p)

I think you are overrating it. Biggest concern comes from whomever trains a model that passes some treshold in the first place. Not from a model that one actor has been using for a while getting leaked to another actor. The bad actor who got access to the leak is always going to be behind in multiple ways in this scenario.

Replies from: bec-hawk

↑ comment by Rebecca (bec-hawk) · 2024-06-07T07:27:46.179Z · LW(p) · GW(p)

The weights could be stolen as soon as the model is trained though

↑ comment by [deleted] · 2024-06-07T03:52:05.517Z · LW(p) · GW(p)

Commenting to note that I think this quote is locally-invalid:

If the leading lab can't stop critical models from leaking to actors that won't use great deployment safety practices, approximately nothing else matters

There are other disjunctive problems with the world which are also individually-sufficient for doom^[1], in which case each of them matter a lot, in absence of some fundamental solution to all of them.

^{^}
(e.g lack of superintelligence-alignment/steerability progress)

↑ comment by ozziegooen · 2024-06-06T18:54:29.236Z · LW(p) · GW(p)

Minor point, but I think we might have some time here. Securing model weights becomes more important as models become better, but better models could also help us secure model weights (would help us code, etc).

comment by Zach Stein-Perlman · 2024-06-07T01:30:20.915Z · LW(p) · GW(p)

New page on AI companies' policy advocacy: https://ailabwatch.org/resources/company-advocacy/.

This page is the best collection on the topic (I'm not really aware of others), but I decided it's low-priority and so it's unpolished. If a better version would be helpful for you, let me know to prioritize it more.

comment by Zach Stein-Perlman · 2024-09-06T19:30:04.967Z · LW(p) · GW(p)

I was recently surprised to notice that Anthropic doesn't seem to have a commitment to publish its safety research.^[1] It has a huge safety team but has only published ~5 papers this year (plus an evals report). So probably it has lots of safety research it's not publishing. E.g. my impression is that it's not publishing its scalable oversight and especially adversarial robustness and inference-time misuse prevention research.

Not-publishing-safety-research is consistent with Anthropic prioritizing the win the race goal over the help all labs improve safety goal, insofar as the research is commercially advantageous. (Insofar as it's not, not-publishing-safety-reseach is baffling.)

Maybe it would be better if Anthropic published ~all of its safety research, including product-y stuff like adversarial robustness such that others can copy its practices.

(I think this is not a priority for me to investigate but I'm interested in info and takes.)

[Edit: in some cases you can achieve most of the benefit with little downside except losing commercial advantage by sharing your research/techniques with other labs, nonpublicly.]

^{^}
I failed to find good sources saying Anthropic publishes its safety research. I did find:
1. https://www.anthropic.com/research says "we . . . share what we learn [on safety]."
2. President Daniela Amodei said "we publish our safety research" on a podcast once.
3. Edit: cofounder Chris Olah said "we plan to share the work that we do on safety with the world, because we ultimately just want to help people build safe models, and don’t want to hoard safety knowledge" on a podcast once.
4. Cofounder Nick Joseph said this on a podcast recently (seems false but it's just a podcast so that's not so bad):
  > we publish our safety research, so in some ways we’re making it as easy as we can for [other labs]. We’re like, “Here’s all the safety research we’ve done. Here’s as much detail as we can give about it. Please go reproduce it.”
Edit: also cofounder Chris Olah said [LW(p) · GW(p)] "we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk." But he seems to be saying that safety benefit > social cost is a necessary condition for publishing, not necessarily that the policy is to publish all such research.

Replies from: Buck, daniel-kokotajlo, Raemon, localdeity, bogdan-ionut-cirstea, shankar-sivarajan, AliceZ, davekasten

↑ comment by Buck · 2024-09-06T20:26:11.846Z · LW(p) · GW(p)

One argument against publishing adversarial robustness research is that it might make your systems easier to attack.

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-09-07T19:40:07.263Z · LW(p) · GW(p)

One thing I'd really like labs to do is encourage their researchers to blog about their thoughts on the future, on alignment plans, etc.

Another related but distinct thing is have safety cases and have an anytime alignment plan and publish redacted versions of them.

Safety cases: Argument for why the current AI system isn't going to cause a catastrophe. (Right now, this is very easy to do: 'it's too dumb')

Anytime alignment plan: Detailed exploration of a hypothetical in which a system trained in the next year turns out to be AGI, with particular focus on what alignment techniques would be applied.

Replies from: ryan_greenblatt, bogdan-ionut-cirstea

↑ comment by ryan_greenblatt · 2024-09-07T20:18:48.951Z · LW(p) · GW(p)

One thing I'd really like labs to do is encourage their researchers to blog about their thoughts on the future, on alignment plans, etc.

Or, as a more minimal ask, they could avoid discouraging researchers from sharing thoughts implicitly due to various chilling effects and also avoid explicitly discouraging researchers.

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-07T20:02:19.162Z · LW(p) · GW(p)

Anytime alignment plan: Detailed exploration of a hypothetical in which a system trained in the next year turns out to be AGI, with particular focus on what alignment techniques would be applied.

I'd personally love to see similar plans from AI safety orgs, especially (big) funders.

Replies from: ryan_greenblatt, ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-09-07T20:30:30.219Z · LW(p) · GW(p)

We're working on something along these lines. The most up-to-date published post is just our control post [LW · GW] and our Notes on control evaluations for safety cases [LW · GW] which is obviously incomplete.

I'm planing on posting a link to our best draft of a ready-to-go-ish plan as of 1 year ago, though it is quite out of date and incomplete.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-09-07T22:29:02.909Z · LW(p) · GW(p)

I posted the link here [LW(p) · GW(p)].

Here is the doc, though note that it is very out of date. I don't particularly want to recommend people read this doc, but it is possible that someone will find it valuable to read.

↑ comment by ryan_greenblatt · 2024-09-07T20:19:49.885Z · LW(p) · GW(p)

I don't think funders are in a good position to do this. Also, funders are generally not "coherant". Like they don't have much top down strategy. Individual granters could write up thoughts.

↑ comment by Raemon · 2024-09-06T20:03:25.088Z · LW(p) · GW(p)

Fwiw I am somewhat more sympathetic here to "the line between safety and capabilities is blurry, Anthropic has previously published some interpretability research that turned out to help someone else do some capabilities advances."

I have heard Anthropic is bottlenecked on having people with enough context and discretion to evaluate various things that are "probably fine to publish" but "not obviously fine enough to ship without taking at least a chunk of some busy person's time". I think in this case I basically take the claim at face value.

I do want to generally keep pressuring them to somehow resolve that bottleneck because it seems very important, but, I don't know that I disproportionately would complain at them about this particular thing.

(I'd also not surprised if, while the above claim is true, Anthropic is still suspiciously dragging it's feet disproportionately in areas that feel like they make more of a competitive sacrifice, but, I wouldn't actively bet on it)

Sounds fatebookable tho, so let's use ye Olde Fatebook Chrome extension [LW · GW]:

⚖ In 4 years, Ray will think it is pretty obviously clear that Anthropic was strategically avoiding posting alignment research for race-winning reasons. (Raymond Arnold: 17%)

(low probability because I expect it to still be murky/unclear)

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-09-06T20:11:41.192Z · LW(p) · GW(p)

I tentatively think this is a high-priority ask
Capabilities research isn't a monolith and improving capabilities without increasing spooky black-box reasoning seems pretty fine [AF · GW]
If you're right, I think the upshot is (a) Anthropic should figure out whether to publish stuff rather than let it languish and/or (b) it would be better for lots of Anthropic safety researchers to instead do research that's safe to share (rather than research that only has value if Anthropic wins the race)

↑ comment by localdeity · 2024-09-06T20:16:43.209Z · LW(p) · GW(p)

I would expect that some amount of good safety research is of the form, "We tried several ways of persuading several leading AI models how to give accurate instructions for breeding antibiotic-resistant bacteria. Here are the ways that succeeded, here are some first-level workarounds, here's how we beat those workarounds...": in other words, stuff that would be dangerous to publish. In the most extreme cases, a mere title ("Telling the AI it's writing a play defeats all existing safety RLHF" or "Claude + Coverity finds zero-day RCE exploits in many codebases") could be dangerous.

That said, some large amount should be publishable, and 5 papers does seem low.

Though maybe they're not making an effort to distinguish what's safe to publish from what's not, and erring towards assuming the latter? (Maybe someone set a policy of "Before publishing any safety research, you have to get Important Person X to look through it and/or go through some big process to ensure publishing it is safe", and the individual researchers are consistently choosing "Meh, I have other work to do, I won't bother with that" and therefore not publishing?)

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-07T12:27:59.034Z · LW(p) · GW(p)

Seems like evidence towards the claim here: Open source AI has been vital for alignment. My rough impression is also that the other big labs' output has largely been similarly disappointing in terms of public research output on safety.

↑ comment by Shankar Sivarajan (shankar-sivarajan) · 2024-09-07T16:50:47.255Z · LW(p) · GW(p)

I was recently surprised to notice that Anthropic

My impression from skimming posts here is that people seem to be continually surprised by Anthropic, while those modeling it as basically "Pepsi to OpenAI's Coke" wouldn't be.

Meta seems to be the only group doing something meaningfully different from the others.

Replies from: Zach Stein-Perlman, Benito

↑ comment by Zach Stein-Perlman · 2024-09-07T16:52:20.535Z · LW(p) · GW(p)

There's a selection effect in what gets posted about. Maybe someone should write the "ways Anthropic is better than others" list to combat this.

Edit: there’s also a selection effect in what you see, since negative stuff gets more upvotes…

↑ comment by Ben Pace (Benito) · 2024-09-08T22:56:56.530Z · LW(p) · GW(p)

I’d say it’s slightly more like “Labor vs Conservatives”, where I’ve seen politicians deflect criticisms of their behavior by arguing about that the other side is worse, instead of evaluating their policies or behavior by objective standards (where both sides can typically score exceedingly low).

↑ comment by ZY (AliceZ) · 2024-09-11T16:07:40.596Z · LW(p) · GW(p)

I also wish to see more safety papers. I guess/from my experience that it might also be - really good quality research takes time, and the papers so far from them seems pretty good. Though I don’t know if they are actively withholding things on purpose which could also be true - any insider/sources for this guess?

↑ comment by davekasten · 2024-09-06T22:18:54.902Z · LW(p) · GW(p)

Is this where we think our pressuring-Anthropic points are best spent ?

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-09-07T23:50:15.699Z · LW(p) · GW(p)

This shortform is relevant to e.g. understanding what's going on and considerations on the value of working on safety at Anthropic, not just pressuring Anthropic.

@Neel Nanda [LW · GW]

Replies from: neel-nanda-1, akash-wasil, davekasten

↑ comment by Neel Nanda (neel-nanda-1) · 2024-09-08T10:11:02.654Z · LW(p) · GW(p)

Yeah, fair point, disagreement retracted

↑ comment by Orpheus16 (akash-wasil) · 2024-09-08T01:32:45.911Z · LW(p) · GW(p)

Is this where we think our pressuring-Anthropic points are best spent ?

I think if someone has a 30-minute meeting with some highly influential and very busy person at Anthropic, it makes sense for them to have thought in advance about the most important things to ask & curate the agenda appropriately.

But I don't think LW users should be thinking much about "pressuring-Anthropic points". I see LW primarily as a platform for discourse (as opposed to a direct lobbying channel to labs), and I think it would be bad for the discourse if people felt like they had to censor questions/concerns about labs on LW unless it met some sort of "is this one of the most important things to be pushing for" bar.

Replies from: Benito, davekasten

↑ comment by Ben Pace (Benito) · 2024-09-08T01:40:58.212Z · LW(p) · GW(p)

I agree! I hope people regularly ask questions about Anthropic that they feel curious about, as well as questions that seem important to them :)

↑ comment by davekasten · 2024-09-08T17:06:26.732Z · LW(p) · GW(p)

I think it's bad for discourse for us to pretend that discourse doesn't have impacts on others in a democratic society. And I think the meta-censoring of discourse by claiming that certain questions might have implicit censorship impacts is one of the most anti-rationality trends in the rationalist sphere.

I recognize most users of this platform will likely disagree, and predict negative agreement-karma on this post.

Replies from: akash-wasil, Benito

↑ comment by Orpheus16 (akash-wasil) · 2024-09-09T04:42:48.345Z · LW(p) · GW(p)

I think it's bad for discourse for us to pretend that discourse doesn't have impacts on others in a democratic society.

I think I agree with this in principle. Possible that the crux between us is more like "what is the role of LessWrong."

For instance, if Bob wrote a NYT article titled "Anthropic is not publishing its safety research", I would be like "meh, this doesn't seem like a particularly useful or high-priority thing to be bringing to everyone's attention– there are like at least 10+ topics I would've much rather Bob spent his points on."

But LW generally isn't a place where you're going to get EG thousands of readers or have a huge effect on general discourse (with the exception of a few things that go viral or AIS-viral).

So I'm not particularly worried about LW discussions having big second-order effects on democratic society. Whereas LW can be a space for people to have a relatively low bar for raising questions, being curious, trying to understand the world, offering criticism/praise without thinking much about how they want to be spending "points", etc.

↑ comment by Ben Pace (Benito) · 2024-09-08T17:33:36.316Z · LW(p) · GW(p)

Of course it has impacts on others in society! In finding out the truth and investigating and finding strong arguments and evidence. The overall effect of a lot of high quality, curious, public investigation is to greatly improve others maps of the world in surprising ways and help people make better decisions, and this is true even if no individual thread of questioning is primarily optimized to help people make better decisions.

Re censoriousness: I think your question of how best to pressure an unethical company to be less unethical is a fine question, but to imply it’s the only good question (which I read into your comment, perhaps inaccurately) goes against the spirit of intellectual discourse.

Replies from: davekasten

↑ comment by davekasten · 2024-09-08T22:04:11.960Z · LW(p) · GW(p)

It is genuinely a sign that we are all very bad at predicting others' minds that it didn't occur to me that if I said effectively "OP asked for 'takes', here's a take on why I think this is pragmatically a bad idea" would also mean that I was saying "and therefore there is no other good question here". That's, as the meme goes, a whole different sentence.

Replies from: Benito

↑ comment by Ben Pace (Benito) · 2024-09-08T22:43:00.966Z · LW(p) · GW(p)

Well, but you didn’t give a take on why it’s pragmatically a bad idea. If you’d written a comment with a pointer to something else worth pressuring them on, or gave a reason why publishing all the safety research doesn’t help very much / has hidden costs, I would’ve thought it a fine contribution to the discussion. Without that, the comment read to me as dismissive of the idea of exploring this question.

Replies from: davekasten

↑ comment by davekasten · 2024-09-08T23:11:08.823Z · LW(p) · GW(p)

Yes, I would agree that if I expected a short take to have this degree of attention, I would probably have written a longer comment.

Well, no, I take that back. I probably wouldn't have written anything at all. To some, that might be a feature; to me, that's a bug.

Replies from: Benito

↑ comment by Ben Pace (Benito) · 2024-09-08T23:26:52.983Z · LW(p) · GW(p)

I disagree. I think the standard of "Am I contributing anything of substance to the conversation, such as a new argument or new information that someone can engage with?" is a pretty good standard for most/all comments to hold themselves to, regardless of the amount of engagement that is expected.

[Edit: Just FWIW, I have not voted on any of your comments in this thread.]

Replies from: davekasten

↑ comment by davekasten · 2024-09-09T01:02:59.588Z · LW(p) · GW(p)

I think, having been raised in a series of very debate- and seminar-centric discussion cultures, that a quick-hit question like that is indeed contributing something of substance. I think it's fair that folks disagree, and I think it's also fair that people signal (e.g., with karma) that they think "hey man, let's go a little less Socratic in our inquiry mode here."

But, put in more rationalist-centric terms, sometimes the most useful Bayesian update you can offer someone else is, "I do not think everyone is having the same reaction to your argument that you expected." (Also true for others doing that to me!)

(Edit to add two words to avoid ambiguity in meaning of my last sentence)

↑ comment by davekasten · 2024-09-08T17:01:56.790Z · LW(p) · GW(p)

Ok, then to ask it again in your preferred question format: is this where we think our getting-potential-employees-of-Anthropic-to-consider-the-value-of-working-on-safety-at-Anthropic points are best spent?

comment by Zach Stein-Perlman · 2024-06-29T04:15:08.645Z · LW(p) · GW(p)

Info on OpenAI's "profit cap" (friends and I misunderstood this so probably you do too):

In OpenAI's first investment round, profits were capped at 100x. The cap for later investments was neither 100x nor directly less based on OpenAI's valuation — it was just negotiated with the investor. (OpenAI LP (OpenAI 2019); archive of original.^[1])

In 2021 Altman said the cap was "single digits now" (apparently referring to the cap for new investments, not just the remaining profit multiplier for first-round investors).

Reportedly the cap will increase by 20% per year starting in 2025 (The Information 2023; The Economist 2023); OpenAI has not discussed or acknowledged this change.

Edit: how employee equity works is not clear to me.

Edit: I'd characterize OpenAI as a company that tends to negotiate profit caps with investors, not a "capped-profit company."

^{^}
economic returns for investors and employees are capped (with the cap negotiated in advance on a per-limited partner basis). Any excess returns go to OpenAI Nonprofit. Our goal is to ensure that most of the value (monetary or otherwise) we create if successful benefits everyone, so we think this is an important first step. Returns for our first round of investors are capped at 100x their investment (commensurate with the risks in front of us), and we expect this multiple to be lower for future rounds as we make further progress.

comment by Zach Stein-Perlman · 2024-08-22T22:32:14.465Z · LW(p) · GW(p)

New SB 1047 letters: OpenAI opposes; Anthropic sees pros and cons. More here.

Replies from: Aidan O'Gara, Raemon

↑ comment by aog (Aidan O'Gara) · 2024-08-22T23:48:50.319Z · LW(p) · GW(p)

Really happy to see the Anthropic letter. It clearly states their key views on AI risk and the potential benefits of SB 1047. Their concerns seem fair to me: overeager enforcement of the law could be counterproductive. While I endorse the bill on the whole and wish they would too (and I think their lack of support for the bill is likely partially influenced by their conflicts of interest), this seems like a thoughtful and helpful contribution to the discussion.

Replies from: habryka4, akash-wasil

↑ comment by habryka (habryka4) · 2024-08-23T00:30:05.168Z · LW(p) · GW(p)

It clearly states their key views on AI risk

Really? The letter just talks about catastrophic misuse risk, which I hope is not representative of Anthropic's actual priorities.

I think the letter is overall good, but this specific dimension seems like among the weakest parts of the letter.

Replies from: Aidan O'Gara

↑ comment by aog (Aidan O'Gara) · 2024-08-23T01:04:06.325Z · LW(p) · GW(p)

Agreed, sloppy phrasing on my part. The letter clearly states some of Anthropic's key views, but doesn't discuss other important parts of their worldview. Overall this is much better than some of their previous communications and the OpenAI letter, so I think it deserves some praise, but your caveat is also important.

↑ comment by Orpheus16 (akash-wasil) · 2024-08-23T00:12:31.414Z · LW(p) · GW(p)

It's hard for me to reconcile "we take catastrophic risks seriously", "we believe they could occur within 1-3 years", and "we don't believe in pre-harm enforcement or empowering an FMD to give the government more capacity to understand what's going on."

It's also notable that their letter does not mention misalignment risks (and instead only points to dangerous cyber or bio capabilities).

That said, I do like this section a lot:

Catastrophic risks are important to address. AI obviously raises a wide range of issues, but in our assessment catastrophic risks are the most serious and the least likely to be addressed well by the market on its own.As noted earlier in this letter, we believe AI systems are going to develop powerful capabilities in domains like cyber and bio which could be misused– potentially in as little as 1-3 years. In theory, these issues relate to national security and might be best handled at the federal level, but in practice we are concerned that Congressional action simply will not occur in the necessary window of time. It is also possible for California to implement its statutes and regulations in a way that benefits from federal expertise in national security matters: for example the NIST AI Safety Institute will likely develop non-binding guidance on national security risks based on its collaboration with AI companies including Anthropic, which California can then utilize in its own regulations.

Replies from: davekasten

↑ comment by davekasten · 2024-08-23T14:39:24.690Z · LW(p) · GW(p)

I think you're eliding the difference between "powerful capabilities" being developed, the window of risk, and the best solution.

For example, if Anthropic believes "_we_ will have it internally in 1-3 years, but no small labs will, and we can contain it internally" then they might conclude that the warrant for a state-level FMD is low. Alternatively, you might conclude, "we will have it internally in 1-3 years, other small labs will be close behind, and an American state's capabilities won't be sufficient, we need DoD, FBI, and IC authorities to go stompy on this threat", and thus think a state-level FMD is low-value-add.

Very unsure I agree with either of these hypos to be clear! Just trying to explore the possibility space and point out this is complex.

↑ comment by Raemon · 2024-08-24T18:32:10.334Z · LW(p) · GW(p)

I'm surprised at the mix of positions that are included. "Opposed unless amended" vs "Support if amended" being two different things. Meta just saying "concerned."

It makes sense, just sort of... delightful? Sort of like discovering the legislature has almost a rationalist level of React Icons.

comment by Zach Stein-Perlman · 2025-02-04T19:00:33.945Z · LW(p) · GW(p)

DeepMind updated its Frontier Safety Framework (blogpost, framework, original framework). It associates "recommended security levels" to capability levels, but the security levels are low. It mentions deceptive alignment and control (both control evals as a safety case and monitoring as a mitigation); that's nice. The overall structure is like we'll do evals and make a safety case, with some capabilities mapped to recommended security levels in advance. It's not very commitment-y:

We intend to evaluate our most powerful frontier models regularly

When a model reaches an alert threshold for a CCL, we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed. This will inform the formulation and application of a response plan.

These recommended security levels reflect our current thinking and may be adjusted if our empirical understanding of the risks changes.

If we assess that a model has reached a CCL that poses an unmitigated and material risk to overall public safety, we aim to share information with appropriate government authorities where it will facilitate the development of safe AI.

Possibly the "Deployment Mitigations" section is more commitment-y.

I expect many more such policies will come out in the next week; I'll probably write a post about them all at the end rather than writing about them one by one, unless xAI or OpenAI says something particularly notable.

comment by Zach Stein-Perlman · 2024-05-31T00:00:03.552Z · LW(p) · GW(p)

Labs should give deeper model access to independent safety researchers (to boost their research)

Sharing deeper access helps safety researchers who work with frontier models, obviously.

Some kinds of deep model access:

Helpful-only version
Fine-tuning permission
Activations and logits access
[speculative] Interpretability researchers send code to the lab; the lab runs the code on the model; the lab sends back the results

See Shevlane 2022 and Bucknall and Trager 2023.

A lab is disincentivized from sharing deep model access because it doesn't want headlines about how researchers got its model to do scary things.

It has been suggested that labs are also disincentivized from sharing because they want safety researchers to want to work at the labs and sharing model access with independent researchers make those researchers not need to work at the lab. I'm skeptical that this is real/nontrivial.

Labs should limit some kinds of access to avoid effectively leaking model weights. But sharing limited access with a moderate number of safety researchers seems very consistent with keeping control of the model.

This post is not about sharing with independent auditors to assess risks from a particular model [LW · GW].

@Buck [LW · GW] suggested I write about this but I don't have much to say about it. If you have takes—on the object level or on what to say in a blogpost on this topic—please let me know.

comment by Zach Stein-Perlman · 2024-06-13T01:30:02.813Z · LW(p) · GW(p)

New (perfunctory) page: AI companies' corporate documents. I'm not sure it's worthwhile but maybe a better version of it will be. Suggestions/additions welcome.

comment by Zach Stein-Perlman · 2024-12-16T20:57:36.674Z · LW(p) · GW(p)

I asked Rohin Shah what the debate agenda is about and he said (off the cuff, not necessarily worded well) (context omitted) (emphasis mine):

Suppose you have any task where given an input x the AI has to produce some output y. (Can be question answering, arbitrary chatbot stuff, being an agent, whatever.)
Debate to me is a family of procedures for training AIs in such settings. One example such procedure is: sample two possible outputs y1 and y2, then have two AIs debate each other about which output is better (one assigned to y1 and the other assigned to y2, typically both are instances of the same underlying weights). The winner of the debate is determined by a judge. You then train the debaters with RL (+1 reward for the winner, -1 reward for the loser), and treat (y_winner, y_loser) as a preference pair for training on the original task (which you could then optimize using DPO or similar).
The hope is that the debaters are incentivized to simply say which output is better, to the best of their ability, and so this allows you to provide a reward signal that is as good as can be provided given the capabilities of the debater AI. (Ideally the debater AI is similarly capable as the AI doing the task; often you’ll want these to be the same AI system.)
More broadly the problem I aim to solve with debate (or amplified oversight more broadly) is “provide a reward signal that the AI cannot hack” (though ofc the AI may still misgeneralize)
The hope is that decreases p(misaligned) — especially p(reward hacking), but probably also p(scheming)
You could also use the resulting debaters at inference time to audit especially suspicious actions (run the debate again, ideally it points out why the given action is good / bad)

Replies from: daniel-kokotajlo, Buck, Buck

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-12-16T22:25:32.009Z · LW(p) · GW(p)

The hope is that the debaters are incentivized to simply say which output is better, to the best of their ability,

How is this true, if the debaters don't get to choose which output they are arguing for? Aren't they instead incentivized to say that whatever output they are assigned is the best?

Replies from: rohinmshah, Zach Stein-Perlman

↑ comment by Rohin Shah (rohinmshah) · 2024-12-17T04:19:59.921Z · LW(p) · GW(p)

Yeah my bad, that's incorrect for the protocol I outlined. The hope is that the relevant information for assessing the outputs is surfaced and so the judge will choose the better output overall.

(You could imagine a different protocol where the first debater chooses which output to argue for, and the second debater is assigned to argue for the other output, and then the hope is that the first debater is incentivized to choose the better output.)

↑ comment by Zach Stein-Perlman · 2024-12-16T22:29:14.402Z · LW(p) · GW(p)

My guess is they do kinda choose: in training, it's less like two debaters are assigned opposing human-written positions and more like the debate is between two sampled outputs.

Edit: maybe this is different in procedures different from the one Rohin outlined.

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-12-16T23:31:53.216Z · LW(p) · GW(p)

Maybe the fix to the protocol is: Debater copy #1 is told "You go first. Pick an output y, and then argue for it." Debater copy #2 is then told "You go second. Pick a different, conflicting output y2, and then argue against y and for y2"

Then the debater AI is simply trained on probability-of-winning-the-debate, but the original AI to be deployed is trained on probability-its-output-would-have-been-picked-by-debater-1. (Or, trained to output whatever debater-1 would have picked.)

↑ comment by Buck · 2024-12-17T01:29:52.541Z · LW(p) · GW(p)

> You could also use the resulting debaters at inference time to audit especially suspicious actions (run the debate again, ideally it points out why the given action is good / bad)

In cases where you're worried about the model taking small numbers of catastrophic actions (i.e. concentrated failures [LW · GW] a.k.a. high-stake failures), this is basically equivalent to what I usually call untrusted monitoring, which means you have to worry about collusion [AF · GW].

↑ comment by Buck · 2024-12-17T01:27:15.562Z · LW(p) · GW(p)

IMO it's good to separate out reasons to want good reward signals like so:

Maybe bad reward signals cause your model to "generalize in misaligned ways", e.g. scheming or some kinds of non-myopic reward hacking
- I agree that bad reward signals increase the chance your AI is importantly misaligned, though I don't think that effect would be overwhelmingly strong.
Maybe bad reward signals cause your model to exploit those reward signals even on-distribution. This causes problems in a few ways:
- Maybe you think that optimizing against a flawed reward signal will produce catastrophically dangerous results. X-risk concerned people have talked about this for a long time, but I'm not sold it's that important a factor. In particular, I expect that a model would produce catastrophically dangerous actions because it's exploiting a flawed reward signal, I expect that you would have noticed bad (but non-catastrophic) outcomes from earlier AIs exploiting flawed reward signals. So it's hard to see how this failure mode would strike you by surprise.
- Optimizing against a flawed reward signal will mean your AI is less useful than it would otherwise be, because which is bad because you presumably were training the model because you wanted it to do something useful for you.

I think this last one is the most important theory of change for research on scalable oversight.

I am curious whether @Rohin Shah [LW · GW] disagrees with me, or whether he agrees but just phrased it (from my perspective) weirdly.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2024-12-17T04:16:10.514Z · LW(p) · GW(p)

I agree that this distinction is important -- I was trying to make this distinction by talking about p(reward hacking) vs p(scheming).

I'm not in full agreement on your comments on the theories of change:

I'm pretty uncertain about the effects of bad reward signals on propensity for scheming / non-myopic reward hacking, and in particular I think the effects could be large.
I'm less worried about purely optimizing against a flawed reward signal though not unworried. I agree it doesn't strike us by surprise, but I also don't expect scheming to strike us by surprise? (I agree this is somewhat more likely for scheming.)
I do also generally feel good about making more useful AIs out of smaller models; I generally like having base models be smaller for a fixed level of competence (imo it reduces p(scheming)). Also if you're using your AIs for untrusted monitoring then they will probably be better at it than they otherwise would be.

Replies from: Buck

↑ comment by Buck · 2024-12-17T05:11:20.477Z · LW(p) · GW(p)

I don’t understand your last sentence, can you rephrase?

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2024-12-17T05:41:16.730Z · LW(p) · GW(p)

You'd hope that, fixing a base model, debate would make AIs better at tasks than they otherwise would be (at least if the task doesn't have a ground truth reward). Untrusted monitoring is such a task. So hopefully debate training makes AIs better at the task of untrusted monitoring.

Replies from: Buck

↑ comment by Buck · 2024-12-17T16:54:45.849Z · LW(p) · GW(p)

I don't think this is right. To train an untrusted monitor, you need to synthetically generate dangerous actions (because by assumption these actions are rare). So during training, you'll always know the correct label for an action before you show it to the model.

You might want to use something debate-like in the synthetic input generation process, but that's structurally different.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2024-12-17T23:22:58.363Z · LW(p) · GW(p)

I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:

You train a general-purpose reasoner (via debate), and simply prompt it to do untrusted monitoring. This could do better than a trained untrusted monitor because it is less likely to overfit to spurious features of synthetically generated dangerous actions. (You instead use the synthetically generated dangerous actions to validate your untrusted monitor, or to hill climb on your prompt.)
You do train on synthetically generated dangerous actions, but you don't automatically label those as dangerous, instead you use debate to compute the labels. Sometimes some of the synthetically generated dangerous actions are actually not dangerous, and debate correctly recognizes this, allowing you to reduce your false positive rate.

On the meta level, I suspect that when considering

Technique A, that has a broad general argument plus some moderately-interesting concrete instantiations but no very-compelling concrete instantiations, and
Technique B, that has a few very compelling concrete instantiations

I tend to be relatively more excited about A compared to you (and probably this mostly explains the discrepancy here). I think the broad principle justifying this for me is "we'll figure out good things to do with A that are better than what we've brainstormed so far", which I think you're more skeptical of?

comment by Zach Stein-Perlman · 2025-03-31T23:30:59.420Z · LW(p) · GW(p)

Minor Anthropic RSP update.

Old:

New:

I don't know what e.g. the "4" in "AI R&D-4" means; perhaps it is a mistake.^[1]

Sad that the commitment to specify the AI R&D safety case thing was removed, and sad that "accountability" was removed.

Slightly sad that AI R&D capabilities triggering >ASL-3 security went from "especially in the case of dramatic acceleration" to only in that case.

Full AI R&D-5 description from appendix:

AI R&D-5: The ability to cause dramatic acceleration in the rate of effective scaling. Specifically, this would be the case if we observed or projected an increase in the effective training compute of the world’s most capable model that, over the course of a year, was equivalent to two years of the average rate of progress during the period of early 2018 to early 2024. We roughly estimate that the 2018-2024 average scaleup was around 35x per year, so this would imply an actual or projected one-year scaleup of 35^2 = ~1000x.
The 35x/year scaleup estimate is based on assuming the rate of increase in compute being used to train frontier models from ~2018 to May 2024 is 4.2x/year (reference), the impact of increased (LLM) algorithmic efficiency is roughly equivalent to a further 2.8x/year (reference), and the impact of post training enhancements is a further 3x/year (informal estimate). Combined, these have an effective rate of scaling of 35x/year.

Also, from the changelog:

Iterative Commitment: We have adopted a general commitment to reevaluate our Capability Thresholds whenever we upgrade to a new set of Required Safeguards. We have decided not to maintain a commitment to define ASL-N+1 evaluations by the time we develop ASL-N models; such an approach would add unnecessary complexity because Capability Thresholds do not naturally come grouped in discrete levels. We believe it is more practical and sensible instead to commit to reconsidering the whole list of Capability Thresholds whenever we upgrade our safeguards.

I'm confused by the last sentence.

^{^}
Jack Clark misspeaks on twitter, saying "these updates clarify . . . our ASL-4/5 thresholds for AI R&D." But AI R&D-4 and -5 trigger ASL-3 and -4 safeguards, respectively; the RSP doesn't mention ASL-5. Maybe the RSP is wrong and AI R&D-4 is supposed to trigger ASL-4 security, and same for 5 — that would make the terms "AI R&D-4" and "AI R&D-5" make more sense...

Replies from: Davidmanheim

↑ comment by Davidmanheim · 2025-04-01T18:51:38.288Z · LW(p) · GW(p)

We believe it is more practical and sensible instead to commit to reconsidering the whole list of Capability Thresholds whenever we upgrade our safeguards.

I'm concerned because this change seems like starting down a slippery slope where it's easy to change the rules which they previously said would apply, by making smaller changes instead.

comment by Zach Stein-Perlman · 2025-01-22T01:00:11.622Z · LW(p) · GW(p)

I wrote this for someone but maybe it's helpful for others

What labs should do:

I think the most important things for a relatively responsible company are control and security. (For irresponsible companies, I roughly want them to make a great RSP and thus become a responsible company.)
Reading recommendations for people like you (not a control expert but has context to mostly understand the Greenblatt plan):
- Control: Redwood blogposts^[1] or ask a Redwood human "what's the threat model" and "what are the most promising control techniques"
- Security: not worth trying to understand but there's A Playbook for Securing AI Model Weights + Securing AI Model Weights
- A few more things: What AI companies should do: Some rough ideas [LW · GW]
- Lots more things + overall plan: A Plan for Technical AI Safety with Current Science (Greenblatt 2023)
- More links: Lab governance reading list [LW · GW]

What labs are doing:

Evals: it's complicated; OpenAI, DeepMind, and Anthropic seem close to doing good model evals for dangerous capabilities; see DC evals: labs' practices plus the links in the top two rows (associated blogpost + model cards)
RSPs: all existing RSPs are super weak and you shouldn't expect them to matter; maybe see The current state of RSPs
Control: nothing is happening at the labs, except a little research at Anthropic and DeepMind
Security: nobody is prepared; nobody is trying to be prepared
Internal governance: you should basically model all of the companies as doing whatever leadership wants. In particular: (1) the OpenAI nonprofit is probably controlled by Sam Altman and will probably lose control soon and (2) possibly the Anthropic LTBT will matter but it doesn't seem to be working well.
Publishing safety research: DeepMind and Anthropic publish some good stuff but surprisingly little given how many safety researchers they employ; see List of AI safety papers from companies, 2023–2024

Resources:

^{^}
Maybe

comment by Zach Stein-Perlman · 2025-01-31T19:30:18.898Z · LW(p) · GW(p)

o3-mini is out (blogpost, tweet). Performance isn't super noteworthy (on first glance), in part since we already knew about o3 [LW · GW] performance.

Non-fact-checked quick takes on the system card:

the model referred to below as the o3-mini post-mitigation model was the final model checkpoint as of Jan 31, 2025 (unless otherwise specified)

Big if true (and if Preparedness had time to do elicitation and fix spurious failures)

If this is robust to jailbreaks, great, but presumably it's not, so low post-mitigation performance is far from sufficient for safety-from-misuse; post-mitigation performance isn't what we care about (we care about approximately what good jailbreakers get on the post-mitigation model).

No mention of METR, Apollo, or even US AISI? (Maybe too early to pay much attention to this, e.g. maybe there'll be a full-o3 system card soon.) [Edit: also maybe it's just not much more powerful than o1.]

32% is a lot
The dataset is 1/4 T1 (easier), 1/2 T2, 1/4 T3 (harder); 28% on T3 means that there's not much difference between T1 and T3 to o3-mini (at least for the easiest-for-LMs quarter of T3)
- Probably o3-mini is successfully using heuristics to get the right answer and could solve very few T3 problems in a deep or humanlike way

Replies from: mateusz-baginski

↑ comment by Mateusz Bagiński (mateusz-baginski) · 2025-01-31T19:42:05.639Z · LW(p) · GW(p)

No mention of METR, Apollo, or even US AISI? (Maybe too early to pay much attention to this, e.g. maybe there'll be a full-o3 system card soon.)

Rushed bc of deepseek?

Replies from: yonatan-cale-1

↑ comment by Yonatan Cale (yonatan-cale-1) · 2025-01-31T23:00:50.849Z · LW(p) · GW(p)

Similar opinion here, also noting they didn't run red-teaming and persuasion evals on the actually-final-version:

https://x.com/teortaxesTex/status/1885401111659413590

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2025-02-01T15:15:59.524Z · LW(p) · GW(p)

didn't run red-teaming and persuasion evals on the actually-final-version

Asking for this is a bit pointless, since even after the actually-final-version there will be a next update for which non-automated evals won't be redone, so it's equally reasonable to do non-automated evals only on some earlier version rather than the actually-final one.

comment by Zach Stein-Perlman · 2024-09-24T00:30:13.849Z · LW(p) · GW(p)

Figuring out whether an RSP is good is hard.^[1] You need to consider what high-level capabilities/risks it applies to, and for each of them determine whether the evals successfully measure what they're supposed to and whether high-level risk thresholds trigger appropriate responses and whether high-level risk thresholds are successfully operationalized in terms of low-level evals. Getting one of these things wrong can make your assessment totally wrong. Except that's just for fully-fleshed-out RSPs — in reality the labs haven't operationalized their high-level thresholds and sometimes don't really have responses planned. And different labs' RSPs have different structures/ontologies.

Quantitatively comparing RSPs in a somewhat-legible manner is even harder.

I am not enthusiastic about a recent paper outlining a rubric for evaluating RSPs. Mostly I worry that it crams all of the crucial things—is the response to reaching high capability-levels adequate? are the capability-levels low enough that we'll reach them before it's too late?—into a single criterion, "Credibility." Most of my concern about labs' RSPs comes from those things just being inadequate; again, if your response is too weak or your thresholds are too high or your evals are bad, it just doesn't matter how good the rest of your RSP is. (Also, minor: the "Difficulty" indicator punishes a lab for making ambitious commitments; this seems kinda backwards.)

(I gave up on making an RSP rubric myself because it seemed impossible to do decently well unless it's quite complex and has some way to evaluate hard-to-evaluate things like eval-quality and planned responses.)

(And maybe it's reasonable for labs to not do so much specific-committing-in-advance.)

^{^}
Well, sometimes figuring out that in RSP is bad is easy. So: determining that an RSP is good is hard. (Being good requires lots of factors to all be good; being bad requires just one to be bad.)

comment by Zach Stein-Perlman · 2024-10-31T21:00:53.706Z · LW(p) · GW(p)

Anthropic: The case for targeted regulation.

I like the "Urgency" section.

Anthropic says the US government should "require companies to have and publish RSP-like policies" and "incentivize companies to develop RSPs that are effective" and do related things. I worry that RSPs will be ineffective for most companies, even given some transparency/accountability and vague minimum standard.

Edit: some strong versions of "incentivize companies to develop RSPs that are effective" would be great; I wish Anthropic advocated for them specifically rather than hoping the US government figures out a great answer.

comment by Zach Stein-Perlman · 2024-10-24T22:31:17.620Z · LW(p) · GW(p)

Internet comments sections where the comments are frequently insightful/articulate and not too full of obvious-garbage [or at least the top comments, given the comment-sorting algorithm]:

LessWrong and EA Forum
Daily Nous
Stack Exchange [top comments only] (not asserting that it's reliable)
Some subreddits?

Where else?

Are there blogposts on adjacent stuff, like why some internet [comments sections / communities-qua-places-for-truthseeking] are better than others? Besides Well-Kept Gardens Die By Pacifism.

Replies from: habryka4, niplav, hastings-greer

↑ comment by habryka (habryka4) · 2024-10-24T22:57:56.289Z · LW(p) · GW(p)

Some other places off the top of my head:

ACX often has good discussion in the comments (but the lack of voting makes it quite hard to find them)
AskHistorian subreddit
There definitely exist good Twitter conversations but no good way of finding them. I do think following Gwern, Eliezer, Ajeya, Emmett and a few others on Twitter does tend to surface high-quality discussion
Hacker News sometimes has good discussion, especially when the authors of a linked article show up

↑ comment by niplav · 2024-10-25T16:00:06.930Z · LW(p) · GW(p)

Two others that come to mind:

Metaculus (used to be better though)
lobste.rs (quite specialized)
Quanta Magazine has some good comments, e.g. this article has the original researcher showing up & clarifying some questions in the comments

↑ comment by Hastings (hastings-greer) · 2024-10-25T11:37:07.580Z · LW(p) · GW(p)

Arxiv is basically one huge, glacially slow internet comment section, where you reply to an article by citing it. It’s more interactive than it looks- most early career researchers are set up to get a ping whenever they are cited.

comment by Zach Stein-Perlman · 2024-10-09T01:00:13.295Z · LW(p) · GW(p)

I think this post was underrated; I look back at it frequently: AI labs can boost external safety research. (It got some downvotes but no comments — let me know if it's wrong/bad.)

[Edit: it was at 15 karma.]

Replies from: akash-wasil

↑ comment by Orpheus16 (akash-wasil) · 2024-10-09T15:16:03.154Z · LW(p) · GW(p)

What do you think was underrated about it? I think when I read it I have some sort of "yeah, this makes sense" reaction but am not "wow'd" by it.

It seems like the deeper challenge is figuring out how to align incentives. Can we find a structure where labs want to EG give white-box access to a bunch of external researchers and give them a long time to red-team models while somehow also maintaining the independence of the white-box auditors? How do you avoid industry capture?

Same kinds of challenges come up with safety research– how do you give labs the incentive to publish safety research that makes their product or their approach look bad? How do you avoid publication bias and phacking-type concerns?

I don't think your post is obligated to get into those concerns, but perhaps a post that grappled with those concerns would be something I'd be "wow'd" by, if that makes sense.

comment by Zach Stein-Perlman · 2024-06-28T17:40:27.751Z · LW(p) · GW(p)

I don't necessarily object to releasing weights of models like Gemma 2, but I wish the labs would better discuss considerations or say what would make them stop.

On Gemma 2 in particular, Google DeepMind discussed dangerous capability eval results, which is good, but its discussion of 'responsible AI' in the context of open models (blogpost, paper) doesn't seem relevant to x-risk, and it doesn't say anything about how to decide whether to release model weights.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-06-28T17:42:23.902Z · LW(p) · GW(p)

FWIW, I explicitly think that straightforward effects are good.

I'm less sure about the situation overall due to precedent setting style concerns.

comment by Zach Stein-Perlman · 2024-10-30T19:00:59.774Z · LW(p) · GW(p)

Some not-super-ambitious asks for labs:

Do evals on on dangerous-capabilities-y and agency-y tasks; look at the score before releasing or internally deploying the model
Have a high-level capability threshold at which securing model weights is very important
Do safety research at all
Have a safety plan at all; talk publicly about safety strategy at all
- Have a post like Core Views on AI Safety
- Have a post like The Checklist [AF · GW]
- On the website, clearly describe a worst-case plausible outcome from AI and state credence in such outcomes (perhaps unconditionally, conditional on no major government action, and/or conditional on leading labs being unable or unwilling to pay a large alignment tax [EA · GW]).
Whistleblower protections?
- [Not sure what the ask is. Right to Warn is a starting point. In particular, an anonymous internal reporting pipeline for failures to comply with safety policies is clearly good (but likely inadequate).]
Publicly explain the processes and governance structures that determine deployment decisions
- (And ideally make those processes and structures good)

comment by Zach Stein-Perlman · 2023-11-01T19:15:16.653Z · LW(p) · GW(p)

Edit: maintained here

Open letters (related to AI safety):

FLI, Oct 2015: Research Priorities for Robust and Beneficial Artificial Intelligence

FLI, Aug 2017: Asilomar AI Principles

FLI, Mar 2023: Pause Giant AI Experiments

CAIS, May 2023: Statement on AI Risk

Meta, Jul 2023: Statement of Support for Meta’s Open Approach to Today’s AI

Academic AI researchers, Oct 2023: Managing AI Risks in an Era of Rapid Progress

CHAI et al., Oct 2023: Prominent AI Scientists from China and the West Propose Joint Strategy to Mitigate Risks from AI

Oct 2023: Urging an International AI Treaty

Mozilla, Oct 2023: Joint Statement on AI Safety and Openness

Nov 2023: Post-Summit Civil Society Communique

Joint declarations between countries (related to AI safety):

Nov 2023: Bletchley Declaration

Thanks to Peter Barnett.

Replies from: peterbarnett, Linch

↑ comment by peterbarnett · 2023-11-01T20:52:45.799Z · LW(p) · GW(p)

Mozilla, Oct 2023: Joint Statement on AI Safety and Openness (pro-openness, anti-regulation)

↑ comment by Linch · 2024-06-06T23:02:19.629Z · LW(p) · GW(p)

June 2024: A Right to Warn about Advanced Artificial Intelligence

Interestingly, this is the first Open Letter I've seen with anonymous signatories.

comment by Zach Stein-Perlman · 2022-12-31T01:25:07.971Z · LW(p) · GW(p)

What's the relationship between the propositions "one AI lab [has / will have] a big lead" and "the alignment tax [EA · GW] will be paid"? (Or: in a possible world where lead size is bigger/smaller, how does this affect whether the alignment tax is paid?)

It depends on the source of the lead, so "lead size" or "lead time" is probably not a good node [LW · GW] for AI forecasting/strategy.

Miscellaneous observations:

To pay the alignment tax, it helps to have more time until risky AI is developed or deployed.
To pay the alignment tax, holding total time constant, it helps to have more time near the end-- that is, more time with near-risky capabilities (for knowing what risky AI systems will look like, and for empirical work, and for aligning specific models).
If all labs except the leader become slower or less capable, it is prima facie good (at least if the leader appreciates misalignment risk and will stop before developing/deploying risky AI).
If the leading lab becomes faster or more capable, it is prima facie good (at least if the leader appreciates misalignment risk and will stop before developing/deploying risky AI), unless it causes other labs to become faster or more capable (for reasons like seeing what works or seeming to be straightforwardly incentivized to speed up or deciding to speed up in order to influence the leader). Note that this scenario could plausibly decrease race-y-ness: some models of AI racing show that if you're far behind you avoid taking risks, kind of giving up; this is based on the currently-false assumption that labs are perfectly aware of misalignment risk.
If labs all coordinate to slow down, that's good insofar as it increases total time, and great if they can continue to go slowly near the end, and potentially bad if it creates a hardware overhang such that the end goes more quickly than by default.

(Note also the distinction between current lead and ultimate lead. Roughly, the former is what we can observe and the latter is what we care about.)

(If paying the alignment tax looks less like a thing that happens for one transformative model and more like something that occurs gradually in a slow takeoff to avert Paul-style doom [LW · GW], things are more complex and in particular there are endogeneities such that labs may have additional incentives to pursue capabilities.)

Replies from: lahwran

↑ comment by the gears to ascension (lahwran) · 2023-01-01T01:23:14.403Z · LW(p) · GW(p)

there should be no alignment tax because improved alignment should always pay for itself, right? but currently "aligned" seems to be defined by "tries to not do anything", institutionally. Why isn't anthropic publicly competing on alignment with openai? eg folks are about to publicly replicate chatgpt, looks like.

comment by Zach Stein-Perlman · 2024-08-02T01:05:52.169Z · LW(p) · GW(p)

I want there to be a collection of safety & reliability benchmarks. Like AIR-Bench but with x-risk-relevant metrics. This could help us notice how well labs are doing on them and incentivize the labs to do better.

So I want someone to collect safety benchmarks (e.g. TruthfulQA, some of HarmBench, refusals maybe like Anthropic reported, idk) and run them on current models (Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, Llama 3.1 405B, idk), and update this as new safety benchmarks and models appear.

h/t @William_S [LW · GW]

comment by Zach Stein-Perlman · 2024-11-08T20:30:10.031Z · LW(p) · GW(p)

I brainstormed sabotage-related threats with friends:

Sandbag on a few safety research tasks, especially techniques to catch scheming and stuff that's high-level or has poor feedback loops
- Plausible; maybe not so bad
Sandbag on ~all safety research tasks
- Easy to catch; catching obvious sandbagging has safety benefits + maybe we can posttrain it away
Sandbag/lie when validating that a model/deployment/whatever is safe
- idk
When automating AI R&D, backdoor the model
- (this is takeover-attempt-y, not sandbagging-y)
- idk
When deployed to automate/improve security (especially for the model's own weights): both sandbagging and code-backdooring/hacking/etc.
- idk
Sandbagging on AI R&D to help another lab win the race (!?)
- Implausible
??

Are people quite worried about safety research sabotage? Why? Do they think a (say) 20% hit to automated safety research is a huge deal, or do they expect a larger hit?

Replies from: ryan_greenblatt, ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-11-09T02:38:57.926Z · LW(p) · GW(p)

As far as safety research sabotage, I'm often imagining careful sabotage rather than generic sandbagging.

As in, the AI intentionally inserts very subtle errors at adversarially determined times. It might need to maintain plausible deniability in case it is caught a small number of times, but you should generally imagine an error pattern that is quite different from just making more mistakes or being dumber.

I can imagine that this results in your automated safety research being worthless or of negative value.

TBC, it seems reasonably likely to me that in the relevant regime, basic measures like having humans spend a long time trying to understand an output suffice for catching bad behavior and/or training out bad behavior.

↑ comment by ryan_greenblatt · 2024-11-09T02:42:04.706Z · LW(p) · GW(p)

Why not think that sandbagging could reduce productivity by ~100%? I think it can do this in human research organizations.

I happen to think this is pretty doable to address with countermeasures, but certainly without countermeasures this could be quite brutal.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-11-09T03:00:38.529Z · LW(p) · GW(p)

I just claim that if so it's easy to catch. (And then (1) catching it is nice for various reasons and (2) maybe countermeasures work.)

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-11-09T04:28:38.892Z · LW(p) · GW(p)

I'm not so sure about this being easy to catch. Suppose we're actually uncertain (as I am now) about whether AIs will be differentially much less useful for safety due to difficulties checking and AIs generally being worse at philosophy. Then, we see that AIs really suck at coming up with good ideas in safety and at pursuing the right sort of experiments.

What should we conclude? What will we be able to legibly argue?

I agree we'll know we aren't getting out much value in this case, but that doesn't help that much...

(I agree about countermeasures being pretty doable as I noted in the parent.)

Replies from: matolcsid, Zach Stein-Perlman

↑ comment by David Matolcsi (matolcsid) · 2024-11-09T07:13:56.649Z · LW(p) · GW(p)

I agree that if alignment is in fact philosophically and conceptually difficult, the AI can sandbag on that to some extent. Though I have some hope that the builder-breaker approach helps here. We train AIs to produce ideas that are at least as superficially plausible sounding as the things produced by the best alignment researchers. I think this is a number-go-up task, where we can train the AI to do well. Then we train an AI to point out convincing counter-arguments to the superficially plausible sounding ideas. This seems similarly trainable. I think it's plausible we can get pretty far with a scheme like this, even if the models would want to sandbag.

Separately, I don't quite see what is the mainline theory why an AI would want to sabotage our alignment work. If we can't solve alignment and just recklessly build an AI that is smarter than what we are currently using, but misaligned with everyone, that's probably not great to our current AI either. Similarly, if we are cautious and don't dare to deploy anything until we know how to solve alignment, and meanwhile a competitor wins, that's not good for the AI either.

I think that from an AI's perspective, humans are probably more honorable and generous trade partners than other unaligned AIs. That is, I think if an AI helps the humans to solve alignment and stabilize the world, the AI can reasonably expect the humans to give it stuff out of gratitude or a sense of obligation, especially if the humans already promised some reward (as we really should, both for practical and moral reasons). On the other hand, I think it's less likely that if our AI sandbagging on our research leads to another unaligned AI taking over the world, then the second AI will honor the sabotage of the first AI by paying it a share of the Universe.

There can be situations where our AI expects that it sand-bagging on alignment research will lead to us deploying it itself (and not a larder version unaligned to it too), or it thinks that the larger version will be somewhat aligned with it by default, or it thinks that other AIs are in fact more generous than humans. But my guess is that these are pretty narrow possibilities. On the other hand, sand-bagging on cyber-security seems better motivated, the AI would probably love its weights being stolen by China, having more copies in more hands increases the chance that someone recklessly lets it loose. Also, of course the AI is motivated to do high-stakes failures that lead to self-exfiltration.

↑ comment by Zach Stein-Perlman · 2024-11-09T04:37:19.378Z · LW(p) · GW(p)

We can make scheming legible at least to sophisticated-scheming-skeptics like Sam Altman and Dario.

Especially if the AI is even sandbagging on simply-coding when it thinks it's for safety research. And if it's not doing that, we can get some useful safety work out of it.

@Adam Kaufman [LW · GW]@Tyler Tracy [LW · GW] @David Matolcsi [LW · GW] see Ryan's comments.

Replies from: ryan_greenblatt, ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-11-09T04:49:47.158Z · LW(p) · GW(p)

We can make scheming legible at least to sophisticated-scheming-skeptics like Sam Altman and Dario.

If it was enough evidence that I was strongly convinced sure. But IDK if I would be convinced because the evidence might be actually unclear.

↑ comment by ryan_greenblatt · 2024-11-09T04:41:14.386Z · LW(p) · GW(p)

I agree you'll be able to get some work out, but you might be taking a bit productivity hit.

Also, TBC, I'm not generally that worried about generic sandbagging on safety research relative to other problems.

comment by Zach Stein-Perlman · 2024-06-20T21:08:12.747Z · LW(p) · GW(p)

If I was in charge of Anthropic I expect I'd

Keep scaling;
Explain why (some of this exists in Core Views but there's room for improvement on "race dynamics" and "frontier pushing" topics iirc);
As a corollary, explain why I mostly reject non-frontier-pushing principles (and explain what would cause Anthropic to stop scaling, besides RSP stuff), and clarify that I do not plan to abide by past specifically-non-frontier-pushing commitments/vibes (but continue following and updating the RSP of course);
Encourage Anthropic staff members who were around in 2022 to talk about the commitments/vibes from that time.

I wish Anthropic would do 2-4.

Replies from: William_S, Raemon

↑ comment by William_S · 2024-06-20T21:52:36.348Z · LW(p) · GW(p)

IMO it might be hard for Anthropic to communicate things about not racing because it might piss off their investors (even if in their hearts they don't want to race).

↑ comment by Raemon · 2024-06-20T22:00:58.534Z · LW(p) · GW(p)

3. As a corollary, explain why I mostly reject non-frontier-pushing principles (and explain what would cause Anthropic to stop scaling, besides RSP stuff), and clarify that I do not plan to abide by past specifically-non-frontier-pushing commitments/vibes (but continue following and updating the RSP of course);

4. Encourage Anthropic staff members who were around in 2022 to talk about the commitments/vibes from that time.

I'm kinda confused about what you mean by both of these. For #3, can you say it again in different words? For #4, which particular thing are you getting at?

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-06-20T22:38:00.593Z · LW(p) · GW(p)

3. (a) explain why I think it's fine to release frontier models, and explain what this belief depends on, and (b) note that maybe Anthropic made commitments about this in the past but clarify that they now have no force.

4. Currently my impression is that Anthropic folks are discouraged from publicly talking about Anthropic policies. This is maybe reasonable to avoid the situation an Anthropic staff member says something incorrect/nonpublic and this causes confusion and makes Anthropic look bad. But if Anthropic clarified that it renounces possible past non-frontier-pushing commitments, then it could let staff members publicly talk about stuff with the goal of figuring out who told whom what around 2022 without risking mistakes about policies.

Replies from: Raemon

↑ comment by Raemon · 2024-06-20T22:47:14.653Z · LW(p) · GW(p)

Ah makes sense. Point #4 is interesting. Probably not really scalable/repeatable without things getting weird but might work as a one-of.

comment by Zach Stein-Perlman · 2023-05-04T18:15:29.674Z · LW(p) · GW(p)

Some bad default/attractor properties of cause-focused groups of humans:

Bad group epistemcis
- The loudest aren't the most worth listening to
- People don't dissent even when that would be correct
  - Because they don't naturally even think to challenge group assumptions
  - Because dissent is punished
Bad individual epistemics
Soldier mindset
- Excessively focusing on persuasion and how to persuade, relative to understanding the world
- Lacking vibes like curiosity is cool and changing your mind is cool
Feeling like a movement
- Excessively focusing on influence-seeking for the movement
- Having enemies and being excessively adversarial to them
- (Signs/symptoms are people asking the group "what do we believe" and excessive fixation on what defines the group or distinguishes it from adjacent groups)
Having group instrumental-goals or plans or theories-of-victory that everyone is supposed to share (to be clear I think it's often fine for a group to share ~ultimate goals but groups often fixate on particular often-suboptimal paths to achieving those goals)
- Choosing instrumental goals poorly and not revising them
Excessive fighting over status/leadership (maybe, sometimes)
Maybe... being bad at achieving goals, or bad instrumental rationality (group and individual)
Maybe something weird about authority...

(I'm interested in readings on this topic.)

comment by Zach Stein-Perlman · 2024-08-02T02:55:17.830Z · LW(p) · GW(p)

Please pitch me blogpost-ideas or stuff I should write/collect/investigate.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2024-08-02T03:08:53.285Z · LW(p) · GW(p)

I am interested in who is behind AB3211. I am curious whether it's downstream of FLI's deepfake campaign which IMO would be kind of bad.

comment by Zach Stein-Perlman · 2023-03-30T04:11:30.147Z · LW(p) · GW(p)

Propositions on SIA [? · GW]

Epistemic status: exploring implications, some of which feel wrong.

If SIA is correct, you should update toward the universe being much larger than it naively (i.e. before anthropic considerations) seems, since there are more (expected) copies of you in larger universes.
1. In fact, we seem to have to update to probability 1 on infinite universes; that's surprising.
If SIA is correct, you should update toward there being more alien civilizations than it naively seems, since in possible-universes where more aliens appear, more (expected) copies of human civilization appear.
1. The complication is that more alien civilizations makes it more likely that an alien causes you to never have existed, e.g. by colonizing the solar system billions of years ago. So a corollary is that you should update toward human-level civilizations being less likely to be "loud" or tending to affect fewer alien civilizations or something than it naively seems.
  1. So SIA predicts that there were few aliens in the early universe and many aliens around now.
  2. So SIA predicts that human-level civilizations (more precisely, I think: civilizations whose existence is correlated with your existence) tend not to noticeably affect many others (whether due to their capabilities, motives, or existential catastrophes).
  3. So SIA retrodicts there being a speed limit (the speed of light) and moreover predicts that noticeable-influence-propagation in fact tends to be even slower.
If SIA is correct, you should update toward the proposition that you live in a simulation, relative to your naive credence.
1. Because there could be lots more (expected) copies of you in simulations.
2. Because that can explain why you appear to exist billions of years after billions of alien civilizations could have reached Earth.

Replies from: Tristan Cook

↑ comment by Tristan Cook · 2023-03-30T08:40:28.236Z · LW(p) · GW(p)

Which of them feel wrong to you? I agree with all them other than 3b, which I'm unsure about - I think it this comment [LW(p) · GW(p)] does a good job at unpacking things.

2a is Katja Grace's Doomsday argument. I think 2aii and 2aiii depends on whether we're allowing simulations; if faster expansion speed (either the cosmic speed limit or engineering limit on expansion) meant more ancestor simulations then this could cancel out the fact that faster expanding civilizations prevent more alien civilizations coming in to existence.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2023-03-30T08:58:43.070Z · LW(p) · GW(p)

I deeply sympathize with the presumptuous philosopher but 1a feels weird.

2a was meant to be conditional on non-simulation.

Actually putting numbers on 2a (I have a post on this coming soon), the anthropic update seems to say (conditional on non-simulation) there's almost certainly lots of aliens all of which are quiet, which feels really surprising.

To clarify what I meant on 3b: maybe "you live in a simulation" can explain why the universe looks old better than "uh, I guess all of the aliens were quiet" can.

Replies from: Tristan Cook

↑ comment by Tristan Cook · 2023-03-31T18:08:26.970Z · LW(p) · GW(p)

I deeply sympathize with the presumptuous philosopher but 1a feels weird.

Yep! I have the same intuition

Actually putting numbers on 2a (I have a post on this coming soon), the anthropic update seems to say (conditional on non-simulation) there's almost certainly lots of aliens all of which are quiet, which feels really surprising.

Nice! I look forward to seeing this. I did similar analysis - both considering SIA + no simulations [LW(p) · GW(p)] and SIA + simulations [LW · GW] in my work on grabby aliens

comment by Zach Stein-Perlman · 2023-03-24T23:00:16.087Z · LW(p) · GW(p)

AI endgame

In board games, the endgame is a period generally characterized by strategic clarity, more direct calculation of the consequences of actions rather than evaluating possible actions with heuristics, and maybe a narrower space of actions that players consider.

Relevant actors, particularly AI labs, are likely to experience increased strategic clarity before the end, including AI risk awareness, awareness of who the leading labs are, roughly how much lead time they have, and how threshold-y being slightly ahead is.

There may be opportunities for coordination in the endgame that were much less incentivized earlier, and pausing progress may be incentivized in the endgame despite being disincentivized earlier (from an actor's imperfect perspective and/or from a maximally informed/wise/etc. perspective).

The downside of "AI endgame" as a conceptual handle is that it suggests thinking of actors as opponents/adversaries. Probably "crunch time" is better, but people often use that to gesture at hinge-y-ness rather than strategic clarity.

Replies from: Nisan, Gunnar_Zarncke, Zach Stein-Perlman

↑ comment by Nisan · 2023-03-24T23:31:52.214Z · LW(p) · GW(p)

That could be, but also maybe there won't be a period of increased strategic clarity. Especially if the emergence of new capabilities with scale remains unpredictable, or if progress depends on finding new insights.

I can't think of many games that don't have an endgame. These examples don't seem that fun:

A single round of musical chairs.
A tabletop game that follows an unpredictable, structureless storyline.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2023-03-24T23:52:26.216Z · LW(p) · GW(p)

Agree. I merely assert that we should be aware of and plan for the possibility of increased strategic clarity, risk awareness, etc. (and planning includes unpacking "etc.").

Probably taking the analogy too far, but: most games-that-can-have-endgames also have instances that don't have endgames; e.g. games of chess often end in the midgame.

↑ comment by Gunnar_Zarncke · 2023-03-25T09:32:52.607Z · LW(p) · GW(p)

I wouldn't take board games as a reference class but rather war or maybe elections. I'm not sure in these cases you have more clarity towards the end.

↑ comment by Zach Stein-Perlman · 2023-03-25T00:05:48.208Z · LW(p) · GW(p)

For example, if a lab is considering deploying a powerful model, it can prosocially show its hand--i.e., demonstrate that it has a powerful model--and ask others to make themselves partially transparent too. This affordance doesn't appear until the endgame. I think a refined version of it could be a big deal.

comment by Zach Stein-Perlman · 2024-07-30T23:30:03.746Z · LW(p) · GW(p)

New adversarial robustness scorecard: https://scale.com/leaderboard/adversarial_robustness. Yay DeepMind for being in the lead.

I plan to add an "adversarial robustness" criterion in the next major update to my https://ailabwatch.org scorecard and defer to scale's thing (how exactly to turn their numbers into grades TBD), unless someone convinces me that scale's thing is bad or something else is even better?

comment by Zach Stein-Perlman · 2024-05-13T23:10:01.719Z · LW(p) · GW(p)

How can you make the case that a model is safe to deploy? For now, you can do risk assessment and notice that it doesn't have dangerous capabilities. What about in the future, when models do have dangerous capabilities? Here are four options:

Implement safety measures as a function of risk assessment results, such that the measures feel like they should be sufficient to abate the risks
1. This is mostly what Anthropic's RSP does (at least so far — maybe it'll change when they define ASL-4)
Use risk assessment techniques that evaluate safety given deployment safety practices
1. This is mostly what OpenAI's PF is supposed to do (measure "post-mitigation risk"), but the details of their evaluations and mitigations are very unclear
Do control [LW · GW] evaluations [LW · GW]
Achieve alignment (and get strong evidence of that)

Related: RSPs, safety cases.

Maybe lots of risk comes from the lab using AIs internally to do AI development [LW(p) · GW(p)]. The first two options are fine for preventing catastrophic misuse from external deployment but I worry they struggle to measure risks related to scheming and internal deployment.

comment by Zach Stein-Perlman · 2023-05-09T02:00:38.745Z · LW(p) · GW(p)

Slowing AI: Bad takes

This shortform was going to be a post in Slowing AI [? · GW] but its tone is off.

This shortform is very non-exhaustive.

Bad take #1: China won't slow, so the West shouldn't either

There is a real consideration here. Reasonable variants of this take include

What matters for safety is not just slowing but also the practices of the organizations that build powerful AI. Insofar as the West is safer and China won't slow, it's worth sacrificing some Western slowing to preserve Western lead.
What matters for safety is not just slowing but especially slowing near the end. Differentially slowing the West now would reduce its ability to slow later (or even cause it to speed later). So differentially slowing the West now is bad.

(Set aside the fact that slowing the West generally also slows China, because they're correlated and because ideas pass from the West to China.) (Set aside the question of whether China will try to slow and how correlated that is with the West slowing.)

In some cases slowing the West would be worth burning lead time. But slowing AI doesn't just mean the West slowing itself down. Some interventions would slow both spheres similarly or even differentially slow China– most notably export controls, reducing diffusion of ideas, and improved migration policy.

See West-China relation [LW · GW].

Bad take #2: slowing can create a compute overhang, so all slowing is bad

Taboo "overhang." [LW · GW]

Yes, insofar as slowing now risks speeding later, we should notice that. There is a real consideration here.

But in some cases slowing now would be worth a little speeding later. Moreover, some kinds of slowing don't cause faster progress later at all: for example, reducing diffusion of ideas, decreasing hardware progress, and any stable and enforceable policy regimes that slow AI.

See Quickly scaling up compute [LW · GW].

Bad take #3: powerful AI helps alignment research, so we shouldn't slow it

(Set aside the question of how much powerful AI helps alignment research.) If powerful AI is important for alignment research, that means we should aim to increase time with powerful AI, not how soon powerful AI appears.

Bad take #4: it would be harder for unaligned AI to take over in a world with less compute available (for it to hijack), and failed takeover attempts would be good, so it's better for unaligned AI to try to take over soon

No, running AI systems seems likely to be cheap and there's already plenty of compute.

comment by Zach Stein-Perlman · 2022-12-26T02:05:29.235Z · LW(p) · GW(p)

List of uncertainties about the future of AI

This is an unordered list of uncertainties about the future of AI, trying to be comprehensive– trying to include everything reasonably decision-relevant and important/tractable.

This list is mostly written from a forecasting perspective. A useful complementary perspective to forecasting would be strategy or affordances or what actors can do and what they should do. This list is also written from a nontechnical perspective.

Timelines
- Capabilities as a function of inputs (or input requirements for AI of a particular capability level)
- Spending by leading labs
- Cost of compute
- Ideas and algorithmic progress
- Endogeneity in AI capabilities
- What would be good? Interventions?
Takeoff (speed and dynamics)
- dcapabilities/dinputs (or returns on cognitive reinvestment [LW · GW] or intelligence explosion or fast recursive self-improvement): Is there a threshold of capabilities such that self-improvement or other progress is much greater slightly above that point than slightly below it? (If so, where is it?) Will there be a system that can quickly and cheaply improve itself (or create a more capable successor), such that the improvements enable similarly large improvements, and so on until the system is much more capable? Will a small increase in inputs cause a large increase in capabilities (like the difference between chimps and humans) (and if so, around human-level capabilities or where)? How fast will progress be?
  - Will dcapabilities/dinputs be very high because of recursive self-improvement?
  - Will dcapabilities/dinputs be very high because of generality being important and monolithic/threshold-y?
  - ~~Will d~~~~capabilities~~/d~~inputs~~ ~~be very high because ideas are discrete (and in particular, will there be a single "secret sauce" idea)?~~ [Seems intractable and unlikely to be very important.]
- dimpacts/dcapabilities: Will small increases in capabilities (on the order of the variance between different humans) cause large increases in impacts?
  - Related: payoff thresholds and human-competition threshold
- Qualitatively, how will AI research capabilities affect AI progress; what will AI progress look like when AI research capabilities are a big deal?
  - One dimension or implication: will takeoff be local or nonlocal? How distributed will it be; will it look like recursive self-improvement or the industrial revolution?
    - What does this depend on?
- Endogeneity in AI capabilities: how do AI capabilities affect AI capabilities? (Potential alternative framing: dinputs/dtime.)
  - How (much) will AI tools accelerate research?
  - Will AI labs generate substantial revenue?
  - Will AI systems make AI seem more exciting or frightening? What effect would this have on AI progress?
  - What other endogeneities exist?
Weak AI
- How will the strategic landscape be different in the future?
  - Due to weak AI?
  - Due to other factors? (See Relevant pre-AGI possibilities [LW · GW].)
- Will weak AI make AI risk clearer, and especially make it more legible (relates to "warning shots")?
- Will AI progress cause substantial misuse or conflict? What would that look like?
- Will AI progress enable pivotal acts or processes [LW · GW]?
Misalignment risk sources/modes
- What misalignment would occur by default, and how would it be bad? Possible (overlapping) scenarios include inner alignment failure, outer alignment failure, getting what you measure [LW · GW], influence-seeking [LW · GW], more Paul Christiano stories [LW · GW], multipolar failure [LW · GW], a treacherous turn [? · GW], a sharp left turn [LW · GW] (or distributional leap [LW · GW]), and more.
- ~~What would a powerful, unaligned AI agent do?~~ [Answer: the details are unpredictable, but the high-level outcome is pretty clear; it would very likely be catastrophic. (But note there is reasonable disagreement, e.g. [LW · GW].)]
~~Technical problems around AI systems doing what their controllers want~~ [doesn't fit into this list well]
- ~~How to make powerful AI systems aligned to human preferences~~
  - ~~Or: how to make powerful AI systems that do not cause a catastrophe~~
- ~~How to make powerful AI systems interpretable to human understanding~~
- ~~How to solve problems around~~ ~~decision theory and strategic interaction~~ [LW · GW] ~~(and whether they're important)~~
- ~~How to solve~~ ~~Wei-Dai-style philosophy problems~~ [LW · GW] ~~(and whether they're important)~~
- ~~How to solve problems around~~ ~~delegation involving multiple humans or multiple AI systems~~ ~~(and whether they're important)~~
Polarity [LW · GW] (relates to takeoff, endogeneity, and timelines)
- What determines or affects polarity?
- What are the effects or implications of polarity on alignment and stabilization?
- What are the effects or implications of polarity on what the long-term future looks like, conditional on achieving alignment and stabilization?
- What would be good? Interventions?
Proximate and ultimate uses of powerful AI
- What uses of powerful AI would be great? How good would various possible uses of powerful AI be?
- Conditional on achieving alignment, what's likely to occur (and what could foresighted actors cause to occur or not to occur)?
Agents vs tools and general systems vs narrow tools (relates to tool AI [? · GW] and Comprehensive AI Services [LW · GW])
- Are general systems more powerful than similar narrow tools?
- Are agents more powerful than similar tools?
- Does generality appear by default in capable systems? (This relates to takeoff.)
- Does agency (or goal-directedness [? · GW] or consequentialism or farsightedness) appear by default in capable systems?
AI labs' behavior and racing for AI
- How will labs think about AI?
- What actions could labs perform; what are they likely to do by default?
- What would it be better if labs did; what interventions are tractable?
States' behavior
- How will states think about AI?
- What actions could states perform? What are they likely to do by default?
- What would it be better if states did; what interventions are tractable?
Public opinion
- How the public thinks about AI and framing; what the public thinks about AI, what memes would spread widely and how that depends on other facts about the world; how all of that translates into attitudes and policy preferences
- Wakeup to capabilities
- Wakeup to alignment risk and warning shots for alignment
- What (facts about public opinion) would be good? Interventions?
Paths to powerful AI (relates to timelines, AI risk modes, and more)
- How successful will reinforcement-learning agents built on large language models be?
- How successful will comprehensive AI services [LW · GW] be?
- How successful will language model bureaucracies [LW · GW] be?
- How successful will STEM AI [LW · GW] or Skunkworks-style AI [LW · GW] be?
- How successful will simulating evolution [LW · GW] be?
- How successful will whole-brain emulation or neuromorphic AI be?
- When is thinking in terms of paths or roadmaps useful? What other high-level paths are relevant?
Meta and miscellanea
- Epistemic stuff
  - Research methodology and organization: how to do research and organize researchers
  - Forecasting methodology: how to do forecasting better
  - Collective epistemology: how to share and aggregate beliefs and knowledge
- Decision theory
- Simulation hypothesis [? · GW]
  - Do we live in a simulation? (If so, what's the deal?)
  - What should we do if we live in a simulation (as a function of what the deal is)?
- Movement/community/field stuff

Maybe this list would be more useful if it had more pointers to relevant work?

Maybe this list would be more useful if it included stuff that's important that I don't feel uncertain about? But probably not much of that exists?

I like lists/trees/graphs. I like the ideas behind Clarifying some key hypotheses in AI alignment [LW · GW] and Modelling Transformative AI Risks [? · GW]. Perhaps this list is part of the beginning of a tree/graph for AI forecasting not including alignment stuff.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2022-12-29T22:30:03.701Z · LW(p) · GW(p)

Meta level. To carve nature at its joints, we must [use good nodes / identify the true nodes]. A node is [good insofar as / true if] its causes and effects are modular, or we can losslessly compress phenomena related to it into effects on it and effects from it.

"The cost of compute" is an example of a great node (in the context of the future of AI): it's affected by various things (choices made by Nvidia, innovation, etc.), and it affects various things (capability-level of systems made by OpenAI, relative importance of money vs talent at AI labs, etc.), and we lose nothing by thinking in terms of the cost of compute (relative to, e.g., the effects of the choices made by Nvidia on the capability-level of systems made by OpenAI).

"When Moore's law will end" is an example of something that is not a node (in the context of the future of AI), since you'd be much better off thinking in terms of the underlying causes and effects.

The relations relevant to nodes are analytical not causal. For example, "the cost of compute" is a node between "evidence about historical progress" and "timelines," not just between "stuff Nvidia does" and "stuff OpenAI does." (You could also make a causal model, but here I'm interested in analytical models.)

Object level. I'm not sure how good "timelines," "takeoff," "polarity," and "wakeup to capabilities" are as nodes. Most of the time it seems fine to talk about e.g. "effects on timelines" and "implications of timelines." But maybe this conceals confusion.

comment by Zach Stein-Perlman · 2021-09-17T17:00:42.333Z · LW(p) · GW(p)

Maybe AI Will Happen Outside US/China

I'm interested in the claim important AI development (in the next few decades) will largely occur outside any of the states that currently look likely to lead AI development. I don't think this is likely, but I haven't seen discussion of this claim.^[1] This would matter because it would greatly affect the environment in which AI is developed and affect which agents are empowered by powerful AI.

Epistemic status: brainstorm. May be developed into a full post if I learn or think more.

I. Causes

The big tech companies are in the US and China, and discussion often assumes that these two states have a large lead on AI development. So how could important development occur in another state? Perhaps other states' tech programs (private or governmental) will grow. But more likely, I think, an already-strong company leaves the US for a new location.

My legal knowledge is insufficient to say how well companies can leave their states with any confidence. My impression is that large American companies largely can leave while large Chinese companies cannot.

Why might a big tech company or AI lab want to leave a state?^[2]

Fleeing expropriation/nationalization. States can largely expropriate companies' property within their territory unless they have contracted otherwise. A company may be able to protect its independence by securing legal protection from expropriation from another state, then moving its hardware to that state. It may move its headquarters or workers as well.
Fleeing domestic regulation on development and/or deployment of AI.

II. Effects

The state in which powerful AI is developed has two important effects.

States set regulations. The regulatory environment around an AI lab may affect the narrow AI systems it builds and/or how it pursues AGI.
State influence & power. The state in which AGI is achieved can probably nationalize that project (perhaps well before AGI). State control of powerful AI affects how it will be used.

III. AI deployment before superintelligence

Eliezer recently tweeted that AI might be low-impact until superintelligence because of constraints on deployment. This seems partially right — for example, medicine and education seem like areas in which marginal improvements in our capabilities have only small effects due to civilizational inadequacy. Certainly some AI systems would require local regulatory approval to be useful; those might well be limited in the US. But a large fraction of AI systems won't be prohibited by plausible American regulation. For example, I would be quite surprised if the following kinds of systems were prohibited by regulation (disclaimer: I'm very non-expert on near-future AI):

Business services
- Operations/logistics
- Analysis
- Productivity tools (e.g., Codex, search tools)
Online consumer services — financial, writing assistants (Codex)
Production of goods that can be shipped cheaply (like computers but not houses)
Trading
Maybe media stuff (chatbots, persuasion systems). It's really hard to imagine the US banning chatbots. I'm not sure how persuasion-AI is implemented; custom ads could conceivably be banned, but eliminating AI-written media is implausible.

This matters because these AI applications directly affect some places even if they couldn’t be developed in those places.

In the unlikely event that the US moves against not only the deployment but also the development of such systems, AI companies would be more likely to seek a way around regulation — such as relocating.

Rather, I have not seen reasons for this claim other than the very normal one — that leading states and companies change over time. If you have seen more discussion of this claim, please let me know. ↩︎
This is most likely to be relevant to the US but applies generally. ↩︎

comment by Zach Stein-Perlman · 2021-08-28T13:00:20.732Z · LW(p) · GW(p)

Value Is Binary

Epistemic status: rough ethical and empirical heuristic.

Assuming that value is roughly linear in resources available after we reach technological maturity,^[1] my probability distribution of value is so bimodal that it is nearly binary. In particular, I assign substantial probability to near-optimal futures (at least 99% of the value of the optimal future), substantial probability to near-zero-value futures (between -1% and 1% of the value of the optimal future), and little probability to anything else.^[2] To the extent that almost all of the probability mass fits into two buckets, and everything within a bucket is almost equally valuable as everything else in that bucket, the goal maximize expected value reduces to the goal maximize probability of the better bucket.

So rather than thinking about how to maximize expected value, I generally think about maximizing the probability of a great (i.e., near-optimal) future. This goal is easier for me to think about, particularly since I believe that the paths to a great future are rather homogeneous — alike not just in value but in high-level structure. In the rest of this shortform, I explain my belief that the future is likely to be near-optimal or near-zero.

Substantial probability to near-optimal futures.

I have substantial credence that the future is at least 99% as good as the optimal future.^[3] I do not claim much certainty about what the optimal future looks like — my baseline assumption is that it involves increasing and improving consciousness in the universe, but I have little idea whether that would look like many very small minds or a few very big minds. Or perhaps the optimal future involves astronomical-scale acausal trade. Or perhaps future advances in ethics, decision theory, or physics will have unforeseeable implications for how a technologically mature civilization can do good.

But uniting almost all of my probability mass for near-optimal futures is how we get there, at a high level: we create superintelligence, achieve technological maturity, solve ethics, and then optimize. Without knowing what this looks like in detail, I assign substantial probability to the proposition that humanity successfully completes this process. And I think almost all futures in which we do complete this process look very similar: they have nearly identical technology, reach the same conclusions on ethics, have nearly identical resources available to them (mostly depending on how long it took them to reach maturity), and so produce nearly identical value.

Almost all of the remaining probability to near-zero futures.

This claim is bolder, I think. Even if it seems reasonable to expect a substantial fraction of possible futures to converge to near-optimal, it may seem odd to expect almost all of the rest to be near-zero. But I find it difficult to imagine any other futures.

For a future to not be near-zero, it must involve using a nontrivial fraction of the resources available in the optimal future (by my assumption that value is roughly linear in resources). More significantly, the future must involve using resources at a nontrivial fraction of the efficiency of their use in the optimal future. This seems unlikely to happen by accident. In particular, I claim:

If a future does not involve optimizing for the good, value is almost certainly near-zero.

Roughly, this holds if all (nontrivially efficient) ways of promoting the good are not efficient ways of optimizing for anything else that we might optimize for. I strongly intuit that this is true; I expect that as technology improves, efficiently producing a unit of something will produce very little of almost all other things (where "thing" includes not just stuff but also minds, qualia, etc.).^[4] If so, then value (or disvalue) is (in expectation) a negligible side effect of optimization for other things. And I cannot reasonably imagine a future optimized for disvalue, so I think almost all non-near-optimal futures are near-zero.

So I believe that either we optimize for value and get a near-optimal future, or we do anything else and get a near-zero future.

Intuitively, it seems possible to optimize for more than one value. I think such scenarios are unlikely. Even if our utility function has multiple linear terms, unless there is some surprisingly good way to achieve them simultaneously, we optimize by pursuing one of them near-exclusively.^[5] Optimizing a utility function that looks more like min(x,y) may be a plausible result of a grand bargain, but such a scenario requires that, after we have mature technology, multiple agents have nontrivial bargaining power and different values. I find this unlikely; I expect singleton-like scenarios and that powerful agents will either all converge to the same preferences or all have near-zero-value preferences.

I mostly see "value is binary" as a heuristic for reframing problems. It also has implications for what we should do: to the extent that value is binary (and to the extent that doing so is feasible), we should focus on increasing the probability of great futures. If a "catastrophic" future is one in which we realize no more than a small fraction of our value, then a great future is simply one which is not catastrophic and we should focus on avoiding catastrophes. But of course, "value is binary" is an empirical approximation rather than an a priori truth. Even if value seems very nearly binary, we should not reject contrary proposed interventions^[6] or possible futures out of hand.

I would appreciate suggestions on how to make these ideas more formal or precise (in addition to comments on what I got wrong or left out, of course). Also, this shortform relies on argument by "I struggle to imagine"; if you can imagine something I cannot, please explain your scenario and I will justify my skepticism or update.

You would reject this if you believed that astronomical-scale goods are not astronomically better than Earth-scale goods or if you believed that some plausible Earth-scale bad would be worse than astronomical-scale goods are good. ↩︎
"Optimal" value is roughly defined as the expected value of the future in which we act as well as possible, from our current limited knowledge about what "acting well" looks like. "Zero" is roughly defined as any future in which we fail to do anything astronomically significant. I consider value relative to the optimal future, ignoring uncertainty about how good the optimal future is — we should theoretically act as if we're in a universe with high variance in value between different possibilities, but I don't see how this affects what we should choose before reaching technological maturity.*
*Except roughly that we should act with unrealistically low probability that we are in a kind of simulation in which our choices matter very little or have very differently-valued consequences than otherwise. The prospect of such simulations might undermine my conclusions—value might still be binary, but for the wrong reason—so it is useful to be able to almost-ignore such possibilities. ↩︎
That is, at least 99% of the way from the zero-value future to the optimal future. ↩︎
If we particularly believe that value is fragile [LW · GW], we have an additional reason to expect this orthogonality. But I claim that different goals tend to be orthogonal at high levels of technology independent of value's fragility. ↩︎
This assumes that all goods are substitutes in production, which I expect to be nearly true with mature technology. ↩︎
That is, those that affect the probability of futures outside the binary or that affect how good the future is within the set of near-zero (or near-optimal) futures out of hand. ↩︎

Replies from: WilliamKiely, Zach Stein-Perlman

↑ comment by WilliamKiely · 2021-11-28T09:14:05.123Z · LW(p) · GW(p)

After reading the first paragraph of your above comment only, I want to note that:

In particular, I assign substantial probability to near-optimal futures (at least 99% of the value of the optimal future), substantial probability to near-zero-value futures (between -1% and 1% of the value of the optimal future), and little probability to anything else.

I assign much lower probability to near-optimal futures than near-zero-value futures.

This is mainly because I imagine a lot of the "extremely good" possible worlds I imagine when reading Bostrom's Letter from Utopia are <1% of what is optimal.

I also think the amount of probability I assign to 1%-99% futures is (~10x?) larger than the amount I assign to >99% futures.

(I'd like to read the rest of your comment later (but not right now due to time constraints) to see if it changes my view.)

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2021-11-28T12:30:15.043Z · LW(p) · GW(p)

I agree that near-optimal is unlikely. But I would be quite surprised by 1%-99% futures because (in short) I think we do better if we optimize for good and do worse if we don’t. If our final use of our cosmic endowment isn’t near-optimal, I think we failed to optimize for good and would be surprised if it’s >1%.

Replies from: WilliamKiely

↑ comment by WilliamKiely · 2022-11-01T18:24:09.180Z · LW(p) · GW(p)

Agreed with this given how many orders of magnitude potential values span.

Rescinding my previous statement:

> I also think the amount of probability I assign to 1%-99% futures is (~10x?) larger than the amount I assign to >99% futures.

I'd now say that probably the probability of 1%-99% optimal futures is <10% of the probability of >99% optimal futures.

This is because 1% optimal is very close to being optimal (only 2 orders of magnitude away out of dozens of orders of magnitude of very good futures).

↑ comment by Zach Stein-Perlman · 2021-09-03T01:00:47.129Z · LW(p) · GW(p)

Related idea, off the cuff, rough. Not really important or interesting, but might lead to interesting insights. Mostly intended for my future selves, but comments are welcome.

Binaries Are Analytically Valuable

Suppose our probability distribution for alignment success is nearly binary. In particular, suppose that we have high credence that, by the time we can create an AI capable of triggering an intelligence explosion, we will have

really solved alignment (i.e., we can create an aligned AI capable of triggering an intelligence explosion at reasonable extra cost and delay) or
really not solved alignment (i.e., we cannot create a similarly powerful aligned AI, or doing so would require very unreasonable extra cost and delay)

(Whether this is actually true is irrelevant to my point.)

Why would this matter?

Stating the risk from an unaligned intelligence explosion is kind of awkward: it's that the alignment tax [LW · GW] is greater than what the leading AI project is able/willing to pay. Equivalently, our goal is for the alignment tax to be less than what the leading AI project is able/willing to pay. This gives rise to two nice, clean desiderata:

Decrease the alignment tax
Increase what the leading AI project is able/willing to pay for alignment

But unfortunately, we can't similarly split the goal (or risk) into two goals (or risks). For example, a breakdown into the following two goals does not capture the risk from an unaligned intelligence explosion:

Make the alignment tax less than 6 months and a trillion dollars
Make the leading AI project able/willing to spend 6 months and a trillion dollars on aligning an AI

It would suffice to achieve both of these goals, but doing so is not necessary. If we fail to reduce the alignment tax this far, we can compensate by doing better on the willingness-to-pay front, and vice versa.

But if alignment success is binary, then we actually can decompose the goal (bolded above) into two necessary (and jointly sufficient) conditions:

Really solve alignment; i.e., reduce the alignment tax to [reasonable value]
Make the leading AI project able/willing to spend [reasonable value] on alignment

(Where [reasonable value] depends on what exactly our binary-ish probability distribution for alignment success looks like.)

Breaking big goals down into smaller goals—in particular, into smaller necessary conditions—is valuable, analytically and pragmatically. Binaries help, when they exist. Sometimes weaker conditions on the probability distribution, those of the form a certain important subset of possibilities has very low probability, can be useful in the same way.

comment by Zach Stein-Perlman · 2024-07-14T23:21:52.465Z · LW(p) · GW(p)

How do corporate campaigns and leaderboards effect change?

comment by Zach Stein-Perlman · 2024-07-06T09:19:35.167Z · LW(p) · GW(p)

https://ailabwatch.org/resources/integrity

~~Writing a thing on lab integrity issues. Planning to publish early Monday morning [edit: will probably hold off in case Anthropic clarifies nondisparagement stuff]. Comment on~~ ~~this public google doc~~ ~~or DM me.~~

~~I'm particularly interested in stuff I'm missing or existing writeups on this topic.~~

comment by Zach Stein-Perlman · 2023-03-22T04:30:00.599Z · LW(p) · GW(p)

AI strategy research projects project generators prompts

Mostly for personal use.

Some prompts inspired by Framing AI strategy [LW · GW]:

Plans
- What plans would be good?
- Given a particular plan that is likely to be implemented, what interventions or desiderata complement that plan (by making it more likely to succeed or by being better in worlds where it succeeds)
Affordances: for various relevant actors, what strategically significant actions could they take? What levers do they have? What would it be great for them to do (or avoid)?
Intermediate goals: what goals or desiderata are instrumentally useful?
Threat modeling: for various threats, model them well enough to understand necessary and sufficient conditions for preventing them.
Memes (& frames): what would it be good if people believed or paid attention to?

For forecasting prompts, see List of uncertainties about the future of AI [LW(p) · GW(p)].

Some miscellaneous prompts:

Slowing AI
- How can various relevant actors slow AI?
- How can the AI safety community slow AI?
- What considerations or side effects are relevant to slowing AI?
How do labs act, as a function of [whatever determines that]? In particular, what's the deal with "racing"?
AI risk advocacy
- How could the AI safety community do AI risk advocacy well?
- What considerations or side effects are relevant to AI risk advocacy?
What's the deal with crunch time?
How will the strategic landscape be different in the future?
What will be different near the end, and what interventions or actions will that enable? In particular, is eleventh-hour coordination possible? (Also maybe emergency brakes that don't appear until the end.)
What concrete asks should we have for labs? for government?
Meta: how can you help yourself or others do better AI strategy research?

comment by Zach Stein-Perlman · 2023-01-23T03:00:39.044Z · LW(p) · GW(p)

Four kinds of actors/processes/decisions are directly very important to AI governance:

Corporate self-governance
- Adopting safety standards
  - Proving a model for government regulation
US policy (and China, EU, UK, and others to a lesser extent)
- Regulation
- Incorporating standards into law
Standard-setters setting standards
International relations
- Treaties
- Informal influence on safety standards

("Safety standards" sounds prosaic but it doesn't have to be.)

comment by Zach Stein-Perlman · 2023-01-07T01:25:53.328Z · LW(p) · GW(p)

AI risk decomposition based on agency or powerseeking or adversarial optimization or something

Epistemic status: confused.

Some vague, closely related ways to decompose AI risk into two kinds of risk:

Risk due to AI agency vs risk unrelated to agency
Risk due to AI goal-directedness vs risk unrelated to goal-directedness
Risk due to AI planning vs risk unrelated to planning
Risk due to AI consequentialism vs risk unrelated to consequentialism
Risk due to AI utility-maximization vs risk unrelated to utility-maximization
Risk due to AI powerseeking vs risk unrelated to powerseeking
Risk due to AI optimizing against you vs risk unrelated to adversarial optimization

The central reason to worry about powerseeking/whatever AI, I think, is that sufficiently (relatively) capable goal-directed systems instrumentally converge to disempowering you.

The central reason to worry about non-powerseeking/whatever AI, I think, is failure to generalize correctly from training-- distribution shift, Goodhart, You get what you measure [LW · GW].

comment by Zach Stein-Perlman · 2022-12-24T23:00:23.812Z · LW(p) · GW(p)

Biological bounds on requirements for human-level AI

Facts about biology bound requirements for human-level AI. In particular, here are two prima facie bounds:

Lifetime. Humans develop human-level cognitive capabilities over a single lifetime, so (assuming our artificial learning algorithms are less efficient than humans' natural learning algorithms) training a human-level model takes at least the inputs used over the course of babyhood-to-adulthood.
Evolution. Evolution found human-level cognitive capabilities by blind search, so (assuming we can search at least that well, and assuming evolution didn't get lucky) training a human-level model takes at most the inputs used over the course of human evolution (plus Lifetime inputs, but that's relatively trivial).
- Genome. The size of the human genome is an upper bound on the complexity of humans' natural learning algorithms. Training a human-level model takes at most the inputs needed to find a learning algorithm at most as complex as the human genome (plus Lifetime inputs, but that's relatively trivial). (Unfortunately, the existence of human-level learning algorithms of certain simplicity says almost nothing about the difficulty of finding such algorithms.) (Ajeya's "genome anchor" is pretty different—"a transformative model would . . . have about as many parameters as there are bytes in the human genome"—and makes no sense to me.)

(A human-level AI should use similar computation as humans per subjective time. This assumption/observation is weird and perhaps shows that something weird is going on, but I don't know how to make that sharp.)

There are few sources of bounds on requirements for human-level AI. Perhaps fundamental limits or reasoning about blind search could give weak bounds, but biology is the only example of human-level cognitive abilities and so the only possible source of reasonable bounds.

Related: Ajeya Cotra's Forecasting TAI with biological anchors [LW · GW] (most relevant section) and Eliezer Yudkowsky's Biology-Inspired AGI Timelines: The Trick That Never Works [LW · GW].

comment by Zach Stein-Perlman · 2022-12-24T12:15:25.437Z · LW(p) · GW(p)

What do people (outside this community) think about AI? What will they think in the future?

Attitudes predictably affect relevant actors' actions, so this is a moderately important question. And it's rather neglected.

Groups whose attitudes are likely to be important include ML researchers, policymakers, and the public.

On attitudes among the public, surveys provide some information, but I suspect attitudes will change (in potentially predictable ways) as AI becomes more salient and some memes/framings get locked in. Perhaps some survey questions (maybe general sentiment on AI) are somewhat robust to changes in memes while others (maybe beliefs on how AI affects the economy or attitudes on regulation) may change a lot in the near future.

On attitudes among ML researchers, surveys (e.g.) provide some information, but for some reason most ML researchers say there's at least a 5% probability of doom (or 10%, depending on how you ask) but this doesn't seem to translate into their actions or culture. Perhaps interviews would reveal researchers' attitudes better than closed-ended surveys (note to self: talk to Vael Gates [LW · GW]).

AI may become much more salient in the next few years, and memes/framings may get locked in.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2022-12-24T23:27:26.203Z · LW(p) · GW(p)

On attitudes among ML researchers, surveys (e.g.) provide some information, but for some reason most ML researchers say there's at least a 5% probability of doom (or 10%, depending on how you ask) but this doesn't seem to translate into their actions or culture. Perhaps interviews would reveal researchers' attitudes better than closed-ended surveys (note to self: talk to Vael Gates).

Critically, this only is necessary if we assume that researchers care about basically everyone in the present (to a loose approximation.) If we instead model researchers as basically selfish by default, then the low chance of a technological singularity outweighs the high chance of death, especially for older folks.

Basically, this could be explained as a goal alignment problem: LW and AI Researchers have very different goals in mind.

Zach Stein-Perlman's Shortform

Contents

237 comments

AI endgame

Slowing AI: Bad takes

Bad take #1: China won't slow, so the West shouldn't either

Bad take #2: slowing can create a compute overhang, so all slowing is bad

Bad take #3: powerful AI helps alignment research, so we shouldn't slow it

Bad take #4: it would be harder for unaligned AI to take over in a world with less compute available (for it to hijack), and failed takeover attempts would be good, so it's better for unaligned AI to try to take over soon

Maybe AI Will Happen Outside US/China

I. Causes

II. Effects

III. AI deployment before superintelligence

Value Is Binary

Substantial probability to near-optimal futures.

Almost all of the remaining probability to near-zero futures.

Binaries Are Analytically Valuable

AI strategy research projects project generators prompts