Posts
Comments
This is too strong. For example, releasing the product would be correct if someone else would do something similar soon anyway and you're safer than them and releasing first lets you capture more of the free energy. (That's not the case here, but it's not as straightforward as you suggest, especially with your "Regardless of how good their alignment plans are" and your claim "There's just no good reason to do that, except short-term greed".)
Constellation (which I think has some important FHI-like virtues, although makes different tradeoffs and misses on others)
What is Constellation missing or what should it do? (Especially if you haven't already told the Constellation team this.)
Harry let himself be pulled, but as Hermione dragged him away, he said, raising his voice even louder, "It is entirely possible that in a thousand years, the fact that FHI was at Oxford will be the only reason anyone remembers Oxford!"
Yes but possibly the lab has its own private scaffolding which is better for its model than any other existing scaffolding, perhaps because it trained the model to use its specific scaffolding, and it can initially not allow users to use that.
(Maybe it’s impossible to give API access to scaffolding and keep the scaffolding private? Idk.)
Edit: Plus what David says.
Suppose you can take an action that decreases net P(everyone dying) but increases P(you yourself kill everyone), and leaves all else equal. I claim you should take it; everyone is better off if you take it.
I deny "deontological injunctions." I want you and everyone to take the actions that lead to the best outcomes, not that keep your own hands clean. I'm puzzled by your expectation that I'd endorse "deontological injunctions."
This situation seems identical to the trolley problem in the relevant ways. I think you should avoid letting people die, not just avoid killing people.
[Note: I roughly endorse heuristics like if you're contemplating crazy-sounding actions for strange-sounding reasons, you should suspect that you're confused about your situation or the effects of your actions, and you should be more cautious than your naive calculations suggest. But that's very different from deontology.]
I guess I'm more willing to treat Anthropic's marketing as not-representing-Anthropic.
Like, when OpenAI marketing says GPT-4 is our most aligned model yet! you could say this shows that OpenAI deeply misunderstands alignment but I tend to ignore it. Even mostly when Sam Altman says it himself.
[Edit after habryka's reply: my weak independent impression is that often the marketing people say stuff that the leadership and most technical staff disagree with, and if you use marketing-speak to substantially predict what-leadership-and-staff-believe you'll make worse predictions.]
Thanks.
I guess I'm more willing to treat Anthropic's marketing as not-representing-Anthropic. Shrug. [Edit: like, maybe it's consistent-with-being-a-good-guy-and-basically-honest to exaggerate your product in a similar way to everyone else. (You risk the downsides of creating hype but that's a different issue than the integrity thing.)]
It is disappointing that Anthropic hasn't clarified its commitments after the post-launch confusion, one way or the other.
Sorry for using the poll to support a different proposition. Edited.
To make sure I understand your position (and Ben's):
- Dario committed to Dustin that Anthropic wouldn't "meaningfully advance the frontier" (according to Dustin)
- Anthropic senior staff privately gave AI safety people the impression that Anthropic would stay behind/at the frontier (although nobody has quotes)
- Claude 3 Opus meaningfully advanced the frontier? Or slightly advanced it but Anthropic markets it like it was a substantial advance so they're being similarly low-integrity?
...I don't think Anthropic violated its deployment commitments. I mostly believe y'all about 2—I didn't know 2 until people asserted it right after the Claude 3 release, but I haven't been around the community, much less well-connected in it, for long—but that feels like an honest miscommunication to me. If I'm missing "past Anthropic commitments" please point to them.
Most of us agree with you that deploying Claude 3 was reasonable, although I for one disagree with your reasoning. The criticism was mostly about the release being (allegedly) inconsistent with Anthropic's past statements/commitments on releases.
[Edit: the link shows that most of us don't think deploying Claude 3 increased AI risk, not think deploying Claude 3 was reasonable.]
"turning out to be right" is CAIS's strength
This is CAIP, not CAIS; CAIP doesn't really have a track record yet.
In addition to the bill, CAIP has a short summary and a long summary.
Try greaterwrong.com, turn on anti-kibbitzer, sort comments by New
If bid-ask spreads are large, consider doing so less often + holding calls that expire at different times so that every time you roll you're only rolling half of your calls.
@gwern I've failed to find a source saying that Hydrazine invested in OpenAI. If it did, that would be a big deal; it would make this a lie.
Hydrazine investment in OA
Source?
Cool, changing my vote from "no" to "yes"
You left out the word "meaningfully," which is quite important.
I think deploying Claude 3 was fine and most AI safety people are confused about the effects of deploying frontier-ish models. I haven't seen anyone else articulate my position recently so I'd probably be down to dialogue. Or maybe I should start by writing a post.
[Edit: this comment got lots of agree-votes and "Deploying Claude 3 increased AI risk" got lots of disagreement so maybe actually everyone agrees it was fine.]
Deploying Claude 3 increased AI risk.
AGI is defined as "a highly autonomous system that outperforms humans at most economically valuable work." We can hope, but it's definitely not clear that AGI comes before existentially-dangerous-AI.
New misc remark:
It's not clear how the PF interacts with sharing models with Microsoft (or others). In particular, if OpenAI is required to share its models with Microsoft and Microsoft can just deploy them, even a great PF wouldn't stop dangerous models from being deployed. See OpenAI-Microsoft partnership.
Good work.
One plausibly-important factor I wish you'd tracked: whether the company offers bug bounties.
[Edit: didn't mean to suggest David's post is redundant.]
I agree. But I claim saying "I can't talk about the game itself, as that's forbidden by the rules" is like saying "I won't talk about the game itself because I decided not to" -- the underlying reason is unclear.
Unfortunately, I can't talk about the game itself, as that's forbidden by the rules.
You two can just change the rules... I'm confused by this rule.
The control-y plan I'm excited about doesn't feel to me like squeeze useful work out of clearly misaligned models. It's like use scaffolding/oversight to make using a model safer, and get decent arguments that using the model (in certain constrained ways) is pretty safe even if it's scheming. Then if you ever catch the model scheming, do some combination of (1) stop using it, (2) produce legible evidence of scheming, and (3) do few-shot catastrophe prevention. But I defer to Ryan here.
Duplicate of https://www.lesswrong.com/posts/6uJBqeG5ywE6Wcdqw/survey-of-2-778-ai-authors-six-parts-in-pictures
New misc remark:
- OpenAI's commitments about deployment seem to just refer to external deployment, unfortunately.
- This isn't explicit, but they say "Deployment in this case refers to the spectrum of ways of releasing a technology for external impact."
- This contrasts with Anthropic's RSP, in which "deployment" includes internal use.
Good post.
Added to the post:
Edit, one day later: the structure seems good, but I'm very concerned that the thresholds for High and Critical risk in each category are way too high, such that e.g. a system could very plausibly kill everyone without reaching Critical in any category. See pp. 8–11. If so, that's a fatal flaw for a framework like this. I'm interested in counterarguments; for now, praise mostly retracted; oops. I still prefer this to no RSP-y-thing, but I was expecting something stronger from OpenAI. I really hope they lower thresholds for the finished version of this framework.
Any mention of what the "mitigations" in question would be?
Not really. For now, OpenAI mostly mentions restricting deployment (this section is pretty disappointing):
A central part of meeting our safety baselines is implementing mitigations to address various types of model risk. Our mitigation strategy will involve both containment measures, which help reduce risks related to possession of a frontier model, as well as deployment mitigations, which help reduce risks from active use of a frontier model. As a result, these mitigations might span increasing compartmentalization, restricting deployment to trusted users, implementing refusals, redacting training data, or alerting distribution partners.
I predict that if you read the doc carefully, you'd say "probably net-harmful relative to just not pretending to have a safety plan in the first place."
Some personal takes in response:
- Yeah, largely the letter of the law isn't sufficient.
- Some evals are hard to goodhart. E.g. "can red-teamers demonstrate problems (given our mitigations)" is pretty robust — if red-teamers can't demonstrate problems, that's good evidence of safety (for deployment with those mitigations), even if the mitigations feel jury-rigged.
- Yeah, this is intended to be complemented by superalignment.
(I edited my comment to add the market, sorry for confusion.)
(Separately, a market like you describe might still be worth making.)
(The label "RSP" isn't perfect but it's kinda established now. My friends all call things like this "RSPs." And anyway I don't think "PFs" should become canonical instead. I predict change in terminology will happen ~iff it's attempted by METR or multiple frontier labs together. For now, I claim we should debate terminology occasionally but follow standard usage when trying to actually communicate.)
Nice. Another possible market topic: the mix of Low/Medium/High/Critical on their Scorecard, when they launch it or on 1 Jan 2025 or something. Hard to operationalize because we don't know how many categories there will be, and we care about both pre-mitigation and post-mitigation scores.
Made a simple market:
Mea culpa. Embarrassed I forgot that. Yay Anthropic too!
Edited the post.
But other labs are even less safe, and not far behind.
Yes, largely alignment is an unsolved problem on which progress is an exogenous function of time. But to a large extent we're safer with safety-interested labs developing powerful AI: this will boost model-independent alignment research, make particular critical models more likely to be aligned/controlled, help generate legible evidence that alignment is hard (insofar as that exists), and maybe enable progress to pause at a critical moment.
See Would edits to the adult brain even do anything?.
(Not endorsing the post or that section, just noticing that it seems relevant to your complaint.)
I think you use "AI governance" to mean "AI policy," thereby excluding e.g. lab governance (e.g. structured access and RSPs). But possibly you mean to imply that AI governance minus AI policy is not a priority.
(I disagree. Indeed, until recently governance people had very few policy asks for government.)
Did that change because people finally finished doing enough basic strategy research to know what policies to ask for?
Yeah, that's Luke Muehlhauser's claim; see the first paragraph of the linked piece.
I mostly agree with him. I wasn't doing AI governance years ago but my impression is they didn't have many/good policy asks. I'd be interested in counterevidence — like pre-2022 (collections of) good policy asks.
Anecdotally, I think I know one AI safety person who was doing influence-seeking-in-government and was on a good track but quit (to do research) because they weren't able to leverage their influence because the AI governance community didn't really have asks for (the US federal) government.
Like, the whole appeal of governance as an approach to AI safety is that it's (supposed to be) bottlenecked mainly on execution, not on research.
(I disagree. Indeed, until recently governance people had very few policy asks for government.)
(Also note that lots of "governance" research is ultimately aimed at helping labs improve their own safety. Central example: Structured access.)
Most don't do policy at all. Many do research. Since you're incredulous, here are some examples of great AI governance research (which don't synergize much with talking to policymakers):
How did various companies do on the requests? Here is how the UK graded them.
CFI reviewers (UK Government)
This wasn't the UK or anything official — it was just some AI safety folks.
(I agree that the scores feel inflated, mostly due to insufficiently precise recommendations and scoring on the basis of whether it checks the box or not—e.g. all companies got 2/2 on doing safety research, despite that some companies are >100x better at it than others—and also just generous grading.)
I am excited about this. I've also recently been interested in ideas like nudge researchers to write 1-5 page research agendas, then collect them and advertise the collection.
Possible formats:
- A huge google doc (maybe based on this post); anyone can comment; there's one or more maintainers; maintainers approve ~all suggestions by researchers about their own research topics and consider suggestions by random people.
- A directory of google docs on particular agendas; the individual google docs are each owned by a relevant researcher, who is responsible for maintaining them; some maintainer-of-the-whole-project occasionally nudges researchers to update their docs and reassigns the topic to someone else if necessary. Random people can make suggestions too.
- (Alex, I think we can do much better than the best textbooks format in terms of organization, readability, and keeping up to date.)
I am interested in helping make something like this happen. Or if it doesn't happen soon I might try to do it (but I'm not taking responsibility for making this happen). Very interested in suggestions.
(One particular kind-of-suggestion: is there a taxonomy/tree of alignment research directions you like, other than the one in this post? (Note to self: taxonomies have to focus on either methodology or theory of change... probably organize by theory of change and don't hesitate to point to the same directions/methodologies/artifacts in multiple places.))
It's "a unified methodology" but I claim it has two very different uses: (1) determining whether a model is safe (in general or within particular scaffolding) and (2) directly making deployment safer. Or (1) model evals and (2) inference-time safety techniques.
Thanks!
I think there's another agenda like make untrusted models safe but useful by putting them in a scaffolding/bureaucracy—of filters, classifiers, LMs, humans, etc.—such that at inference time, takeover attempts are less likely to succeed and more likely to be caught. See Untrusted smart models and trusted dumb models (Shlegeris 2023). Other relevant work:
- AI safety with current science (Shlegeris 2023)
- Preventing Language Models From Hiding Their Reasoning (Roger and Greenblatt 2023)
- Coup probes (Roger 2023)
- (Some other work boosts this agenda but isn't motivated by it — perhaps activation engineering and chain-of-thought)
[Edit: now AI Control (Shlegeris et al. 2023) and Catching AIs red-handed (Greenblatt and Shlegeris 2024).]
[Edit: I make a bid for an expert—probably someone at Redwood—to make a public reading list on this control agenda.]
Update: Greg Brockman quit.
Update: Sam and Greg say:
Sam and I are shocked and saddened by what the board did today.
Let us first say thank you to all the incredible people who we have worked with at OpenAI, our customers, our investors, and all of those who have been reaching out.
We too are still trying to figure out exactly what happened. Here is what we know:
- Last night, Sam got a text from Ilya asking to talk at noon Friday. Sam joined a Google Meet and the whole board, except Greg, was there. Ilya told Sam he was being fired and that the news was going out very soon.
- At 12:19pm, Greg got a text from Ilya asking for a quick call. At 12:23pm, Ilya sent a Google Meet link. Greg was told that he was being removed from the board (but was vital to the company and would retain his role) and that Sam had been fired. Around the same time, OpenAI published a blog post.
- As far as we know, the management team was made aware of this shortly after, other than Mira who found out the night prior.
The outpouring of support has been really nice; thank you, but please don’t spend any time being concerned. We will be fine. Greater things coming soon.
Update: three more resignations including Jakub Pachocki.
Sam Altman's firing as OpenAI CEO was not the result of "malfeasance or anything related to our financial, business, safety, or security/privacy practices" but rather a "breakdown in communications between Sam Altman and the board," per an internal memo from chief operating officer Brad Lightcap seen by Axios.
Update: Sam is planning to launch something (no details yet).
Update: Sam may return as OpenAI CEO.
Update: Tigris.
Update: talks with Sam and the board.
Update: Mira wants to hire Sam and Greg in some capacity; board still looking for a permanent CEO.
Update: Emmett Shear is interim CEO; Sam won't return.
Update: lots more resignations (according to an insider).
Update: Sam and Greg leading a new lab in Microsoft.
Update: total chaos.
Has anyone collected their public statements on various AI x-risk topics anywhere?
A bit, not shareable.
Helen is an AI safety person. Tasha is on the Effective Ventures board. Ilya leads superalignment. Adam signed the CAIS statement.
Thanks!
automating the world economy will take longer
I'm curious what fraction-of-2023-tasks-automatable and maybe fraction-of-world-economy-automated you think will occur at e.g. overpower time, and the median year for that. (I sometimes notice people assuming 99%-automatability occurs before all the humans are dead, without realizing they're assuming anything.)