Common misconceptions about OpenAI

jacob_hilton

Common misconceptions about OpenAI

post by Jacob_Hilton · 2022-08-25T14:02:26.257Z · LW · GW · 154 comments

Common accurate impressions
Correct: OpenAI is trying to directly build safe AGI.
Correct: the majority of researchers at OpenAI are working on capabilities.
Correct: the majority of OpenAI employees did not join with the primary motivation of reducing existential risk from AI specifically.
Correct: most interpretability research at OpenAI stopped after the Anthropic split.
Common misconceptions
Incorrect: OpenAI is not working on scalable alignment.
Incorrect: most people who were working on alignment at OpenAI left for Anthropic.
Incorrect: OpenAI is a purely for-profit organization.
Incorrect: OpenAI is not aware of the risks of race dynamics.
Incorrect: OpenAI leadership is dismissive of existential risk from AI.
Personal opinions
Opinion: OpenAI leadership cares about reducing existential risk from AI.
Opinion: capabilities researchers at OpenAI have varying attitudes to existential risk.
Opinion: disagreements about OpenAI's strategy are substantially empirical.
Opinion: I am personally extremely uncertain about strategy-related questions.
Opinion: OpenAI's actions have drawn a lot of attention to large language models.
Opinion: OpenAI is deploying models in order to generate revenue, but also to learn about safety.
Opinion: OpenAI's particular research directions are driven in large part by researchers.
Opinion: OpenAI should be focusing more on alignment.
Opinion: OpenAI is a great place to work to reduce existential risk from AI.
None
154 comments

I have recently encountered a number of people with misconceptions about OpenAI. Some common impressions are accurate, and others are not. This post is intended to provide clarification on some of these points, to help people know what to expect from the organization and to figure out how to engage with it. It is not intended as a full explanation or evaluation of OpenAI's strategy.

The post has three sections:

Common accurate impressions
Common misconceptions
Personal opinions

The bolded claims in the first two sections are intended to be uncontroversial, i.e., most informed people would agree with how they are labeled (correct versus incorrect). I am less sure about how commonly believed they are. The bolded claims in the last section I think are probably true, but they are more open to interpretation and I expect others to disagree with them.

Note: I am an employee of OpenAI. Sam Altman (CEO of OpenAI) and Mira Murati (CTO of OpenAI) reviewed a draft of this post, and I am also grateful to Steven Adler, Steve Dowling, Benjamin Hilton, Shantanu Jain, Daniel Kokotajlo, Jan Leike, Ryan Lowe, Holly Mandel and Cullen O'Keefe for feedback. I chose to write this post and the views expressed in it are my own.

Common accurate impressions

Correct: OpenAI is trying to directly build safe AGI.

OpenAI's Charter states: "We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome." OpenAI leadership describes trying to directly build safe AGI as the best way to currently pursue OpenAI's mission, and have expressed concern about scenarios in which a bad actor is first to build AGI, and chooses to misuse it.

Correct: the majority of researchers at OpenAI are working on capabilities.

Researchers on different teams often work together, but it is still reasonable to loosely categorize OpenAI's researchers (around half the organization) at the time of writing as approximately:

Capabilities research: 100
Alignment research: 30
Policy research: 15

Correct: the majority of OpenAI employees did not join with the primary motivation of reducing existential risk from AI specifically.

My strong impressions, which are not based on survey data, are as follows. Across the company as a whole, a minority of employees would cite reducing existential risk from AI as their top reason for joining. A significantly larger number would cite reducing risk of some kind, or other principles of beneficence put forward in the OpenAI Charter, as their top reason for joining. Among people who joined to work in a safety-focused role, a larger proportion of people would cite reducing existential risk from AI as a substantial motivation for joining, compared to the company as a whole. Some employees have become motivated by existential risk reduction since joining OpenAI.

Correct: most interpretability research at OpenAI stopped after the Anthropic split.

Chris Olah led interpretability research at OpenAI before becoming a cofounder of Anthropic. Although several members of Chris's former team still work at OpenAI, most of them are no longer working on interpretability.

Common misconceptions

Incorrect: OpenAI is not working on scalable alignment.

OpenAI has teams focused both on practical alignment (trying to make OpenAI's deployed models as aligned as possible) and on scalable alignment (researching methods for aligning models that are beyond human supervision, which could potentially scale to AGI). These teams work closely with one another. Its recently-released alignment research includes self-critiquing models (AF discussion [AF · GW]), InstructGPT, WebGPT (AF discussion [AF · GW]) and book summarization (AF discussion [AF · GW]). OpenAI's approach to alignment research is described here, and includes as a long-term goal an alignment MVP (AF discussion [AF · GW]).

Incorrect: most people who were working on alignment at OpenAI left for Anthropic.

The main group of people working on alignment (other than interpretability) at OpenAI at the time of the Anthropic split at the end of 2020 was the Reflection team, which has since been renamed to the Alignment team. Of the 7 members of the team at that time (who are listed on the summarization paper), 4 are still working at OpenAI, and none are working at Anthropic. Edited to add: this fact alone is not intended to provide a complete picture of the Anthropic split, which is more complicated than I am able to explain here.

Incorrect: OpenAI is a purely for-profit organization.

OpenAI has a hybrid structure in which the highest authority is the board of directors of a non-profit entity. The members of the board of directors are listed here. In legal paperwork signed by all investors, it is emphasized that: "The [OpenAI] Partnership exists to advance OpenAI Inc [the non-profit entity]'s mission of ensuring that safe artificial general intelligence is developed and benefits all of humanity. The General Partner [OpenAI Inc]'s duty to this mission and the principles advanced in the OpenAI Inc Charter take precedence over any obligation to generate a profit. The Partnership may never make a profit, and the General Partner is under no obligation to do so."

Incorrect: OpenAI is not aware of the risks of race dynamics.

OpenAI's Charter contains the following merge-and-assist clause: "We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.”"

Incorrect: OpenAI leadership is dismissive of existential risk from AI.

OpenAI has a Governance team (within Policy Research) that advises leadership and is focused on strategy for avoiding existential risk from AI. In multiple recent all-hands meetings, OpenAI leadership have emphasized to employees the need to scale up safety efforts over time, and encouraged employees to familiarize themselves with alignment ideas. OpenAI's Chief Scientist, Ilya Sutskever, recently pivoted to spending 50% of his time on safety.

Personal opinions

Opinion: OpenAI leadership cares about reducing existential risk from AI.

I think that OpenAI leadership are familiar and agree with the basic case for concern and appreciate the magnitude of what's at stake. Existential risk is an important factor, but not the only factor, in OpenAI leadership's decision making. OpenAI's alignment work is much more than just a token effort.

Opinion: capabilities researchers at OpenAI have varying attitudes to existential risk.

I think that capabilities researchers at OpenAI have a wide variety of views, including some with long timelines who are skeptical of attempts to mitigate risk now, and others who are supportive but may consider the question to be outside their area of expertise. Some capabilities researchers actively look for ways to help with alignment, or to learn more about it.

Opinion: disagreements about OpenAI's strategy are substantially empirical.

I think that some of the main reasons why people in the alignment community might disagree with OpenAI's strategy are largely disagreements about empirical facts. In particular, compared to people in the alignment community, OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely on capabilities, and are more concerned about bad actors developing and misusing AGI. I would expect OpenAI leadership to change their mind on these questions given clear enough evidence to the contrary.

Opinion: I am personally extremely uncertain about strategy-related questions.

I do not spend most of my time thinking about strategy. If I were forced to choose between OpenAI speeding up or slowing down its work on capabilities, my guess is that I would end up choosing the latter, all else equal, but I am very unsure.

Opinion: OpenAI's actions have drawn a lot of attention to large language models.

I think that the release of GPT-3 and the OpenAI API led to significantly increased focus and somewhat of a competitive spirit around large language models. I consider there to be advantages and disadvantages to this. I don't think OpenAI predicted this in advance, and believe that it would have been challenging, but not impossible, to foresee this.

Opinion: OpenAI is deploying models in order to generate revenue, but also to learn about safety.

I think that OpenAI is trying to generate revenue through deployment in order to directly create value and in order to fund further research and development. At the same time, it also uses deployment as a way to learn in various ways, and about safety in particular.

Opinion: OpenAI's particular research directions are driven in large part by researchers.

I think that OpenAI leadership has control over staffing and resources that affects the organization's overall direction, but that particular research directions are largely delegated to researchers, because they have the most relevant context. OpenAI would not be able to do impactful alignment research without researchers who have a strong understanding of the field. If there were talented enough researchers who wanted to lead new alignment efforts at OpenAI, I would expect them to be enthusiastically welcomed by OpenAI leadership.

Opinion: OpenAI should be focusing more on alignment.

I think that OpenAI's alignment research in general, and its scalable alignment research in particular, has significantly higher average social returns than its capabilities research on the margin.

Opinion: OpenAI is a great place to work to reduce existential risk from AI.

I think that the Alignment, RL, Human Data, Policy Research, Security, Applied Safety, and Trust and Safety teams are all doing work that seems useful for reducing existential risk from AI.

154 comments

Comments sorted by top scores.

comment by Adam Scholl (adam_scholl) · 2022-08-27T15:54:54.544Z · LW(p) · GW(p)

One comment [LW(p) · GW(p)] in this thread compares the OP to Philip Morris’ claims to be working toward a “smoke-free future.” I think this analogy is overstated, in that I expect Philip Morris is being more intentionally deceptive than Jacob Hilton here. But I quite liked the comment anyway, because I share the sense that (regardless of Jacob's intention) the OP has an effect much like safetywashing [LW · GW], and I think the exaggerated satire helps make that easier to see.

The OP is framed as addressing common misconceptions about OpenAI, of which it lists five:

OpenAI is not working on scalable alignment.
Most people who were working on alignment at OpenAI left for Anthropic.
OpenAI is a purely for-profit organization.
OpenAI is not aware of the risks of race dynamics.
OpenAI leadership is dismissive of existential risk from AI.

Of these, I think 1, 3, and 4 address positions that are held by basically no one. So by “debunking” much dumber versions of the claims people actually make, the post gives the impression of engaging with criticism, without actually meaningfully doing that. 2 at least addresses a real argument, but at least as I understand it, is quite misleading—while technically true, it seriously underplays the degree to which there was an exodus of key safety-conscious staff, who left because they felt OpenAI leadership was too reckless. So of these, only 5 strikes me as responding non-misleadingly to a real criticism people actually regularly make.

In response to the Philip Morris analogy, Jacob advised [LW(p) · GW(p)] caution:

rhetoric like this seems like an excellent way to discourage OpenAI employees from ever engaging with the alignment community.

For many years, the criticism I heard of OpenAI in private was dramatically more vociferous than what I heard in public. I think much of this was because many people shared Jacob’s concern—if we say what we actually think about their strategy, maybe they’ll write us off as enemies, and not listen later when it really counts?

But I think this is starting to change. I’ve seen a lot more public criticism lately, which I think is probably at least in part because it’s become so obvious that the strategy of mincing our words hasn't worked. If they mostly ignore all but the very most optimistic alignment researchers now, why should we expect that will change later, as long as we keep being careful to avoid stating any of our offensive-sounding beliefs?

From talking with early employees and others, my impression is that OpenAI’s founding was incredibly reckless, in the sense that they rushed to deploy their org, before first taking much time to figure out how to ensure that went well. The founders' early comments about accident risk mostly strike me as so naive and unwise, that I find it hard to imagine they thought much at all about the existing alignment literature before deciding to charge ahead and create a new lab. Their initial plan—the one still baked into their name—would have been terribly dangerous if implemented, for reasons I’d think should have been immediately obvious to them had they stopped to think hard about accident risk at all.

And I think their actions since then have mostly been similarly reckless. When they got the scaling laws result, they published a paper about it, thereby popularizing the notion that “just making the black box bigger” might be a viable path to AGI. When they demoed this strategy with products like GPT-3, DALL-E, and CLIP, they described much of the architecture publicly, inspiring others to pursue similar research directions.

So in effect, as far as I can tell, they created a very productive “creating the x-risk” department, alongside a smaller “mitigating that risk” department—the presence of which I take the OP to describe as reassuring—staffed by a few of the most notably optimistic alignment researchers, many of whom left because even they felt too worried about OpenAI’s recklessness.

After all of that, why would we expect they’ll suddenly start being prudent and cautious when it comes time to deploy transformative tech? I don’t think we should.

My strong bet is that OpenAI leadership are good people, in the standard deontological sense, and I think that’s overwhelmingly the sense that should govern interpersonal interactions. I think they’re very likely trying hard, from their perspective, to make this go well, and I urge you, dear reader, not to be an asshole to them. Figuring out what makes sense is hard; doing things is hard; attempts to achieve goals often somehow accidentally end up causing the opposite thing to happen; nobody will want to work with you if small strategic updates might cause you to suddenly treat them totally differently.

But I think we are well past the point where it plausibly makes sense for pessimistic folks to refrain from stating their true views about OpenAI (or any other lab) just to be polite. They didn’t listen the first times alignment researchers screamed in horror, and they probably won’t listen the next times either. So you might as well just say what you actually think—at least that way, anyone who does listen will find a message worth hearing.

Replies from: ofer, William_S, None

↑ comment by Ofer (ofer) · 2022-08-28T10:02:56.070Z · LW(p) · GW(p)

Another bit of evidence about OpenAI that I think is worth mentioning in this context: OPP recommended a grant of $30M to OpenAI in a deal that involved OPP's then-CEO becoming a board member of OpenAI. OPP hoped that this will allow them to make OpenAI improve their approach to safety and governance. Later, OpenAI appointed both the CEO's fiancée and the fiancée's sibling to VP positions.

Replies from: Vaniver

↑ comment by Vaniver · 2022-08-29T21:49:55.139Z · LW(p) · GW(p)

Both of whom then left for Anthropic with the split, right?

Replies from: ofer

↑ comment by Ofer (ofer) · 2022-08-30T04:54:38.643Z · LW(p) · GW(p)

Yes. To be clear, the point here is that OpenAI's behavior in that situation seems similar to how, seemingly, for-profit companies sometimes try to capture regulators by paying their family members. (See 30 seconds from this John Oliver monologue as evidence that such tactics are not rare in the for-profit world.)

Replies from: Vaniver, gabor-fuisz

↑ comment by Vaniver · 2022-08-30T17:18:41.600Z · LW(p) · GW(p)

Makes sense; it wouldn't surprise me if that's what's happening. I think this perhaps understates the degree to which the attempts at capture were mutual--a theory of change where OPP gives money to OpenAI in exchange for a board seat and the elevation of safety-conscious employees at OpenAI seems like a pretty good way to have an effect. [This still leaves the question of how OPP assesses safety-consciousness.]

I should also note find the 'nondisparagement agreements' people have signed with OpenAI somewhat troubling because it means many people with high context will not be writing comments like Adam Scholl's above if they wanted to, and so the absence of evidence [LW · GW] is not as much evidence of absence [LW · GW] as one would hope.

Replies from: ofer

↑ comment by Ofer (ofer) · 2022-08-30T17:45:49.834Z · LW(p) · GW(p)

Does everyone who work at OpenAI sign a non-disparagement agreement? (Including those who work on governance/policy?)

↑ comment by gugu (gabor-fuisz) · 2022-09-01T03:12:49.081Z · LW(p) · GW(p)

Sooo this was such an intriguing idea that I did some research -- but reality appears to be more boring:

In a recent informal discussion I believe said OPP CEO remarked he had to give up the OpenAI board seat as his fiancée joining Anthropic creates a conflict of interest. Naively this is much more likely, and I think is much better supported by the timelines.
According to LinkedIn of the mentioned fiancée joined in already as VP in 2018 and was promoted to a probably more serious position in 2020, and her sibling was promoted to VP in 2019.
The Anthropic split occurred in June 2021.
A new board member (who is arguably very aligned to OPP) was inducted in September 2021, probably in place of OPP CEO.
It is unclear when OPP CEO exactly left the board, but I would guess sometime in 2021. This seem better explained by "conflict of interest with his fiancée joining-cofounding Anthropic" and OpenAI putting an other OPP-aligned board member in his place wouldn't make for very productive scheming.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2022-09-01T20:42:09.498Z · LW(p) · GW(p)

The "conflict of interest" explanation also matches my understanding of the situation better.

↑ comment by William_S · 2022-09-03T16:20:08.596Z · LW(p) · GW(p)

(I work at OpenAI). Is the main thing you think has the effect of safetywashing here the claim that the misconceptions are common? Like if the post was "some misconceptions I've encountered about OpenAI" it would mostly not have that effect? (Point 2 was edited to clarify that it wasn't a full account of the Anthropic split.)

↑ comment by [deleted] · 2022-08-28T09:15:18.275Z · LW(p) · GW(p)

“the presence of which I take the OP to describe as reassuring”

I get the sense from this, and from the rest of your comment here that you think we should in fact not find this even mildly reassuring. I’m not going to argue with such a claim, because I don’t think such an effort on my part would be very useful to anyone. However, if I’m not completely off base or I’m not overstating your position (which I totally could be) , then could you go into some more detail as to why you think that we shouldn’t find their presence reassuring at all?

Replies from: lc

↑ comment by lc · 2022-08-28T11:47:18.313Z · LW(p) · GW(p)

Suppose you're in middle school, and one day you learn that your teachers are planning a mandatory field trip, during which the entire grade will jump off of a skyscraper without a parachute. You approach a school administrator to talk to them about how dangerous that would be, and they say, "Don't worry! We'll all be wearing hard hats the entire time."

Hearing that probably does not reassure you even a little bit, because hard hats alone would not nudge the probability of death below ~100%. It might actually make you more worried, because the fact that they have a prepared response means school administrators were aware of potential issues and then decided the hard hat solution was appropriate. It's generally harder to argue someone out of believing in an incorrect solution to a problem, than into believing the problem exists in the first place.

This analogy overstates the obviousness of (and my personal confidence in) the risk, but to a lot of alignment researchers it's an essentially accurate metaphor for how ineffective they think OpenAI's current precautions will turn out in practice, even if making a doomsday AI feels like a more "understandable" mistake.

Replies from: None

↑ comment by [deleted] · 2022-08-28T18:29:54.136Z · LW(p) · GW(p)

Thank you! I think I understand this position a good deal more now.

comment by Tomás B. (Bjartur Tómas) · 2022-08-25T14:54:24.476Z · LW(p) · GW(p)

Incorrect: OpenAI leadership is dismissive of existential risk from AI.

So the reason I think this is very high-level people have made claims like, “the orthogonality thesis is probably false”, and someone I know who talked to a very, very, very high-level person at OpenAI had to explain to them that inner alignment is a thing. If they actually cared, I would expect the leadership to have more familiarity with their critic’s arguments.

No one remembers now, but the founding rhetoric was also pretty bad, though walked back I suppose.

Also, I often see them claim their AI ethics work (train a model not to offend the average Berkeley humanities grad - possibly not useless, I suppose, but not exactly going to save our lightcone) is important alignment work. Obviously, what is going on inside is not legible to me, but what I see from the outside has mostly been disheartening. Their recent blog on alignment was an exception to this.

Though there are people with their priorities straight at OpenAI, I see little evidence that this is true of their leadership. I’m not confident an organization can be net beneficial when this is the case.

Replies from: Quadratic Reciprocity, Wei_Dai

↑ comment by Quadratic Reciprocity · 2022-08-26T04:05:07.133Z · LW(p) · GW(p)

If we're thinking about the same "very, very, very high-level person at OpenAI", it does seem like this person now buys that inner alignment is a thing and is concerned about it (or says he's concerned). It is scary because people at these AI labs don't know all that much about AI alignment but also hopeful because they don't seem to disagree with it and maybe just need to be given the arguments in a good way by someone they would listen to?

Replies from: Bjartur Tómas

↑ comment by Tomás B. (Bjartur Tómas) · 2022-08-26T05:29:23.010Z · LW(p) · GW(p)

I suspect we are thinking about the same person and that is heartening that they changed their mind.

↑ comment by Wei Dai (Wei_Dai) · 2022-08-26T17:16:34.425Z · LW(p) · GW(p)

Also, I often see them claim their AI ethics work (train a model not to offend the average Berkeley humanities grad—possibly not useless, I suppose, but not exactly going to save our lightcone) is important alignment work.

Wait, you don't think this (I mean the training, not the offending) is a safety problem in and of itself? (See also my previous comment about this [LW(p) · GW(p)].)

comment by Lauro Langosco · 2022-08-27T14:20:20.597Z · LW(p) · GW(p)

People at OpenAI regularly say things like

Our current path [to solve alignment] is very promising (https://twitter.com/janleike/status/1562501343578689536)
[...] even without fundamentally new alignment ideas, we can likely build sufficiently aligned AI systems to substantially advance alignment research itself (https://openai.com/blog/our-approach-to-alignment-research/ )

And you say:

OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely on capabilities

AFAICT, no-one from OpenAI has publicly explained why they believe that RLHF + amplification is supposed to be enough to safely train systems that can solve alignment for us. The blog post linked above says "we believe" four times, but does not take the time to explain why anyone believes these things.

Writing up this kind of reasoning is time-intensive, but I think it would be worth it: if you're right, then the value of information for the rest of the community is huge; if you're wrong, it's an opportunity to change your minds.

Replies from: Evan R. Murphy

↑ comment by Evan R. Murphy · 2023-02-07T00:15:11.032Z · LW(p) · GW(p)

AFAICT, no-one from OpenAI has publicly explained why they believe that RLHF + amplification is supposed to be enough to safely train systems that can solve alignment for us. The blog post linked above says "we believe" four times, but does not take the time to explain why anyone believes these things.

Probably true at the time, but in December Jan Leike did write in some detail about why he's optimistic about OpenAI approach: https://aligned.substack.com/p/alignment-optimism

comment by johnswentworth · 2022-08-25T16:02:41.977Z · LW(p) · GW(p)

Opinion: disagreements about OpenAI's strategy are substantially empirical.
I think that some of the main reasons why people in the alignment community might disagree with OpenAI's strategy are largely disagreements about empirical facts. In particular, compared to people in the alignment community, OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely on capabilities, and are more concerned about bad actors developing and misusing AGI. I would expect OpenAI leadership to change their mind on these questions given clear enough evidence to the contrary.

See, this is exactly the problem. Alignment as a field is hard precisely because we do not expect to see empirical evidence before it is too late. That is the fundamental reason why alignment is harder than other scientific fields. Goodhart problems in outer alignment, deception in inner alignment, phase change in hard takeoff, "getting what you measure" in slow takeoff, however you frame it the issue is the same: things look fine early on, and go wrong later.

And as far as I can tell, OpenAI as an org just totally ignores that whole class of issues/arguments, and charges ahead assuming that if they don't see a problem then there isn't a problem (and meanwhile does things which actively select for hiding problems, like e.g. RLHF).

Replies from: Jacob_Hilton, ricraz

↑ comment by Jacob_Hilton · 2022-08-25T17:14:37.537Z · LW(p) · GW(p)

To clarify, by "empirical" I meant "relating to differences in predictions" as opposed to "relating to differences in values" (perhaps "epistemic" would have been better). I did not mean to distinguish between experimental versus conceptual evidence. I would expect OpenAI leadership to put more weight on experimental evidence than you, but to be responsive to evidence of all kinds. I think that OpenAI leadership are aware of most of the arguments you cite, but came to different conclusions after considering them than you did.

Replies from: Joe_Collman

↑ comment by Joe Collman (Joe_Collman) · 2022-08-25T18:23:57.812Z · LW(p) · GW(p)

[First of all, many thanks for writing the post; it seems both useful and the kind of thing that'll predictably attract criticism]

I'm not quite sure what you mean to imply here (please correct me if my impression is inaccurate - I'm describing how-it-looks-to-me, and I may well be wrong):

I would expect OpenAI leadership to put more weight on experimental evidence than you...

Specifically, John's model (and mine) has:
X = [Class of high-stakes problems on which we'll get experimental evidence before it's too late]
Y = [Class of high-stakes problems on which we'll get no experimental evidence before it's too late]

Unless we expect Y to be empty, when we're talking about Y-problems the weighting is irrelevant: we get no experimental evidence.

Weighting of evidence is an issue when dealing with a fixed problem.
It seems here as if it's being used to select the problem: we're going to focus on X-problems because we put a lot of weight on experimental evidence. (obviously silly, so I don't imagine anyone consciously thinks like this - but out-of-distribution intuitions may be at work)

What kind of evidence do you imagine would lead OpenAI leadership to change their minds/approach?
Do you / your-model-of-leadership believe that there exist Y-problems?

Replies from: Jacob_Hilton

↑ comment by Jacob_Hilton · 2022-08-25T18:54:30.026Z · LW(p) · GW(p)

I don't think I understand your question about Y-problems, since it seems to depend entirely on how specific something can be and still count as a "problem". Obviously there is already experimental evidence that informs predictions about existential risk from AI in general, but we will get no experimental evidence of any exact situation that occurs beforehand. My claim was more of a vague impression about how OpenAI leadership and John tend to respond to different kinds of evidence in general, and I do not hold it strongly.

Replies from: Joe_Collman

↑ comment by Joe Collman (Joe_Collman) · 2022-08-25T20:15:49.218Z · LW(p) · GW(p)

To rephrase, it seems to me that in some sense all evidence is experimental. What changes is the degree of generalisation/abstraction required to apply it to a particular problem.

Once we make the distinction between experimental and non-experimental evidence, then we allow for problems on which we only get the "non-experimental" kind - i.e. the kind requiring sufficient generalisation/abstraction that we'd no longer tend to think of it as experimental.

So the question on Y-problems becomes something like:

Given some characterisation of [experimental evidence] (e.g. whatever you meant that OpenAI leadership would tend to put more weight on than John)...
...do you believe there are high-stakes problems for which we'll get no decision-relevant [experimental evidence] before it's too late?

↑ comment by Richard_Ngo (ricraz) · 2022-08-25T20:43:44.266Z · LW(p) · GW(p)

Alignment as a field is hard precisely because we do not expect to see empirical evidence before it is too late.

I don't think this is the core reason that alignment is hard - even if we had access to a bunch of evidence about AGI misbehavior now, I think it'd still be hard to convert that into a solution for alignment. Nor do I believe we'll see no empirical evidence of power-seeking behavior before it's too late (and I think opinions amongst alignment researchers are pretty divided on this question).

Replies from: johnswentworth, habryka4

↑ comment by johnswentworth · 2022-08-25T22:06:22.398Z · LW(p) · GW(p)

I don't think this is the core reason that alignment is hard - even if we had access to a bunch of evidence about AGI misbehavior now, I think it'd be very hard to convert that into a solution for alignment.

If I imagine that we magically had a boxing setup which let us experiment with powerful AGI alignment without dying, I do agree it would still be hard to solve alignment. But it wouldn't be harder than the core problems of any other field of science/engineering. It wouldn't be unusually hard, by the standards of technical research.

Of course, "empirical evidence of power-seeking behavior" is a lot weaker than a magical box. With only that level of empirical evidence, most of the "no empirical feedback" problem would still be present. More on that next.

Nor do I believe we'll see no empirical evidence of power-seeking behavior before it's too late (and I think opinions amongst alignment researchers are pretty divided on this question).

The key "lack of empirical feedback" property in Goodhart, deceptive alignment, hard left turn, get what you measure, etc, is this: for any given AI, it will look fine early on (e.g. in training or when optimization power is low) and then things will fall apart later on. If we are lucky enough to be in a very-slow-takeoff world, then an empirically-minded person might still notice that their AIs keep falling apart in deployment, and conclude that alignment is a problem. I don't put very high probability on that (partly because of the very-slow-takeoff assumption and partly because scenarios like getting what we measure don't necessarily look like a problem with the AI), but I buy it as a basically-plausible story.

But that doesn't really change the problem that much, for multiple reasons (any one of which is sufficient):

Even if we put only a low probability on not getting a warning shot, we probably don't want to pursue a strategy in which humanity goes extinct if we don't get a fire alarm. Thinking we'll probably get a warning shot makes sense to me; relying on a warning shot while plowing ahead building AGI is idiotic. The downside is far too large for the risk to make sense unless we are unrealistically confident that there will definitely be a warning shot.
Training AI in ways which will obviously incentivize it to hide problems (i.e. RLHF), and therefore make a warning shot less likely, is similarly foolish even if we think we'll probably get a warning shot.
The failure mode I actually think is most likely: we do get a warning shot, and then people try to train away the problems until the problems cease to be visible. And that fails because of Goodheart, deception, hard left turn, getting what you measure, etc.

Psychologizing a bit: I suspect both RLHF and reliance on warning shots are symptoms of a more general cognitive pattern where people just don't believe in anything they can't see, and "iterate until we can't see any problem" is very much the sort of strategy I expect such people to use. (I believe it's also the strategy suggested by OP's phrase "empirical methods that rely on capabilities".) It's not just about "not getting empirical evidence" in terms of a warning shot, it's about not getting empirical evidence about alignment of any given powerful AGI until it's too late, and that problem interacts very poorly with a mindset where people iterate a lot and don't believe in problems they can't see.

That's the sort of problem which doesn't apply in most scientific/engineering fields. In most fields, "iterate until we can't see any problem" is a totally reasonable strategy. Alignment as a field is unusually hard because we can't use that strategy; the core failure modes we're worried about all involve problems which aren't visible until later on the AI at hand.

↑ comment by habryka (habryka4) · 2022-08-25T20:49:18.857Z · LW(p) · GW(p)

Huh, I thought you agreed with statements like "if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier".

My model is that John is talking about "evidence on whether an AI alignment solution is sufficient", and you understood him to say "evidence on whether the AI Alignment problem is real/difficult". My guess is you both agree on the former, but I am not confident.

Replies from: ricraz

↑ comment by Richard_Ngo (ricraz) · 2022-08-25T21:20:49.846Z · LW(p) · GW(p)

Huh, I thought you agreed with statements like "if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier".

I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).

I don't really know what "reliable empirical feedback" means in this context - if you have sufficiently reliable feedback mechanisms, then you've solved most of the alignment problem. But, out of the things John listed:

Goodhart problems in outer alignment, deception in inner alignment, phase change in hard takeoff, "getting what you measure" in slow takeoff

I expect that we'll observe a bunch of empirical examples of each of these things happening (except for the hard takeoff phase change), and not know how to fix them.

Replies from: habryka4, johnswentworth

↑ comment by habryka (habryka4) · 2022-08-26T22:53:31.037Z · LW(p) · GW(p)

I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).

I do really feel like it would have been really extremely hard to build rockets if we had to get it right on the very first try.

I think for rockets the fact that it is so costly to experiment with stuff, explains the majority of the difficulty of rocket engineering. I agree you also have very little chance to build a successful space rocket without having a good understanding of newtonian mechanics and some aspects of relativity, but I don't know, if I could just launch a rocket every day without bad consequences, I am pretty sure I wouldn't really need a deep understanding of either of those, or would easily figure out the relevant bits as I kept experimenting.

The reason why rocket science relies so much on having solid theoretical models is because we have to get things right in only a few shots. I don't think you really needed any particularly good theory to build trains for example. Just a lot of attempts and tinkering.

Replies from: ricraz, capybaralet

↑ comment by Richard_Ngo (ricraz) · 2022-08-29T09:45:54.440Z · LW(p) · GW(p)

At a sufficiently high level of abstraction, I agree that "cost of experimenting" could be seen as the core difficulty. But at a very high level of abstraction, many other things could also be seen as the core difficulty, like "our inability to coordinate as a civilization" or "the power of intelligence" or "a lack of interpretability", etc. Given this, John's comment seemed like mainly rhetorical flourishing rather than a contentful claim about the structure of the difficult parts of the alignment problem.

Also, I think that "on our first try" thing isn't a great framing, because there are always precursors (e.g. we landed a man on the moon "on our first try" but also had plenty of tries at something kinda similar). Then the question is how similar, and how relevant, the precursors are - something where I expect our differing attitudes about the value of empiricism [LW(p) · GW(p)] to be the key crux.

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2022-08-29T15:08:18.027Z · LW(p) · GW(p)

Well you could probably build a rocket that looks like it works, anyways. Could you build one you would want to try to travel to the moon in? (Are you imagining you get to fly in these rockets? Or just launch and watch from ground? I was imagining the 2nd...)

↑ comment by johnswentworth · 2022-08-25T22:12:56.311Z · LW(p) · GW(p)

I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).

I basically buy that argument, though I do still think lack of shots is the main factor which makes alignment harder than most other technical fields in their preparadigmatic stage.

Replies from: Roman Leventov

↑ comment by Roman Leventov · 2022-08-26T09:47:09.259Z · LW(p) · GW(p)

"Harder" can have two meanings: "the program (of design, and the proof) is longer" and "the program is less likely to be generated in the real world". These meanings are correlated, but not identical.

comment by habryka (habryka4) · 2022-08-25T20:58:20.135Z · LW(p) · GW(p)

The main group of people working on alignment (other than interpretability) at OpenAI at the time of the Anthropic split at the end of 2020 was the Reflection team, which has since been renamed to the Alignment team. Of the 7 members of the team at that time (who are listed on the summarization paper), 4 are still working at OpenAI, and none are working at Anthropic.

I think this is literally true, but at least as far as I know is not really conveying the underlying dynamics and so I expect readers to walk away with the wrong impression.

Again, I might be totally wrong here, but as far as I understand the underlying dynamics is that there was a substantial contingent of people who worked at OpenAI because they cared about safety but worked in a variety of different roles, including many engineering roles. That contingent had pretty strong disagreements with leadership about a mixture of safety and other operating priorities (but I think mostly safety). Dario in-particular had lead a lot of the capabilities research and was dissatisfied with how the organization was run.

Dario left and founded Anthropic, taking a substantial number of engineering and research talent with him (I don't know the details, but I've heard statements to the effect that he took 2/4 top engineers), and around the same time a substantial contingent of other people concerned about safety also left the organization, since I think they became much less optimistic about their ability to do safety research in the organization in the absence of Dario.

Some of them went to Anthropic, others went to Redwood, others went and did their own thing (e.g. Paul). Some previous OpenAI staff that had left earlier then joined Anthropic.

I think it is interesting that of the one team that was officially working on safety nobody directly went to Anthropic (except Dario himself), but the above paragraph is I think failing to convey the degree to which there was a substantial exodus out of OpenAI into Anthropic, and a general exodus of safety concerned people out of OpenAI.

Replies from: howie-lempel-2, Jacob_Hilton

↑ comment by Howie Lempel (howie-lempel-2) · 2022-08-29T12:53:33.472Z · LW(p) · GW(p)

[I privately wrote the following quick summary of some publicly-available information on (~safety-relevant) talent leaving OpenAI since the founding of Anthropic. Seems worth pasting here since it already exists but I'd have been more careful if I wrote it with public sharing in mind, it's not comprehensive, and I don't have time to really edit. I'd advise against updating too hard on it because:

I basically don't have any visibility into OpenAI
Inferences from LinkedIn often don't give a super accurate sense of somebody's contribution.
I wrote down what I know about departures from OpenAI but didn't try to write up new hires in the same way.
It's often impossible for people at orgs to talk publicly about personnel issues/departures so if Jacob/others don't correct me, it's not very strong evidence that nothing below is inaccurate/misleading.]

The main group of people working on alignment (other than interpretability) at OpenAI at the time of the Anthropic split at the end of 2020 was the Reflection team, which has since been renamed to the Alignment team. Of the 7 members of the team at that time (who are listed on the summarization paper), 4 are still working at OpenAI, and none are working at Anthropic.

Like Habryka, I believe it's literally true that nobody from the "Alignment team" left for Anthropic and 4/7 are still working at OpenAI. But it seems possible that things look different if you weight by seniority and account for potential contributions to OpenAI's attention to existential safety made by people who weren’t technical safety researchers, who were researchers on another team, etc.

Important: I don't know why the below people left OpenAI and their inclusion doesn't mean there's any bad blood between them or that they necessarily have criticisms of OpenAI's attitude toward safety.

If I understand correctly,

1 The alignment team lost its team lead (Paul).

2 Two senior people who weren’t counted as on the team but oversaw it or helped with its research direction left for Anthropic.

VP of safety and policy (Daniela) whose linked in says she oversaw the safety and policy teams
VP of Research (Dario), who was the Team Lead for AI Safety before he got promoted and says he built and lead several of their long-term safety teams left for Anthropic. He was also an author on the summarization paper Jacob references. Id guess that he continued to be a contributor to their AI Safety work after being promoted.

3 The head of the interpretability team (Chris Olah), which is one of the other teams that seems most relevant to existential safety, left for Anthropic.

(Jacob acknowledges this earlier in the post)

4 Other Anthropic co-founders who left OpenAI include

Tom Brown (led the engineering of GPT-3)
Sam McCandlish and Jared Kaplan (just a consultant), who I think led their scaling laws research? I think I heard Jared is leading an Anthropic alignment team? I think Sam M did a fellowship on the safety team before building the scaling laws team

5 Another person who worked on technical safety at OpenAI and left for Anthropic

Tom Henighan was on the technical staff, safety team but I guess not on the alignment team?

6 Several people on the policy team left for Anthropic including the director and two EAs who are interested in alignment.

Policy Director, Jack Clark
Danny Hernandez
Amanda Askell

7 Another EA who I believe cares about alignment and left OpenAI for Anthropic:

Nicholas Joseph

8 Other people I don’t know who left for Anthropic

Kamal Ndousse
Benjamin Mann. LinkedIn says he was on their security and safety working groups

9 Holden is no longer on OpenAI's board (though Helen Toner now is).

On the other hand, they’ve also hired some EAs who care about alignment since then. I believe examples include:

Jan Leike, alignment team lead
Richard Ngo, team lead(?) for futures subteam of the policy team
Daniel Kokotajlo, futures subteam
Surely others I don't know of or am leaving out

↑ comment by Jacob_Hilton · 2022-08-26T14:51:14.694Z · LW(p) · GW(p)

Without commenting on the specifics, I have edited to the post to mitigate potential confusion: "this fact alone is not intended to provide a complete picture of the Anthropic split, which is more complicated than I am able to explain here".

comment by lc · 2022-08-27T06:00:28.754Z · LW(p) · GW(p)

Here is a similar post one could make about a different company:

A friend of mine has recently encountered a number of people with misconceptions about their employer, Phillip Morris International (PMI). Some common impressions are accurate, and others are not. He encouraged me to write a post intended to provide clarification on some of these points, to help people know what to expect from the organization and to figure out how to engage with it. It is not intended as a full explanation or evaluation of Phillip Morris's strategy.
Common accurate impressions
Phillip Morris International is the world's largest producer of cigarettes.
The majority of employees at Phillip Morris International work on tobacco production and marketing for the developing world.
The majority of Phillip Morris International's employees did not join with the primary motivation of reducing harm from tobacco smoke specifically.
PMI is the largest tobacco company in the world when measuring by market capitalization or revenue. PMI has six multibillion US$ brands and ships tens of billions of units to (in order of volume) southeast Asia, the European Union, the Middle East and Africa, Eastern Europe, The Americas, and East Asia & Australia.
Common misconceptions
Incorrect: PMI is not working on alternatives to cigarettes.
Philip Morris International is one of the largest funders in the world of research and development on smoke-free products. PMI established and supported the Foundation for a Smoke-Free World in 2017, and pledged to provide it with over one billion dollars over the next twelve years. By 2018 it spent several billion dollars on developing alternatives like iQOS through its research house in Switzerland, and all told since its inception PMI has spent over 9 billion investing in research on smoke free alternatives.
Incorrect: Most people who were working on cigarette alternatives left during the Altria Group/PMI split.
Until a spin-off in March 2008, Philip Morris International was an operating company of Altria. Altria explained the spin-off by arguing PMI would have more "freedom" outside the responsibilities and standards of American corporate ownership in terms of potential litigation and legislative restrictions to "pursue sales growth in emerging markets", while Altria focuses on the American domestic market. PMI kept the majority of its international research houses during the split.
Incorrect: PMI leadership is dismissive of risks caused by smoking.
Since public outcry in the 90s Phillip Morris spinoffs have been very public about their acknowledgement of the health problems that cigarettes cause. You can tell they care about their customers because their website is basically ~80% dedications to their publicly declared quest to end smoking, and reach a target of 50% of revenue coming from smokeless tobacco products by 2025.

Replies from: Benito, Slider, Jacob_Hilton

↑ comment by Ben Pace (Benito) · 2022-08-27T17:28:49.810Z · LW(p) · GW(p)

I found this comment helpful for me as I was trying to understand AI labs' roles in all this. Please consider retracting the retraction :)

Replies from: lc

↑ comment by lc · 2022-08-28T11:38:49.748Z · LW(p) · GW(p)

Now that I have your blessing I shall do that! I was mostly worried cause I have a history of making unhelpfully aggressive AI safety-related comments and I didn't want moderators to get frustrated with me again (which, to be clear, so far has happened only for very understandable reasons).

↑ comment by Slider · 2022-08-27T16:20:26.149Z · LW(p) · GW(p)

The parent seems to be redacted, but I wish to express that the satire angle did give quite a clear picture of some dynamics that could get watered down to the point of irrelevance. With the length and intensity it might have been unfriendlier than it could have been.

So in brief and abstract if an oil company promises carbon reductions because of social responcibility can be facing a conflict of interests and might not be pushing in both directions with the same gusto.

So with a organisation both making AI happen and not happen left hand spinning what the right hand is doing is relatively likely.

↑ comment by Jacob_Hilton · 2022-08-27T08:30:00.246Z · LW(p) · GW(p)

I obviously think there are many important disanalogies, but even if there weren't, rhetoric like this seems like an excellent way to discourage OpenAI employees from ever engaging with the alignment community, which seems like a pretty bad thing to me.

Replies from: gadypdj

↑ comment by gadyp (gadypdj) · 2022-08-27T13:14:07.858Z · LW(p) · GW(p)

I'd agree if somebody else wrote what you wrote but I don't think it's appropriate for you as an OpenAI employee to say that.

Replies from: Jacob_Hilton

↑ comment by Jacob_Hilton · 2022-08-27T15:26:35.695Z · LW(p) · GW(p)

Thank you for causing me to reconsider. I should have said "other OpenAI employees". I do not intend to disengage from the alignment community because of critical rhetoric, and I apologize if my comment came across as a threat to do so. I am concerned about further breakdown of communication between the alignment community and AI labs where alignment solutions may need to be implemented.

I don't immediately see any other reason why my comment might have been inappropriate, but I welcome your clarification if I am missing something.

Replies from: gadypdj

↑ comment by gadyp (gadypdj) · 2022-08-28T23:40:23.776Z · LW(p) · GW(p)

Thanks for the clarification.

comment by Adam Scholl (adam_scholl) · 2022-08-26T11:23:02.984Z · LW(p) · GW(p)

Incorrect: OpenAI leadership is dismissive of existential risk from AI.

Why, then, would they continue to build the technology which causes that risk? Why do they consider it morally acceptable to build something which might well end life on Earth?

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2022-08-27T19:41:09.947Z · LW(p) · GW(p)

A common view is that the timelines to risky AI are largely driven by hardware progress and deep learning progress occurring outside of OpenAI. Many people (both at OpenAI and elsewhere) believe that questions of who builds AI and how are very important relative to acceleration of AI timelines. This is related to lower estimates of alignment risk, higher estimates of the importance of geopolitical conflict, and (perhaps most importantly of all) radically lower estimates for the amount of useful alignment progress that would occur this far in advance of AI if progress were to be slowed down. Below I'll also discuss two arguments that delaying AI progress would on net reduce alignment risk which I often encountered at OpenAI.

I think that OpenAI has had a meaningful effect on accelerating AI timelines and that this was a significant cost that the organization did not adequately consider (plenty of safety-focused folk pushed back on various accelerating decisions and this is ultimately related to many departures though not directly my own). I also think that OpenAI is significantly driven by the desire to do something impactful and to reap the short-term benefits of AI. In significant part that's about wanting to be involved in altruistic benefits (though it's also based on a more basic and generally scary desire to just do something impactful). I think that OpenAI folks' views on altruistic benefits are based on some claims I agree with about possible impacts, but also on them caring less than I do about future generations and by having what I regard as mistaken empirical views (which partly persist because many folks have underinvested in careful thinking about the future).

That said, I think that the LW community significantly overestimates the negative impact of OpenAI's timeline-accelerating effects to date, and I suspect that these do not dominate their net impacts (neither do the claims about disrupting a relatively flimsy "only DeepMind works on AGI" equilibrium). That still leaves room for debate about whether the other impacts are positive or negative.

It's worth being aware of some common arguments that acceleration is less bad than it looks or even net positive:

I think it's basically reasonable to think that MIRI and the broader AI safety community made very little meaningful progress over the last 10 years, and to have the view that the overwhelmingly dominant drivers of accelerating alignment progress have been and will continue to be increased interest and investment as AI improves (this seems wrong to me in large part because the AI safety community and EA community more broadly have been growing independent of increased interest in AI). If that were the case, then cutting one month off of AI timelines does not have much direct effect on our ability to manage AI risk via giving us more time for alignment research, and the calculus is instead dominated by other trends in the world are positive or negative (e.g. how much do you think general institutional capacity is improving vs deteriorating over the coming decades, how worried are you about rise of China relative to the west, etc.)
Another fairly common argument and motivation at OpenAI in the early days was the risk of "hardware overhang," that slower development of AI would result in building AI with less hardware at a time when they can be more explosively scaled up with massively disruptive consequences. I think that in hindsight this effect seems like it was real, and I would guess that it is larger than the entire positive impact of the additional direct work that would be done by the AI safety community if AI progress had been slower 5 years ago. I think the LW community considers this argument non-serious, but in my opinion (and I expect the judgment of most independent observers) the empirical track record of this community and Eliezer on the relevant AI forecasts seems bad enough that no one should be taking community consensus on that point as a source of independent evidence.

I think that both of those arguments were significantly more plausible in the past and particularly before the release of GPT-3, though I still think they were wrong and likely in significant part the result of motivated cognition (or more realistically memetic and political selection within OpenAI and the adjacent communities).

At this point I think it's fairly clear that if OpenAI were focused on making the long-term future good they should not be disclosing or deploying improved systems (and it seems likely to me that they should not even be developing them), so the main point of debate is exactly how bad it is. I think it's less obvious whether it is good or bad on a certain kind of myopic altruism since I'd guess that the cost of 1 year of acceleration is less than a 1% reduction in survival probability (while ~1% of people die each year and people might reasonably value the profound suffering that occurs over a single year at 1% of survival).

Overall I think the LW community tends to be kind of deontological about this, and that when making quantitative estimates they tend to be at best debatable (and wildly overconfident and aggressive). I'd guess these overall decrease the efficiency of the LW community as a good influence on labs or force for good in the world.

Replies from: Lanrian, gadypdj, Chris_Leong

↑ comment by Lukas Finnveden (Lanrian) · 2022-08-29T23:39:17.767Z · LW(p) · GW(p)

Another fairly common argument and motivation at OpenAI in the early days was the risk of "hardware overhang," that slower development of AI would result in building AI with less hardware at a time when they can be more explosively scaled up with massively disruptive consequences. I think that in hindsight this effect seems like it was real, and I would guess that it is larger than the entire positive impact of the additional direct work that would be done by the AI safety community if AI progress had been slower 5 years ago.

Could you clarify this bit? It sounds like you're saying that OpenAI's capabilities work around 2017 was net-positive for reducing misalignment risk, even if the only positive we count is this effect. (Unless you think that there's substantial reason that acceleration is bad other than giving the AI safety community less time.) But then in the next paragraph you say that this argument was wrong (even before GPT-3 was released, which vaguely gestures at the "around 2017"-time). I don't see how those are compatible.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2022-08-30T14:51:26.079Z · LW(p) · GW(p)

One positive consideration is: AI will be built at a time when it is more expensive (slowing later progress). One negative consideration is: there was less time for AI-safety-work-of-5-years-ago. I think that this particular positive consideration is larger than this particular negative consideration, even though other negative considerations are larger still (like less time for growth of AI safety community).

Replies from: lc

↑ comment by lc · 2022-08-30T18:56:57.348Z · LW(p) · GW(p)

Are you saying that the AI safety community gets less effective at advancing SOTA interpretability/etc. as it gets more funding/interest, or that the negative consideration is the fact that the AI safety has had less time to grow, or something else? It seems odd to me that AI safety research progress would be negatively correlated with the size and amount of volunteer hours in the field, though I can imagine reasons why someone would think that.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2022-08-30T19:45:35.519Z · LW(p) · GW(p)

I'm saying that faster progress gives less time for the AI safety community to grow. (I added "less time for" to the original comment to clarify.)

Replies from: lc

↑ comment by lc · 2022-08-30T19:49:48.943Z · LW(p) · GW(p)

Ahh, ok.

↑ comment by gadyp (gadypdj) · 2022-08-28T23:34:34.157Z · LW(p) · GW(p)

A common view is that the timelines to risky AI are largely driven by hardware progress and deep learning progress occurring outside of Open

What's the justification for this view? It seems like significant deep learning process happens inside of OpenAI.

Many people (both at OpenAI and elsewhere) believe that questions of who builds AI and how are very important relative to acceleration of AI timelines.

If who builds AI is such an important question for OpenAI, then why would they publish capabilities research thus giving up majority of control on who builds AI and how?

At this point I think it's fairly clear that if OpenAI were focused on making the long-term future good they should not be disclosing or deploying improved systems (and it seems most likely they should not even be developing them), so the main point of debate is exactly how bad it is.

To a layman, It seems like they're on track to deploy GPT-4 as well as publish all the capabilities research related to that soon. Is there any reason to hope they won't be doing that?

~1% of people die each year and people might reasonably value the profound suffering that occurs over a single year at 1% of survival).

How is the harm caused by 1% of people dying even remotely equivalent to 1% reduction in survival, even without considering the value lost in the future lightcone?

It seems highly doubtful to me that OpenAI's dedication to doing and publishing capabilities research is a deliberate choice to accelerate timelines due to their deep philosophical adherence to myopic altruism.

I don't think they would be doing this if they actually thought they were increasing p(doom) by 1% (which is already an optimistic estimate) per 1 year acceleration of timelines - a much simpler explanation is that they're at least somewhat longtermist (like most humans) but they don't really think there's a significant p(doom) (at least the capabilities researchers and the leadership team).

Replies from: lcmgcd

↑ comment by lemonhope (lcmgcd) · 2022-08-29T09:33:27.479Z · LW(p) · GW(p)

I think Paul was speaking in 3rd person for parts of it where you didn't realize

↑ comment by Chris_Leong · 2022-09-17T13:12:52.626Z · LW(p) · GW(p)

This seems wrong to me in large part because the AI safety community and EA community more broadly have been growing independent of increased interest in AI

Agreed, this is one of the biggest considerations missed, in my opinion, by people who think accelerating progress was good. (TBH, if anyone was attempting to accelerate progress to reduce AI risk, I think that they were trying to be too clever by half; or just rationalisting).

comment by GeneSmith · 2022-08-28T04:48:19.073Z · LW(p) · GW(p)

OpenAI's continued practice of publishing the blueprints allowing others to create more powerful models seems to undermine their claims that they are worried about "bad actors getting there first".

If you were a scientist working on the Manhattan project because you were worried about Hitler getting the atomic bomb first, you wouldn't send your research on centrifuge design to german research scientists. Yet every company that claims they are more likely than other groups to create safe AGI continues to publish the blueprints for creating AGI to the open web.

Is there any actual justification for this other than "The prestige of getting published in top journals makes us look impressive?"

Replies from: lcmgcd

↑ comment by lemonhope (lcmgcd) · 2022-08-29T09:35:15.083Z · LW(p) · GW(p)

Makes you wonder who is developing secret AGI as we speak. One might assume that there is 10x more secret research (and researchers?) than meets the eye

comment by Thomas Larsen (thomas-larsen) · 2022-08-25T16:39:12.259Z · LW(p) · GW(p)

Incorrect: OpenAI is not aware of the risks of race dynamics.
OpenAI's Charter contains the following merge-and-assist clause: "We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.”"

Being worried about race dynamics and then stopping at the last minute makes sense and seems a lot better than nothing. But I'm confused why this understanding doesn't propagate to other beliefs/actions.

Specifically, below are some confusions I have with OpenAI's worldview. If answered, these could give me a lot more hope in OpenAI's direction.

How will you know that AGI has a >50% chance of success in the next two years? MIRI certainly seems to think this is hard.
How does OpenAI leadership feel about accelerating timelines? ^[1]
What are OpenAI leadership's timelines right now? What are these timelines based off of?
Does OpenAI retroactively think that publishing that GPT-3 worked was a mistake? ^[2]^[3]
Will OpenAI publish GPT-4? what are the factors driving this decision?

^{^}
On my models, we want to know as much about alignment as possible before we get close to AGI, and so it is incredibly important to have as much time as possible before we are close to AGI. I would much rather live in the world where we had 20 years to solve AI alignment than the world where we only have 10.
^{^}
If the benefits are using GPT-3 to do alignment research, why not give it to just alignment researchers, and not tell anyone else?
^{^}
Again, according to my current worldview, actions such as releasing GPT-3 are extremely negative, because it tells everyone that LLMs work and thus accelerates capabilities and therefore also shortens timelines.

Replies from: Gurkenglas

↑ comment by Gurkenglas · 2022-08-26T22:02:05.636Z · LW(p) · GW(p)

If the purpose of the merge-and-assist clause is to prevent a race dynamic, then it's sufficient for that clause to trigger when OpenAI would otherwise decide to start racing. They can interpret their own decision-making, right? Right?

comment by Zack_M_Davis · 2022-08-25T15:39:30.996Z · LW(p) · GW(p)

merge-and-assist clause [...] we commit to stop competing with and start assisting this project

So, if you don't think AI should be open (because that looks dangerous), has anyone considered just ... changing the name? (At least, the name of the organization, even if the "OpenAI API" as a product has the string openai embedded in the code too much.) Yeah, it's inconvenient, but ... Alphabet did it! Meta did it! If you're trying to make the most important event in the history of life go well, isn't it worth a little inconvenience to be clear about what that entails?

Replies from: lcmgcd

↑ comment by lemonhope (lcmgcd) · 2022-08-29T09:24:34.281Z · LW(p) · GW(p)

How's it gonna go over if they start calling it closed ai?

Replies from: Zack_M_Davis

↑ comment by Zack_M_Davis · 2022-08-30T04:25:04.329Z · LW(p) · GW(p)

So call it something else. GoodAI. OpalAI. BeneficiAI. ThoroughAI.

Replies from: jkaufman

↑ comment by jefftk (jkaufman) · 2022-09-02T19:15:14.283Z · LW(p) · GW(p)

Or OpEnAi: Optimally Envisioning AI. Then the code can still say openai.

comment by Adam Scholl (adam_scholl) · 2022-08-26T11:07:34.344Z · LW(p) · GW(p)

Incorrect: OpenAI is not aware of the risks of race dynamics.

I don't think this is a common misconception. I, at least, have never heard anyone claim OpenAI isn't aware of the risk of race dynamics—just that it nonetheless exacerbates them. So I think this section is responding to a far dumber criticism than the one which people actually commonly make.

comment by Larks · 2022-08-25T19:01:20.063Z · LW(p) · GW(p)

Alignment research: 30

Could you share some breakdown for what these people work on? Does this include things like the 'anti-bias' prompt engineering?

Replies from: Jacob_Hilton

↑ comment by Jacob_Hilton · 2022-08-25T19:24:06.718Z · LW(p) · GW(p)

It includes the people working on the kinds of projects I listed under the first misconception. It does not include people working on things like the mitigation you linked to. OpenAI distinguishes internally between research staff (who do ML and policy research) and applied staff (who work on commercial activities), and my numbers count only the former.

Replies from: habryka4, Larks

↑ comment by habryka (habryka4) · 2022-08-25T20:41:44.331Z · LW(p) · GW(p)

WebGPT seemed like one of the most in-expectation harmful projects that OpenAI has worked on, with no (to me) obvious safety relevance, so my guess is I would still mostly categorize the things you list under the first misconception as capabilities research. InstructGPT also seems to be almost fully capabilities research (like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons)

(Edit: My current guess for full-time equivalents who are doing safety work at OpenAI (e.g. if someone is doing 50% work that a researcher fully focused on capabilities would do and 50% on alignment work, then we count them as 0.5 full-time equivalents) is around 10, maybe a bit less, though I might be wrong here.)

Replies from: Jacob_Hilton, neel-nanda-1, neel-nanda-1, conor-sullivan

↑ comment by Jacob_Hilton · 2022-08-25T21:19:58.323Z · LW(p) · GW(p)

I was the project lead on WebGPT and my motivation was to explore ideas for scalable oversight and truthfulness (some further explanation is given here [LW · GW]).

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2022-08-26T00:01:59.076Z · LW(p) · GW(p)

The real question for Habryka is why does he think that it's bad for WebGPT to be built in order to get truthful AI? Like, isn't solving that problem quite a significant thing already for alignment?

Replies from: habryka4, Quadratic Reciprocity

↑ comment by habryka (habryka4) · 2022-08-26T05:18:19.527Z · LW(p) · GW(p)

WebGPT is approximately "reinforcement learning on the internet".

There are some very minimal safeguards implemented (search via Bing API, but the AI can click on arbitrary links), but I do indeed think "reinforcement learning on the internet" is approximately the worst direction for modern AI to go in terms of immediate risks.

I don't think connecting GPT-3 to the internet is risky at current capability levels, but pushing AI in the direction of just hooking up language models with reinforcement learning to a browser seems like one of the worst directions for AI to go. And my guess is the majority of the effect of this research will be to cause more people to pursue this direction in the future (Adept.AI seems to be pursuing a somewhat similar approach).

Edit: Jacob does talk about this a bit in a section I had forgotten about in the truthful LM post:

Another concern is that working on truthful LMs may lead to AI being "let out of the box" by encouraging research in which models interact with the external world agentically, in the manner of WebGPT.

I think this concern is worth taking seriously, but that the case for it is weak:
As AI capabilities improve, the level of access to the external world required for unintended model behavior to cause harm goes down. Hence access to the external world needs to be heavily restricted in order to have a meaningful safety benefit, which imposes large costs on research that are hard to justify.
I am in favor of carefully and conservatively evaluating the risks of unintended model behavior before conducting research, and putting in place appropriate monitoring. But in the short term, this seems like an advantage of the research direction rather than a disadvantage, since it helps surface risks while the stakes are still low, build institutional capacity for evaluating and taking into account these risks, and set good precedents.
In case this does turn out to be more of a concern upon reflection, there are other approaches to truthful AI that involve less agentic interaction with the external world than continuing in the style of WebGPT.
There is still an argument that there will be a period during which AI is capable enough to cause serious damage, but not capable enough to escape from sandboxed environments, and that setting precedents could worsen the risks posed during this interval. I don't currently find this argument persuasive, but would be interested to hear if there is a more persuasive version of it. That said, one bright line that stands out is training models to perform tasks that actually require real-world side effects, and I think it makes sense to think carefully before crossing that line.

I don't think I would phrase the problem as "letting the AI out of the box" and more "training an AI in a context where agency is strongly rewarded and where there are a ton of permanent side effects".

I find the point about "let's try to discover the risky behavior as early as possible" generally reasonable, and am in-favor of doing this kind of work now instead of later, but I think in that case we need to put in quite strong safeguards and make it very clear that quite soon we don't want to see more research like this, and I don't think the WebGPT work got that across.

I don't understand this point at all:

As AI capabilities improve, the level of access to the external world required for unintended model behavior to cause harm goes down. Hence access to the external world needs to be heavily restricted in order to have a meaningful safety benefit, which imposes large costs on research that are hard to justify.

This says to me "even very little access to the external world will be sufficient for capable models to cause great harm, so we have to restrict access to the external world a lot, therefore... we won't do that because that sounds really inconvenient for my research". Like, yes, more direct access to the external world is one of the most obvious ways AIs can cause more harm and learn more agentic behavior. Boxing is costly.

The primary job of OpenAI is to be a clear leader here and do the obvious good things to keep an AI safe, which will hopefully include boxing it. Saying "well, seems like the cost is kinda high so we won't do it" seems like exactly the kind of attitude that I am worried will cause humanity to go extinct.

Separately, there is also no other research I am aware of that is training AI as directly on access to the internet (except maybe at Adept.ai), so I don't really buy that currently at the margin the cost of avoiding research like this would be very high, either for capabilities or safety.

But I might also be completely misunderstanding this section. I also don't really understand why you only get a safety benefit when you restrict access a lot. Seems like you also get a safety benefit earlier, by just making it harder for the AI to build a good model of the external world and not learning heuristics for manipulating people, etc.

There is still an argument that there will be a period during which AI is capable enough to cause serious damage, but not capable enough to escape from sandboxed environments, and that setting precedents could worsen the risks posed during this interval.

I mean, isn't this the mainline scenario of most prosaic AI Alignment research? A lot of the current plans for AI Alignment consist of taking unaligned AIs, boxing them, and then trying to use them to do better AI Alignment research despite them being somewhat clearly unaligned, but unable to break out of the box.

Replies from: paulfchristiano, lc

↑ comment by paulfchristiano · 2022-08-26T17:15:38.245Z · LW(p) · GW(p)

The primary job of OpenAI is to be a clear leader here and do the obvious good things to keep an AI safe, which will hopefully include boxing it. Saying "well, seems like the cost is kinda high so we won't do it" seems like exactly the kind of attitude that I am worried will cause humanity to go extinct.

When you say "good things to keep an AI safe" I think you are referring to a goal like "maximize capability while minimizing catastrophic alignment risk." But in my opinion "don't give your models access to the internet or anything equally risky" is a bad way to make that tradeoff. I think we really want dumber models doing more useful things, not smarter models that can do impressive stuff with less resources. You can get a tiny bit of safety by making it harder for your model to have any effect on the world, but at the cost of significant capability, and you would have been better off just using a slightly dumber model with more ability to do stuff. This effect is much bigger if you need to impose extreme limitations in order to get any of this "boxing benefit" (as claimed by the quote you are objecting to).
I assume the harms you are pointing to here are about setting expectations+norms about whether AI should interact with the world in a way that can have effects. But people attempting to box smart unaligned AIs, or believing that boxed AIs are significantly safer because they can't access the internet, seems to me like a bad situation. An AI smart enough to cause risk with internet access is very likely to be able to cause risk anyway, and at best you are creating a super unstable situation where a lab leak is catastrophic. So the possible norms you are gesturing at preserving seem like they are probably net negative to me, because the main effect of these norms is on how strong an AI has to be before people consider it dangerous, not the relevance to an alignment strategy that (to me) doesn't seem very workable.
I think the argument "don't do things that could lead to low-stakes failures because then people will get in the habit of allowing failure" is sometimes right but often wrong. I think examples of reward hacking in a successor to WebGPT would have a large effect on reducing risk via learning and example, while having essentially zero direct costs. You say this requires "strong protections" presumably to avoid the extrapolation out to catastrophe, but you don't really get into any quantitative detail and when I think about the numbers on this it looks like the net effect is positive. I don't think this is close to the primary benefit of this work, but I think it's already large enough to swamp the costs. I think the story would be way different if the actual risk posed by WebGPT was meaningful (say if it were driving >0.1% of the risk of OpenAI's activities).
I believe the most important drivers of catastrophic misalignment risk are models that optimize in ways humans don't understand or are deceptively aligned. So the great majority of risk comes from actions that accelerate those events, and especially making models smarter. I think your threat model here is quantitatively wrong, and that it's an important disagreement.
There is a lot of disagreement about what alignment research is useful. For example, much of the work I consider useful you consider ~useless, and much of the work you consider useful I consider ~useless. But I think the more interesting disagreement is did the work help, and focusing on net negativeness seems rhetorically relevant but not very relevant to the cost-benefit analysis. This is related to the last point. If you thought that researchers working on WebGPT were shortening timelines significantly more efficiently than the average AI researcher, then the direct harm starts to become relevant compared to opportunity costs. (This is also related to the argument here [LW · GW] which I disagree with very strongly, but might be the kind of intuition you are drawing on.)
I don't think "your AI wants to kill you but it can't get out of the box so it helps you with alignment instead" is the mainline scenario. You should be building an AI that wouldn't stab you if your back was turned and it was holding a knife, and if you can't do that then you should not build the AI. I believe all the reasonable approaches to prosaic AI alignment involve avoiding that situation, and the question is how well you succeed. I agree you want defense in depth and so you should also not give your AI a knife while you are looking away, but again (i) web access is really weaksauce compared to the opportunities you want to give your AI in order for it to be useful, and "defense in depth" doesn't mean compromising on protections that actually matter like intelligence in order to buy a tiny bit of additional security, (ii) "make sure all your knives are plastic" is a pretty lame norm that is more likely to make it harder to establish clarity about risks than to actually help once you have AI systems who would stab you if they got the chance.
It's very plausible the core disagreement here may be something like "how useful is it for safety if if people try to avoid giving their AI access to the internet." It's possible that after thinking more about the argument that this is useful I might change my mind. I don't know if you have any links to someone making this argument. I think there are more and less useful forms of boxing and "your AI can't browse the internet" is one of the forms that makes relatively little sense. (I think the versions that make most sense are way more in the weeds about details of the training setup.) I think that many kinds of improvements in security and thoughtfulness about training setup make much more sense (though are mostly still lower-order terms).
I think the best argument for work like WebGPT having harms in the same order of magnitude as opportunity cost is that it's a cool thing to do with AI that might further accelerate interest in the area. I'm much less sure how to think about these effects and I could imagine it carrying the day that publication of WebGPT is net negative. (Though this is not my view.)

To be clear, this is not post hoc reasoning. I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arguments that the harms from the research were significant, I believe that it likely wouldn't have happened.

Replies from: habryka4, habryka4, habryka4, habryka4, habryka4, capybaralet, daniel-kokotajlo

↑ comment by habryka (habryka4) · 2022-08-26T22:26:10.315Z · LW(p) · GW(p)

If you thought that researchers working on WebGPT were shortening timelines significantly more efficiently than the average AI researcher, then the direct harm starts to become relevant compared to opportunity costs.

Yeah, my current model is that WebGPT feels like some of the most timelines-reducing work that I've seen (as has most of OpenAIs work). In-general, OpenAI seems to have been the organization that has most shortened timelines in the last 5 years, with the average researcher seeming ~10x more efficient at shortening timelines than even researchers at other AGI companies like Deepmind, and probably ~100x more efficient than researchers at most AI research organizations (like Facebook AI).

WebGPT strikes me on the worse side of OpenAI capabilities research in terms of accelerating timelines (since I think it pushes us into a more dangerous paradigm that will become dangerous earlier, and because I expect it to be the kind of thing that could very drastically increase economical returns from AI). And then it also has the additional side-effect of pushing us into a paradigm of AIs that are much harder to align and so doing alignment work in that paradigm will be slower (as has I think a bunch of the RLHF work, though there I think there is a more reasonable case for a commensurate benefit there in terms of the technology also being useful for AI Alignment).

Replies from: paulfchristiano, Aidan O'Gara

↑ comment by paulfchristiano · 2022-08-27T00:17:28.969Z · LW(p) · GW(p)

I think almost all of the acceleration comes from either products that generate $ and hype and further investment, or more directly from scaleup to more powerful models. I think "We have powerful AI systems but haven't deployed them to do stuff they are capable of" is a very short-term kind of situation and not particularly desirable besides.

I'm not sure what you are comparing RLHF or WebGPT to when you say "paradigm of AIs that are much harder to align." I think I probably just think this is wrong, in that (i) you are comparing to pure generative modeling but I think that's the wrong comparison point barring a degree of coordination that is much larger than what is needed to avoid scaling up models past dangerous thresholds, (ii) I think you are wrong about the dynamics of deceptive alignment under existing mitigation strategies and that scaling up generative modeling to the point where it is transformative is considerably more likely to lead to deceptive alignment than using RLHF (primarily via involving much more intelligent models).

↑ comment by aog (Aidan O'Gara) · 2022-09-01T16:35:16.518Z · LW(p) · GW(p)

Something I learned today that might be relevant: OpenAI was not the first organization to train transformer language models with search engine access to the internet. Facebook AI Research released their own paper on the topic six months before WebGPT came out, though the paper is surprisingly uncited by the WebGPT paper.

Generally I agree that hooking language models up to the internet is terrifying, despite the potential improvements for factual accuracy. Paul's arguments seem more detailed on this and I'm not sure what I would think if I thought about them more. But the fact that OpenAI was following rather than leading the field would be some evidence against WebGPT accelerating timelines.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2022-09-01T20:46:42.064Z · LW(p) · GW(p)

I did not know!

However, I don't think this is really the same kind of reference class in terms of risk. It looks like the search engine access for the Facebook case is much more limited and basically just consisted of them appending a number of relevant documents to the query, instead of the model itself being able to send various commands that include starting new searches and clicking on links.

Replies from: gwern

↑ comment by gwern · 2022-09-01T21:58:51.909Z · LW(p) · GW(p)

It does generate the query itself, though:

A search query generator: an encoder-decoder Transformer that takes in the dialogue context as input, and generates a search query. This is given to the black-box search engine API, and N documents are returned.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2022-09-02T08:57:49.144Z · LW(p) · GW(p)

Does it itself generate the query, or is it a separate trained system? I was a bit confused about this in the paper.

Replies from: gwern

↑ comment by gwern · 2022-09-02T16:29:47.522Z · LW(p) · GW(p)

You'd think they'd train the same model weights and just make it multi-task with the appropriate prompting, but no, that phrasing implies that it's a separate finetuned model, to the extent that that matters. (I don't particularly think it does matter because whether it's one model or multiple, the system as a whole still has most of the same behaviors and feedback loops once it gets more access to data or starts being trained on previous dialogues/sessions - how many systems are in your system? Probably a lot, depending on your level of analysis. Nevertheless...)

↑ comment by habryka (habryka4) · 2022-08-26T22:30:18.738Z · LW(p) · GW(p)

But people attempting to box smart unaligned AIs, or believing that boxed AIs are significantly safer because they can't access the internet, seems to me like a bad situation. An AI smart enough to cause risk with internet access is very likely to be able to cause risk anyway, and at best you are creating a super unstable situation where a lab leak is catastrophic.

I do think we are likely to be in a bad spot, and talking to people at OpenAI, Deepmind and Anthropic (e.g. the places where most of the heavily-applied prosaic alignment work is happening), I do sure feel unhappy that their plan seems to be to be banking on this kind of terrifying situation, which is part of why I am so pessimistic about the likelihood of doom.

If I had a sense that these organizations are aiming for a much more comprehensive AI Alignment solution that doesn't rely on extensive boxing I would agree with you more, but I am currently pretty sure they aren't ensuring that, and by-default will hope that they can get far enough ahead with boxing-like strategies.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2022-08-30T09:54:04.168Z · LW(p) · GW(p)

talking to people at OpenAI, Deepmind and Anthropic [...]
If I had a sense that these organizations are aiming for a much more comprehensive AI Alignment solution that doesn't rely on extensive boxing I would agree with you more, but I am currently pretty sure they aren't ensuring that, and by-default will hope that they can get far enough ahead with boxing-like strategies.

... Who are you talking to? I'm having trouble naming a single person at either of OpenAI or Anthropic who seems to me to be interested in extensive boxing (though admittedly I don't know them that well). At DeepMind there's a small minority who think about boxing, but I think even they wouldn't think of this as a major aspect of their plan.

I agree that they aren't aiming for a "much more comprehensive AI alignment solution" in the sense you probably mean it but saying "they rely on boxing" seems wildly off.

My best-but-still-probably-incorrect guess is that you hear people proposing schemes that seem to you like they will obviously not work in producing intent aligned systems and so you assume that the people proposing them also believe that and are putting their trust in boxing, rather than noticing that they have different empirical predictions about how likely those schemes are to produce intent aligned systems.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2022-09-01T20:57:24.830Z · LW(p) · GW(p)

Here is an example quote from the latest OpenAI blogpost on AI Alignment:

Language models are particularly well-suited for automating alignment research because they come “preloaded” with a lot of knowledge and information about human values from reading the internet. Out of the box, they aren’t independent agents and thus don’t pursue their own goals in the world. To do alignment research they don’t need unrestricted access to the internet. Yet a lot of alignment research tasks can be phrased as natural language or coding tasks.

This sounds super straightforwardly to me like the plan of "we are going to train non-agentic AIs that will help us with AI Alignment research, and will limit their ability to influence the world, by e.g. not giving them access to the internet". I don't know whether "boxing" is the exact right word here, but it's the strategy I was pointing to here.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2022-09-02T10:29:43.206Z · LW(p) · GW(p)

The immediately preceding paragraph is:

Importantly, we only need “narrower” AI systems that have human-level capabilities in the relevant domains to do as well as humans on alignment research. We expect these AI systems are easier to align than general-purpose systems or systems much smarter than humans.

I would have guessed the claim is "boxing the AI system during training will be helpful for ensuring that the resulting AI system is aligned", rather than "after training, the AI system might be trying to pursue its own goals, but we'll ensure it can't accomplish them via boxing". But I can see your interpretation as well.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2022-09-03T00:05:03.793Z · LW(p) · GW(p)

Oh, I do think a bunch of my problems with WebGPT is that we are training the system on direct internet access.

I agree that "train a system with internet access, but then remove it, then hope that it's safe", doesn't really make much sense. In-general, I expect bad things to happen during training, and separately, a lot of the problems that I have with training things on the internet is that it's an environment that seems like it would incentivize a lot of agency and make supervision really hard because you have a ton of permanent side effects.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2022-09-03T14:06:35.591Z · LW(p) · GW(p)

Oh you're making a claim directly about other people's approaches, not about what other people think about their own approaches. Okay, that makes sense (though I disagree).

Oh, I do think a bunch of my problems with WebGPT is that we are training the system on direct internet access.
I agree that "train a system with internet access, but then remove it, then hope that it's safe", doesn't really make much sense.

I was suggesting that the plan was "train a system without Internet access, then add it at deployment time" (aka "box the AI system during training"). I wasn't at any point talking about WebGPT.

↑ comment by habryka (habryka4) · 2022-08-26T21:23:26.031Z · LW(p) · GW(p)

I don't think "your AI wants to kill you but it can't get out of the box so it helps you with alignment instead" is the mainline scenario. You should be building an AI that wouldn't stab you if your back was turned and it was holding a knife, and if you can't do that then you should not build the AI.

That's interesting. I do think this is true about your current research direction (which I really like about your research and I do really hope we can get there), but when I e.g. talk to Carl Shulman he (if I recall correctly) said things like "we'll just have AIs competing against each other and box them and make sure they don't have long-lasting memory and then use those competing AIs to help us make progress on AI Alignment". Buck's post on "The prototypical catastrophic AI action is getting root access to its datacenter" also suggests to me that the "AI gets access to the internet" scenario is a thing that he is pretty concerned about.

More broadly, I remember that Carl Shulman said that he thinks that the reference class of "violent revolutions" is generally one of the best reference classes for forecasting whether an AI takeover will happen, and that a lot of his hope comes from just being much better at preventing that kind of revolution, by making it harder by e.g. having AIs rat out each other, not giving them access to resources, resetting them periodically, etc.

I also think that many AI Alignment schemes I have heard about rely quite a bit on preventing an AI from having long-term memory or generally be able to persist over multiple instantiations, which becomes approximately impossible if an AI just has direct access to the internet.

I think we both agree that in the long-run we want to have an AI that we can scale up much more and won't stab us in the back even when much more powerful, but my sense is outside of your research in-particular, I haven't actually seen anyone work on that in a prosaic context, and my model of e.g. OpenAI's safety team is indeed planning to rely on having a lot of very smart and not-fully-aligned AIs do a lot of work for us, with a lot of that work happening just at the edge of where the systems are really capable, but not able to overthrow all of us.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2022-08-27T00:13:48.090Z · LW(p) · GW(p)

Even in those schemes, I think the AI systems in question will have much better levers for causing trouble than access to the internet, including all sorts of internal access and their involvement in the process of improving your AI (and that trying to constrain them so severely would mean increasing their intelligence far enough that you come out behind). The mechanisms making AI uprising difficult are not mostly things like "you are in a secure box and can't get out," they are mostly facts about all the other AI systems you are dealing with.

That said, I think you are overestimating how representative these are of the "mainline" hope most places, I think the goal is primarily that AI systems powerful enough to beat all of us combined come after AI systems powerful enough to greatly improve the situation. I also think there are a lot of subtle distinctions about how AI systems are trained that are very relevant to a lot of these stories (e.g. WebGPT is not doing RL over inscrutable long-term consequences on the internet---just over human evaluations of the quality of answers or browsing behavior).

↑ comment by habryka (habryka4) · 2022-08-26T22:34:52.526Z · LW(p) · GW(p)

I believe the most important drivers of catastrophic misalignment risk are models that optimize in ways humans don't understand or are deceptively aligned. So the great majority of risk comes from actions that accelerate those events, and especially making models smarter. I think your threat model here is quantitatively wrong, and that it's an important disagreement.

I agree with this! But I feel like this kind of reinforcement learning on a basically unsupervisable action-space while interfacing with humans and getting direct reinforcement on approval is exactly the kind of work that will likely make AIs more strategic and smarter, create deceptive alignment, and produce models that humans don't understand.

I do indeed think the WebGPT work is relevant to both increasing capabilities and increasing likelihood of deceptive alignment (as is most reinforcement learning that directly pushes on human approval, especially in a large action space with permanent side effect).

↑ comment by habryka (habryka4) · 2022-08-26T22:36:49.354Z · LW(p) · GW(p)

I think the story would be way different if the actual risk posed by WebGPT was meaningful (say if it were driving >0.1% of the risk of OpenAI's activities).

Huh, I definitely expect it to drive >0.1% of OpenAI's activities. Seems like the WebGPT stuff is pretty close to commercial application, and is consuming much more than 0.1% of OpenAI's research staff, while probably substantially increasing OpenAI's ability to generally solve reinforcement learning problems. I am confused why you would estimate it at below 0.1%. 1% seems more reasonable to me as a baseline estimate, even if you don't think it's a particularly risky direction of research (given that it's consuming about 4-5% of OpenAI's research staff).

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2022-08-27T00:22:25.741Z · LW(p) · GW(p)

I think the direct risk of OpenAI's activities is overwhelmingly dominated by training new smarter models and by deploying the public AI that could potentially be used in unanticipated ways.

I agree that if we consider indirect risks broadly (including e.g. "this helps OpenAI succeed or raise money and OpenAI's success is dangerous") then I'd probably move back towards "what % of OpenAI's activities is it."

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2022-08-29T16:24:20.312Z · LW(p) · GW(p)

When you say "good things to keep an AI safe" I think you are referring to a goal like "maximize capability while minimizing catastrophic alignment risk." But in my opinion "don't give your models access to the internet or anything equally risky" is a bad way to make that tradeoff. I think we really want dumber models doing more useful things, not smarter models that can do impressive stuff with less resources. You can get a tiny bit of safety by making it harder for your model to have any effect on the world, but at the cost of significant capability, and you would have been better off just using a slightly dumber model with more ability to do stuff. This effect is much bigger if you need to impose extreme limitations in order to get any of this "boxing benefit" (as claimed by the quote you are objecting to).

I don'd think the choice is between "smart and boxed" or "less smart and less boxed". Intelligence (e.g. especially domain knowledge) is not 1-dimensional, boxing is largely a means of controlling what kind of knowledge the AI has. We might prefer AI savants that are super smart about some task-relevant aspects of the world and ignorant about a lot of other strategically-relevant aspects of the world.

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-08-30T20:32:55.961Z · LW(p) · GW(p)

I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arguments that the harms from the research were significant, I believe that it likely wouldn't have happened.

Just to make sure I follow: You told them at the time that it was overdetermined that the risks weren't significant? And if you had instead told them that the risks were significant, they wouldn't have done it?

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2022-08-30T21:35:42.383Z · LW(p) · GW(p)

As in: there seem to have generally been informal discussions about how serious this risk was, and I participated in some of those discussions (though I don't remember which discussions were early on vs prior to paper release vs later). In those discussions I said that I thought the case for risk seemed very weak.

If the case for risk had been strong, I think there are a bunch of channels by which the project would have been less likely. Some involve me---I would have said so, and I would have discouraged rather than encouraged the project in general since I certainly was aware fo it. But most of the channels would have been through other people---those on the team who thought about it would have come to different conclusions, internal discussions on the team would have gone differently, etc.

Obviously I have only indirect knowledge about decision-making at OpenAI so those are just guesses (hence "I believe that it likely wouldn't have happened"). I think the decision to train WebGPT would be unusually responsive to arguments that it is bad (e.g. via Jacob's involvement) and indeed I'm afraid that OpenAI is fairly likely to do risky things in other cases where there are quite good arguments against.

↑ comment by lc · 2022-08-26T11:37:24.810Z · LW(p) · GW(p)

Glad to know at least that "Reinforcement Learning but in a highly dynamic and hard-to-measure and uncontrollable environment" is as unsafe as my intuition says it is.

↑ comment by Quadratic Reciprocity · 2022-08-26T04:16:42.709Z · LW(p) · GW(p)

Letting GPT-3 interact with the internet seems pretty bad to me

↑ comment by Neel Nanda (neel-nanda-1) · 2022-08-26T07:50:59.235Z · LW(p) · GW(p)

like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons

This also seems like an odd statement - it seems reasonable to say "I think the net effect of InstructGPT is to boost capabilities" or even "If someone was motivated by x-risk it would be poor prioritisation/a mistake to work on InstructGPT". But it feels like you're assuming some deep insight into the intention behind the people working on it, and making a much stronger statement than "I think OpenAI's alignment team is making bad prioritisation decisions".

Like, reading the author list of InstructGPT, there are obviously a bunch of people on there who care a bunch about safety including I believe the first two authors - it seems pretty uncharitable and hostile to say that they were motivated by a desire to boost capabilities, even if you think that was a net result of their work.

(Note: My personal take is to be somewhat confused, but to speculate that InstructGPT was mildly good for the world? And that a lot of the goodness comes from field building of getting more people investing in good quality RLHF.)

Replies from: habryka4

↑ comment by habryka (habryka4) · 2022-08-26T21:32:59.175Z · LW(p) · GW(p)

Yeah, I agree that I am doing reasoning on people's motivations here, which is iffy and given the pushback I will be a bit more hesitant to do, but also like, in this case reasoning about people's motivations is really important, because what I care about is what the people working at OpenAI will actually do when they have extremely powerful AI in their hands, and that will depend a bunch on their motivations.

I am honestly a bit surprised to see that WebGPT was as much driven by people who I do know reasonably well and who seem to be driven primarily by safety concerns, since the case for it strikes me as so weak, and the risk seeming as somewhat obviously high, so I am still trying to process that and will probably make some kind of underlying update.

I do think overall I've had much better success at predicting the actions of the vast majority of people at OpenAI, including a lot of safety work, by thinking of them by being motivated by doing cool capability things, sometimes with a thin safety veneer on top, instead of being motivated primarily by safety. For example, I currently think that the release strategy for the GPT models of OpenAI is much better explained by OpenAI wanting a moat around their language model product instead of being motivated by safety concerns. I spent many hours trying to puzzle over the reasons for why they choose this release strategy, and ultimately concluded that the motivation was primarily financial/competetive-advantage related, and not related to safety (despite people at OpenAI claiming otherwise).

I also overall agree that trying to analyze motivations of people is kind of fraught and difficult, but I also feel pretty strongly that it's now been many years where people have been trying to tell a story of OpenAI leadership being motivated by safety stuff, with very little action to actually back that up (and a massive amount of harm in terms of capability gains), and I do want to be transparent that I no longer really believe the stated intentions of many people working there.

↑ comment by Neel Nanda (neel-nanda-1) · 2022-08-26T07:46:00.201Z · LW(p) · GW(p)

WebGPT seemed like one of the most in-expectation harmful projects that OpenAI has worked on

That seems weirdly strong. Why do you think that?

Replies from: Jacob_Hilton

↑ comment by Jacob_Hilton · 2022-08-26T20:34:25.729Z · LW(p) · GW(p)

For people viewing on the Alignment Forum, there is a separate thread on this question here. [LW(p) · GW(p)] (Edit: my link to LessWrong is automatically converted to an Alignment Forum link, you will have to navigate there yourself.)

Replies from: habryka4

↑ comment by habryka (habryka4) · 2022-08-26T21:42:33.488Z · LW(p) · GW(p)

I moved that thread over the AIAF as well!

↑ comment by Lone Pine (conor-sullivan) · 2022-08-26T10:27:04.442Z · LW(p) · GW(p)

InstructGPT also seems to be almost fully capabilities research

I don't understand this at all. I see InstructGPT as an attempt to make a badly misaligned AI (GPT-3) corrigible. GPT-3 was never at a dangerous capability level, but it was badly misaligned; InstructGPT made a lot of progress.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2022-08-26T21:41:23.677Z · LW(p) · GW(p)

I think the primary point of InstructGPT is to make the GPT-API more useful to end users (like, it just straightforwardly makes OpenAI more money, and the metric to be optimized is I don't think something particularly close to corrigibility).

I don't think Instruct-GPT has made the AI more corrigible in any obvious way (unless you are using the word corrigible very very broadly). In-general, I think we should expect reinforcement learning to make AIs more agentic and less corrigible, though there is some hope we can come up with clever things in the future that will allow us to use reinforcement learning to also increase corrigibility (but I don't think we've done that yet).

See also a previous discussion between me and Paul where we were talking about whether it makes sense to say that Instruct-GPT is more "aligned" than GPT-3, which maybe explored some related disagreements: https://www.lesswrong.com/posts/auKWgpdiBwreB62Kh/sam-marks-s-shortform?commentId=ktxyWjAaQXGBwvitf [LW(p) · GW(p)]

Replies from: ricraz, lcmgcd

↑ comment by Richard_Ngo (ricraz) · 2022-08-29T09:35:21.726Z · LW(p) · GW(p)

Could you clarify what you mean by "the primary point" here? As in: the primary actual effect? Or the primary intended effect? From whose perspective?

Replies from: habryka4

↑ comment by habryka (habryka4) · 2022-08-29T15:53:03.248Z · LW(p) · GW(p)

I think it's the primary reason why OpenAI leadership cares about InstructGPT and is willing to dedicate substantial personel and financial resources on it. I expect that when OpenAI leadership is making tradeoffs of different types of training, the primary question is commercial viability, not safety.

Similarly, if InstructGPT would hurt commercial viability, I expect it would not get deployed (I think individual researchers would likely still be able to work on it, though I think they would be unlikely to be able to hire others to work on it, or get substantial financial resources to scale it).

↑ comment by lemonhope (lcmgcd) · 2022-08-29T10:25:03.250Z · LW(p) · GW(p)

though there is some hope we can come up with clever things in the future that will allow us to use reinforcement learning to also increase corrigibility

Any particular research directions you're optimistic about?

↑ comment by Larks · 2022-08-26T01:23:42.189Z · LW(p) · GW(p)

Thanks!

comment by orthonormal · 2024-12-06T06:57:31.638Z · LW(p) · GW(p)

The fault does not lie with Jacob, but wow, this post aged like an open bag of bread.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2024-12-06T17:31:39.523Z · LW(p) · GW(p)

It… was the fault of Jacob?

The post was misleading when it was written, and I think was called out as such by many people at the time. I think we should have some sympathy with Jacob being naive and being tricked, but surely a substantial amount of blame accrues to him for going to the bat for OpenAI when that turned out to be unjustified in the end (and at least somewhat predictably so).

Replies from: WayZ

↑ comment by simeon_c (WayZ) · 2024-12-07T22:33:45.018Z · LW(p) · GW(p)

250 upvotes is also crazy high. Another sign of the disastrous abilities of EA/LessWrong communities at character judgment.

The same is right now happening before our eyes on Anthropic. And similar crowds are as confidently asserting that this time they're really the good guys.

Replies from: Benito, habryka4

↑ comment by Ben Pace (Benito) · 2024-12-07T22:56:34.260Z · LW(p) · GW(p)

I am somewhat confused about this.

To be clear I am pro people from organizations I think are corrupt showing up to defend themselves, so I would upvote it if it had like 20 karma or less.

I would point out that the comments criticizing the organization’s behavior and character are getting similar vote levels (e.g. top comment calls OpenAI reckless and unwise and 185 karma and 119 agree-vote).

↑ comment by habryka (habryka4) · 2024-12-07T22:51:01.128Z · LW(p) · GW(p)

I think people were happy to have the conversation happen. I did strong-downvote it, but I don't think upvotes are the correct measure here. If we had something like agree/disagree-votes on posts, that would have been the right measure, and my guess is it would have overall been skewed pretty strongly into the disagree-vote diretion.

Replies from: akash-wasil

↑ comment by Orpheus16 (akash-wasil) · 2024-12-07T23:28:19.455Z · LW(p) · GW(p)

Out of curiosity, what’s the rationale for not having agree/disagree votes on posts? (I feel like pretty much everyone thinks it has been a great feature for comments!)

Replies from: habryka4

↑ comment by habryka (habryka4) · 2024-12-07T23:31:43.228Z · LW(p) · GW(p)

I explained it a bit here: https://www.lesswrong.com/posts/fjfWrKhEawwBGCTGs/a-simple-case-for-extreme-inner-misalignment?commentId=tXPrvXihTwp2hKYME [LW(p) · GW(p)]

Yeah, the principled reason (though I am not like super confident of this) is that posts are almost always too big and have too many claims in them to make a single agree/disagree vote make sense. Inline reacts are the intended way for people to express agreement and disagreement on posts.
I am not super sure this is right, but I do want to avoid agreement/disagreement becoming disconnected from truth values, and I think applying them to elements that clearly don't have a single truth value weakens that connection.

comment by johnswentworth · 2022-08-25T15:42:49.263Z · LW(p) · GW(p)

Correct: OpenAI is trying to directly build safe AGI.
OpenAI's Charter states: "We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome." OpenAI leadership describes trying to directly build safe AGI as the best way to currently pursue OpenAI's mission, and have expressed concern about scenarios in which a bad actor is first to build AGI, and chooses to misuse it.

You seem confused about the difference between "paying lip service to X" and "actually trying to do X".

To be clear, this in itself isn't evidence against the claim that OpenAI is trying to directly build safe AI. But it's not much evidence for it, either.

Correct: the majority of researchers at OpenAI are working on capabilities.
Researchers on different teams often work together, but it is still reasonable to loosely categorize OpenAI's researchers (around half the organization) at the time of writing as approximately:
Capabilities research: 100
Alignment research: 30
Policy research: 15

I'd guess that is an overestimate of the number of people actually doing alignment research at OpenAI, as opposed to capabilities research in which people pay lip service to alignment. In particular, all of the RLHF work is basically capabilities work which makes alignment harder in the long term (because it directly selects for deception), while billing itself as "alignment". EDIT: I have been convinced that I was wrong about this, and I apologize. I still definitely maintain that RLHF makes alignment harder and is negative progress for both outer and inner alignment, but I have been convinced that the team actually was trying to solve problems which kill us, and therefore not just paying lip service to alignment. See comment here [LW(p) · GW(p)] and the thread leading up to it for the information which changed my mind.

Replies from: paulfchristiano, Vaniver

↑ comment by paulfchristiano · 2022-08-27T19:05:27.571Z · LW(p) · GW(p)

Calling work you disagree with "lip service" seems wrong and unhelpful.

There are plenty of ML researchers who think that they are doing real work on alignment and that your research is useless. They could choose to describe the situation by saying that you aren't actually doing alignment research. But I think it would be more accurate and helpful if they were to instead say that you are both working on alignment but have big disagreements about what kind of research is likely to be useful.

(To be clear, plenty of folks also think that my work is useless.)

Replies from: johnswentworth, steve2152

↑ comment by johnswentworth · 2022-08-27T22:33:00.107Z · LW(p) · GW(p)

I definitely do not use "lip service" as a generic term for alignment research I disagree with. I think you-two-years-ago were on a wrong track with HCH, but you were clearly aiming to solve alignment. Same with lots of other researchers today - I disagree with the approaches of most people in the field, but I do not accuse them not actually doing alignment research.

No, this accusation is specifically for things RLHF (which are very obviously not even trying to solve any of the problems which could plausibly kill us), and to things like "AI ethics" work (which are very obviously not even attempting to solve the extinction problem). In general, it has to be not even trying to solve a problem which kills us in order for me to make that sort of accusation.

If someone on the OpenAI team which worked on RLHF thought humanity had a decent (not necessarily large) chance of going extinct from AI, and they honestly thought implementing and popularizing RLHF made that chance go down, and they chose to work on RLHF because of that, then I would say I was wrong to accuse them of merely paying lip service. I'd think they were pretty stupid about their strategy, but hey, it's alignment, lots of us think each other are being stupid about strategy.

What I actually think is that they saw something that would let them do cool high-status high-paying ML work, while being nominally vaguely related to alignment, and decided to do that without actually stopping to think about questions like "Is this actually going to decrease humanity's chance of extinction?". And then later on they made up a story about how the work was helpful for alignment, because that's the sort of rationalization humans do all the time. Standard Bottom Line [LW · GW] failure.

Replies from: ricraz, Zack_M_Davis, thomas-kwa, capybaralet

↑ comment by Richard_Ngo (ricraz) · 2022-08-28T20:07:43.544Z · LW(p) · GW(p)

I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is. I think this is an ITT which a lot of people in the broader LW cluster would fail. I think the basic mistake that's being made here is failing to recognize that reality doesn't grade on a curve when it comes to understanding the world - your arguments can be false even if nobody has refuted them. That's particularly true when it comes to very high-level abstractions, like the ones this field is built around (and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment).

Historically, the way that great scientists have gotten around this issue is by engaging very heavily with empirical data (like Darwin did) or else with strongly predictive theoretical frameworks (like Einstein did). Trying to do work which lacks either is a road with a lot of skulls on it. And that's fine, this might be necessary, and so it's good to have some people pushing in this direction, but it seems like a bunch of people around here don't just ignore the skulls, they seem to lack any awareness that the absence of the key components by which scientific progress has basically ever been made is a red flag at all.

I think it's possible to criticise work on RLHF while taking seriously the possibility that empirical work on our biggest models is necessary for solving alignment. But criticisms like this one seem to showcase a kind of blindspot. I'd be more charitable if people in the LW cluster had actually tried to write up the arguments for things like "why inner misalignment is so inevitable". But in general people have put shockingly little effort into doing so, with almost nobody trying to tackle this rigorously. E.g. I was surprised when my debates with Eliezer involved him still using all the same intuition-pumps as he did in the sequences, because to me the obvious thing to do over the next decade is to flesh out the underlying mental models of the key issue, which would then allow you to find high-level intuition pumps that are both more persuasive and more trustworthy.

I'm more careful than John about throwing around aspersions on which people are "actually trying" to solve problems. But it sure seems to me that blithely trusting your own intuitions because you personally can't imagine how they might be wrong is one way of not actually trying to solve hard problems.

Replies from: johnswentworth, johnswentworth, habryka4

↑ comment by johnswentworth · 2022-08-29T16:41:40.643Z · LW(p) · GW(p)

Comments on parts of this other than the ITT thing (response to the ITT part is here [LW(p) · GW(p)])...

(and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment)

I don't usually focus much on the outer/inner abstraction, and when I do I usually worry about outer alignment. I consider RLHF to have been negative progress on outer alignment, same as inner alignment; I wasn't relying on that particular abstraction at all.

Historically, the way that great scientists have gotten around this issue is by engaging very heavily with empirical data (like Darwin did) or else with strongly predictive theoretical frameworks (like Einstein did). Trying to do work which lacks either is a road with a lot of skulls on it. And that's fine, this might be necessary, and so it's good to have some people pushing in this direction, but it seems like a bunch of people around here don't just ignore the skulls, they seem to lack any awareness that the absence of the key components by which scientific progress has basically ever been made is a red flag at all.

I think your model here completely fails to predict Descartes, Laplace, Von Neumann & Morgenstern, Shannon, Jaynes, Pearl, and probably many others. Basically all of the people who've successfully made exactly the sort of conceptual advances we aim for in agent foundations.

But it is a model under which one could try to make a case for RLHF.

I still do not think that the team doing RLHF work at OpenAI actually thought about whether this model makes RLHF decrease the chance of human extinction, and deliberated on that in a way which could plausibly have resulted in the project not happening [LW · GW]. But I have made that claim maximally easy to falsify if I'm wrong.

I'd be more charitable if people in the LW cluster had actually tried to write up the arguments for things like "why inner misalignment is so inevitable".

Speaking for myself, I don't think inner misalignment is clearly inevitable. I do think outer misalignment is much more clearly inevitable, and I do think inner misalignment is not plausibly sufficiently unlikely that we can afford to ignore the possibility. Similar to this comment [LW(p) · GW(p)]: I'm pretty sympathetic to the view that powerful deceptive inner agents are unlikely, but charging ahead assuming that they will not happen is idiotic given the stakes.

A piece which I think is missing from this thread thus far: in order for RLHF to decrease the chance of human extinction, there has to first be some world in which humans go extinct from AI. By and large, it seems like people who think RLHF is useful are mostly also people who think we're unlikely to die of AI, and that's not a coincidence: worlds in which the iterative-incremental-empiricism approach suffices for alignment are worlds where we're unlikely to die in the first place. Humans are good at iterative incremental empiricism. The worlds in which we die are worlds in which that approach is fundamentally flawed for some reason (usually because we are unable to see the problems).

Thus the wording of this claim I made upthread:

If someone on the OpenAI team which worked on RLHF thought humanity had a decent (not necessarily large) chance of going extinct from AI, and they honestly thought implementing and popularizing RLHF made that chance go down, and they chose to work on RLHF because of that, then I would say I was wrong to accuse them of merely paying lip service.

In order for work on RLHF to reduce the chance of humanity going extinct from AI, it has to help in one of the worlds where we otherwise go extinct, not in one of the worlds where alignment by default [LW · GW] kicks in and we would probably have been fine anyway.

(In case it was not obvious: I am definitely not saying that one must assign high P(doom) to do actual alignment work. I am saying that one must have some idea of worlds in which we're actually likely to die.)

↑ comment by johnswentworth · 2022-08-29T16:09:14.355Z · LW(p) · GW(p)

I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is.

I don't want to pour a ton of effort into this, but here's my 5-paragraph ITT attempt.

"As an analogy for alignment, consider processor manufacturing. We didn't get to gigahertz clock speed and ten nanometer feature size by trying to tackle all the problems of 10 nm manufacturing processes right out the gate. That would never have worked; too many things independently go wrong to isolate and solve them all without iteration. We can't get many useful bits out of empirical feedback if the result is always failure, and always for a long list of reasons.

And of course, if you know anything about modern fabs, you know there'd have been no hope whatsoever of identifying all the key problems in advance just based on theory. (Side note: I remember a good post or thread from the past year on crazy shit fabs need to do, but can't find it; anyone remember that and have a link?)

The way we actually did it was to start with gigantic millimeter-size features, which were relatively easy to manufacture. And then we scaled down slowly. At each new size, new problems came up, but those problems came up just a few at a time as we only scaled down a little bit at each step. We could carry over most of our insights from earlier stages, and isolate new problems empirically.

The analogy, in AI, is to slowly ramp up the capabilities/size/optimization pressure of our systems. Start with low capability, and use whatever simple tricks will help in that regime. Then slowly ramp up, see what new problems come up at each stage, just like we did for chip manufacturing. And to complete the analogy: just like with chips, at each step we can use the products of the previous step to help design the next step.

That's the sort of plan which has a track record of actually handling the messiness of reality, even when scaling things over many orders of magnitude."

There, let me know how plausible that was as an ITT attempt for "people who have different views [than I do] about how valuable incremental empiricism is".

Replies from: ricraz, RobertKirk

↑ comment by Richard_Ngo (ricraz) · 2023-11-04T07:59:10.434Z · LW(p) · GW(p)

Forgot to reply to this at the time, but I think this is a pretty good ITT. (I think there's probably some additional argument that people would make about why this isn't just an isolated analogy, but rather a more generally-applicable argument, but it does seem to be a fairly central example of that generally-applicable argument.)

↑ comment by RobertKirk · 2022-09-03T14:02:04.697Z · LW(p) · GW(p)

I think people who value empirical alignment work now probably think that (to some extent) we can predict at a high level what future problems we might face (contrasting with "there'd have been no hope whatsoever of identifying all the key problems in advance just based on theory"). Obviously this is a spectrum, but I think the chip fab analogy is I think further towards people believing there are unknown unknowns in the problem space than people at OpenAI are (e.g. OpenAI people possibly think outer alignment and inner alignment capture all of the kinds of problems we'll face).

However, they probably don't believe you can work on solutions to those problems without being able to empirically demonstrate those problems and hence iterate on them (and again one could probably appeal to a track record here of most proposed solutions to problems not working unless they were developed by iterating on the actual problem). We can maybe vaguely postulate what the solutions could look like (they would say), but it's going to be much better to try and actually implement solutions on versions of the problem we can demonstrate, and iterate from there. (Note that they probably also perhaps try and produce demonstrations of the problems such that they can then work on those solutions, but this is still all empirical).

Otherwise I do think your ITT does seem reasonable to me, although I don't think I'd put myself in the class of people you're trying to ITT, so that's not much evidence.

↑ comment by habryka (habryka4) · 2022-08-29T05:46:39.766Z · LW(p) · GW(p)

and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment

I am confused. How does RLHF help with outer alignment? Isn't optimizing fur human approval the classical outer-alignment problem? (e.g. tiling the universe with smiling faces)

I don't think the argument for RLHF runs through outer alignment. I think it has to run through using it as a lens to study how models generalize, and eliciting misalignment (i.e. the points about empirical data that you mentioned, I just don't understand where the inner/outer alignment distinction comes from in this context)

Replies from: ricraz

↑ comment by Richard_Ngo (ricraz) · 2022-08-29T07:04:34.526Z · LW(p) · GW(p)

RLHF helps with outer alignment because it leads to rewards which more accurately reflect human preferences than the hard-coded reward functions (including the classic specification gaming examples, but also intrinsic motivation functions like curiosity and empowerment) which are used to train agents in the absence of RLHF.

The smiley faces example feels confusing as a "classic" outer alignment problem because AGIs won't be trained on a reward function anywhere near as limited as smiley faces. An alternative like "AGIs are trained on a reward function in which all behavior on a wide range of tasks is classified by humans as good or bad" feels more realistic, but also lacks the intuitive force of the smiley face example - it's much less clear in this example why generalization will go badly, given the breadth of the data collected.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2022-08-29T15:59:06.714Z · LW(p) · GW(p)

I think the smiling example is much more analogous than you are making it out here. I think the basic argument for "this just encourages taking control of the reward" or "this just encourages deception" goes through the same way.

Like, RLHF is not some magical "we have definitely figured out whether a behavior is really good or bad" signal, it's historically been just some contractors thinking for like a minute about whether a thing is fine. I don't think there is less bayesian evidence conveyed by people smiling (like, the variance in smiling is greater than the variance in RLHF approval, and so the amount of information conveyed is actually more), so I don't buy that RLHF conveys more about human preferences in any meaningful way.

↑ comment by Zack_M_Davis · 2022-08-29T00:13:36.471Z · LW(p) · GW(p)

RLHF (which are very obviously not even trying to solve any of the problems which could plausibly kill us)

Sorry for being dumb, but I thought the naïve case for RLHF is that it helps solve the problem of "people are very bad at manually writing down an explicit utility or reward function that does what they intuitively want"? Does that not count as one of the lethal problems (even if RLHF alone would kill us because of the other problems)? If one of the other problems is Goodharting/unforseen-maxima, it seems like RLHF could be helpful insofar as if RLHF rewards are quantitatively less misaligned than hand-coded rewards, you can get away with optimizing them harder before they kill you?

Replies from: johnswentworth

↑ comment by johnswentworth · 2022-08-29T16:53:36.751Z · LW(p) · GW(p)

That is a reasonable case, with the obvious catch that you don't know how hard you can optimize before it goes wrong, and when it does go wrong you're less likely to notice than with a hand-coded utility/reward.

But I expect the people who work on RLHF do not expect an explicit utility/reward to be a problem which actually kills us, because they'd expect visible failures before it gets to the capability level of killing us. RLHF makes those visible failures less likely. Under that frame, it's the lack of a warning shot which kills us.

Replies from: Zack_M_Davis

↑ comment by Zack_M_Davis · 2022-08-30T03:27:26.953Z · LW(p) · GW(p)

when it does go wrong you're less likely to notice than with a hand-coded utility/reward [...] RLHF makes those visible failures less likely

Because it incentivizes learning human models [LW · GW] which can then be used to be more competently deceptive, or just because once you've fixed the problems you know how to notice, what's left are the ones you don't know how to notice? The latter doesn't seem specific to RLHF (you'd have the same problem if people magically got better at hand-coding rewards), but I see how the former is plausible and bad.

Replies from: johnswentworth

↑ comment by johnswentworth · 2022-08-30T04:58:48.420Z · LW(p) · GW(p)

The problem isn't just learning whole human models. RLHF will select for any heuristic/strategy which, even by accident, hides bad behavior from humans. It applies even at low capabilities.

↑ comment by Thomas Kwa (thomas-kwa) · 2022-08-28T03:47:07.614Z · LW(p) · GW(p)

This is testable by asking someone from OpenAI things like

how the decision to work on RLHF was made: how many hours were spent on it, who was in charge
their models under which RLHF is good and bad for humanity

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2022-08-29T15:58:57.278Z · LW(p) · GW(p)

FWIW, I personally know some of the people involved pretty well since ~2015, and I think you are wrong about their motivations.

Replies from: johnswentworth

↑ comment by johnswentworth · 2022-08-29T16:18:43.048Z · LW(p) · GW(p)

That is plausible; I have made my position here very easy to falsify if I'm wrong.

Replies from: ricraz

↑ comment by Richard_Ngo (ricraz) · 2022-08-29T23:55:51.464Z · LW(p) · GW(p)

How? E.g. Jacob left a comment here about his motivations [LW(p) · GW(p)], does that count as a falsification? Or, if you'd say that this is an example of rationalization, then what would the comment need to look like in order to falsify your claim? Does Paul's comment here [LW(p) · GW(p)] mentioning the discussions that took place before launching the GPT-3 work count as a falsification? if not, why not?

Replies from: johnswentworth

↑ comment by johnswentworth · 2022-08-30T00:18:53.758Z · LW(p) · GW(p)

Jacob's comment does not count, since it's not addressing the "actually consider whether the project will net decrease chance of extinction" or the "could the answer have plausibly been 'no' and then the project would not have happened" part.

Paul's comment does address both of those, especially this part at the end:

To be clear, this is not post hoc reasoning. I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arguments that the harms from the research were significant, I believe that it likely wouldn't have happened.

That does indeed falsify my position, and I have updated the top-level comment accordingly. Thankyou for the information.

↑ comment by Steven Byrnes (steve2152) · 2022-08-27T20:52:03.915Z · LW(p) · GW(p)

I think Jacob (OP) said "OpenAI is trying to directly build safe AGI." and cited the charter and other statements as evidence of this claim. Then John replied that the charter and other statements are "not much evidence" either for or against this claim, because talk is cheap. I think that's a reasonable point.

Separately, maybe John in fact believes that the charter and other statements are insincere lip service. If so, I would agree with you (Paul) that John's belief is probably incorrect, based on my very limited knowledge. [Where I disagree with OpenAI, I presume that top leadership is acting sincerely to make a good future with safe AGI, but that they have mistaken beliefs about the hardness of alignment and other topics.]

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2022-08-28T01:10:30.391Z · LW(p) · GW(p)

I was replying to:

I'd guess that is an overestimate of the number of people actually doing alignment research at OpenAI, as opposed to capabilities research in which people pay lip service to alignment. In particular, all of the RLHF work is basically capabilities work which makes alignment harder in the long term (because it directly selects for deception), while billing itself as "alignment".

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2022-08-28T02:30:41.400Z · LW(p) · GW(p)

Thanks, sorry for misunderstanding.

↑ comment by Vaniver · 2022-08-26T15:47:58.998Z · LW(p) · GW(p)

In particular, all of the RLHF work is basically capabilities work which makes alignment harder in the long term (because it directly selects for deception), while billing itself as "alignment".

I share your opinion of RLHF work but I'm not sure I share your opinion of its consequences. For situations where people don't believe arguments that RLHF is fundamentally flawed because they're too focused on empirical evidence over arguments, the generation of empirical evidence that RLHF is flawed seems pretty useful for convincing them!

comment by aog (Aidan O'Gara) · 2022-08-25T15:51:38.264Z · LW(p) · GW(p)

“OpenAI leadership tend to put more likelihood on slow takeoff”

Could you say more about the timelines of people at OpenAI? My impression was that they’re very short and explicitly include the possibility of scaling language models to AGI. If somebody builds AGI in the next 10 years, OpenAI seems like a leading candidate to do so. Would people at OpenAI generally agree with this?

comment by Celer · 2024-05-23T23:26:22.458Z · LW(p) · GW(p)

I am very curious how you think about this post in retrospect: parts of it seem clearly falsified. I completely understand if you currently feel bound by a non-disparagement clause and expect it to be a few weeks before that can be confirmed to no longer apply.

Replies from: Jacob_Hilton, gwern, ryan_greenblatt

↑ comment by Jacob_Hilton · 2024-05-24T08:10:32.076Z · LW(p) · GW(p)

If the question is whether I think they were true at time given the information I have now, I think all of the individual points hold up except for the first and third "opinions". I am now less sure about what OpenAI leadership believed or cared about. The last of the "opinions" also seems potentially overstated. Consequently, the overall thrust now seems off, but I still think it was good to share my views at the time, to start a discussion.

If the question is about the state of the organization now, I know less about that because I haven't worked there in over a year. But the organization has certainly changed a lot since this post was written over 18 months ago.

↑ comment by gwern · 2024-05-24T21:16:45.908Z · LW(p) · GW(p)

Hilton has posted on Twitter that he is no longer bound: https://x.com/JacobHHilton/status/1794090554730639591

Replies from: Celer

↑ comment by Celer · 2024-05-27T03:32:54.919Z · LW(p) · GW(p)

https://x.com/JacobHHilton/status/1794090561294467074

He has also explicitly told people not to expect candor from him on this issue until the situation changes. That the binding is no longer part of a contract, as opposed to implicit threat, seems of little relevance.

↑ comment by ryan_greenblatt · 2024-05-24T00:59:56.034Z · LW(p) · GW(p)

None of it seems falsified to me.

I think a few of Jacob's "Personal opinions" now seem less accurate than they did previously. (And perhaps Jacob no longer endorses "Opinion: OpenAI is a great place to work to reduce existential risk from AI.")

comment by TekhneMakre · 2022-08-25T15:01:16.372Z · LW(p) · GW(p)

I would expect OpenAI leadership to change their mind on these questions given clear enough evidence to the contrary.

Why do you expect this? For what sorts of evidence do you expect? What do you suppose they think of arguments about inner alignment, orthogonality, deceptive alignment, FOOM, sharp-left-turn?

comment by Jacob_Hilton · 2023-12-16T05:39:03.217Z · LW(p) · GW(p)

Since this post was written, OpenAI has done much more to communicate its overall approach to safety, making this post somewhat obsolete. At the time, I think it conveyed some useful information, although it was perceived as more defensive than I intended.

My main regret is bringing up the Anthropic split, since I was not able to do justice to the topic. I was trying to communicate that OpenAI maintained its alignment research capacity, but should have made that point without mentioning Anthropic.

Ultimately I think the post was mostly useful for sparking some interesting discussion in the comments.

comment by Lauro Langosco · 2022-08-27T14:41:01.080Z · LW(p) · GW(p)

I would be very curious to see your / OpenAI's responses to Eliezer's Dimensions of Operational Adequacy in AGI Projects [LW · GW] post. Which points do you / OpenAI leadership disagree with? Insofar as you agree but haven't implemented the recommendations, what's stopping you?

comment by Joe Collman (Joe_Collman) · 2022-08-25T19:57:55.689Z · LW(p) · GW(p)

Thanks again for writing this.
A few thoughts:

I think that the release of GPT-3 and the OpenAI API led to significantly increased focus and somewhat of a competitive spirit around large language models... I don't think OpenAI predicted this in advance, and believe that it would have been challenging, but not impossible, to foresee this.

Do you believe any general lessons have been learned from this? Specifically, it seems a highly negative pattern if [we can't predict concretely how this is likely to go badly] translates to [we don't see any reason not to go ahead].

I note that there's an asymmetry here: [states of the world we like] are a small target. To the extent that we can't predict the impact of a large-scale change, we should bet on negative impact.

OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely on capabilities, and are more concerned about bad actors developing and misusing AGI...

Questions:

If we're in a scenario with [slow takeoff], [alignment is fairly easy], and [empirical, capabilities-reliant approaches work well], wouldn't we expect alignment to get solved by default without OpenAI? Why is this the scenario to focus on?
If the concern is with bad actors, and OpenAI is serious about avoiding race conditions, why not enact the merge-and-assist clause now? Why not join forces with DeepMind now? Would this be negative? If so, why? Would it simply be impractical? Then on what basis would we expect it to be practical when it matters?

OpenAI's particular research directions are driven in large part by researchers

If an organisation's research efforts are largely driven by researchers, then the key question becomes: what is the organisation doing to ensure the creation/selection of the right kinds of alignment researchers? (the actually-likely-to-help-solve-the-problem kind)

If there were talented enough researchers who wanted to lead new alignment efforts at OpenAI, I would expect them to be enthusiastically welcomed by OpenAI leadership.

To the extent that the alignment problem is hard this seems negative: it's possible to be extremely talented, yet heading in a predictably wrong direction - orthogonality isn't just for AIs. In order to be net-useful, an organisation would need to select strongly for [is likely to head in an effective direction].

Replies from: conor-sullivan

↑ comment by Lone Pine (conor-sullivan) · 2022-08-26T10:30:48.506Z · LW(p) · GW(p)

why not enact the merge-and-assist clause now?

DM has to deal with Alphabet management, who is significantly less alignment-aware than DM or OAI leadership. Merging wouldn't solve the race dynamics and would make ownership/leadership issues worse.

Replies from: Joe_Collman

↑ comment by Joe Collman (Joe_Collman) · 2022-08-26T15:34:08.747Z · LW(p) · GW(p)

Sure, that makes sense to me. I suppose my main point is "why would we expect this to be different in the future?". (perhaps there are reasons to think things would be different, but I've heard no argument to this effect)

comment by Prometheus · 2022-08-30T09:53:08.307Z · LW(p) · GW(p)

Could you explain the rational behind the "Open" in OpenAI? I can understand the rational of trying to beat more reckless companies to achieving AGI first (albeit, this mentality is potentially extremely dangerous too), but what is the rational behind releasing your research? This will enable companies that do not prioritize safety to speed ahead with you, perhaps just a few years behind. And, if OpenAI hesitates to progress, due to concerns over safety, the more risk-taking orgs will likely speed ahead of OpenAI in capabilities. The bottomline is I'm concerned your efforts to achieve AGI might not do much to ensure an aligned AGI is actually created, but instead only speed-up the timeline toward achieving AGI by years or even decades.

comment by jungofthewon · 2022-08-25T20:23:13.488Z · LW(p) · GW(p)

I also appreciated reading this.

comment by Neel Nanda (neel-nanda-1) · 2022-08-26T05:23:18.144Z · LW(p) · GW(p)

Thanks for writing this! I agree with most of the claims you consider to be objective, and appreciate you writing this up so clearly.

comment by Esben Kran (esben-kran) · 2022-08-25T19:19:02.374Z · LW(p) · GW(p)

Thank you very much for writing this post, Jacob. I think it clears up several of the misconceptions you emphasize.

I generally seem to agree with John that the class of problems OpenAI focuses on might be more capabilities-aligned than optimal but at the same time, having a business model that relies on empirical prosaic alignment of language models generates interesting alignment results and I'm excited for the alignment work that OpenAI will be working on!

comment by Martin Randall (martin-randall) · 2023-03-24T00:52:20.436Z · LW(p) · GW(p)

If I was trying to secure technology that could cause human extinction I think I would offer cash bounties for responsibly disclosed security vulnerabilities.

Manifold bettors are also skeptical that the merge-and-assist clause will fire soon, though I suppose that might be driven by takeoff speeds. But there might be an announcement of something.

Not sure what else we should be tracking.

comment by Ofer (ofer) · 2022-08-25T16:51:50.695Z · LW(p) · GW(p)

The Partnership may never make a profit

I couldn't find this quote in the page that you were supposedly quoting from. The only google result for it is this post. Am I missing something?

Replies from: ofer

↑ comment by Ofer (ofer) · 2022-08-25T16:56:11.025Z · LW(p) · GW(p)

Sorry, that text does appear in the linked page (in an image).

comment by hobs · 2022-08-28T17:55:59.628Z · LW(p) · GW(p)

I might add the most glaring misconception, at least for me in the early days... I assumed their primary goal was to support Open Source AI, and would "default to open" on all their projects. Instead orgs like HuggingingFace expend significant resources reverse engineering the AI papers and models that OpenAI releases.

Common misconceptions about OpenAI

Contents

Common accurate impressions

Common misconceptions

Personal opinions

154 comments

Common accurate impressions

Common misconceptions