The Security Mindset, S-Risk and Publishing Prosaic Alignment Research

post by lukemarks (marc/er) · 2023-04-22T14:36:50.535Z · LW · GW · 7 comments

Contents

  Introduction
  Arguments 1 and 3
  Argument 2
  Conclusion
None
7 comments

Note: I no longer endorse this post as strongly as I did when publishing it. Although not publishing blatantly capability-advancing research is of course beneficial, I now agree with Neel Nanda's criticism and endorse this position. My primary reason for this was exactly as stated in his comment; I was simply too new to the field to correctly judge how potentially exfohazardous given research could be. Additionally, I was hellbent on never even potentially advancing capabilities in even the slightest or most indirect form to an extent that I now believe negatively impact my progress overall. 

I still think the analysis in the post could be useful, but I wanted to include this note at the beginning of the post in the case that someone else in my position happens to find it. 

Introduction

When converging upon useful alignment ideas, it is natural to want to share them, get feedback, run experiments and iterate in the hope of improving them, but doing this suddenly can bare contradiction to the security mindset [LW · GW]. The Waluigi Effect [LW · GW], for example, ended with a rather grim conclusion:

If this Semiotic–Simulation Theory is correct, then RLHF is an irreparably inadequate solution to the AI alignment problem, and RLHF is probably increasing the likelihood of a misalignment catastrophe. Moreover, this Semiotic–Simulation Theory has increased my credence in the absurd science-fiction tropes that the AI Alignment community has tended to reject [? · GW], and thereby increased my credence in s-risks [? · GW].”

Could the same apply to other prosaic alignment techniques? What if they do end up scaling to superintelligence?

It's easy to internally justify publishing in spite of this. Reassuring comments always sound reasonable at the time, but aren't always robust upon reflection. While developing new ideas one may tell themselves “This is really interesting! I think this could have an impact on alignment!”, and while justifying publishing them “It’s probably not that good of an idea anyway, just push through and wait until someone points out the obvious flaw in your plan.” Completely and detrimentally inconsistent.

Arguments 1 and 3

Many have shared their musings about this before. Andrew Saur [LW · GW] does so in his post “The Case Against AI Alignment [LW · GW]”, Andrea Miotti [AF · GWexplains here [? · GW] why they believe RLHF-esque research is a net-negative for existential safety (see Paul Christiano [LW · GW]’s response [AF · GW]), and Christiano provides an alternative perspective in his “Thoughts on the Impact of RLHF Research [LW · GW]”. Generally, these arguments seem to fall into three different categories (or their inversions):

  1. This research will further capabilities more than it will alignment
  2. This alignment technique if implemented could actually elevate s/x-risks
  3. This alignment technique if implemented will increase the commercializability of AI, feeding into the capabilities hype cycle, indirectly contributing to capabilities more than alignment. 

To clarify, I am speaking here in terms of impact, not necessarily in terms of some static progress metric (e.g. one year of alignment progress is not necessarily equivalent to one year of capabilities progress in terms of impact).

In an ideal world, we could trivially measure potential capabilities/alignment advances and make simple comparisons. Sadly this is not that world, and realistically most decisions are going to be intuition-derived. Worse yet is that argument 2 is even murkier than 1 and 3. Trying to pose responses to question 2 is quite literally trying to prove that a solution you don’t know the implications of won’t result in an outcome we can hardly begin to imagine through means we don’t understand.

In reference to 1 and 3 (which seem to me addressable by similar solutions), Christiano has proposed a simple way to model this dilemma [AF · GW]:

Adapting this model to work for single publications instead of the entire field:

This is going to look very different every time it’s performed (e.g. an agent foundations publication is likely to have a larger A value than some agenda that involves turning an LLM into an agent). Reframing as ratios:

The issue is that hard numbers are essentially impossible to conjure, and intuitions attached to important ideas are rarely honest, let alone correct. So how can we use a model like this to make judgements about the safety of publishing alignment research? If the result of you sharing your publication being favorable to alignment is dependent on one of these nigh unknowable factors turning in your favor [LW · GW], maybe it isn’t safe to share openly [LW · GW]. Also, again referring to Yudkowsky’s security mindset literature; analysis by one individual with a robust security mindset is not evidence of it being safe to publish. If you’ve analyzed your research with the framework above, I urge you to think carefully about who to share it with, what feedback you hope to receive from them, and as painful as it might be; the worst case scenario for the malicious use of your ideas.

Something along the lines of Conjecture’s infohazard policy seems reasonable to apply to alignment ideas with a high probability of satisfying any of the aforementioned three arguments, and is something I would be interested in drafting.

Argument 2

This failure mode is embodied primarily by this post [LW · GW], which in short aims to convey that partial alignment could inflate s-risk likelihoods, an example of which is The Waluigi Effect [LW · GW] which seems to imply that this is true for LLMs aligned using RLHF [? · GW]. I assume this is of greater concern for prosaic approaches, and considerably less so for formal ones.

S-risk to me seems to be a considerably less intuitive concept to think about than x-risk [? · GW]. As I mentioned earlier, I appeared to be modeling AGI ruin in a binary manner, by which I would consider ‘the good outcome’ one in which we all live happily ever after as immortal citizens of Digitopia, and the bad outcome as one in which we were all dissolved into some flavor [LW · GWof goo [LW · GW]. Perhaps this becomes the case as we drift toward the later years of post-AGI existence, but I now see the terrifying and wonderful spectrum of short term outcomes, including good ones that involve suffering and bad ones that do not. I had consumed s-risk literature prior but was so far behind in my thinking compared to my reading that it took actually applying the messages of this literature to a personal scenario to intellectualize it.

The root of this unintuitiveness can likely be found somewhere North of “Good Outcomes can still Entail Suffering” but East of “Reduction of Extinction Risk can Result in an Increase in S-Risk (and Vice Versa) [LW · GW]”. In order to overcome this you do at some point need to quantify the goodness/badness of some increment in extinction risk with a unit also applicable to the same increment in s-risk. This looks analogous to the capabilities vs alignment paradigm we have currently whereby it’s difficult to affect the position of one without affecting the other. Not just in terms of publishing but also in terms of focusing research efforts I think this is a critical bit of thinking that needs to be done and needs to be done really well with great rigor. If this has been done already please let me know and I will update this section of the post. I presume answers to this conundrum already exist in studies of axiology, and I have a writeup planned for this soon.

Conclusion

As capabilities progresses further and more rapidly than ever before, maintaining a security mindset when publishing potentially x/s-risk inducing research is critical, and doesn’t necessarily tax overall alignment progress that greatly, as rigorous but high value assessments can be done quickly and are required to be deferred to be effective. Considering the following three arguments before publishing could have powerful long term implications:

  1. This research will further capabilities more than it will alignment
  2. This alignment technique if implemented could actually elevate s/x-risks
  3. This alignment technique if implemented will increase the commercializability of AI, feeding into the capabilities hype cycle, indirectly contributing to capabilities more than it will to alignment. 

7 comments

Comments sorted by top scores.

comment by Neel Nanda (neel-nanda-1) · 2023-04-22T22:00:56.226Z · LW(p) · GW(p)

Some examples of justifications I have given to myself are “You’re so new to this, this is not going to have any real impact anyway”,

I think this argument is just clearly correct among people new to the field - thinking that your work may be relevant to alignment is motivating and exciting and represents the path to eventually doing useful things, but it's also very likely to be wrong. Being repeatedly wrong is what improvement feels like!

People new to the field tend to wildly overthink the harms of publishing, in a way that increases their anxiety and makes them much more likely to bounce off. This is a bad dynamic, and I wish people would stop promoting it

Replies from: habryka4, marc/er
comment by habryka (habryka4) · 2023-04-23T03:34:27.442Z · LW(p) · GW(p)

As someone who is quite concerned about the AI Alignment field having had a major negative impact via accelerating AI capabilities, I also agree with this. It's really quite unlikely for your first pieces of research to make a huge difference. I think the key people who I am worried will drive forward capabilities are people who have been in the field for quite a while and have found traction on the broader AGI problems and questions (as well as people directly aiming towards accelerating capabilities, though the worry there is somewhat different in nature). 

comment by lukemarks (marc/er) · 2023-04-23T01:31:50.766Z · LW(p) · GW(p)

It's fine to make the mistake of publishing something if the mistake you made was assuming "this is great research", but if the mistake was "this is safe to publish because I'm new to research", the consequences can be irreversible. I probably fall into the category of 'wildly overthinking the harms of publishing due to inexperience', but it seems to me like a simple assessment using the ABC model I outlined in the post should take only a few minutes and could quickly inform someone of whether or not they might want to show their research to someone more experienced before publishing.

I am personally having this dilemma. I have something I want to publish, but I'm unsure of whether I should listen to the voice telling me "you’re so new to this, this is not going to have any real impact anyway" or the voice that's telling me "if it does have some impact or was hypothetically implemented in a generally intelligent system this could reduce extinction risk but inflate s-risk". It was a difficult decision, but I decided I would rather show someone more experienced, which is what I am doing currently.  This post was intended to be a summary of why/how I converged upon that decision.

Replies from: neel-nanda-1, habryka4
comment by Neel Nanda (neel-nanda-1) · 2023-04-24T18:42:45.745Z · LW(p) · GW(p)

but it seems to me like a simple assessment using the ABC model I outlined in the post should take only a few minutes

Empirically, many people new to the field get very paralysed and anxious about fears of doing accidental harm, in a way that I believe has significant costs. I haven't fully followed the specific model you outline, but it seems to involve ridiculously hard questions around the downstream consequences of your work, which I struggle to robustly apply to my work (indirect effects are really hard man!). Ditto, telling someone that they need to ask someone more experienced to sanity check can have significant costs in terms of social anxiety (I personally sure would publish fewer blog posts if I felt a need to run each one by someone like Chris Olah first!)

Having significant costs doesn't mean that doing this is bad, per se, but there needs to be major benefits to match these costs, and I'm just incredibly unconvinced that people's first research projects meet these. Maybe if you've gotten a bunch of feedback from more experienced people that your work is awesome? But also, if you're in that situation, then you can probably ask them whether they're concerned.

comment by habryka (habryka4) · 2023-04-23T03:35:34.236Z · LW(p) · GW(p)

It's fine to make the mistake of publishing something if the mistake you made was assuming "this is great research", but if the mistake was "this is safe to publish because I'm new to research", the consequences can be irreversible.

"Irreversible consequences" is not that huge of a deal. The consequences of writing almost any internet comment are irreversible. I feel like you need to argue for also the expected magnitude of the consequences being large, instead of them just being irreversible.

Replies from: marc/er
comment by lukemarks (marc/er) · 2023-04-23T03:41:58.756Z · LW(p) · GW(p)

I agree with this sentiment in response to the question of "will this research impact capabilities more than it will alignment?", but not in response to the question of "will this research (if implemented) elevate s-risks?". Partial alignment inflating s-risk is something I am seriously worried about, and prosaic solutions especially could lead to a situation like this.

If your research not influencing s-risks negatively is dependent on it not being implemented, and you think that it your research is good enough to post about, don't you see the dilemma here?

comment by Raymon (raymond-rin) · 2023-04-22T21:12:33.033Z · LW(p) · GW(p)

Extremely important discussion to have.