Solve Corrigibility Week

elriggs

Solve Corrigibility Week

post by Logan Riggs (elriggs) · 2021-11-28T17:00:29.986Z · LW · GW · 21 comments

A low-hanging fruit for solving alignment is to dedicate a chunk of time actually trying to solve a sub-problem collectively.

To that end, I’ve broken up researching the sub-problem of corrigibility into two categories in this google doc (you have suggestion privileges):

Previous Work: let’s not reinvent the wheel. Write out links to any past work on corrigibility. This can range from just links to links & summaries & analyses. Do comment reactions to other's reviews to provide counter-arguments. This is just a google doc, low-quality posts, comments, links are accepted; I want people to lean towards babbling more.
Tasks: what do we actually do this week to make progress? Suggest any research direction you find fruitful or general research questions or framings. Example: write an example of corrigibility (one could then comment an actual example).

Additionally, I’ll post 3 top-level comments for:

Meetups: want to co-work with others in the community? Comment availability, work preferences, and a way to contact you (eg calendly link, “dm me”, “ my email is bob and alice dot com”, etc) For example [LW(p) · GW(p)], I’m available most times this week with a Calendly link for scheduling 1-on-1 co-working sessions. Additionally, you yourself could message those you know to collaborate on this, or have a nerdy house co-working party.
Potential topics: what other topics besides corrigibility could we collaborate on in future weeks?
Meta: what are different formats this type of group collaboration could take? Comment suggestions with trade offs or discuss the cost/benefits of what I’m presenting in this post.

I do believe there’s a legitimate, albeit small, chance that we solve corrigibility or find its “core” this week. Nonetheless, I think it’s of great value to be able to make actual progress on alignment issues as a community and to figure out how to do that better. Additionally, it’s immensely valuable to have an alignment topic post include a literature review, the community's up-to-date thoughts, and possible future research directions to pursue. I also believe a collaborative project like this will put several community members on the same page as far as terminology and gears-level models.

I explicitly commit to 3 weeks of this (so corrigibility this week and two more the next two weeks). After that is Christmas and New Years, after which I may resume depending on how it goes.

Thanks to Alex Turner for reviewing a draft.

21 comments

Comments sorted by top scores.

comment by Zack_M_Davis · 2021-12-04T22:02:14.414Z · LW(p) · GW(p)

As a starting point, it might help to understand exactly where people's naïve intuitions about why corrigibility should be easy, clash with the technical argument that it's hard.

For me, the intuition goes like this: if I wanted to spend some fraction of my effort helping dolphins in their own moral reference frame, that seems like something I could do. I could give them gifts that I can predict that they'd like (like tasty fish or a water purifier), and be conservative when I couldn't figure out what dolphins "really wanted", and be eager to accept feedback when the dolphins wanted to change how I was trying to help. If my superior epistemic vantage point let me predict that the way dolphins would respond to gifts, would depend on the details like what order the gifts were presented in, I might compute an average over possible gift-orderings, or I might try to ask the dolphins to clarify, but I definitely wouldn't tile the lightcone with tiny molecular happy-dolphin sculptures, because I can tell that's not what dolphins want under any sensible notion of "want".

So what I'd like to understand better is, where exactly does the analogy between "humans being corrigible to dolphins (in the fraction of their efforts devoted to helping dolphins)" and "AI being corrigible to humans" break, such that I haven't noticed yet because empathic inference [LW · GW] between mammals still works "well enough", but won't work when scaled to superintelligence? When I try to think of gift ideas for dolphins, am I failing to notice some way in which I'm "selfishly" projecting what I think dolphins should want onto them, or am I violating some coherence axiom?

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2021-12-04T22:47:03.227Z · LW(p) · GW(p)

When I try to think of gift ideas for dolphins, am I failing to notice some way in which I'm "selfishly" projecting what I think dolphins should want onto them, or am I violating some coherence axiom?

I think it's rather that 'it's easy to think of ways to help a dolphin (and a smart AGI would presumably find this easy too), but it's hard to make a general intelligence that robustly wants to just help dolphins, and it's hard to safely coerce an AGI into helping dolphins in any major way if that's not what it really wants'.

I think the argument is two-part, and both parts are important:

A random optimization target won't tend to be 'help dolphins'. More specifically, if you ~gradient-descent your way to the first general intelligence you can find that has the external behavior 'help dolphins in the training environment' (or that is starting to approximate that behavior), you will almost always find an optimizer that has some other goal in general.
1. E.g.: Humans invented condoms once we left the EAA. In this case, we could imagine that we have instilled some instinct in the AGI that makes it emit dolphin-helping behaviors at low capability levels; but then once it has more options, it will push into extreme starts of the state-space. (Condoms are humans' version of 'tiling the universe with smiley faces'.)
2. Alternatively: If you tried to get a human prisoner to devote their lives to helping dolphins, you would get 'human who pretends to care about dolphins but is always on the lookout for opportunities to escape' long before you get 'human who has deeply and fully replaced their utility function with helping dolphins'. In this case, we can imagine an AGI that pretends to care about the optimization target as a deliberate strategy.
Given that you haven't instilled exactly the desired 'help dolphins' goal right off the bat, now there are strong coherence-pressures against the AGI allowing its goal to be changed ('improved'), against the AGI allowing something else with a different goal to call the shots, etc.

comment by Koen.Holtman · 2021-11-30T11:02:17.325Z · LW(p) · GW(p)

I don't feel like joining this, but I do wish you luck, and I'll make a high level observation about methodology.

I do believe there’s a legitimate, albeit small, chance that we solve corrigibility or find its “core” this week. Nonetheless, I think it’s of great value to be able to make actual progress on alignment issues as a community and to figure out how to do that better.

I don't consider myself to be a rationalist or EA, but I do post on this web site, so I guess this makes me part of the community of people who post on this site. My high level observation on solving corrigibility is this: the community of people who post on this site have absolutely no mechanism for agreeing among themselves whether a problem has been solved.

This is what you get when a site is in part a philosophy-themed website/forum/blogging platform. In philosophy, problems are never solved to the satisfaction of the community of all philosophers. This is not necessarily a bad thing. But it does imply that you should not expect that this community will ever be willing to agree that corrigibility, or any other alignment problem, has been solved.

In business, there is the useful terminology that certain meetings will be run as 'decision making meetings', e.g. to make a go/no-go decision on launching a certain product design, even though a degree of uncertainty remains. Other meetings are exploratory meetings only, and are labelled as such. This forum is not a decision making forum.

Replies from: TurnTrout, elriggs

↑ comment by TurnTrout · 2021-11-30T17:46:23.580Z · LW(p) · GW(p)

But it does imply that you should not expect that this community will ever be willing to agree that corrigibility, or any other alignment problem. has been solved.

Noting that I strongly disagree but don't have time to type out arguments right now, sorry. May or may not type out later.

↑ comment by Logan Riggs (elriggs) · 2021-11-30T20:21:08.320Z · LW(p) · GW(p)

I think we're pretty good at avoiding semantic arguments. The word "corrigible" can (and does) mean different things to different people on this site. Becoming explicit about what different properties you mean and which metrics they score well on resolves the disagreement. We can taboo [LW · GW] the word corrigible.

This has actually already happened in the document with corrigible either meaning:

Correctable all the time regardless
Correctable up until the point where the agent actually knows how to achieve your values better than you (related to intent alignment and coherent extrapolated volition).

Then we can think "assuming corrigible-definition-1, then yes, this is a solution".

I don't see a benefit to the exploratory/decision making forum distinction when you can just do the above, but maybe I'm missing something?

Replies from: Koen.Holtman

↑ comment by Koen.Holtman · 2021-12-02T11:03:16.267Z · LW(p) · GW(p)

Becoming explicit about what different properties you mean and which metrics they score well on resolves the disagreement.

Indeed this can resolve disagreement among a small sub-group of active participants. This is an important tool if you want to make any progress.

but maybe I'm missing something?

The point I was trying to make is about what is achievable for the entire community, not what is achievable for a small sub-group of committed participants. The community of people who post on this site have absolutely no mechanism for agreeing among themselves whether a problem has been solved, or whether some sub-group has made meaningful progress on it.

To make the same point in another way: the forces which introduce disagreeing viewpoints and linguistic entropy [LW · GW] to this forum are stronger than the forces that push towards agreement and clarity.

My thinking about how strong these forces are has been updated recently, by the posting of a whole sequence of Yudkowsky conversations [? · GW] and also this one [LW · GW]. In these discussion logs, Yudkowsky goes to full Great more-epistemic-than-thou Philosopher mode, Confidently Predicting AGI Doom while Confidently Dismissing Everybody's AGI Alignment Research Results. Painful to read.

I am way past Denial and Bargaining, I have Accepted that this site is a community of philosophers.

Replies from: elriggs

↑ comment by Logan Riggs (elriggs) · 2021-12-02T18:10:25.920Z · LW(p) · GW(p)

The linguistic entropy point is countered by my previous point, right? Unless you want to say not everyone who posts in this community is capable of doing that? Or can naturally do that?

In these discussion logs, Yudkowsky goes to full Great more-epistemic-than-thou Philosopher mode, Confidently Predicting AGI Doom while Confidently Dismissing Everybody's AGI Alignment Research Results. Painful to read.

Hahaha, yes. Yudkowsky can easily be interpreted as condescending and annoying in those dialogues (and he could've done a better job at not coming across that way). Though I believe the majority of the comments were in the spirit of understanding and coming to an agreement. Adam Shimi is also working on a post to describe the disagreements in the dialogue as different epistemic strategies [LW · GW], meaning the cause of disagreement is non-obvious. Alignment is pre-paradigmic, so agreeing is more difficult compared to communities that have clear questions and metrics to measure them on. I still think we succeed at the harder problem.

I am way past Denial and Bargaining, I have Accepted that this site is a community of philosophers.

By "community of philosophers", you mean noone makes any actual progress on anything (or can agree that progress is being made)?

I believe Alex Turner has made progress on formalizing impact and power-seeking and I'm not aware of parts of the community arguing this isn't progress at all (though I don't read every comment).
I also believe Vanessa's and Diffractor's Infrabayesism is progress on thinking about probabilities, and am unaware of parts of the community arguing this isn't progress (though there is a high mathematical bar required before you can understand it enough to criticize it)
I also also believe Evan Hubingers et al work on mesa optimizers is quite clearly progress on crisply stating an alignment issue that the community has largely agreed is progress.

Do you disagree on these examples or disagree that they prove the community makes progress and agrees that progress is being made?

Replies from: Koen.Holtman

↑ comment by Koen.Holtman · 2021-12-05T14:13:42.430Z · LW(p) · GW(p)

Yes, by calling this site a "community of philosophers", I roughly mean that at the level of the entire community, nobody can agree that progress is being made. There is no mechanism for creating a community-wide agreement that a problem has been solved.

You give three specific examples of progress above. From his recent writings, it is clear that Yudkowsky does not believe, like you do, that any contributions posted on this site in the last few years have made any meaningful progress towards solving alignment. You and I may agree that some or all of the above three examples represent some form of progress, but you and I are not the entire community here, Yudkowsky is also part of it.

On the last one of your three examples, I feel that 'mesa optimizers' is another regrettable example of the forces of linguistic entropy overwhelming any attempts at developing crisply stated definitions which are then accepted and leveraged by the entire community. It is not like the people posting on this site are incapable of using the tools needed to crisply define things, the problem is that many do not seem very interested in ever using other people's definitions or models as a frame of reference. They'd rather free-associate on the term, and then develop their own strongly held beliefs of what it is all supposed to be about.

I am sensing from your comments that you believe that, with more hard work and further progress on understanding alignment, it will in theory be possible to make this community agree, in future, that certain alignment problems have been solved. I, on the other hand, do not believe that it is possible to ever reach that state of agreement in this community, because the debating rules of philosophy apply here.

Philosophers are always allowed to disagree based on strongly held intuitive beliefs that they cannot be expected to explain any further. The type of agreement you seek is only possible in a sub-community which is willing to use more strict rules of debate.

This has implications for policy-related alignment work. If you want to make a policy proposal that has a chance of being accepted, it is generally required that you can point to some community of subject matter experts who agree on the coherence and effectiveness of your proposal. LW/AF cannot serve as such a community of experts.

Replies from: TAG

↑ comment by TAG · 2021-12-05T19:24:17.919Z · LW(p) · GW(p)

On the last one of your three examples, I feel that ‘mesa optimizers’ is another regrettable example of the forces of linguistic entropy overwhelming any attempts at developing crisply stated definitions which are then accepted and leveraged by the entire community. It is not like the people posting on this site are incapable of using the tools needed to crisply define things, the problem is that many do not seem very interested in ever using other people’s definitions or models as a frame of reference. They’d rather free-associate on the term, and then develop their own strongly held beliefs of what it is all supposed to be about

Yes.. clarity isn't optional.

MIRI abandonned the idea of producing technology a long time ago , so what it will offer to the the people who are working on AI technology is some kind of theory expressed by n some kind of document ..which will be of no use to them if they can't understand it.

And it takes a constant parallel effort to keep the lines of communication open. It's no use "woodshedding" , spending a lot of time developing your own ideas in your own language.

comment by plex (ete) · 2021-11-30T16:45:35.589Z · LW(p) · GW(p)

I've got a slightly terrifying hail mary "solve alignment with this one weird trick"-style paradigm I've been mulling over for the past few years which seems like it has the potential to solve corrigibility and a few other major problems (notably value loading without Goodharting, using an alternative to CEV which seems drastically easier to specify). There are a handful of challenging things needed to make it work, but they look to me maybe more achievable than other proposals which seem like they could scale to superintelligence I've read.

Realistically I am not going to publish it anytime soon given my track record, but I'd be happy to have a call with anyone who'd like to poke my models and try and turn it into something. I've had mildly positive responses from explaining it to Stuart Armstrong and Rob Miles, and everyone else I've talked to about it at least thought it was creative and interesting.

Replies from: elriggs

↑ comment by Logan Riggs (elriggs) · 2021-11-30T20:25:41.386Z · LW(p) · GW(p)

I've updated my meeting times to meet more this week if you'd like to sign up for a slot? (link w/ a pun) , and from his comment, I'm sure diffractor would also be open to meeting.

I will point out that there's a confusion in terms that I noticed in myself of corrigibility meaning either "always correctable" and "something like CEV", though we can talk that over a call too:)

Replies from: ete

↑ comment by plex (ete) · 2021-12-01T12:30:44.434Z · LW(p) · GW(p)

Cool, booked a call for later today.

comment by Quintin Pope (quintin-pope) · 2021-11-28T18:19:25.698Z · LW(p) · GW(p)

Your Google docs link leads to Alex's "Corrigibility Can Be VNM-Incoherent [LW · GW]" post. Is this a mistake or am I miss-understanding something?

Replies from: elriggs

↑ comment by Logan Riggs (elriggs) · 2021-11-28T21:03:49.720Z · LW(p) · GW(p)

Fixed! Thanks:)

comment by Logan Riggs (elriggs) · 2021-11-28T17:02:15.279Z · LW(p) · GW(p)

Potential topics: what other topics besides corrigibility could we collaborate on in future weeks? Also, are we able to poll users for topics in site?

Replies from: elriggs

↑ comment by Logan Riggs (elriggs) · 2021-11-28T17:02:50.352Z · LW(p) · GW(p)

Timelines and forecasting
Goodhart’s law
Power-seeking
Human values
Learning from human feedback
Pivotal actions
Bootstrapping alignment
Embedded agency
Primer on language models, reinforcement learning, or machine learning basics
- This ones not really on-topic, but I do see value in a more “getting up to date” focus where experts can give talks or references to learn things (eg “here’s a tutorial for implementing a small GPT-2”). Though I could just periodically ask LW questions on whatever topic ends up interesting me at the moment. Though, I could do my own Google search, but I feel there’s some community value here that won’t be gained. Like learning and teaching together makes it easier for the community to coordinate in the future. Plus connections bonuses.

comment by Logan Riggs (elriggs) · 2021-11-28T17:03:21.370Z · LW(p) · GW(p)

Meta: what are different formats this type of group collaboration could take? Comment suggestions with trade offs or discuss the cost/benefits of what I’m presenting in this post.

Replies from: elriggs

↑ comment by Logan Riggs (elriggs) · 2021-11-28T17:06:33.150Z · LW(p) · GW(p)

Google docs is kind of weird because I have to trust people won't spam suggestions. I also may need to keep up with allowing suggestions on a consistent basis. I would want this hosted on LW/AlignmentForum, but I do really like in-line commenting and feeling like there's less of a quality-bar to meet. I'm unsure if this is just me.
Walled Garden group discussion block time: have a block of ~4-16 hours using Walled Garden software. There could be a flexible schedule with schelling points to coordinate meeting up. For example, if someone wants to give a talk on a specific corrigibly research direction and get live feedback/discussion, they can schedule a time to do so.
Breaking up the task comment. Technically the literature review, summaries, extra thoughts is a “task” to do. I do want broken down tasks that many people could do, though what may end up happening is whoever wants a specific task done ends up doing it themselves. Could also have “possible research directions” as a high-level comment.

comment by Logan Riggs (elriggs) · 2021-11-28T17:01:32.592Z · LW(p) · GW(p)

Meetups: want to co-work with others in the community? Comment availability, work preferences, and a way to contact you (eg calendly link, “dm me”, “ my email is bob and alice dot com”, etc).

Replies from: Diffractor, elriggs

↑ comment by Diffractor · 2021-11-29T03:10:31.035Z · LW(p) · GW(p)

Availability: Almost all times between 10 AM and PM, California time, regardless of day. Highly flexible hours. Text over voice is preferred, I'm easiest to reach on Discord. The LW Walled Garden can also be nice.

↑ comment by Logan Riggs (elriggs) · 2021-11-28T17:01:52.363Z · LW(p) · GW(p)

Update: I am available this week until Saturday evening at this calendly link(though I will close the openings if a large number of people sign up) ~~I am~~ ~~available all Saturday Dec 4th~~ ~~(calendly link will allow you to see your time zone)~~. We can read and discuss posts, do tasks together, or whatever you want. Previous one-on-one conversations with members of the community have gone really well.There’s not a person here I haven’t enjoyed getting to know, so do feel free to click that link and book a time!