AI Alignment Open Thread October 2019

post by habryka (habryka4) · 2019-10-04T01:28:15.597Z · score: 20 (5 votes) · LW · GW · 25 comments

Continuing the experiment from August, let's try another open thread for AI Alignment discussion. The goal is to be a place where researchers and upcoming research can ask small questions they are confused about, share early stage ideas and have lower-key discussions.


Comments sorted by top scores.

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-10-23T20:46:20.065Z · score: 7 (4 votes) · LW · GW

My new research direction for an "end-to-end" alignment scheme [AF · GW].

See also this clarifying comment [AF · GW].

I'm posting this in the Open Thread because, for technical reasons the shortforms don't appear in the feed on the main page of alignmentforum, so I am a little worried people missed it entirely (I discussed it with Oliver).

comment by Grue_Slinky · 2019-10-08T17:55:24.309Z · score: 7 (3 votes) · LW · GW

I have a bunch of half-baked ideas, most of which are mediocre in expectation and probably not worth investing my time and other’s attention writing up. Some of them probably are decent, but I’m not sure which ones, and the user base is probably as good as any for feedback.

So I’m just going to post them all as replies to this comment. Upvote if they seem promising, downvote if not. Comments encouraged. I reserve the “right” to maintain my inside view, but I wouldn’t make this poll if I didn’t put substantial weight on this community’s opinions.

comment by Grue_Slinky · 2019-10-08T17:59:39.231Z · score: 22 (8 votes) · LW · GW


In light of the “Fixed Points” critique, a set of exercises that seem more useful/reflective of MIRI’s research than those exercises. What I have in mind is taking some of the classic success stories of formalized philosophy (e.g. Turing machines, Kolmogorov complexity, Shannon information, Pearlian causality, etc., but this could also be done for reflective oracles and logical induction), introducing the problems they were meant to solve, and giving some stepping stones that guide one to have the intuitions and thoughts that (presumably) had to be developed to make the finished product. I get that this will be hard, but I think this can be feasibly done for some of the (mostly easier) concepts, and if done really well, it could even be a better way for people to learn those concepts than actually reading about them.

comment by johnswentworth · 2019-10-08T18:26:08.050Z · score: 8 (4 votes) · LW · GW

I think this would be an extremely useful exercise for multiple independent reasons:

  • it's directly attempting to teach skills which I do not currently know any reproducible way to teach/learn
  • it involves looking at how breakthroughs happened historically, which is an independently useful meta-strategy
  • it directly involves investigating the intuitions behind foundational ideas relevant to the theory of agency, and could easily expose alternative views/interpretations which are more useful (in some contexts) than the usual presentations

comment by Grue_Slinky · 2019-10-09T00:49:41.358Z · score: 5 (3 votes) · LW · GW

*begins drafting longer proposal*

Yeah, this is definitely more high-risk, high-reward than the others, and the fact that there's potentially some very substantial spillover effects if successful makes me both excited and nervous about the concept. I'm thinking of Arbital as an example of "trying to solve way too many problems at once", so I want to manage expectations and just try to make some exercises that inspire people to think about the art of mathematizing certain fuzzy philosophical concepts. (Running title is "Formalization Exercises", but I'm not sure if there's a better pithy name that captures it).

In any case, I appreciate the feedback, Mr. Entworth.

comment by johnswentworth · 2019-10-09T02:47:26.131Z · score: 4 (2 votes) · LW · GW

Oh no, not you too. It was bad enough with just Bena.

comment by Grue_Slinky · 2019-10-08T17:58:40.098Z · score: 11 (6 votes) · LW · GW


A skeptical take on Part I of “What failure looks like [AF · GW]” (3 objections, to summarize briefly: not much evidence so far, not much precedent historically, and “why this, of all the possible axes of differential progress?”) [Unsure if these objections will stand up if written out more fully]

comment by Grue_Slinky · 2019-10-08T17:59:03.489Z · score: 7 (5 votes) · LW · GW


An analysis of what kinds of differential progress we can expect from stronger ML. Actually, I don’t feel like writing this post, but I just don’t understand why Dai and Christiano, respectively, are particularly concerned about differential progress on the polynomial hierarchy [AF · GW] and what’s easy-to-measure vs. hard-to-measure [AF · GW]. My gut reaction is “maybe, but why privilege that axis of differential progress of all things”, and I can’t resolve that in my mind without doing a comprehensive analysis of potential “differential progresses” that ML could precipitate. Which, argh, sounds like an exhausting task, but someone should do it?

comment by rohinmshah · 2019-10-11T02:18:54.870Z · score: 5 (3 votes) · LW · GW

Re: easy-to-measure vs. hard-to-measure axis: That seems like the most obvious axis on which AI is likely to be different from humans, and it clearly does lead to bad outcomes?

comment by Grue_Slinky · 2019-10-08T17:58:18.275Z · score: 6 (6 votes) · LW · GW


A post discussing my confusions about Goodhart and Garrabrant’s taxonomy [AF · GW] of it. I find myself not completely satisfied with it:

  • “adversarial” seems too broad to be that useful as a category
  • It doesn’t clarify what phenomenon is meant by “Goodhart”; in particular, “regressional” doesn’t feel like something the original law was talking about, and any natural definition of “Goodhart” that includes it seems really broad
  • Whereas “regressional” and “extremal” (and perhaps “causal”) are defined statistically, “adversarial” is defined in terms of agents, and this may have downsides (I’m less sure about this objection)

But I’m also not sure how I’d reclassify it and that task seems hard. Which partially updates me in favor of the Taxonomy being good, but at the very least I feel there’s more to say about it.

comment by Grue_Slinky · 2019-10-08T17:59:23.231Z · score: 2 (2 votes) · LW · GW


A critique of MIRI’s “Fixed Points [AF · GW]” paradigm, expanding on some points I made on MIRIxDiscord a while ago (which would take a full post to properly articulate). Main issue is, I'm unsure if it's still guiding anyone's research and/or who outside MIRI would care.

comment by Grue_Slinky · 2019-10-08T17:57:19.457Z · score: 2 (4 votes) · LW · GW


“When and why should we be worried about robustness to distributional shift?”: When reading that section of Concrete Problems, there’s a temptation to just say “this isn’t relevant long-term, since an AGI by definition would have solved that problem”. But adversarial examples and the human safety problems [AF · GW] (to the extent we worry about them) both say that in some circumstances we don’t expect this to be solved by default. I’d like to think more about when the naïve “AGI will be smart” intuition applies and when it breaks.

comment by rohinmshah · 2019-10-11T02:19:56.679Z · score: 4 (3 votes) · LW · GW

Concerns about mesa-optimizers are mostly concerns that "capabilities" will be robust to distributional shift while "objectives" will not be robust.

comment by Grue_Slinky · 2019-10-08T17:55:40.892Z · score: 2 (2 votes) · LW · GW

Upvote this comment (and downvote the others as appropriate) if most of the other ideas don’t seem that fruitful.

By default, I’d mostly take this as a signal of “my time would be better spent working on someone else’s agenda or existing problems that people have posed” but I suppose other alternatives exist, if so comment below.

comment by Grue_Slinky · 2019-10-08T17:56:44.595Z · score: 1 (1 votes) · LW · GW


[I probably need a better term for this] “Wide-open-source game theory”: Where other agents can not only simulate you, but also figure out "why" you made a given decision. There’s a Standard Objection [LW · GW] to this: it’s unfair to compare algorithms in environments where they are judged not only by their actions, but on arbitrary features of their code; to which I say, this isn’t an arbitrary feature. I was thinking about this in the context [AF · GW] of how, even if an AGI makes the right decision, we care “why” it did so, i.e. because it’s optimizing for what we want vs. optimizing for human approval for instrumental reasons). I doubt we’ll formalize this “why” anytime soon (see e.g. section 5 of this), but I think semi-formal things can be said about it upon some effort. [I thought of this independently from (1), but I think every level of the “transparency hierarchy” could have its own kind of game theory, much like the “open-source” level clearly does]

comment by Grue_Slinky · 2019-10-08T17:56:05.380Z · score: 1 (1 votes) · LW · GW


A classification of some of the vulnerabilities/issues we might expect AGIs to face because they are potentially open-source, and generally more “transparent” to potential adversaries. For instance, they could face adversarial examples, open-source game theory problems, Dutch books, or weird threats that humans don’t have to deal with. Also, there’s a spectrum from “extreme black box” to “extreme white box” with quite a few plausible milestones along the way, that makes for a certain transparency hierarchy, and it may be helpful to analyze this (or at least take a stab at formulating it).

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-10-21T10:38:19.366Z · score: 4 (2 votes) · LW · GW

Meta: When should I used this rather the shortform? Do we really need both?

comment by Raemon · 2019-10-21T12:45:15.056Z · score: 3 (1 votes) · LW · GW

I'm not sure what'll end up settling with for "regular Open Threads" vs shortform. Open Threads predate shortform, but didn't create the particular feeling of a person-space like shortform does, so it seemed useful to add shortform. I'm not sure if Open Threads still provide a particular service that shortform doesn't provide.

In _this_ case, however, I think the Alignment Open Thread servers a bit of a different purpose – it's a place to spark low-key conversation between AF members. (Non-AF members can comment on LessWrong, but I think it's valuable that AF members who prefer AF to LessWrong can show up on and see some conversation that's easy to jump into)

For your own comments: the subtle difference is that open threads are more like a market square where you can show up and start talking to strangers, and shortform is more like a conversation in your living room. If you have a preference for one of those subtle distinctions, do that I guess, and if not... dunno, flip a coin I guess. :P

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-10-21T14:51:09.370Z · score: 2 (1 votes) · LW · GW

Actually, now I'm confused. I just posted a shortform, but I don't see where it appears on the main page? There is "AI Alignment Posts" which only includes the "longforms" and there is "recent discussion" which only includes the comments. Does it mean nobody sees the shortform unless they open my profile?

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-10-21T13:02:27.151Z · score: 2 (1 votes) · LW · GW

...the subtle difference is that open threads are more like a market square where you can show up and start talking to strangers, and shortform is more like a conversation in your living room.

Hmm, this seems like an informal cultural difference that isn't really enforced by the format. Technically, people can comment on the shortform as easily as on open thread comments. So, I am not entirely sure whether everyone perceive it this way (and will continue to perceive it this way).

comment by Raemon · 2019-10-21T13:09:59.077Z · score: 3 (1 votes) · LW · GW

There actually is an important difference, which is that you get to set the moderation norms on your shortform posts (and can ban people from commenting if need be). But, indeed, I would agree most of the difference is informal-cultural.

comment by riceissa · 2019-10-23T22:39:02.752Z · score: 1 (1 votes) · LW · GW

[Meta] I'm not a full member on Alignment Forum, but I've had some of my LW content cross-posted to AF. However, this cross-posting seems haphazard, and does not correspond to my intuitive feeling of which of my posts/comments "should" end up on AF. I would like for one of the following to happen:

  • More insight into the mechanism that decides what gets cross-posted, so I feel less annoyed at the arbitrary-seeming nature of it.
  • More control over what gets cross-posted (if this requires applying for full membership, I would be willing to do that).
  • Have all my AF cross-posting be undone so that readers don't get a misleading impression of my AI alignment content. (I would like to avoid people visiting my AF profile, reading content there, and concluding something about my AI alignment output based on that.)
comment by habryka (habryka4) · 2019-10-23T22:53:08.832Z · score: 7 (4 votes) · LW · GW

There is currently a small set of moderators/admins who can move posts by non-AIAF members from LW to the AI Alignment Forum. The full list is for users who can move over posts:

  • Scott Garrabrant
  • Vaniver
  • Rohin Shah
  • Oliver Habryka
  • Vika
  • Abram Demski
  • Ben Pace
  • ricraz

Every AIAF user can suggest a post to be moved over to the forum, which then usually gets processed by Vaniver.

My sense (guessing from your post history) is that you are mostly confused about comments. For comments, any AIAF member can move comments over from any other user, and this also automatically happens when they make an AIAF-tagged reply to a non-AIAF comment. My sense is that currently most comments that get moved over to the AIAF by non-AIAF members is because someone wanted to reply to it on the AIAF.

For both cases, Vaniver is admin of the forum and has final say on what gets moved over or not.

comment by riceissa · 2019-10-24T00:07:07.354Z · score: 3 (2 votes) · LW · GW


My sense (guessing from your post history) is that you are mostly confused about comments.

I am more confused about posts than comments. For posts, only my comparison of decision theories [AF · GW] post is currently cross-posted to AF, but I actually think my post about deliberation [LW · GW], question about iterative approaches to alignment [LW · GW] (along with Rohin's answer), and question about coordination on AI progress projects [LW · GW] are more relevant to AF (either because they make new claims or because they encourage others to do so). If I see that a particular post hasn't been cross-posted to AF, I'm wondering if I should be thinking more like "every single moderator has looked at the post, and believes it doesn't belong on AF" or more like "either the moderators are busy, or something about the post title caused them to not look at the post, and it sort of fell through the cracks".

comment by habryka (habryka4) · 2019-10-24T02:42:52.766Z · score: 3 (2 votes) · LW · GW

I think currently it's more likely to be the second case.

I do think we can probably do better here. I will ping Vaniver about his thoughts on this.