Review AI Alignment posts to help figure out how to make a proper AI Alignment review

post by habryka (habryka4), Raemon · 2023-01-10T00:19:23.503Z · LW · GW · 31 comments

Contents

  Current AI Alignment post frontrunners in the review
None
31 comments

I've had many conversations over the last few years about the health of the AI Alignment field and one of the things that has come up most frequently (including in conversations with Rohin, Buck and various Open Phil people) is that many people wish there was more of a review process in the AI Alignment field. 

I also think there is a bunch of value in better review processes, but have felt hesitant to create something very official and central, since AI Alignment is a quite preparadigmatic field, which makes creating shared standards of quality hard, and because I haven't had the time to really commit to maintain something great here. 

Separately, I am also quite proud of the LessWrong review, and am very happy about the overall institution that we've created there, and I realized that the LessWrong review might just be a good test bed and bandaid for having a better AI Alignment review process. I think the UI we built for it is quite good, and I think the vote does have real stakes and a lot of the people voting are also people quite active in AI Alignment. 

So this year, I would like to encourage many of the people who expressed a need for better review processes in AI Alignment to try reviewing some AI Alignment posts from 2021 as part of the LessWrong review. I personally got quite a bit of personal value out of doing that, and e.g. found that my review of the MIRI dialogues [LW(p) · GW(p)] helped crystallize some helpful new directions for me to work towards, and I am also hoping to write a longer review of Eliciting Latent Knowledge that I also think will help clarify some things for me, and is something that I will feel comfortable linking to later when people ask me about my takes on ELK-adjacent AI Alignment research. 

I am also interested in comments on this post with takes for better review-processes in AI Alignment. I am currently going through a period where I feel quite confused how to relate to the field at large, so it might be a good time to also have a conversation about what kind of standards we even want to have in the field.

Current AI Alignment post frontrunners in the review

We've had an initial round of preliminary voting, in which people cast non-binding votes that help prioritize posts during the Review Phase. Among Alignment Forum voters, the top Alignment Forum posts were:

  1. ARC's first technical report: Eliciting Latent Knowledge [LW · GW]
  2. What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) [LW · GW]
  3. Another (outer) alignment failure story [LW · GW]
  4. Finite Factored Sets [LW · GW]
  5. Ngo and Yudkowsky on alignment difficulty [LW · GW]
  6. My research methodology [LW · GW]
  7. Fun with +12 OOMs of Compute [LW · GW]
  8. The Plan [LW · GW]
  9. Comments on Carlsmith's “Is power-seeking AI an existential risk?” [LW · GW]
  10. Ngo and Yudkowsky on AI capability gains [LW · GW]

There are also a lot of other great alignment posts in the review (a total of 88 posts were nominated), and I do expect things to shift around a bit, but I do think all 10 of these top essays deserve some serious engagement and a relatively in-depth review, since I expect most of them will get read by people for many years to come, and people might be basing new research approaches and directions on them.

To review a post, you can navigate to the post page, and click the "Review" button at the top of the page (just under the post title). It looks like this:

31 comments

Comments sorted by top scores.

comment by Rohin Shah (rohinmshah) · 2023-01-10T10:52:36.640Z · LW(p) · GW(p)

one of the things that has come up most frequently (including in conversations with Rohin, Buck and various Open Phil people) is that many people wish there was more of a review process in the AI Alignment field. 

Hmm, I think I've complained a bunch about lots of AI alignment work being conceptually confused, or simply stating points rather than arguing for them, or being otherwise epistemically sketchy. But I also don't particularly feel optimistic about a review process either; for that to fix these problems the reviewers would have to be more epistemically competent than the post authors, and that currently doesn't seem likely to happen.

Also, when I actually imagine what the reviews would look like, I mostly think of people talking about the same old cruxes and disagreements that change whether or not the work is worth doing at all, rather than actually talking about the details, which is what I would usually find useful about reviews.

(Tbc, it's possible I did express optimism about a review process in conversation with you; my opinions could have changed a bunch. I would be a bit surprised though.)

Replies from: akash-wasil, Buck, Raemon, Joe_Collman, habryka4
comment by Akash (akash-wasil) · 2023-01-10T19:44:42.346Z · LW(p) · GW(p)

How would you feel about a review process that had two sections?

Section One: How important do you find this work & to what extent do you think the research is worth doing? (Ex: Does it strike at what you see as core alignment problems?)

Section Two: What do you think of the details of the research? (Ex: Do you see any methodological flaws, do you have any ideas for further work, etc).

My impression is that academic peer-reviewers generally do both of these. Compared to academic peer-review, LW/AF discussions tend to have a lot of Section One and not much Section Two.

My (low-resilience) guess is that the field would benefit from more “Section Two Engagement” from people with different worldviews.

(On the other hand, perhaps people with different worldviews and research priorities won’t have a comparative advantage in offering specific, detailed critiques. Ex: An agent foundations researcher might not be very good at providing detailed critiques of an interpretability paper.)

(But to counter that point, maybe there are certain specific/detailed critiques that are easier for “outsiders” to catch.)

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2023-01-11T08:18:15.930Z · LW(p) · GW(p)

Disclaimer: I'm not particularly confident in any of my views here or in my original comment. I mostly commented on the post because the post implied that I supported the idea of a review process whereas my actual opinion is mixed (not uniformly negative). If I hadn't been named explicitly I wouldn't have said anything; I don't want to stop people from trying this out if they think it would be good; it's quite plausible that someone thinking about this full time would have a vision for it that would be good that I haven't even considered yet (given how little I've thought about it).

I think how excited I'd be would depend a lot more on the details (e.g. who are the reviewers, how much time do they spend, how are they incentivized, what happens after the reviews are completed). But if we just imagine the LessWrong Review extended to the Alignment Forum, I'm not that excited, because I predict (not confidently) that the reviews just wouldn't be good at engaging with the details. (Mostly because LW / AF comments don't seem very good at engaging with details on existing LW / AF posts, and because typical LW / AF commenters don't seem familiar enough with ML to judge details in ML papers.)

My impression is that academic peer-reviewers generally do both of these.

Academic peer review does do both in principle, but I'd say that typically most of the emphasis is on Section Two. Generally the Section One style review is just "yup, this is in fact trying to make progress on a problem academia has previously deemed important, and is not just regurgitating things that people previously said" (i.e. it is significant and novel).

(It is common for bad reviews to just say "this is not significant / not novel" and then ignore Section One entirely, but this is pretty commonly thought of as explicitly "this was a bad review", unless they actually justified "not significant / not novel" well enough that most others would agree with them.)

comment by Buck · 2023-01-26T05:30:56.836Z · LW(p) · GW(p)

But I also don't particularly feel optimistic about a review process either; for that to fix these problems the reviewers would have to be more epistemically competent than the post authors, and that currently doesn't seem likely to happen.

 

For what it's worth, this is also where I'm at on an Alignment Forum review.

Replies from: Raemon
comment by Raemon · 2023-01-26T20:34:06.652Z · LW(p) · GW(p)

I've been trying to articulate some thoughts since Rohin's original comment, and maybe going to just rant-something-out now.

On one hand: I don't have a confident belief that writing in-depth reviews is worth Buck or Rohin's time (or their immediate colleague's time for that matter). It's a lot of work, there's a lot of other stuff worth doing. And I know at least Buck and Rohin have already spent quite a lot of time arguing about the conceptual deep disagreements for many of the top-voted posts.

On the other hand, the combination of "there's stuff epistemically wrong or confused or sketchy about LW", but "I don't trust a review process to actually work because I don't believe the it'll get better epistemics than what have already been demonstrated" seems a combination of "self-defeatingly wrong" and "also just empirically (probably) wrong". 

Presumably Rohin and Buck and similar colleagues think they have at least (locally) better epistemics than the writers they're frustrated by. 

I'm guessing your take is like "I, Buck/Rohin, could write a review that was epistemically adequate, but I'm busy and don't expect it to accomplish anything that useful." Assuming that's a correct characterization, I don't necessarily disagree (at least not confidently). But something about the phrasing feels off.

Some reasons it feels off:

  • Even if there are clusters of research that seem too hopeless to be worth engaging with, I'd be very surprised if there weren't at least some clusters of research that Rohin/Buck/etc are more optimistic about. If what happens is "people write reviews of the stuff that feels real/important enough to be worth engaging with", that still seems valuable to me.
  • It seems like people are sort of treating this like a stag-hunt [LW · GW], and it's not worth participating if a bunch of other effort isn't going in. I do think there are network effects that make it more valuable as more people participate. But I also think "people incrementally do more review work each year as it builds momentum" is pretty realistic, and I think individual thoughtful reviews are useful in isolation for building clarity on individual posts.
  • The LessWrong/Alignment Review process is pretty unopinionated at the moment. If you think a particular type of review is more valuable than other types, there's nothing stopping you from doing that type of review.
  • If the highest review-voted work is controversial, I think it's useful for the field orienting to know that it's controversial. It feels pretty reasonable to me to publish an Alignment Forum Journal-ish-thing that includes the top-voted content, with short reviews from other researchers saying "FYI I disagree conceptually here about this post being a good intellectual output"
    • (or, stepping out of the LW-review frame: if the alignment field is full of controversy and people who think each other are confused, I think this is a fairly reasonable fact to come out of any kind of review process)
  • I'm skeptical that the actual top-voted posts trigger this reaction. At the time of this post, the top voted posts were:

I do think a proper alignment review should likely have more content that wasn't published on alignment forum. This was technically available this year (we allowed people to submit non-LW content during the nomination phase), but we didn't promote it very heavily and it didn't frame it as a "please submit all Alignment progress you think was particularly noteworthy" to various researchers.

I don't know that the current review process is great, but, again, it's fairly unopinionated and leaves plenty of room to be-the-change-you-want-to-see in the alignment scene meta-reflection.

(aside: I apologize for picking on Rohin and Buck when they bothered to stick their neck out and comment, presumably there are other people who feel similarly who didn't even bother commenting. I appreciate you sharing your take, and if this feels like dragging you into something you don't wanna deal with, no worries. But, I think having concrete people/examples is helpful. I also think a lot of what I'm saying applies to people I'd characterize as "in the MIRI camp", who also haven't done much reviewing, although I'd frame my response a bit differently)

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2023-02-05T10:08:17.794Z · LW(p) · GW(p)

Sorry, didn't see this until now (didn't get a notification, since it was a reply to Buck's comment).

I'm guessing your take is like "I, Buck/Rohin, could write a review that was epistemically adequate, but I'm busy and don't expect it to accomplish anything that useful."

In some sense yes, but also, looking at posts I've commented on in the last ~6 months, I have [LW(p) · GW(p)] written [LW(p) · GW(p)] several [LW(p) · GW(p)] technical [LW(p) · GW(p)] reviews [LW(p) · GW(p)] (and nontechnical [LW(p) · GW(p)] reviews [LW(p) · GW(p)]). And these are only the cases where I wrote a comment that engaged in detail with the main point of the post; many of my other comments review specific claims and arguments within posts.

(I would be interested in quantifications of how valuable those reviews are to people other than the post authors. I'd think it is pretty low.)

I'd be very surprised if there weren't at least some clusters of research that Rohin/Buck/etc are more optimistic about.

Yes, but they're usually papers, not LessWrong posts, and I do give feedback to their authors -- it just doesn't happen publicly.

(And it would be maybe 10x more work to make it public, because (a) I have to now write the review to be understandable by people with wildly different backgrounds and (b) I would hold myself to a higher standard (imo correctly).)

(Indeed if you look at the reviews I linked above one common thread is that they are responding to specific people whose views I have some knowledge of, and the reviews are written with those people in mind as the audience.)

I also think "people incrementally do more review work each year as it builds momentum" is pretty realistic

I broadly agree with this and mostly feel like it is the sort of thing that is happening amongst the folks who are working on prosaic alignment.

If the highest review-voted work is controversial, I think it's useful for the field orienting to know that it's controversial. [...] if the alignment field is full of controversy and people who think each other are confused, I think this is a fairly reasonable fact to come out of any kind of review process

We already know this though? You have to isolate particular subclusters (Nate/Eliezer, shard theory folks, IDA/debate folks, etc) before it's even plausible to find pieces of work that might be uncontroversial. We don't need to go through a review effort to learn that.

(This is different from beliefs / opinions that are uncontroversial; there are lots of those.)

(And when I say that they are controversial, I mean that people will disagree significantly on whether it makes progress on alignment, or what the value of the work is; often the work will make technical claims that are uncontroversial. I do think it could be good to highlight which of the technical claims are controversial.)

I'm skeptical that the actual top-voted posts trigger this reaction.

What is "this reaction"?

If you mean "work being conceptually confused, or simply stating points rather than arguing for them, or being otherwise epistemically sketchy", then I agree there are posts that don't trigger this reaction (but that doesn't seem too relevant to whether it is good to write reviews).

If you mean "reviews of these posts by a randomly selected alignment person would not be very useful", then I do still have that reaction to every single one of those posts.

comment by Raemon · 2023-01-17T01:32:37.353Z · LW(p) · GW(p)

rather than actually talking about the details, which is what I would usually find useful about reviews.

I'm interested in details about what you find useful about the prospect of reviews that talk about the details. I share a sense that it'd be helpful, but I'm not sure I could justify that belief very strongly (when it comes to the opportunity cost of the people qualified to do the job)

In general, I'm legit fairly uncertain whether "effort-reviews"(whether detail-focused or big-picture focused) are worthwhile. It seems plausible to me that detail-focused-reviews are more useful soon after a work is published than 2 years later, and that big-picture-reviews are more useful in the "two year retrospective" sense (and maybe we should figure out some way to get detail-oriented reviews done more frequently, faster). 

It does seem to me that, by the time a book is being considered for "long-term-valuable', I would like someone, at some point, to have done a detail-oriented review examining all of the fiddly pieces. In some cases, that review has been done before the post was even published, in a private google doc.

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2023-01-18T16:26:58.622Z · LW(p) · GW(p)

A couple of reasons:

  1. It's far easier for me to figure out how much to update on evidence when someone else has looked at the details and highlighted ways in which the evidence is stronger or weaker than a reader might naively take away from the paper. (At least, assuming the reviewer did a good job.)
    1. This doesn't apply to big-picture reviews because such reviews are typically a rehash of old arguments I already know.
    2. This is similar to the general idea in AI safety via debate -- when you have access to a review you are more like a judge; without a review you are more like the debate opponent.
  2. Having someone else explain the paper from their perspective can surface other ways of thinking about the paper that can help with understanding it.
    1. This sometimes does happen with big-picture reviews, though I think it's less common.

Tbc, I'm not necessarily saying it is worth the opportunity cost of the reviewer's time; I haven't thought much about it.

comment by Joe Collman (Joe_Collman) · 2023-01-13T12:17:54.422Z · LW(p) · GW(p)

for that to fix these problems the reviewers would have to be more epistemically competent than the post authors

I think this is an overstatement. They'd need to notice issues the post authors missed. That doesn't require greater epistemic competence: they need only tend to make different mistakes, not fewer mistakes.

Certainly there's a point below which the signal-to-noise ratio is too low. I agree that high reviewer quality is important.

On the "same old cruxes and disagreements" I imagine you're right - but to me that suggests we need a more effective mechanism to clarify/resolve them (I think you're correct in implying that review is not that mechanism - I don't think academic review achieves this either). It's otherwise unsurprising that they bubble up everywhere.

I don't have any clear sense of the degree of time and effort that has gone into clarifying/resolving such cruxes, and I'm sure it tends to be a frustrating process. However, my guess is that the answer is "nowhere close to enough". Unless researchers have very high confidence that they're on the right side of such disagreements, it seems appropriate to me to spend ~6 months focusing on purely this (of course this would require coordination, and presumably seems wildly impractical).

My sense is that nothing on this scale happens (right?), and that the reasons have more to do with (entirely understandable) impracticality, coordination difficulties and frustration, than with principled epistemics and EV calculations.
But perhaps I'm way off? My apologies if this is one of the same old cruxes and disagreements :).

Replies from: rohinmshah, Raemon
comment by Rohin Shah (rohinmshah) · 2023-01-13T22:53:33.755Z · LW(p) · GW(p)

That doesn't require greater epistemic competence: they need only tend to make different mistakes, not fewer mistakes.

Yes, that's true, I agree my original comment is overstated for this reason. (But it doesn't change my actual prediction about what would happen; I still don't expect reviewers to catch issues.)

My sense is that nothing on this scale happens (right?)

I'd guess that I've spent around 6 months debating these sorts of cruxes and disagreements (though not with a single person of course). I think the main bottleneck is finding avenues that would actually make progress.

Replies from: Joe_Collman
comment by Joe Collman (Joe_Collman) · 2023-01-14T01:58:04.865Z · LW(p) · GW(p)

Ah, well that's mildly discouraging (encouraging that you've made this scale of effort; discouraging in what it says about the difficulty of progress).

I'd still be interested to know what you'd see as a promising approach here - if such crux resolution were the only problem, and you were able to coordinate things as you wished, what would be a (relatively) promising strategy?
But perhaps you're already pursuing it? I.e. if something like [everyone works on what they see as key problems, increases their own understanding and shares insights] seems most likely to open up paths to progress.

Assuming review wouldn't do much to help on this, have you thought about distributed mechanisms that might? E.g. mapping out core cruxes and linking all available discussions where they seem a fundamental issue (potentially after holding/writing-up a bunch more MIRI Dialogues [? · GW] style interactions [which needn't all involve MIRI]).
Does this kind of thing seem likely to be of little value - e.g. because it ends up clearly highlighting where different intuitions show up, but shedding little light on their roots or potential justification?

I suppose I'd like to know what shape of evidence seems most likely to lead to progress - and whether much/any of it might be unearthed through clarification/distillation/mapping of existing ideas. (where the mapping doesn't require connections that only people with the deepest models will find)

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2023-01-16T11:58:53.957Z · LW(p) · GW(p)

Personally if I were trying to do this I'd probably aim to do a combination of:

  1. Identify what kinds of reasoning people are employing, investigate under what conditions they tend to lead to the truth. E.g. one way that I think I differ from many others is that I am skeptical of analogies as direct evidence about the truth; I see the point of analogies as (a) tools for communicating ideas more effectively and (b) locating hypotheses that you then verify by understanding the underlying mechanism and checking that the mechanism ports (after which you don't need the analogy any more). 
  2. State arguments more precisely and rigorously, to narrow in on more specific claims that people disagree about (note there are a lot of skulls along this road)
comment by Raemon · 2023-01-13T17:51:47.209Z · LW(p) · GW(p)

FWIW I think a fairly substantial amount of effort has gone into resolving longstanding disagreements. I think that effort has resulted in a lot of good works and updates from many people reading about the disagreement discussion, but not really changed the mind of the people doing the arguing. (See: the MIRI Dialogues [? · GW])

And it's totally plausible to me the answer is "10-100x the amount of work that is gone in so far."

I maybe agree that people haven't literally sat and double-cruxed for six months. I don't know that it's fair to describe this as "impracticality, coordination difficulties and frustration" instead of "principled epistemics and EV calculations." Like, if you've done a thing a bunch and it doesn't seem to be working and you feel like you have traction on another thing, it's not crazy to do the other thing.

(That said, I do still have the gut level feeling of 'man it's absolutely bonkers that in the so-called rationality community a lot of prominent thinkers still disagree about such fundamental stuff.')

Replies from: Joe_Collman
comment by Joe Collman (Joe_Collman) · 2023-01-13T19:05:26.949Z · LW(p) · GW(p)

Oh sure, I certainly don't mean to imply that there's been little effort in absolute terms - I'm very encouraged by the MIRI dialogues, and assume there are a bunch of behind-the-scenes conversations going on.
I also assume that everyone is doing what seems best in good faith, and has potentially high-value demands on their time.

However, given the stakes, I think it's a time for extraordinary efforts - and so I worry that [this isn't the kind of thing that is usually done] is doing too much work.

I think the "principled epistemics and EV calculations" could perfectly well be the explanation, if it were the case that most researchers put around a 1% chance on [Eliezer/Nate/John... are largely correct on the cruxy stuff].

That's not the sense I get - more that many put the odds somewhere around 5% to 25%, but don't believe the arguments are sufficiently crisp to allow productive engagement.

If I'm correct on that (and I may well not be), it does not seem a principled justification for the status-quo. Granted the right course isn't obvious - we'd need whoever's on the other side of the double-cruxing to really know their stuff. Perhaps Paul's/Rohin's... time is too valuable for a 6 month cost to pay off. (the more realistic version likely involves not-quite-so-valuable people from each 'side' doing it)

As for "done a thing a bunch and it doesn't seem to be working", what's the prior on [two experts in a field from very different schools of thought talk for about a week and try to reach agreement]? I'm no expert, but I strongly expect that not to work in most cases.

To have a realistic expectation of its working, you'd need to be doing the kinds of thing that are highly non-standard. Experts having some discussions over a week is standard. Making it your one focus for 6 months is not. (frankly, I'd be over the moon for the one month version [but again, for all I know this may have been tried])

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2023-01-13T21:32:13.659Z · LW(p) · GW(p)

Even more importantly, Aumann's Agreement Theorem demands that both sides eventually agree on something, and the fact that the AI Alignment field hasn't agreed yet is concerning.

Here's the link:

https://www.lesswrong.com/tag/aumann-s-agreement-theorem [? · GW]

comment by habryka (habryka4) · 2023-01-11T15:11:51.221Z · LW(p) · GW(p)

This was quite a while ago, probably over 2 years, though I do feel like I remember it quite distinctly. I guess my model of you has updated somewhat here over the years, and now is more interested in heads-down work.

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2023-01-11T15:45:01.342Z · LW(p) · GW(p)

Yeah, that sounds entirely plausible if it was over 2 years ago, just because I'm terrible at remembering my opinions from that long ago.

comment by Roman Leventov · 2023-01-16T04:29:06.649Z · LW(p) · GW(p)

I feel that LW is quite bad as the system for performing AI safety research. Most likely worse than the traditional system of academic publishing, in aggregate. A random list of things that I find unhelpful:
 

  • Very short “attention timespan”. People read posts within a couple of days from publishing unless curated (but curation is not a solution, because it also either happens or not within a short time window, and is a subjective judgement of a few moderators), perhaps within a few weeks, unless hugely upvoted. And a big “wave of upvotes” is also a somewhat immediate reaction rather than a research appreciation.
    • Contrast: academic papers get read when they have many citations. Citations accumulate over time. Citations are a much more reliable indicator of a paper’s usefulness (exception: controversial takes by high-profile authors that invoke a lot of counter-takes). I suspect a paper’s “peak read” is somewhere within a year of the publishing date, but definitely not the day 1 after publishing.
    • Given LW/AF’s naturally faster turnover of posts and ideas (also necessitated by the ultra-fast AI progress), if it adopted showing on the front page posts that accumulated the most citations (backlinks) on these forums in the last few months, it could be more useful for guiding research.
  • No standards and practices of review. No expectation that established researchers review the work of novices, thus disseminating ideas and helping novices to fix their conceptual confusion. Unstructured comments are way too “poor man’s” reviews, they often cherry-pick a small point and the ensuing argument loses the big picture quickly, making it largely unhelpful.
    • Yann LeCun proposed a review system here that seems to be implementable on LW/AF. Importantly, a full review of a post (a “paper”) could give it a “rating”, but a mere comment can’t. People can subscribe to streams of posts, approved (or highly rated) by specific researchers or groups (“Reviewing Entities”).
    • The cumulative rating of the post could also be used to guide the community’s attention. But note that reading a big post and writing a full review of it and giving it a rating or approval is much harder than just clicking the upvote button, and is in itself an intellectual bar. The “crowd” that just clicks upvotes (usually before even reading the post beyond the title) shouldn’t guide the attention of the research community.
    • The current review system on LW emphasises focusing on specific lines and claims, rather than on the big picture, and doesn’t guide the reviewers to ask the right questions about the work.
  • No community standards (nor LW native support, e. g. for a dedicated "references" section) for referencing prior work. An academic paper in physics, computer science, biology, etc. with zero or a couple of references to prior work is unfathomable, in LW/AF this is standard practice (including for posts attempting to make research contributions: develop new theories, etc.). Also, these references shouldn’t be solipsistically limited to LW/AF alone.
  • The selectivity of AF doesn’t seem to me to work or help (of course, this take is shaped by my personal experience of posts that IMO are obviously relevant for AF and of no lower quality than a big portion of posts on AF that don’t get published there because I’m not a member of the in-group, and, perhaps, not completely onboard with the currently most popular paradigms for thinking about alignment). Seems that AF just reinforces the “echo chamber” effect that is already present on LW (relative to the rest of scientific publishing/thought).

AF as a research medium

I think the following setup would be interesting: all posts with tags like “AI”, “Alignment”, “Agency”, plus an additional tag/label that authors may add, like “Research”, are automatically visible on “AF” (though, it wouldn’t make much sense to call that medium “forum” anymore). On these posts, apart from regular comments, people could also post structured reviews (with many obligatory sections, similar to academic reviews). There is already similar machinery on LW: on questions, there could be “answers”, but there also could be regular comments. 

Finally, the content on the “new AF” is ranked completely differently than on LW, along the lines as I described above: regular LW’s upvotes and downvotes don’t matter at all, however, the ratings given out in reviews do matter (along, perhaps, with the reputation of the reviewers), as well as backlinks/reference counts. This necessarily implies that the content on this “AF” becomes visible on the front page only slowly (busy researchers are unlikely to review work quicker than on about a week timescale, and reviews themselves take time), but also potentially stays there for longer, if the content generates a lot of backlinks.

Also, this "new AF" should provide formal academic labelling, similar to how https://distill.pub/ did.


If someone shares these ideas and has more thoughts on why LW/AF system doesn’t cater for AI safety progress well, I would like to collaborate on a shared post (for LW).

PS. We, the AI Safety community, need to “outcompete” capability research, including on this front, so it’s unacceptable that we toil behind including because our epistemic systems don’t support a more effective accumulation of knowledge and insight.

Replies from: habryka4
comment by habryka (habryka4) · 2023-01-16T04:32:35.060Z · LW(p) · GW(p)

I have definitely been neglecting engineering and mechanism design for the AI Alignment Forum for quite a while, so concrete ideas for how to reform things are quite welcome. I also think things aren't working that well, though my guess is my current top criticisms are quite different from yours.

comment by Seb Farquhar · 2023-01-10T11:49:58.084Z · LW(p) · GW(p)

I've also been thinking about how to boost reviewing in the alignment field. Unsure if AF is the right venue, but it might be. I was more thinking along the lines of academic peer review. Main advantages of reviewing generally I see are:
- Encourages sharper/clearer thinking and writing;
- Makes research more inter-operable between groups;
- Catches some errors;
- Helps filter the most important results.

Obviously peer review is imperfect at all of these. But so is upvoting or not doing review systematically.

I think the main reasons alignment researchers currently don't submit their work to peer reviewed venues are:
- Existing peer reviewed venues are super slow (something like 4 month turnaround is considered good).
- Existing peer reviewed venues have few expert reviewers in alignment, so reviews are low quality and complain about things which are distractions.
- Existing peer reviewed venues often have pretty low-effort reviews.
- Many alignment researchers have not been trained in how to write ML papers that get accepted, so they have bad experiences at ML conferences that turn them off.

One hypothesis I've heard from people is that actually alignment researchers are great at sending out their work for feedback from actual peers, and the AF is good for getting feedback as well, so there's no problem that needs fixing. This seems unlikely. Critical feedback from people who aren't already thinking on your wavelength is uncomfortable to get and effortful to integrate, so I'd expect natural demand to be lower than optimal. Giving careful feedback is also effortful so I'd expect it to be undersupplied.

I've been considering a high-effort 'journal' for alignment research. It would be properly funded and would pay for high-effort reviews, aiming for something like a 1 week desk-reject and a 2 week initial review time. By focusing on AGI safety/Alignment you could maintain a pool of actually relevant expert reviewers. You'd probably want to keep some of the practice of academic review process (e.g., structured fields for feedback from reviewers), ranking or sorting papers for significance and quality; but not others (e.g., allow markdown or google doc submissions).

In my dream version of this, you'd use prediction markets about the ultimate impact of the paper, and then uprate the reviews from profitable impact forecasters.

Would be good to talk with people who are interested in this or variants. I'm pretty uncertain about the right format, but I think we can probably build something better than what we have now and the potential for value is large. I'm especially worried about the alignment community forming cliques that individually feel good about their work and don't engage with concerns from other researchers and people feeling so much urgency that they make sloppy logical mistakes that end up being extremely costly.

Replies from: jacques-thibodeau, Roman Leventov
comment by jacquesthibs (jacques-thibodeau) · 2023-01-16T04:48:10.601Z · LW(p) · GW(p)

I’ve talked to a few people who have suggested journal or conference ideas, but they never happened. I think it was mostly a mix of not knowing how to do it well and (mostly) they were busy with other stuff. Someone probably actually needs to take initiative on this if we want our research to be more ‘academic’.

Regardless of whether a journal is created or not, I’ve certainly wished I had more academic collaborators or someone who could teach me how to publish work that ends up being accepted within the ML community. As an indepedent researcher, it feels like the gap is too large and causes too much friction to figure things out and get started.

Replies from: Seb Farquhar
comment by Seb Farquhar · 2023-01-16T16:45:18.431Z · LW(p) · GW(p)

Yeah, I think it requires some specialist skills, time, and a bit of initiative. But it's not deeply super hard.

Sadly, I think learning how to write papers for ML conferences is pretty time consuming. It's one of the main things a phd student spends time learning in the first year or two of their phd. I do think there's a lot that's genuinely useful about that practice though, it's not just jumping through hoops.

comment by Roman Leventov · 2023-01-16T04:36:00.473Z · LW(p) · GW(p)

I strongly agree with most of this.

Did you see LeCun's proposal about how to improve academic review here? It strikes me as very good and I'd love if AI safety/x-risk community had a system like this.

I'm suspicious about creating a separate journal, rather than concentrating efforts around existing institutions: LW/AF. I think it would be better to fund LW exactly for this purpose and add monetary incentives for providing good reviews of research writing on LW/AF (and, of course, the research writing itself could be incentivised in this way, too).

Then, turn AF in exactly the kind of "journal" that you proposed, as I described here [LW(p) · GW(p)].

Replies from: Seb Farquhar
comment by Seb Farquhar · 2023-01-16T16:47:35.004Z · LW(p) · GW(p)

Yeah, LeCun's proposal seems interesting. I was actually involved in an attempt to modify OpenReview to push along those lines a couple years ago. But it became very much a 'perfect is the enemy of the good' situation where the technical complexity grew too fast relative to the amount of engineering effort devoted to it.

What makes you suspicious about a separate journal? Diluting attention? Hard to make new things? Or something else? I'm sympathetic to diluting attention, but bet that making a new thing wouldn't be that hard.

Replies from: Roman Leventov
comment by Roman Leventov · 2023-01-17T09:05:33.760Z · LW(p) · GW(p)

Attention dilution, exactly. Ultimately, I want (because I think this will be more effective) all relevant work to be syndicated on LW/AF (via Linkposts, and review posts), not the other way around: AI safety researchers had to subscribe to arxiv sanity, google AI blog, all relevant standalone blogs such as Bengio's and Scott Aaronson's, etc. etc. etc., all by themselves and separately.

I even think if LW hired part-time staff dedicated to doing this would be very valuable.

Also, alignment newsletters, to further pre-process information, don't live. Shah tried to revive his newsletter mid last year, but it didn't survive for long. Part-time could also curate such an "AF newsletter", I don't think it takes Shah's competence to do this well.

Replies from: Seb Farquhar
comment by Seb Farquhar · 2023-01-17T17:59:17.972Z · LW(p) · GW(p)

FWIW I think doing something like the newsletter well actually does take very rare skills. Summarizing well is really hard. Having relevant/interesting opinions about the papers is even harder.

comment by the gears to ascension (lahwran) · 2023-01-10T07:14:05.021Z · LW(p) · GW(p)

Hmm. Perhaps it could be a good idea. However, what do y'all think about the various takes that much of the problem with science boils down to the peer review process? eg:

and I'm sure even more to-the-point sources could be found, though these are no slouch themselves. of course, I was quite literally searching for articles that claim peer review is not working, so there's an obvious bias to the articles I found! but that's because I'd recently seen the dec 2022 one. The full metaphor results have a wider variety of nuanced takes, I just selected some that seemed like they made the point I wanted to make, so there's an additional layer of selection there. nonetheless, it seems like they're making an interesting point.

Replies from: TekhneMakre
comment by TekhneMakre · 2023-01-10T07:41:21.502Z · LW(p) · GW(p)

This seems important. Can you crystallize more of the causality, from your reading? E.g. is it because peer review creates cabals and entrenched interests who upvote work that makes their work seem "in the hot areas", or similar? Or because it creates wasteful work for the academics trying to conform to logistical peer review requirements? Or predatory journals select for bad editors? Or it creates an illusion of consensus, obscuring that there are gaping wide open questions? Or...?

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2023-01-10T08:08:18.073Z · LW(p) · GW(p)

I don't feel qualified to distill, which is why I did not. I only have a fuzzy grasp of the issue myself. Your hypotheses all seem plausible to me.

comment by JakubK (jskatt) · 2023-01-10T05:45:31.305Z · LW(p) · GW(p)

many people wish there was more of a review process in the AI Alignment field. 

Could you provide more details about these opinions? I imagine the people you talked to have different ideas about what a review process might look like and why it would be useful.

comment by alkjash · 2023-01-10T23:22:29.450Z · LW(p) · GW(p)

Is it just me or are alignment-related post titles getting longer and longer?