Alignment Gaps 2024-06-08T15:23:16.396Z


Comment by kcyras on Alignment Gaps · 2024-06-10T18:14:21.808Z · LW · GW

Thanks, appreciated. My expectation is not that one should cite the generally relevant research. Rather, one is expected to quickly browse, at least by similar keywords, the existing literature for seemingly closely related works. If anything comes up, there are a few options.

  1. Just mention the work - this shows the reader that you've at least browsed around, but also gives the opportunity to somebody else to explore that relevant angle. Arguably, interpretability researchers that aim at actually figuring out model internals must be aware of formal verification of DNNs and should position their work accordingly.
  2. Mention and argue that the work is not applicable. This additionally justifies one's motivation. I've seen it done wrt mech interp in one of the main papers, where the typical heuristic explainability research was discarded as inapplicable, and appropriately so. A similar argument could be made towards formal verification.
  3. Consider using the existing approach or show what concrete problem it fails to address. This may even further incentivise any proponents of the existing approach to improve that work. In short, it's not so much about citing others' work (that's only the metrics game), but also about giving the opportunity for others to relate your work to that of others (that's science).

Re preferences. The book is fine (yes I've read it), but reading it is for somebody working deeply on preference-based reasoning. The point is about skimming through or at least picking up the conceptual claims as to what the research is about. A relatively quick browse through such relevant materials should help to realise that value learning and value-based reasoning have been extensively studied. Again, using one of the 3 options above, there may be opportunities to both identify new problems and to point one's readers to (not) pursue other lines of inquiry.