Some ideas for follow-up projects to Redwood Research’s recent paper

post by JanB (JanBrauner) · 2022-06-06T13:29:08.622Z · LW · GW · 0 comments

Contents

  How does this methodology fail?
  Improve the classifier or its training
  Do a project on high-stakes reliability that does not use human oversight, in order to focus more on the ML side of things
  Do a project on high-stakes reliability that uses real-world high-stakes tasks
  Do a project on high-stakes reliability that is more similar to AGI alignment in one (or several?) key way(s):
  How to evaluate high-stakes reliability?
  Better adversarial attacks
  Create a benchmark dataset that captures worst-case reliability
  What does work on high-stakes reliability tell us about the difficulty of alignment?
  Some thoughts/ideas prompted by reading this paper
None
No comments

Disclaimer: I originally wrote this list for myself and then decided it might be worth sharing. Very unpolished, not worth reading for most people. Some (many?) of these ideas were suggested in the paper already, I don’t make any claims of novelty.


 

I am super excited by Redwood Research’s new paper [AF · GW], and especially by the problem of “high-stakes reliability”.

Others have explained much better why the problem of “high-stakes reliability” is a worthy one (LINK [LW · GW], LINK [LW · GW]), so I’ll keep my opinion. In short, I think:


 

Here is an extremely unpolished of ideas and directions for follow-up research that I had after reading the paper, in no particular order. Some (many?) of these ideas were suggested in the paper already, I don’t make any claims of novelty.


 

How does this methodology fail?


 

Improve the classifier or its training


 

Do a project on high-stakes reliability that does not use human oversight, in order to focus more on the ML side of things

Do a project on high-stakes reliability that uses real-world high-stakes tasks


 

Do a project on high-stakes reliability that is more similar to AGI alignment in one (or several?) key way(s):



 

How to evaluate high-stakes reliability?


 

Better adversarial attacks




 

Create a benchmark dataset that captures worst-case reliability



 

What does work on high-stakes reliability tell us about the difficulty of alignment?


 

Some thoughts/ideas prompted by reading this paper

0 comments

Comments sorted by top scores.