Posts

Reflection Mechanisms as an Alignment Target - Attitudes on “near-term” AI 2023-03-02T04:29:47.741Z
On the Importance of Open Sourcing Reward Models 2023-01-02T19:01:52.993Z
Reflection Mechanisms as an Alignment target: A follow-up survey 2022-10-05T14:03:19.923Z
Reflection Mechanisms as an Alignment target: A survey 2022-06-22T15:05:55.703Z

Comments

Comment by elandgre on On the Importance of Open Sourcing Reward Models · 2023-01-02T20:42:32.179Z · LW · GW

I think I don't completely understand the objection? Is your concern that organizations who are less competent will over-fit to the reward models during fine-tuning, and so give worse models than OpenAI/Anthropic are able to train? I think this is a fair objection, and one argument for open-sourcing the full model. 

My main goal with this post is to advocate for it being good to "at least" open source the reward models, and that the benefits of doing this would far outweigh the costs, both societally and for the organizations doing the open-sourcing. I tend to think that completely unaligned models will get more backlash than imperfectly aligned ones, but maybe this is incorrect. I haven't thought deeply about whether it is safe/good for everyone to open source the underlying capability model.