Posts

Comments

Comment by Kyle O’Brien (kyle-o-brien) on SAEs are highly dataset dependent: a case study on the refusal direction · 2024-11-08T21:25:17.933Z · LW · GW

Great works, folks. This further highlights a challenge that wasn't obvious to me when I first began to study SAEs — which features are learned is just super contingent on the SAE size, sparsity, and training data. Ablations like this one are important. 

Comment by Kyle O’Brien (kyle-o-brien) on Looking for an alignment tutor · 2022-12-18T23:25:19.468Z · LW · GW

I agree with this suggestion. EleutherAI's alignment channels have been invaluable for my understanding of the alignment problem. I typically get insightful responses and explanations on the same day as posting. I've also been able to answer other folks' questions to deepen my inside view.

There is a alignment-beginners channel and a alignment-general channel. Your questions seem similar to what I see in alignment-general . For example, I received helpful answers when I asked this question about inverse reinforcement learning there yesterday.

Question: When I read Human Compatible a while back, I had the takeaway that Stuart Russel was very bullish on Inverse Reinforcement Learning being an important alignment research direction. However, I don’t see much mention of IRL on EleutherAI and the alignment forum. I see much more content about RLHF. Is IRL and RLHF the same thing? If not, what are folks’ thoughts on IRL?