Looking for an alignment tutor

post by JanB (JanBrauner) · 2022-12-17T19:08:10.770Z · LW · GW · 2 comments

Contents

2 comments

Hey, this is me. I’d like to understand AI X-risk better. Is anyone interested in being my “alignment tutor”, for maybe 1 h per week, or 1 h every two weeks? I’m happy to pay.

 

Fields I want to understand better:

 

Fields I’m not interested in (right now):

 

My level of understanding:

 

Example questions I wrestled with recently, and I might have brought up during the tutoring:

You don’t need to have very crisps answers to these to be my tutor, but you should probably have at least some good thoughts.


 

2 comments

Comments sorted by top scores.

comment by Ulisse Mini (ulisse-mini) · 2022-12-18T15:39:58.492Z · LW(p) · GW(p)

EleutherAI's #alignment channels are good to ask questions in. Some specific answers

I understand that a reward maximiser would wire-head (take control over the reward provision mechanism), but I don’t see why training an RL agent would necessarily end up in a reward-maximising agent? Turntrout’s Reward is Not the Optimisation Target shed some clarity on this, but I definitely have remaining questions.

Leo Gao's Toward Deconfusing Wireheading and Reward Maximization [AF · GW] sheds some light on this.

Replies from: kyle-o-brien
comment by Kyle O’Brien (kyle-o-brien) · 2022-12-18T23:25:19.468Z · LW(p) · GW(p)

I agree with this suggestion. EleutherAI's alignment channels have been invaluable for my understanding of the alignment problem. I typically get insightful responses and explanations on the same day as posting. I've also been able to answer other folks' questions to deepen my inside view.

There is a alignment-beginners channel and a alignment-general channel. Your questions seem similar to what I see in alignment-general . For example, I received helpful answers when I asked this question about inverse reinforcement learning there yesterday.

Question: When I read Human Compatible a while back, I had the takeaway that Stuart Russel was very bullish on Inverse Reinforcement Learning being an important alignment research direction. However, I don’t see much mention of IRL on EleutherAI and the alignment forum. I see much more content about RLHF. Is IRL and RLHF the same thing? If not, what are folks’ thoughts on IRL?