Looking for an alignment tutor

post by JanB (JanBrauner) · 2022-12-17T19:08:10.770Z · LW · GW · 2 comments

Hey, this is me. I’d like to understand AI X-risk better. Is anyone interested in being my “alignment tutor”, for maybe 1 h per week, or 1 h every two weeks? I’m happy to pay.


Fields I want to understand better:


Fields I’m not interested in (right now):


My level of understanding:


Example questions I wrestled with recently, and I might have brought up during the tutoring:

You don’t need to have very crisps answers to these to be my tutor, but you should probably have at least some good thoughts.



Comments sorted by top scores.

comment by Ulisse Mini (ulisse-mini) · 2022-12-18T15:39:58.492Z · LW(p) · GW(p)

EleutherAI's #alignment channels are good to ask questions in. Some specific answers

I understand that a reward maximiser would wire-head (take control over the reward provision mechanism), but I don’t see why training an RL agent would necessarily end up in a reward-maximising agent? Turntrout’s Reward is Not the Optimisation Target shed some clarity on this, but I definitely have remaining questions.

Leo Gao's Toward Deconfusing Wireheading and Reward Maximization [AF · GW] sheds some light on this.

Replies from: kyle-o-brien
comment by Kyle O’Brien (kyle-o-brien) · 2022-12-18T23:25:19.468Z · LW(p) · GW(p)

I agree with this suggestion. EleutherAI's alignment channels have been invaluable for my understanding of the alignment problem. I typically get insightful responses and explanations on the same day as posting. I've also been able to answer other folks' questions to deepen my inside view.

There is a alignment-beginners channel and a alignment-general channel. Your questions seem similar to what I see in alignment-general . For example, I received helpful answers when I asked this question about inverse reinforcement learning there yesterday.

Question: When I read Human Compatible a while back, I had the takeaway that Stuart Russel was very bullish on Inverse Reinforcement Learning being an important alignment research direction. However, I don’t see much mention of IRL on EleutherAI and the alignment forum. I see much more content about RLHF. Is IRL and RLHF the same thing? If not, what are folks’ thoughts on IRL?