TASP Ep 3 - Optimal Policies Tend to Seek Power
post by Quinn (quinn-dougherty) · 2021-03-11T01:44:02.814Z · LW · GW · 0 commentsThis is a link post for https://technical-ai-safety.libsyn.com/3-optimal-policies-tend-to-seek-power
Contents
Show notes: Optimal Policies Tend to Seek Power None No comments
Welcome to the Technical AI Safety Podcast, the show where I interview computer scientists about their papers. This month I covered Optimal Policies Tend to Seek Power, which is closely related to Seeking Power is Often Robustly Instrumental in MDPs [LW · GW] which is a part of the Reframing Impact [? · GW] sequence and was recently a part of the 2019 review [LW · GW].
The point of the show is to make papers more parsable, the interview features a detailed walkthrough padded on either side by discussion of where the work came from and where it's going.
I had a lot of fun doing this month's episode, a tricky paper to wrap my head around but very rewarding. Do let me know if you have trouble finding it on your favorite podcast app, thanks!
Show notes:
With Alex Turner
Optimal Policies Tend to Seek Power
by Alexander Matt Turner, Logan Smith, Rohin Shah, Andrew Critch, Prasad Tadepalli
Abstract:
Some researchers have speculated that capable reinforcement learning agents are often incentivized to seek resources and power in pursuit of their objectives. While seeking power in order to optimize a misspecified objective, agents might be incentivized to behave in undesirable ways, including rationally preventing deactivation and correction. Others have voiced skepticism: human power-seeking instincts seem idiosyncratic, and these urges need not be present in reinforcement learning agents. We formalize a notion of power within the context of Markov decision processes. With respect to a class of neutral reward function distributions, we provide sufficient conditions for when optimal policies tend to seek power over the environment.
What Counts as Defection? [LW · GW]
Non-Obstruction [LW · GW]
0 comments
Comments sorted by top scores.