When Should the Fire Alarm Go Off: A model for optimal thresholds 2021-04-28T12:27:20.031Z
Does making unsteady incremental progress work? 2021-03-05T07:23:30.338Z
Summary of AI Research Considerations for Human Existential Safety (ARCHES) 2020-12-09T23:28:43.635Z


Comment by peterbarnett on When Should the Fire Alarm Go Off: A model for optimal thresholds · 2021-04-29T10:00:25.774Z · LW · GW

Oh you're right! Thanks for catching that. I think I was lead astray because I wanted there to be a big payoff for averting the bad event, but I guess the benefit is just not having to pay D.
I'll have a look and see how much this changes things

Edit: Fixed it up now, none of the conclusions seem to change (which is good because they seemed like common sense!). Thanks for reading this and pointing that out!

Comment by peterbarnett on Does making unsteady incremental progress work? · 2021-03-05T10:49:33.635Z · LW · GW

Thanks! Yeah, I definitely think that "it's okay to slack today if I pull up the average later on" is a pretty common way people lose productivity. I think one framing could be that if you do have an off day, that doesn't have to put you off track forever, and you can make up for it in the future. 

I make the graphs using the [matplotlib xkcd mode](, it's super easy you use, you just put your plotting in a "with plt.xkcd():" block 

Comment by peterbarnett on Artificial Intelligence: A Modern Approach (4th edition) on the Alignment Problem · 2020-09-18T01:50:08.741Z · LW · GW

My read of Russel's position is that if we can successfully make the agent uncertain about its model for human preferences then it will defer to the human when it might do something bad, which hopefully solves (or helps with) making it corrigible.

I do agree that this doesn't seem to help with inner-alignment stuff though, but I'm still trying to wrap my head around this area.