Is there an unified way to make sense of ai failure modes?

post by walking_mushroom · 2022-06-17T18:00:58.908Z · LW · GW · No comments

This is a question post.

Contents

  Answers
    1 Thomas Larsen
None
No comments

Status: I'm new to ai alignment, went through AGI fundamentals curriculum, currently reading MIRI's embedded agents series and I feel confused. I'm not familiar with decision theory.

So far the cases discussed in the series (and ai failure modes in general) felt solid in themselves (if follow its reasoning, with some proofs that I half understand etc) but I don't yet see the connection between different cases or any low-complexity structure behind them.

For example: there are many kinds of bridges and many ways bridges can crash (failure modes), but they all boil down to the principles of structural mechanics and the structural integrity of bridges can be determined by a few physical measurements agnostic to detailed bridge designs. 

Can we boil down decision-theoretic failure modes mentioned in the embedded agent series (and ai failure modes in general) to a more universal picture of predictive principles, as in the case of bridges and structural mechanics? 

Thank you all.

edit: formatting & slight wording, ask for specific recommendations

Answers

answer by Thomas Larsen · 2022-08-27T00:51:31.857Z · LW(p) · GW(p)

Late response, but I think that Adam Shimi's Unbounded Atomic Optimization [LW · GW] does a good job at a unified frame on alignment failures.

I tend to think that decision theory failures are not a primary worry for why we might land in AGI ruin, the AI having a poor decision theory is more of a capabilities problem. Afaik, the motivation behind studying decision theory [LW · GW] is to have a better understanding of agency (and related concepts like counterfactual reasoning, logical updatelessness, etc) at a basic level.  

No comments

Comments sorted by top scores.