Is there an unified way to make sense of ai failure modes?
post by walking_mushroom · 2022-06-17T18:00:58.908Z · LW · GW · No commentsThis is a question post.
Contents
Answers 1 Thomas Larsen None No comments
Status: I'm new to ai alignment, went through AGI fundamentals curriculum, currently reading MIRI's embedded agents series and I feel confused. I'm not familiar with decision theory.
So far the cases discussed in the series (and ai failure modes in general) felt solid in themselves (if follow its reasoning, with some proofs that I half understand etc) but I don't yet see the connection between different cases or any low-complexity structure behind them.
For example: there are many kinds of bridges and many ways bridges can crash (failure modes), but they all boil down to the principles of structural mechanics and the structural integrity of bridges can be determined by a few physical measurements agnostic to detailed bridge designs.
Can we boil down decision-theoretic failure modes mentioned in the embedded agent series (and ai failure modes in general) to a more universal picture of predictive principles, as in the case of bridges and structural mechanics?
- If there exist such a state of understanding what should I learn first to gain such understanding? Specific reading recommendation is most appreciated!
- If not (or this isn't a good question to ask), why is that so?
- Clearly I'm ignorant, but am I ignorant in a specific/common way with specific/common path to get out of it?
Thank you all.
edit: formatting & slight wording, ask for specific recommendations
Answers
Late response, but I think that Adam Shimi's Unbounded Atomic Optimization [LW · GW] does a good job at a unified frame on alignment failures.
I tend to think that decision theory failures are not a primary worry for why we might land in AGI ruin, the AI having a poor decision theory is more of a capabilities problem. Afaik, the motivation behind studying decision theory [LW · GW] is to have a better understanding of agency (and related concepts like counterfactual reasoning, logical updatelessness, etc) at a basic level.
No comments
Comments sorted by top scores.