Critiquing Risks From Learned Optimization, and Avoiding Cached Theories

post by ProofBySonnet (redheadbros) · 2023-07-11T11:38:34.436Z · LW · GW · 0 comments

Contents

  What I'm doing with this post and why
      Method:
  Initial Questions
  Summary
  Comments or Questions I had while reading
    The source of this model
    The purpose of this framework
  Results
    What alternate theories/frames would look like
    My internal "preface" for this framework
  Takeaways
None
No comments

What I'm doing with this post and why

I've been told that it's a major problem (this [? · GW] post, point 4), of alignment students just accepting the frames they're given without question. The usual advice (the intro of this [LW · GW]) is to first do a bunch of background research and thinking on your own, to come up with your own frames if at all possible. BUT: I REALLY want to learn about existing alignment frames, it's Fucking Goddamn Interesting.

<sidenote>At the last possible minute while editing this post, I've had the realization that "the usual advice" I somehow acquired might not actually be "usual" at all. In fact, upon actually rereading the study guide [LW · GW], it looks like he actually recommended to read up on the existing frames and to try applying them. So, I must've just completely hallucinated that advice! Or, at least, extremely overestimated how important that point from 7 traps [? · GW]was. I'm still going to follow through with this exercise, but holy shit my internal category of "possible study routes" has expanded DRASTICALLY. </sidenote>

So I ask the question: is it possible to read about and pick apart an existing frame, such that you don't just end up uncritically accepting it? What questions do you need to ask or what thoughts do you need to think, in order to figure out the assumptions and/or problems with an existing framework?

I've chosen Risks From Learned Optimization [? · GW] because I've already been exposed to the basic concept, from various videos by Robert Miles. I shouldn't be losing anything by learning this more deeply and trying to critique it.

Moreover, I want to generate evidence of my competence anyway, so I might as well write this up.

Method:

I'm writing pretty much all the notes I generated while going through this process here, though I somewhat re-organized them to make them more readable and coherent.

Initial Questions

These were all generated before reading the sequence.

Summary

This is high-level: I don't need all the details in order to generate a useful critique.

First post: Introduction

Second post: conditions for mesa-optimization

Third post: the inner alignment problem

Fourth post: deceptive alignment

Fifth post: Conclusions and related work

Comments or Questions I had while reading

I jotted these down as I went, and during this editing process I grouped them into a few useful categories.

The source of this model

The purpose of this framework

An earlier observation: "X is an optimizer" is a way of thinking about X that allows us to better predict X's behavior. If we see it do things, we wonder about what objective would fit those actions. And if we know that X has objective A, then we can come up with possible things that X would be more likely to do, based on how much said action contributes to A.

Our true goal, somewhat: predict what our AI agent is going to do, and take actions such that it does the sorts of things we want it to do.

I don't know about what other categories or ways of thinking would help with this. Damn.

Results

What alternate theories/frames would look like

My internal "preface" for this framework

(these questions are for my own benefit) (ask before applying?)

Are the objects or agents in your particular problem full-on optimizers? Or do they break the optimizer/non-optimizer dichotomy? If the latter, consider that fact while reviewing this framework.

Is the optimizer-ness of the objects or agents in your problem the most relevant point? What other factors or categories might be relevant? What is the ACTUAL PROBLEM/QUESTION you are trying to solve/answer, and how might this framework help or hinder?

Takeaways

How I could improve or streamline this process, and what I want to do next.

After finishing this post: I plan to pick a new conceptual alignment tool or alignment framework, and try out this method on it. I'm not sure how to determine if I've "critiqued correctly," though. How can I tell if a framework has permanently skewed my thoughts? What is a test I can run?

0 comments

Comments sorted by top scores.