Hypothesis regarding your confusion about agency:

Describing humans using a "utility function" or through "goals" is wrong.

Humans are a bunch of habits (like CFAR TAPs) which have some correlation with working towards goals, but this is more of an imperfect rationalization than a reasonable/natural way to describe the situation.

Also yes, we have some part that thinks in goals, but it has a very limited effect on anything (like actions) compared to what we'd naturally think.

Credit to a friend

[I have no idea what I'm talking about, feel free to ignore if this doesn't resonate of course, seemed worth a comment]

This has influenced how I write things and how I give disclaimers (I read it a few months ago)


Some people here inspire me to make predictions ;) So here's my attempt:

My guess, mainly based on this image (linked from the post):


Is that he'd say it's a sub category of "getting models to output things based only on their training data, while treating them as a black box and still assuming unexpected outputs will happen sometimes", as well as "this might work well for training, but obviously not for an AGI" and "if we're going to talk about limiting a model's output, Redwood Research is more of a way to go" and perhaps "this will just advance AI faster"

[AI risk question, not sure where to ask]

Hey, could you (or someone) help me understand how useful this would be? (Or, what would Yudkowsky say about it?)

I'm asking because this might be something that I, or someone that I know, could do

@hnykda this is why I have a vmDaniel.  You see, it's even Duncan Approved