Posts

Suffering Is Not Pain 2024-06-18T18:04:43.407Z
Question about Lewis' counterfactual theory of causation 2024-06-07T20:15:27.561Z
Why I stopped working on AI safety 2024-05-02T05:08:56.813Z
Integrating Three Models of (Human) Cognition 2021-11-23T01:06:48.955Z
Grokking the Intentional Stance 2021-08-31T15:49:36.699Z
Discussion: Objective Robustness and Inner Alignment Terminology 2021-06-23T23:25:36.687Z
Empirical Observations of Objective Robustness Failures 2021-06-23T23:23:28.077Z
Old post/writing on optimization daemons? 2021-04-15T18:00:17.923Z
Mapping the Conceptual Territory in AI Existential Safety and Alignment 2021-02-12T07:55:54.438Z

Comments

Comment by jbkjr on MichaelDickens's Shortform · 2024-08-23T00:49:43.174Z · LW · GW

It sounds to me like a problem of not reasoning according to Occam's razor and "overfitting" a model to the available data.

Ceteris paribus, H' isn't more "fishy" than any other hypothesis, but H' is a significantly more complex hypothesis than H or ¬H: instead of asserting H or ¬H, it asserts (A=>H) & (B=>¬H), so it should have been commensurately de-weighted in the prior distribution according to its complexity. The fact that Alice's study supports H and Bob's contradicts it does, in fact, increase the weight given to H' in the posterior relative to its weight in the prior; it's just that H' is prima facie less likely, according to Occam.

Given all the evidence, the ratio of likelihoods P(H'|E)/P(H|E)=P(E|H')P(H')/(P(E|H)P(H)). We know P(E|H') > P(E|H) (and P(E|H') > P(E|¬H)), since the results of Alice's and Bob's studies together are more likely given H', but P(H') < P(H) (and P(H') < P(¬H)) according to the complexity prior. Whether H' is more likely than H (or ¬H, respectively) is ultimately up to whether P(E|H')/P(E|H) (or P(E|H')/P(E|¬H)) is larger or smaller than P(H')/P(H) (or P(H')/P(¬H)).

I think it ends up feeling fishy because the people formulating H' just used more features (the circumstances of the experiments) in a more complex model to account for the as-of-yet observed data after having observed said data, so it ends up seeming like in selecting H' as a hypothesis, they're according it more weight than it deserves according to the complexity prior.

Comment by jbkjr on Shortform · 2024-07-22T17:38:52.762Z · LW · GW

Why should I include any non-sentient systems in my moral circle? I haven't seen a case for that before.

Comment by jbkjr on Indecision and internalized authority figures · 2024-07-12T16:44:02.680Z · LW · GW

To me, "indecision results from sub-agential disagreement" seems almost tautological, at least within the context of multi-agent models of mind, since if all the sub-agents were in agreement, there wouldn't be any indecision. So, the question I have is: how often are disagreeing sub-agents "internalized authority figures"? I think I agree with you in that the general answer is "relatively often," although I expect a fair amount of variance between individuals.

Comment by jbkjr on Suffering Is Not Pain · 2024-06-24T16:39:31.779Z · LW · GW

I'd guess it's a problem of translation; I'm pretty confident the original text in Pali would just say "dukkha" there.

The Wikipedia entry for dukkha says it's commonly translated as "pain," but I'm very sure the referent of dukkha in experience is not pain, even if it's mistranslated as such, however commonly.

Comment by jbkjr on Suffering Is Not Pain · 2024-06-24T14:47:56.170Z · LW · GW

Say I have a strong desire to eat pizza, but only a weak craving. I have a hard time imagining what that would be like.

I think this is likely in part due to “desire” connoting both craving and preferring. In the Buddhist context, “desire” is often used more like “craving,” but on the other hand, if I have a pizza for dinner, it seems reasonable to say it was because I desired so (in the sense of having a preference for it), even if there was not any craving for it.

I think people tend to crave what they prefer until they’ve made progress on undoing the habit of craving/aversion, so it’s understandable that it can be hard for such a person to imagine having a strong preference without an associated craving. However, the difference becomes clearer if/when one experiences intentions and preferences in the absence of craving/aversion.

Perhaps it would be informative to examine your experience of preferring in instances other than e.g. eating, where I think there is more of a tendency to crave because “you need food to survive.” For example, if you’re writing and considering two ways of articulating something, you might find you have a strong preference for one way over another, but I imagine there might be less craving in the sense of “I must have it this way, not another.” Perhaps this isn’t the best example possible, but I think careful consideration will reveal the difference in experience between “desire” in the craving sense and “desire” in the preferring sense.

ETA: Another example I thought of is selecting a song to listen to if you're listening to music—you might want to listen to one song vs. others, but not necessarily have a strong craving for it.

Does then craving (rather than desire) frustration, or aversion realization, constitute suffering?

No, because craving something results in suffering, even if you get that which you crave, and being averse to something results in suffering, even if you avoid that to which you’re averse.

But still, it seems to make sense to say I have an aversion to pain because I suffer from it

I think it makes more sense to say there’s an aversion to pain because pain feels bad; since suffering is not a necessary consequence of pain, it doesn’t make sense to say that you’re averse to pain because it results in suffering. The causal chain is aversion->suffering, not the other way around.

Comment by jbkjr on Suffering Is Not Pain · 2024-06-23T19:10:05.462Z · LW · GW

I'd be interested if you have any other ideas for underexplored / underappreciated cause areas / intervention groups that might be worth further investigation when reevaluated via this pain vs suffering distinction?

Unfortunately, I don’t have much to point you toward supporting that I’m aware of already existing in the space. I’d generally be quite interested in studies which better evaluate meditation’s effects on directly reducing suffering in terms of e.g. how difficult it is for how many people to reduce their suffering by how much, but the EA community doesn’t seem to currently be focused on this very much. I am still supportive of existing organizations with a direct focus on reducing suffering; I just wanted to make the point that such organizations would do well to recognize the distinction between suffering and pain in order to ensure their efforts are actually aimed at suffering and not merely pain on the margin.

Comment by jbkjr on Suffering Is Not Pain · 2024-06-23T19:00:13.563Z · LW · GW

Then the question is whether the idiosyncratic words are only ever explained using other idiosyncratic words, or whether at some point it actually connects with the shared reality.

The point is that the words ground out in actual sensations and experiences, not just other words and concepts. What I’m arguing is that it’s not useful to use the English word “suffering” to refer to ordinary pain or displeasure, because there is a distinction in experience between what we refer to as “pain” or “displeasure” and what is referred to by the term “dukkha,” and that “suffering” is best understood as this dukkha. That we commonly say things like “he suffered the pain” is an indication of this distinction already existing in English, even if there is a tendency to messily equivocate between the two.

Comment by jbkjr on Suffering Is Not Pain · 2024-06-23T18:45:42.916Z · LW · GW

My point is that in English "experience such severe pain that one might prefer non-existence to continuing to endure that pain" would be considered an uncontroversial example of "suffering", not as something suffering-neutral to which suffering might or might not be added.

Sure, but I think that’s just because of the usual conflation between pain and suffering which I’m trying to address with this post. If you ask anyone with the relevant experience “does Buddhism teaching me to never suffer again mean that I’ll never experience (severe) pain again?”, they’ll just answer no. I don’t think it’s reasonable to think of this as a “bait-and-switch” because the dhamma never taught the end of pain, only the end of suffering; it’s not the dhamma’s fault if novices think the end of suffering means an end to pain.

Comment by jbkjr on Suffering Is Not Pain · 2024-06-23T18:36:51.983Z · LW · GW

The assumption that these can be completely dropping the habit is entirely theoretical. The historical Buddha's abilities are lost to history. Modern meditators can perform immense feats of pain tolerance, but I personally haven't heard one claim to have completely eradicated the habit of suffering.

I believe Daniel Ingram makes such a claim by virtue of his claim of arhatship; if he still suffers then he cannot reasonably claim to be an arhat. He also has an anecdote of someone else he considers to be an arhat saying “This one is not suffering!” in response to a question at a retreat. I think it’s often the case that someone who has found the end of suffering doesn’t go around proclaiming it widely for various reasons.

More directly, I know a complete cessation of craving/aversion and therefore suffering is possible because I have experienced it; I do not suffer. I hesitate to make this claim publicly because I’m not interested in getting into debates about whether or not I actually do not suffer—I know so, and that’s enough for me. However, if it’s helpful to know that the complete cessation of suffering is actually attainable by a kind of existence proof, I do not mind speaking simply about what I know in my own experience(s).

Comment by jbkjr on Suffering Is Not Pain · 2024-06-23T18:19:50.082Z · LW · GW

I think you're right about all the claims of fact. The Buddha won't suffer when he feels pain. But unenlightened beings, which is all the rest of us, particularly animals, will.

But the example of the Buddha goes to show that humans have the capacity to not suffer even in painful circumstances, even if right now they do. It’s not like “unenlightenment” is something you’re forever resigned to.

So taking pain as a proxy for suffering is pretty reasonable for thinking about how to reduce suffering

I agree that in most cases where someone suffers in the presence of extreme pain, they’re likely to suffer noticeably less if that pain is alleviated, but I don’t think this means “the best way to alleviate suffering is to reduce pain as a proxy for it,” since what’s actually causing the suffering is not the pain but the aversion to it.

Comment by jbkjr on Suffering Is Not Pain · 2024-06-23T18:10:39.798Z · LW · GW

The message of Buddhism isn’t “in order to not suffer, don’t want anything”; not craving/being averse doesn’t mean not having any intentions or preferences. Sure, if you crave the satisfaction of your preferences, or if you’re averse to their frustration, you will suffer, but intentions and preferences remain when craving/aversion/clinging is gone. It’s like a difference between “I’m not ok unless this preference is satisfied” and “I’d still like this preference to be satisfied, but I’ll ultimately be ok either way.”

Comment by jbkjr on Suffering Is Not Pain · 2024-06-23T18:00:04.619Z · LW · GW

I wouldn’t say suffering is merely preference frustration—if you’re not attached to your preferences and their satisfaction, then you won’t suffer if they’re frustrated. Not craving/being averse doesn’t mean you don’t have preferences, though—see this reply I made to another comment on this post for more discussion of this.

I don’t know if I would say depression isn’t painful, at least in the emotional sense of pain. In either case, it’s certainly unpleasant, and if you want to use “pain” to refer to unpleasantness associated with tissue damage and “displeasure” to refer to a larger category of sensations which includes both pain and other “unpleasant-but-not-‘painful’” experiences such as depression, I don’t really have an objection—my point is still that suffering is distinct from ordinary displeasure.

Comment by jbkjr on Suffering Is Not Pain · 2024-06-22T20:39:16.036Z · LW · GW

My sense is that existing mindfulness studies don't show the sort of impressive results that we'd expect if this were a great solution.

If you have any specific studies in mind which show this, I would be interested to see! I have a sense that mindfulness tends to be studied in the context of “increasing well-being” in a general sense and not specifically to “decrease or eliminate suffering.” I would be quite interested in a study which studies meditation’s effects when directly targeting suffering.

Also, I think people who would benefit most from having less day-to-day suffering often struggle with having no "free room" available for meditation practice, and that seems like an issue that's hard to overcome even if meditation practice would indeed help them a lot.

I really appreciate you raising this point in detail; I think it’s something I haven’t included enough in my own considerations. Having enough free time and energy for meditation practice is indeed a kind of privilege.

I’m going to chew on this some more, but one initial thought I’ve had is that the general quality of life needed as a prerequisite to devoting enough time and energy to meditation practice may be lower than one may expect, at least by Western standards. For example, in a place like India, there seems to be a good amount of people in difficult circumstances that nonetheless make time for meditation and spiritual pursuits. However, I agree that in the limit, if all of your waking moments are focused on simply acquiring enough food today, it seems much less reasonable to prescribe meditation as the solution for their suffering.

Comment by jbkjr on Suffering Is Not Pain · 2024-06-22T20:17:50.388Z · LW · GW

I want to address a common misconception that I see you having here when you write phrases like:

not many people… are going to remain indifferent about it

“… I can choose to react on them or to ignore them, and I am choosing to ignore this one”

when people feel pain, and a desire to avoid that pain arises…

a person who really has no preference whatsoever

to the level that they actually become indifferent to pain

Importantly, “not being averse to pain,” in the intended sense of the word aversion, does not mean that one is “indifferent to pain,” in the sense of (not) having intentions and preferences. When I speak of “craving” or “aversion,” I am referring to a very specific kind of mental action and experience which results in suffering, not to intentions and preferences. Craving/aversion is the kind of desire which finds the way things presently are fundamentally unacceptable. Craving is like an attempt to grab at sensations and experiences, and aversion is like an attempt to hold them at arm’s length.

So, it isn’t the case that someone who has let go of craving/aversion and therefore suffering is completely indifferent to pain or that they won’t take action to alleviate the pain. If you stab a non-sufferer, they’ll still e.g. go to a hospital and have the wound treated. They’ll still have a general preference for pleasure over pain and take actions in accordance with those preferences; they just won’t seek pleasure or avoid pain in the sense of “I am not ok without this pleasure, or with this pain”—that’s craving and aversion. Pain still serves its important functional role, without the extra mental (re)actions of aversion and suffering.

This is also why I prefer to use terms like craving, aversion, and clinging to terms like “desire.” Sometimes you hear the Buddhist teaching formulated as “the cause of suffering is desire” and “stop desiring in order to stop suffering,” but I think the layperson is likely to misinterpret this as “so I can’t have any intentions or preferences,” due to the common usage of the word “desire” as “intending” or “preferring.” For example, see this LW comment thread which discusses the use of the word “desire” in this context, or Daniel Ingram’s discussion (and eschewal) of “no-preference models” of awakening.

Similarly, one could mean by “aversion” something like “dispreference” and describe someone who simply intends to alleviate pain as being averse to that pain, but this is not “aversion” as used in the Buddhist context to refer to the cause of suffering, tanha. While “aversion” may still be misinterpreted in such a manner, I think it’s less likely to be misunderstood than terms like “desire.” Same for “craving”—I think people are generally already familiar with “craving” in experience in the sense of “I must have that which I crave, I am not ok without having that which I crave.”

Comment by jbkjr on Suffering Is Not Pain · 2024-06-22T19:08:07.810Z · LW · GW

It seems a bit misguided to me to argue “well, even in the absence of suffering, one might experience such severe pain that one might prefer non-existence to continuing to endure that pain, so this ‘not suffering’ can’t be all it’s cracked up to be”—would you rather experience suffering on top of that pain? With or without pain, not suffering is preferable to suffering.

For example, with end-of-life patients, circumstances being so unpleasant doesn’t mean that they may as well suffer, too; nor does “being an end-of-life patient” being a possible experience among the space of all possible non-suffering experiences make not suffering any less valuable.

Acknowledging that not suffering is preferable to suffering, even in the presence of pain, doesn’t trivialize the reality of pain, which still feels bad!

Comment by jbkjr on Suffering Is Not Pain · 2024-06-18T20:52:47.224Z · LW · GW

Which of the following claims would you disagree with?

  1. Craving/aversion causes suffering; there is suffering if and only if there is craving/aversion.
  2. There are practices by which one can untrain the habit of craving/aversion with careful attention and intention.
  3. In the limit, these practices can result in totally dropping the habit of craving/aversion, regardless of circumstance.
  4. The Buddha practiced in such a manner as to totally stop craving/aversion, regardless of circumstance.
  5. Therefore, the Buddha would not be averse to even the most extreme pain and therefore not suffer even in the most painful circumstances possible.
Comment by jbkjr on Suffering Is Not Pain · 2024-06-18T20:28:59.122Z · LW · GW

Interestingly, I had a debate with someone on an earlier draft of this post about whether or not pain could be considered a cause of suffering, which led to me brushing up on some of literature on causation.

What seems clear to me is that suffering causally depends on craving/aversion, but not on pain—there is suffering if and only if there is craving/aversion, but there can be suffering without pain, and there can be pain without suffering.

On Lewis' account, causal dependence implies causation but not vice versa, so this does not itself mean that pain cannot be considered a cause of suffering. However, given that suffering occurs if and only if craving/aversion occurs, I'm pretty comfortable pointing to craving/aversion as "the" cause of suffering.

I'm actually fairly sympathetic to a view which holds pain as an indirect cause of suffering in that it seems pretty natural to say that e.g. aversion arose in response to pain, at least in specific instances where there is pain, aversion (to the pain), and then suffering. However, I'm not sure I would make the claim that "pain causes aversion" in general, as it is quite possible for pain to occur without aversion then occurring.

I'm curious as to what part of my argument you perceived as invalidly inferring from extreme to average situations?

Comment by jbkjr on Suffering Is Not Pain · 2024-06-18T20:10:28.707Z · LW · GW

Re: 2, I disagree—there will be suffering if there is craving/aversion, even in the absence of pain. Craving pleasure results in suffering just as much as aversion to pain does.

Re: 4, While I agree that animals likely "live more in the moment" and have less capacity to make up stories about themselves, I do not think that this precludes them from having the basic mental reaction of craving/aversion and therefore suffering. I think the "stories" you're talking about have much more to do with ego/psyche than the "self" targeted in Buddhism—I think of ego/psyche as "the story/stories a mind tells itself about itself," whereas "self" is more about modeling some sensations as "me or mine" and other sensations as "not me or mine." I think non-human animals do not tell themselves stories about themselves to the same extent humans do, but do think they're quite capable of making the self/other distinction in the relevant sense. I think it's quite possible for craving/aversion to occur without having concocted such a story.

Comment by jbkjr on Suffering Is Not Pain · 2024-06-18T19:53:37.527Z · LW · GW

You're welcome, and thanks for the support! :)

Re: MAPLE, I might have in interest in visiting—I became acquainted with MAPLE because I think Alex Flint spent some time there? Does one need to be actively working on an AI safety project to visit? I am not currently doing so, having stepped away from AI safety work to focus on directly addressing suffering.

Comment by jbkjr on Suffering Is Not Pain · 2024-06-18T19:47:36.241Z · LW · GW

Lukas, thanks for taking the time to read and reply! I appreciate you reminding me of your article on Tranquilism—it's been a couple of years since I read it (during my fellowship with CLR), and I hadn't made a mental note of it making such a distinction when I did, so thanks for the reminder.

While I agree that it's an open question as to how effective meditation is for alleviating suffering at scale (e.g. how easy it is for how many humans to reduce their suffering by how much with how much time/effort), I don't think it would require as much of a commitment as you seem to expect in the median case. Personally, I think it's likely that the median person would be able to make substantial progress in reducing suffering as a layperson, i.e. without becoming a monastic. Even if attaining a total and forevermore cessation of suffering is substantially more difficult/attainable by substantially fewer people in one lifetime, I don't think it's unreasonable to think that most people could suffer at least 50 percent less with dedicated mindfulness practice. I'm curious as to what might feed an opposing intuition for you! I'd be quite excited about empirical research that investigates the tractability and scalability of meditation for reducing suffering, in either case.

(By the way, would it be alright if I ping you privately to set up a meeting? I've been a fan of your writing since becoming familiar with you during my time at CLR and would love a chance to pick your brain about SFE stuff and hear about what you've been up to lately!)

Comment by jbkjr on Question about Lewis' counterfactual theory of causation · 2024-06-10T18:24:55.218Z · LW · GW

If I understand the example and the commentary from SEP correctly, doesn't this example illustrate a problem with Lewis' definition of causation? I agree that commonsense dictates that Alice throwing the rock caused the window to smash, but I think the problem is that you cannot construct a sequence of stepwise dependences from cause to effect:

Lewis’s theory cannot explain the judgement that Suzy’s throw caused the shattering of the bottle. For there is no causal dependence between Suzy’s throw and the shattering, since even if Suzy had not thrown her rock, the bottle would have shattered due to Billy’s throw. Nor is there a chain of stepwise dependences running cause to effect, because there is no event intermediate between Suzy’s throw and the shattering that links them up into a chain of dependences. Take, for instance, Suzy’s rock in mid-trajectory. This event depends on Suzy’s initial throw, but the problem is that the shattering of the bottle does not depend on it, because even without it the bottle would still have shattered because of Billy’s throw.

Is the example of the two hitmen given in the SEP article (where B does not fire if A does) an instance of causation without causal dependence?

Comment by jbkjr on Question about Lewis' counterfactual theory of causation · 2024-06-10T15:18:41.628Z · LW · GW

Thanks so much for this—it was just the answer I was looking for!

I was able to follow the logic you presented, and in particular, I understand that  and  but not  in the example given.

So, I was correct in my original example of c->d->e that

(1) if c were to not happen, d would not happen

(2) if d were to not happen, e would not happen

BUT it was incorrect to then derive that if c were to not happen, e would not happen? Have I understood you correctly?

I'm still a bit fuzzy on the informal counterexample you presented, possibly because of the introduction of probability. For example, I don't understand how not winning big entails not winning something because winning nothing is more likely than winning small. If you did not win big, you might still have won small (even if it's unlikely); I don't understand how the likelihood comes into account.

I wish I had an intuitive example of a c and e where c is a cause of e without e causally depending on c, but I'm struggling to imagine one.

Comment by jbkjr on What alignment-related concepts should be better known in the broader ML community? · 2021-12-09T20:02:30.218Z · LW · GW

This is kind of vague, but I have this sense that almost everybody doing RL and related research takes the notion of "agent" for granted, as if it's some metaphysical primitive*, as opposed to being a (very) leaky abstraction that exists in the world models of humans. But I don't think the average alignment researcher has much better intuitions about agency, either, to be honest, even though some spend time thinking about things like embedded agency. It's hard to think meaningfully about the illusoriness of the Cartesian boundary when you still live 99% of your life and think 99% of your thoughts as if you were a Cartesian agent, fully "in control" of your choices, thoughts, and actions.

(*Not that "agent" couldn't, in fact, be a metaphysical primitive, just that such "agents" are hardly "agents" in the way most people consider humans to "be agents" [and, equally importantly, other things, like thermostats and quarks, to "not be agents"].)

Comment by jbkjr on Integrating Three Models of (Human) Cognition · 2021-12-06T18:19:46.869Z · LW · GW

To illustrate my reservations: soon after I read the sentence about GNW meaning you can only be conscious of one thing at a time, as I was considering that proposition, I felt my chin was a little itchy and so I scratched it. So now I can remember thinking about the proposition while simultaneously scratching my chin. Trying to recall exactly what I was thinking at the time now also brings up a feeling of a specific body posture.

To me, "thinking about the proposition while simultaneously scratching my chin" sounds like a separate "thing" (complex representation formed in the GNW) than either "think about proposition" or "scratch my chin"... and you experienced this thing after the other ones, right? Like, from the way you described it, it sounds to me like there was actually 1) the proposition 2) the itch 3) the output of a 'summarizer' that effectively says "just now, I was considering this proposition and scratching my chin". [I guess, in this sense, I would say you are ordinarily doing some "weird self-deceptive dance" that prevents you from noticing this, because most people seem to ordinarily experience "themselves" as the locus of/basis for experience, instead of there being a stream of moments of consciousness, some of which apparently refer to an 'I'.]

Also, I have this sense that you're chunking your experience into "things" based on what your metacognitive summarizer-of-mental-activity is outputting back to the workspace, but there are at least 10 representations streaming through the workspace each second, and many of these will be far more primitive than any of the "things" we've already mentioned here (or than would ordinarily be noticed by the summarizer without specific training for it, e.g. in meditation). Like, in your example, there were visual sensations from the reading, mental analyses about its content, the original raw sensation of the itch, the labeling of it as "itchy," the intention to scratch the itch, (definitely lots more...), and, eventually, the thought "I remember thinking about this proposition and scratching my chin 'at the same time'."

Comment by jbkjr on Redwood Research’s current project · 2021-09-23T10:51:28.299Z · LW · GW

I think it's really cool you're posting updates as you go and writing about uncertainties! I also like the fiction continuation as a good first task for experimenting with these things.

My life is a relentless sequence of exercises in importance sampling and counterfactual analysis

This made me laugh out loud :P

Comment by jbkjr on Grokking the Intentional Stance · 2021-09-16T14:45:14.312Z · LW · GW

If you then deconfuse agency as "its behavior is reliably predictable by the intentional strategy", I then have the same question: "why is its behavior reliably predictable by the intentional strategy?" Sure, its behavior in the set of circumstances we've observed is predictable by the intentional strategy, but none of those circumstances involved human extinction; why expect that the behavior will continue to be reliably predictable in settings where the prediction is "causes human extinction"?

Overall, I generally agree with the intentional stance as an explanation of the human concept of agency, but I do not think it can be used as a foundation for AI risk arguments. For that, you need something else, such as mechanistic implementation details, empirical trend extrapolations, analyses of the inductive biases of AI systems, etc.

The requirement for its behavior being "reliably predictable" by the intentional strategy doesn't necessarily limit us to postdiction in already-observed situations; we could require our intentional stance model of the system's behavior to generalize OOD. Obviously, to build such a model that generalizes well, you'll want it to mirror the actual causal dynamics producing the agent's behavior as closely as possible, so you need to make further assumptions about the agent's cognitive architecture, inductive biases, etc. that you hope will hold true in that specific context (e.g. human minds or prosaic AIs). However, these are additional assumptions needed to answer question of why an intentional stance model will generalize OOD, not replacing the intentional stance as the foundation of our concept of agency, because, as you say, it explains the human concept of agency, and we're worried that AI systems will fail catastrophically in ways that look agentic and goal-directed... to us.

You are correct that having only the intentional stance is insufficient to make the case for AI risk from "goal-directed" prosaic systems, but having it as the foundation of what we mean by "agent" clarifies what more is needed to make the sufficient case—what about the mechanics of prosaic systems will allow us to build intentional stance models of their behavior that generalize well OOD?

Comment by jbkjr on Goal-Directedness and Behavior, Redux · 2021-08-11T09:22:09.798Z · LW · GW

What's my take? I think that when we talk about goal-directedness, what we really care about is a range of possible behaviors, some of which we worry about in the context of alignment and safety.

  • (What I'm not saying) We shouldn't ascribe any cognition to the system, just find rules of association for its behavior (aka Behaviorism)
  • That's not even coherent with my favored approach to goal-directedness, the intentional stance. Dennett clearly ascribes beliefs and desires to beings and systems; his point is that the ascription is done based on the behavior and the circumstances.

I agree pretty strongly with all of this, fwiw. I think Dennett/the intentional stance really gets at the core of what it means for a system to "be an agent"; essentially, a system is one to the extent it makes sense to model it as such, i.e. as having beliefs and preferences, and acting on those beliefs to achieve those preferences, etc. The very reason why we usually consider our selves and other humans to be "agents" is exactly because that's the model over sensory data that the mind finds most reasonable to use, most of the time. In doing so, we actually are ascribing cognition to these systems, and in practice, of course we'll need to understand how such behavior will actually be implemented in our AIs. (And thinking about how "goal-directed behavior" is implemented in humans/biological neural nets seems like a good place to mine for useful insights and analogies for this purpose.)

Comment by jbkjr on How many parameters do self-driving-car neural nets have? · 2021-08-06T11:38:04.147Z · LW · GW

I think they do some sort of distillation type thing where they train massive models to label data or act as “overseers” for the much smaller models that actually are deployed in cars (as inference time has to be much better to make decisions in real time)… so I wouldn’t actually expect them to be that big in the actual cars. More details about this can be found in Karpathy’s recent CLVR talk, iirc, but not about parameter count/model size?

Comment by jbkjr on Re-Define Intent Alignment? · 2021-08-05T14:18:05.068Z · LW · GW

To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?

Probably something like the last one, although I think "even in principle" is doing some probably doing something suspicious in that statement. Like, sure, "in principle," you can pretty much construct any demarcation you could possibly imagine, including the Cartesian one, but what I'm trying to say is something like, "all demarcations, by their very nature, exist only in the map, not the territory." Carving reality is an operation that could only make sense within the context of a map, as reality simply is. Your concept of "agent" is defined in terms of other representations that similarly exist only within your world-model; other humans have a similar concept of "agent" because they have a similar representation built from correspondingly similar parts. If an AI is to understand the human notion of "agency," it will need to also understand plenty of other "things" which are also only abstractions or latent variables within our world models, as well as what those variables "point to" (at least, what variables in the AI's own world model they 'point to,' as by now I hope you're seeing the problem with trying to talk about "things they point to" in external/'objective' reality!).

Comment by jbkjr on Re-Define Intent Alignment? · 2021-08-05T10:49:46.067Z · LW · GW

(Because you'd always be unable to answer the legitimate question: "the mesa-objective of what?")

All I'm saying is that, to the extent you can meaningfully ask the question, "what is this bit of the universe optimizing for?", you should be able to clearly demarcate which bit you're asking about.

I totally agree with this; I guess I'm just (very) wary about being able to "clearly demarcate" whichever bit we're asking about and therefore fairly pessimistic we can "meaningfully" ask the question to begin with? Like, if you start asking yourself questions like "what am 'I' optimizing for?," and then try to figure out exactly what the demarcation is between "you" and "everything else" in order to answer that question, you're gonna have a real tough time finding anything close to a satisfactory answer.

Comment by jbkjr on Re-Define Intent Alignment? · 2021-08-04T16:45:52.446Z · LW · GW

Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.

I'm not saying you can't reason under the assumption of a Cartesian boundary, I'm saying the results you obtain when doing so are of questionable relevance to reality, because "agents" and "environments" can only exist in a map, not the territory. The idea of trying to e.g. separate "your atoms" or whatever from those of "your environment," so that you can drop them into those of "another environment," is only a useful fiction, as in reality they're entangled with everything else. I'm not aware of formal proof of this point that I'm trying to make; it's just a pretty strongly held intuition. Isn't this also kind of one of the key motivations for thinking about embedded agency?

Comment by jbkjr on An Orthodox Case Against Utility Functions · 2021-08-03T15:23:00.045Z · LW · GW

I definitely see it as a shift in that direction, although I'm not ready to really bite the bullets -- I'm still feeling out what I personally see as the implications. Like, I want a realist-but-anti-realist view ;p

You might find Joscha Bach's view interesting...

Comment by jbkjr on Refactoring Alignment (attempt #2) · 2021-08-03T09:38:38.565Z · LW · GW

I didn't really take the time to try and define "mesa-objective" here. My definition would be something like this: if we took long enough, we could point to places in the big NN (or whatever) which represent goal content, similarly to how we can point to reward systems (/ motivation systems) in the human brain. Messing with these would change the apparent objective of the NN, much like messing with human motivation centers.

This sounds reasonable and similar to the kinds of ideas for understanding agents' goals as cognitively implemented that I've been exploring recently.

However, I think possibly you want a very behavioral definition of mesa-objective. If that's true, I wonder if you should just identify with the generalization-focused path instead. After all, one of the main differences between the two paths is that the generalization-focused path uses behavioral definitions, while the objective-focused path assumes some kind of explicit representation of goal content within a system.

The funny thing is I am actually very unsatisfied with a purely behavioral notion of a model's objective, since a deceptive model would obviously externally appear to be a non-deceptive model in training. I just don't think there will be one part of the network we can point to and clearly interpret as being some objective function that the rest of the system's activity is optimizing. Even though I am partial to the generalization focused approach (in part because it kind of widens the goal posts with the "acceptability" vs. "give the model exactly the correct goal" thing), I still would like to have a more cognitive understanding of a system's "goals" because that seems like one of the best ways to make good predictions about how the system will generalize under distributional shift. I'm not against assuming some kind of explicit representation of goal content within a system (for sufficiently powerful systems); I'm just against assuming that that content will look like a mesa-objective as originally defined.

Comment by jbkjr on Re-Define Intent Alignment? · 2021-08-03T09:28:42.922Z · LW · GW

I haven't engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don't resonate at all with other people's complaints, it seems.

For example, I don't put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and a large part of cognition is devoted to making these expectations as coherent as possible (updating them based on experience, propagating expectations of more distant events to nearer, etc). This is in contrast to keeping some centrally represented utility function, and devoting cognition to computing expectations for this utility function.

Is this related to your post An Orthodox Case Against Utility Functions? It's been on my to-read list for a while; I'll be sure to give it a look now.

Comment by jbkjr on Did they or didn't they learn tool use? · 2021-07-29T14:23:59.488Z · LW · GW

One idea as to the source of the potential discrepancy... did any of the task prompts for the tasks in which it did figure out how to use tools tell it explicitly to "use the objects to reach a higher floor," or something similar? I'm wondering if the cases where it did use tools are examples where doing so was instrumentally useful to achieving a prompted objective that didn't explicitly require tool use.

Comment by jbkjr on Refactoring Alignment (attempt #2) · 2021-07-28T18:36:28.770Z · LW · GW

I'm not too keen on (2) since I don't expect mesa objectives to exist in the relevant sense.

Same, but how optimistic are you that we could figure out how to shape the motivations or internal "goals" (much more loosely defined than "mesa-objective") of our models via influencing the training objective/reward, the inductive biases of the model, the environments they're trained in, some combination of these things, etc.?

These aren't "clean", in the sense that you don't get a nice formal guarantee at the end that your AI system is going to (try to) do what you want in all situations, but I think getting an actual literal guarantee is pretty doomed anyway (among other things, it seems hard to get a definition for "all situations" that avoids the no-free-lunch theorem, though I suppose you could get a probabilistic definition based on the simplicity prior).

Yup, if you want "clean," I agree that you'll have to either assume a distribution over possible inputs, or identify a perturbation set over possible test environments to avoid NFL.

Comment by jbkjr on Refactoring Alignment (attempt #2) · 2021-07-28T18:31:49.620Z · LW · GW

Intent Alignment: A model is intent-aligned if it has a mesa-objective, and that mesa-objective is aligned with humans. (Again, I don't want to get into exactly what "alignment" means.)

This path apparently implies building goal-oriented systems; all of the subgoals require that there actually is a mesa-objective.

I pretty strongly endorse the new diagram with the pseudo-equivalences, with one caveat (much the same comment as on your last post)... I think it's a mistake to think of only mesa-optimizers as having "intent" or being "goal-oriented" unless we start to be more inclusive about what we mean by "mesa-optimizer" and "mesa-objective." I don't think those terms as defined in RFLO actually capture humans, but I definitely want to say that we're "goal-oriented" and have "intent."

But the graph structure makes perfect sense, I just am doing the mental substitution of "intent alignment means 'what the model is actually trying to do' is aligned with 'what we want it to do'." (Similar for inner robustness.)

However, I'm not confident that the details of Evan's locutions are quite right. For example, should alignment be tested only in terms of the very best policy?

I also don't think optimality is a useful condition in alignment definitions. (Also, a similarly weird move is pulled with "objective robustness," which is defined in terms of the optimal policy for a model's behavioral objective... so you'd have to get the behavioral objective, which is specific to your actual policy, and find the actually optimal policy for that objective, to determine objective robustness?)

I find myself thinking that objective robustness is actually what I mean by the inner alignment problem. Abergal voiced similar thoughts. But this makes it seem unfortunate that "inner alignment" refers specifically to the thing where there are mesa-optimizers. I'm not sure what to do about this.

Yeah, I think I'd also wish we could collectively agree to redefine inner alignment to be more like objective robustness (or at least be more inclusive of the kinds of inner goals humans have). But I've been careful not to use the term to refer to anything except mesa-optimizers, partially in order to be consistent with Evan's terminology, but primarily not to promote unnecessary confusion with those who strongly associate "inner alignment" with mesa-optimization (although they could also be using a much looser conception of mesa-optimization, if they consider humans to be mesa-optimizers, in which case "inner alignment" pretty much points at the thing I'd want it to point at).

Comment by jbkjr on Re-Define Intent Alignment? · 2021-07-28T17:40:44.174Z · LW · GW

The behavioral objective, meanwhile, would be more like the thing the agent appears to be pursuing under some subset of possible distributional shifts. This is the more realistic case where we can't afford to expose our agent to every possible environment (or data distribution) that could possibly exist, so we make do and expose it to only a subset of them. Then we look at what objectives could be consistent with the agent's behavior under that subset of environments, and those count as valid behavioral objectives.

The key here is that the set of allowed mesa-objectives is a reliable invariant of the agent, while the set of allowed behavioral objectives is contingent on our observations of the agent's behavior under a limited set of environments In principle, the two sets of objectives won't converge perfectly until we've run our agent in every possible environment that could exist.

This is the right sort of idea; in the OOD robustness literature you try to optimize worst-case performance over a perturbation set of possible environments. The problem I have with what I understand you to be saying is with the assumption that there is any possible reliable invariant of the agent over every possible environment that could be a mesa-objective, which stems from the assumption that you are able to carve an environment up into an agent and an environment and place the "same agent" in arbitrary environments. No such thing is possible in reality, as an agent cannot exist without its environment, so why shouldn't we talk about the mesa-objective being over a perturbation set, too, just that it has to be some function of the model's internal features?

Comment by jbkjr on Re-Define Intent Alignment? · 2021-07-28T17:33:43.306Z · LW · GW

However, we could instead define "intent alignment" as "the optimal policy of the mesa objective would be good for humans".

I agree that we need a notion of "intent" that doesn't require a purely behavioral notion of a model's objectives, but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea "what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?" I suspect MIRI people have some unshared intuitions about why we might expect this, but I currently don't have a good reason to believe this.)

I want to be able to talk about how we can shape goals which may be messier, perhaps somewhat competing, internal representations or heuristics or proxies that determine behavior. If we actually want to understand "intent," we have to understand what the heck intentions and goals actually are in humans and what they might look like in advanced ML systems. However, I do think this is a very good point you raise about intent alignment (that it should correspond to the model's internal goals, objectives, intentions, etc.), and the need to be mindful of which version we're using in a given context.

Also, I'm iffy on including the "all inputs"/optimality thing (I believe Rohin is, too)... it does have the nice property that it lets you reason without considering e.g. training setup, dataset, architecture, but we won't actually have infinite data and optimal models in practice. So, I think it's pretty important to model how different environments or datasets interact with the reward/objective function in producing the intentions and goals of our models.

Evan highlights the assumption that solving inner alignment will solve behavioral alignment: he thinks that the most important cases of catastrophic bad behavior are intentional (ie, come from misaligned objectives, either outer objective or inner objective).

I don't think this is necessarily a crux between the generalization- and objective-driven approaches—if intentional behavior requires a mesa-objective, then humans can't act "intentionally." So we obviously want a notion of intent that applies to the messier middle cases of goal representation (between a literal mesa-objective and a purely implicit behavioral objective).

Comment by jbkjr on Refactoring Alignment (attempt #2) · 2021-07-26T23:21:10.323Z · LW · GW

So, for example, this claims that either intent alignment + objective robustness or outer alignment + robustness would be sufficient for impact alignment.

Shouldn’t this be “intent alignment + capability robustness or outer alignment + robustness”?

Btw, I plan to post more detailed comments in response here and to your other post, just wanted to note this so hopefully there’s no confusion in interpreting your diagram.

Comment by jbkjr on Looking Deeper at Deconfusion · 2021-07-21T11:40:41.890Z · LW · GW

Great post. My one piece of feedback is that not calling the post "Deconfusing 'Deconfusion'" might've been a missed opportunity. :)

I even went to this cooking class once where the chef proposed his own deconfusion of the transformations of food induced by different cooking techniques -- I still use it years later.

Unrelatedly, I would be interested in details on this.

Comment by jbkjr on Why Subagents? · 2021-07-17T18:40:30.901Z · LW · GW

The way I'd think of it, it's not that you literally need unanimous agreement, but that in some situations there may be subagents that are strong enough to block a given decision.

Ah, I think that makes sense. Is this somehow related to the idea that the consciousness is more of a "last stop for a veto from the collective mind system" for already-subconsciously-initiated thoughts and actions? Struggling to remember where I read this, though.

It gets a little handwavy and metaphorical but so does the concept of a subagent.

Yeah, considering the fact that subagents are only "agents" insofar as it makes sense to apply the intentional stance (the thing we'd like to avoid having to apply to the whole system because it seems fundamentally limited) to the individual parts, I'm not surprised. It seems like it's either "agents all the way down" or abandon the concept of agency altogether (although posing that dichotomy feels like a suspicious presumption of agency, itself!).

Comment by jbkjr on Why Subagents? · 2021-07-17T11:54:12.821Z · LW · GW

Wouldn't decisions about e.g. which objects get selected and broadcast to the global workspace be made by a majority or plurality of subagents? "Committee requiring unanimous agreement" feels more like what would be the case in practice for a unified mind, to use a TMI term. I guess the unanimous agreement is only required because we're looking for strict/formal coherence in the overall system, whereas e.g. suboptimally-unified/coherent humans with lots of akrasia can have tug-of-wars between groups of subagents for control.

Comment by jbkjr on Why Subagents? · 2021-07-17T09:55:58.936Z · LW · GW

The arrows show preference: our agent prefers A to B if (and only if) there is a directed path from A to B along the arrows.

Shouldn't this be "iff there is a directed path from B to A"? E.g. the agent prefers pepperoni to cheese, so there is a directed arrow from cheese to pepperoni.

Comment by jbkjr on Taboo "Outside View" · 2021-06-18T22:50:22.844Z · LW · GW

Great post. That Anakin meme is gold.

“Whenever you notice yourself saying ‘outside view’ or ‘inside view,’ imagine a tiny Daniel Kokotajlo hopping up and down on your shoulder chirping ‘Taboo outside view.’”

Somehow I know this will now happen automatically whenever I hear or read “outside view.” 😂

Comment by jbkjr on The Hard Work of Translation (Buddhism) · 2021-05-24T22:11:16.796Z · LW · GW

The Buddha taught one specific concentration technique and a simple series of insight techniques

Any pointers on where I can find information about the specific techniques as originally taught by the Buddha?

Comment by jbkjr on A non-mystical explanation of "no-self" (three characteristics series) · 2021-05-21T23:37:05.873Z · LW · GW

I've found this interview with Richard Lang about the "headless" method of interrogation helpful and think Sam's discussion provides useful context to bridge the gap to the scientific skeptics as well as to other meditative techniques and traditions (some of which are touched upon in this post). It also includes a pointing out exercise.

Comment by jbkjr on Deducing Impact · 2021-04-29T18:08:54.354Z · LW · GW

Late to the party, but here's my crack at it (ROT13'd since markdown spoilers made it an empty box without my text):

Fbzrguvat srryf yvxr n ovt qrny vs V cerqvpg gung vg unf n (ovt) vzcnpg ba gur cbffvovyvgl bs zl tbnyf/inyhrf/bowrpgvirf orvat ernyvmrq. Nffhzvat sbe n zbzrag gung gur tbnyf/inyhrf ner jryy-pncgherq ol n hgvyvgl shapgvba, vzcnpg jbhyq or fbzrguvat yvxr rkcrpgrq hgvyvgl nsgre gur vzcnpgshy rirag - rkcrpgrq hgvyvgl orsber gur rirag. Boivbhfyl, nf lbh'ir cbvagrq bhg, fbzrguvat orvat vzcnpgshy nppbeqvat gb guvf abgvba qrcraqf obgu ba gur inyhrf naq ba ubj "bowrpgviryl vzcnpgshy" vg vf (v.r. ubj qenfgvpnyyl vg punatrf gur frg bs cbffvoyr shgherf).

Comment by jbkjr on Old post/writing on optimization daemons? · 2021-04-17T18:53:56.793Z · LW · GW

Ah, it was John's post I was thinking of; thanks! (Apologies to John for not remembering it was his writing, although I suppose mistaking someone's visual imagery on a technical topic for Eliezer's might be considered an accidental compliment :).)