Grokking the Intentional Stance 2021-08-31T15:49:36.699Z
Discussion: Objective Robustness and Inner Alignment Terminology 2021-06-23T23:25:36.687Z
Empirical Observations of Objective Robustness Failures 2021-06-23T23:23:28.077Z
Old post/writing on optimization daemons? 2021-04-15T18:00:17.923Z
Mapping the Conceptual Territory in AI Existential Safety and Alignment 2021-02-12T07:55:54.438Z


Comment by jbkjr on Grokking the Intentional Stance · 2021-09-16T14:45:14.312Z · LW · GW

If you then deconfuse agency as "its behavior is reliably predictable by the intentional strategy", I then have the same question: "why is its behavior reliably predictable by the intentional strategy?" Sure, its behavior in the set of circumstances we've observed is predictable by the intentional strategy, but none of those circumstances involved human extinction; why expect that the behavior will continue to be reliably predictable in settings where the prediction is "causes human extinction"?

Overall, I generally agree with the intentional stance as an explanation of the human concept of agency, but I do not think it can be used as a foundation for AI risk arguments. For that, you need something else, such as mechanistic implementation details, empirical trend extrapolations, analyses of the inductive biases of AI systems, etc.

The requirement for its behavior being "reliably predictable" by the intentional strategy doesn't necessarily limit us to postdiction in already-observed situations; we could require our intentional stance model of the system's behavior to generalize OOD. Obviously, to build such a model that generalizes well, you'll want it to mirror the actual causal dynamics producing the agent's behavior as closely as possible, so you need to make further assumptions about the agent's cognitive architecture, inductive biases, etc. that you hope will hold true in that specific context (e.g. human minds or prosaic AIs). However, these are additional assumptions needed to answer question of why an intentional stance model will generalize OOD, not replacing the intentional stance as the foundation of our concept of agency, because, as you say, it explains the human concept of agency, and we're worried that AI systems will fail catastrophically in ways that look agentic and goal-directed... to us.

You are correct that having only the intentional stance is insufficient to make the case for AI risk from "goal-directed" prosaic systems, but having it as the foundation of what we mean by "agent" clarifies what more is needed to make the sufficient case—what about the mechanics of prosaic systems will allow us to build intentional stance models of their behavior that generalize well OOD?

Comment by jbkjr on Goal-Directedness and Behavior, Redux · 2021-08-11T09:22:09.798Z · LW · GW

What's my take? I think that when we talk about goal-directedness, what we really care about is a range of possible behaviors, some of which we worry about in the context of alignment and safety.

  • (What I'm not saying) We shouldn't ascribe any cognition to the system, just find rules of association for its behavior (aka Behaviorism)
  • That's not even coherent with my favored approach to goal-directedness, the intentional stance. Dennett clearly ascribes beliefs and desires to beings and systems; his point is that the ascription is done based on the behavior and the circumstances.

I agree pretty strongly with all of this, fwiw. I think Dennett/the intentional stance really gets at the core of what it means for a system to "be an agent"; essentially, a system is one to the extent it makes sense to model it as such, i.e. as having beliefs and preferences, and acting on those beliefs to achieve those preferences, etc. The very reason why we usually consider our selves and other humans to be "agents" is exactly because that's the model over sensory data that the mind finds most reasonable to use, most of the time. In doing so, we actually are ascribing cognition to these systems, and in practice, of course we'll need to understand how such behavior will actually be implemented in our AIs. (And thinking about how "goal-directed behavior" is implemented in humans/biological neural nets seems like a good place to mine for useful insights and analogies for this purpose.)

Comment by jbkjr on How many parameters do self-driving-car neural nets have? · 2021-08-06T11:38:04.147Z · LW · GW

I think they do some sort of distillation type thing where they train massive models to label data or act as “overseers” for the much smaller models that actually are deployed in cars (as inference time has to be much better to make decisions in real time)… so I wouldn’t actually expect them to be that big in the actual cars. More details about this can be found in Karpathy’s recent CLVR talk, iirc, but not about parameter count/model size?

Comment by jbkjr on Re-Define Intent Alignment? · 2021-08-05T14:18:05.068Z · LW · GW

To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?

Probably something like the last one, although I think "even in principle" is doing some probably doing something suspicious in that statement. Like, sure, "in principle," you can pretty much construct any demarcation you could possibly imagine, including the Cartesian one, but what I'm trying to say is something like, "all demarcations, by their very nature, exist only in the map, not the territory." Carving reality is an operation that could only make sense within the context of a map, as reality simply is. Your concept of "agent" is defined in terms of other representations that similarly exist only within your world-model; other humans have a similar concept of "agent" because they have a similar representation built from correspondingly similar parts. If an AI is to understand the human notion of "agency," it will need to also understand plenty of other "things" which are also only abstractions or latent variables within our world models, as well as what those variables "point to" (at least, what variables in the AI's own world model they 'point to,' as by now I hope you're seeing the problem with trying to talk about "things they point to" in external/'objective' reality!).

Comment by jbkjr on Re-Define Intent Alignment? · 2021-08-05T10:49:46.067Z · LW · GW

(Because you'd always be unable to answer the legitimate question: "the mesa-objective of what?")

All I'm saying is that, to the extent you can meaningfully ask the question, "what is this bit of the universe optimizing for?", you should be able to clearly demarcate which bit you're asking about.

I totally agree with this; I guess I'm just (very) wary about being able to "clearly demarcate" whichever bit we're asking about and therefore fairly pessimistic we can "meaningfully" ask the question to begin with? Like, if you start asking yourself questions like "what am 'I' optimizing for?," and then try to figure out exactly what the demarcation is between "you" and "everything else" in order to answer that question, you're gonna have a real tough time finding anything close to a satisfactory answer.

Comment by jbkjr on Re-Define Intent Alignment? · 2021-08-04T16:45:52.446Z · LW · GW

Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.

I'm not saying you can't reason under the assumption of a Cartesian boundary, I'm saying the results you obtain when doing so are of questionable relevance to reality, because "agents" and "environments" can only exist in a map, not the territory. The idea of trying to e.g. separate "your atoms" or whatever from those of "your environment," so that you can drop them into those of "another environment," is only a useful fiction, as in reality they're entangled with everything else. I'm not aware of formal proof of this point that I'm trying to make; it's just a pretty strongly held intuition. Isn't this also kind of one of the key motivations for thinking about embedded agency?

Comment by jbkjr on An Orthodox Case Against Utility Functions · 2021-08-03T15:23:00.045Z · LW · GW

I definitely see it as a shift in that direction, although I'm not ready to really bite the bullets -- I'm still feeling out what I personally see as the implications. Like, I want a realist-but-anti-realist view ;p

You might find Joscha Bach's view interesting...

Comment by jbkjr on Refactoring Alignment (attempt #2) · 2021-08-03T09:38:38.565Z · LW · GW

I didn't really take the time to try and define "mesa-objective" here. My definition would be something like this: if we took long enough, we could point to places in the big NN (or whatever) which represent goal content, similarly to how we can point to reward systems (/ motivation systems) in the human brain. Messing with these would change the apparent objective of the NN, much like messing with human motivation centers.

This sounds reasonable and similar to the kinds of ideas for understanding agents' goals as cognitively implemented that I've been exploring recently.

However, I think possibly you want a very behavioral definition of mesa-objective. If that's true, I wonder if you should just identify with the generalization-focused path instead. After all, one of the main differences between the two paths is that the generalization-focused path uses behavioral definitions, while the objective-focused path assumes some kind of explicit representation of goal content within a system.

The funny thing is I am actually very unsatisfied with a purely behavioral notion of a model's objective, since a deceptive model would obviously externally appear to be a non-deceptive model in training. I just don't think there will be one part of the network we can point to and clearly interpret as being some objective function that the rest of the system's activity is optimizing. Even though I am partial to the generalization focused approach (in part because it kind of widens the goal posts with the "acceptability" vs. "give the model exactly the correct goal" thing), I still would like to have a more cognitive understanding of a system's "goals" because that seems like one of the best ways to make good predictions about how the system will generalize under distributional shift. I'm not against assuming some kind of explicit representation of goal content within a system (for sufficiently powerful systems); I'm just against assuming that that content will look like a mesa-objective as originally defined.

Comment by jbkjr on Re-Define Intent Alignment? · 2021-08-03T09:28:42.922Z · LW · GW

I haven't engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don't resonate at all with other people's complaints, it seems.

For example, I don't put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and a large part of cognition is devoted to making these expectations as coherent as possible (updating them based on experience, propagating expectations of more distant events to nearer, etc). This is in contrast to keeping some centrally represented utility function, and devoting cognition to computing expectations for this utility function.

Is this related to your post An Orthodox Case Against Utility Functions? It's been on my to-read list for a while; I'll be sure to give it a look now.

Comment by jbkjr on Did they or didn't they learn tool use? · 2021-07-29T14:23:59.488Z · LW · GW

One idea as to the source of the potential discrepancy... did any of the task prompts for the tasks in which it did figure out how to use tools tell it explicitly to "use the objects to reach a higher floor," or something similar? I'm wondering if the cases where it did use tools are examples where doing so was instrumentally useful to achieving a prompted objective that didn't explicitly require tool use.

Comment by jbkjr on Refactoring Alignment (attempt #2) · 2021-07-28T18:36:28.770Z · LW · GW

I'm not too keen on (2) since I don't expect mesa objectives to exist in the relevant sense.

Same, but how optimistic are you that we could figure out how to shape the motivations or internal "goals" (much more loosely defined than "mesa-objective") of our models via influencing the training objective/reward, the inductive biases of the model, the environments they're trained in, some combination of these things, etc.?

These aren't "clean", in the sense that you don't get a nice formal guarantee at the end that your AI system is going to (try to) do what you want in all situations, but I think getting an actual literal guarantee is pretty doomed anyway (among other things, it seems hard to get a definition for "all situations" that avoids the no-free-lunch theorem, though I suppose you could get a probabilistic definition based on the simplicity prior).

Yup, if you want "clean," I agree that you'll have to either assume a distribution over possible inputs, or identify a perturbation set over possible test environments to avoid NFL.

Comment by jbkjr on Refactoring Alignment (attempt #2) · 2021-07-28T18:31:49.620Z · LW · GW

Intent Alignment: A model is intent-aligned if it has a mesa-objective, and that mesa-objective is aligned with humans. (Again, I don't want to get into exactly what "alignment" means.)

This path apparently implies building goal-oriented systems; all of the subgoals require that there actually is a mesa-objective.

I pretty strongly endorse the new diagram with the pseudo-equivalences, with one caveat (much the same comment as on your last post)... I think it's a mistake to think of only mesa-optimizers as having "intent" or being "goal-oriented" unless we start to be more inclusive about what we mean by "mesa-optimizer" and "mesa-objective." I don't think those terms as defined in RFLO actually capture humans, but I definitely want to say that we're "goal-oriented" and have "intent."

But the graph structure makes perfect sense, I just am doing the mental substitution of "intent alignment means 'what the model is actually trying to do' is aligned with 'what we want it to do'." (Similar for inner robustness.)

However, I'm not confident that the details of Evan's locutions are quite right. For example, should alignment be tested only in terms of the very best policy?

I also don't think optimality is a useful condition in alignment definitions. (Also, a similarly weird move is pulled with "objective robustness," which is defined in terms of the optimal policy for a model's behavioral objective... so you'd have to get the behavioral objective, which is specific to your actual policy, and find the actually optimal policy for that objective, to determine objective robustness?)

I find myself thinking that objective robustness is actually what I mean by the inner alignment problem. Abergal voiced similar thoughts. But this makes it seem unfortunate that "inner alignment" refers specifically to the thing where there are mesa-optimizers. I'm not sure what to do about this.

Yeah, I think I'd also wish we could collectively agree to redefine inner alignment to be more like objective robustness (or at least be more inclusive of the kinds of inner goals humans have). But I've been careful not to use the term to refer to anything except mesa-optimizers, partially in order to be consistent with Evan's terminology, but primarily not to promote unnecessary confusion with those who strongly associate "inner alignment" with mesa-optimization (although they could also be using a much looser conception of mesa-optimization, if they consider humans to be mesa-optimizers, in which case "inner alignment" pretty much points at the thing I'd want it to point at).

Comment by jbkjr on Re-Define Intent Alignment? · 2021-07-28T17:40:44.174Z · LW · GW

The behavioral objective, meanwhile, would be more like the thing the agent appears to be pursuing under some subset of possible distributional shifts. This is the more realistic case where we can't afford to expose our agent to every possible environment (or data distribution) that could possibly exist, so we make do and expose it to only a subset of them. Then we look at what objectives could be consistent with the agent's behavior under that subset of environments, and those count as valid behavioral objectives.

The key here is that the set of allowed mesa-objectives is a reliable invariant of the agent, while the set of allowed behavioral objectives is contingent on our observations of the agent's behavior under a limited set of environments In principle, the two sets of objectives won't converge perfectly until we've run our agent in every possible environment that could exist.

This is the right sort of idea; in the OOD robustness literature you try to optimize worst-case performance over a perturbation set of possible environments. The problem I have with what I understand you to be saying is with the assumption that there is any possible reliable invariant of the agent over every possible environment that could be a mesa-objective, which stems from the assumption that you are able to carve an environment up into an agent and an environment and place the "same agent" in arbitrary environments. No such thing is possible in reality, as an agent cannot exist without its environment, so why shouldn't we talk about the mesa-objective being over a perturbation set, too, just that it has to be some function of the model's internal features?

Comment by jbkjr on Re-Define Intent Alignment? · 2021-07-28T17:33:43.306Z · LW · GW

However, we could instead define "intent alignment" as "the optimal policy of the mesa objective would be good for humans".

I agree that we need a notion of "intent" that doesn't require a purely behavioral notion of a model's objectives, but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea "what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?" I suspect MIRI people have some unshared intuitions about why we might expect this, but I currently don't have a good reason to believe this.)

I want to be able to talk about how we can shape goals which may be messier, perhaps somewhat competing, internal representations or heuristics or proxies that determine behavior. If we actually want to understand "intent," we have to understand what the heck intentions and goals actually are in humans and what they might look like in advanced ML systems. However, I do think this is a very good point you raise about intent alignment (that it should correspond to the model's internal goals, objectives, intentions, etc.), and the need to be mindful of which version we're using in a given context.

Also, I'm iffy on including the "all inputs"/optimality thing (I believe Rohin is, too)... it does have the nice property that it lets you reason without considering e.g. training setup, dataset, architecture, but we won't actually have infinite data and optimal models in practice. So, I think it's pretty important to model how different environments or datasets interact with the reward/objective function in producing the intentions and goals of our models.

Evan highlights the assumption that solving inner alignment will solve behavioral alignment: he thinks that the most important cases of catastrophic bad behavior are intentional (ie, come from misaligned objectives, either outer objective or inner objective).

I don't think this is necessarily a crux between the generalization- and objective-driven approaches—if intentional behavior requires a mesa-objective, then humans can't act "intentionally." So we obviously want a notion of intent that applies to the messier middle cases of goal representation (between a literal mesa-objective and a purely implicit behavioral objective).

Comment by jbkjr on Refactoring Alignment (attempt #2) · 2021-07-26T23:21:10.323Z · LW · GW

So, for example, this claims that either intent alignment + objective robustness or outer alignment + robustness would be sufficient for impact alignment.

Shouldn’t this be “intent alignment + capability robustness or outer alignment + robustness”?

Btw, I plan to post more detailed comments in response here and to your other post, just wanted to note this so hopefully there’s no confusion in interpreting your diagram.

Comment by jbkjr on Looking Deeper at Deconfusion · 2021-07-21T11:40:41.890Z · LW · GW

Great post. My one piece of feedback is that not calling the post "Deconfusing 'Deconfusion'" might've been a missed opportunity. :)

I even went to this cooking class once where the chef proposed his own deconfusion of the transformations of food induced by different cooking techniques -- I still use it years later.

Unrelatedly, I would be interested in details on this.

Comment by jbkjr on Why Subagents? · 2021-07-17T18:40:30.901Z · LW · GW

The way I'd think of it, it's not that you literally need unanimous agreement, but that in some situations there may be subagents that are strong enough to block a given decision.

Ah, I think that makes sense. Is this somehow related to the idea that the consciousness is more of a "last stop for a veto from the collective mind system" for already-subconsciously-initiated thoughts and actions? Struggling to remember where I read this, though.

It gets a little handwavy and metaphorical but so does the concept of a subagent.

Yeah, considering the fact that subagents are only "agents" insofar as it makes sense to apply the intentional stance (the thing we'd like to avoid having to apply to the whole system because it seems fundamentally limited) to the individual parts, I'm not surprised. It seems like it's either "agents all the way down" or abandon the concept of agency altogether (although posing that dichotomy feels like a suspicious presumption of agency, itself!).

Comment by jbkjr on Why Subagents? · 2021-07-17T11:54:12.821Z · LW · GW

Wouldn't decisions about e.g. which objects get selected and broadcast to the global workspace be made by a majority or plurality of subagents? "Committee requiring unanimous agreement" feels more like what would be the case in practice for a unified mind, to use a TMI term. I guess the unanimous agreement is only required because we're looking for strict/formal coherence in the overall system, whereas e.g. suboptimally-unified/coherent humans with lots of akrasia can have tug-of-wars between groups of subagents for control.

Comment by jbkjr on Why Subagents? · 2021-07-17T09:55:58.936Z · LW · GW

The arrows show preference: our agent prefers A to B if (and only if) there is a directed path from A to B along the arrows.

Shouldn't this be "iff there is a directed path from B to A"? E.g. the agent prefers pepperoni to cheese, so there is a directed arrow from cheese to pepperoni.

Comment by jbkjr on Taboo "Outside View" · 2021-06-18T22:50:22.844Z · LW · GW

Great post. That Anakin meme is gold.

“Whenever you notice yourself saying ‘outside view’ or ‘inside view,’ imagine a tiny Daniel Kokotajlo hopping up and down on your shoulder chirping ‘Taboo outside view.’”

Somehow I know this will now happen automatically whenever I hear or read “outside view.” 😂

Comment by jbkjr on The Hard Work of Translation (Buddhism) · 2021-05-24T22:11:16.796Z · LW · GW

The Buddha taught one specific concentration technique and a simple series of insight techniques

Any pointers on where I can find information about the specific techniques as originally taught by the Buddha?

Comment by jbkjr on A non-mystical explanation of "no-self" (three characteristics series) · 2021-05-21T23:37:05.873Z · LW · GW

I've found this interview with Richard Lang about the "headless" method of interrogation helpful and think Sam's discussion provides useful context to bridge the gap to the scientific skeptics as well as to other meditative techniques and traditions (some of which are touched upon in this post). It also includes a pointing out exercise.

Comment by jbkjr on Deducing Impact · 2021-04-29T18:08:54.354Z · LW · GW

Late to the party, but here's my crack at it (ROT13'd since markdown spoilers made it an empty box without my text):

Fbzrguvat srryf yvxr n ovt qrny vs V cerqvpg gung vg unf n (ovt) vzcnpg ba gur cbffvovyvgl bs zl tbnyf/inyhrf/bowrpgvirf orvat ernyvmrq. Nffhzvat sbe n zbzrag gung gur tbnyf/inyhrf ner jryy-pncgherq ol n hgvyvgl shapgvba, vzcnpg jbhyq or fbzrguvat yvxr rkcrpgrq hgvyvgl nsgre gur vzcnpgshy rirag - rkcrpgrq hgvyvgl orsber gur rirag. Boivbhfyl, nf lbh'ir cbvagrq bhg, fbzrguvat orvat vzcnpgshy nppbeqvat gb guvf abgvba qrcraqf obgu ba gur inyhrf naq ba ubj "bowrpgviryl vzcnpgshy" vg vf (v.r. ubj qenfgvpnyyl vg punatrf gur frg bs cbffvoyr shgherf).

Comment by jbkjr on Old post/writing on optimization daemons? · 2021-04-17T18:53:56.793Z · LW · GW

Ah, it was John's post I was thinking of; thanks! (Apologies to John for not remembering it was his writing, although I suppose mistaking someone's visual imagery on a technical topic for Eliezer's might be considered an accidental compliment :).)