The Plan

johnswentworth

The Plan

post by johnswentworth · 2021-12-10T23:41:39.417Z · LW · GW · 78 comments

  What’s your plan for AI alignment?
  That sounds… awfully optimistic. Do you actually think that’s viable?
  Do you just have really long timelines?
  … Wat. Not relevant until we’re down to two years?!?
  But iterative engineering is important!
  But engineering is important for advancing understanding too!
  What do you mean by “fundamentally confused”?
  What are we fundamentally confused about?
  What kinds of “incremental progress” do you have in mind here?
  Ok, the incremental progress makes sense, but the full plan still sounds ridiculously optimistic with 10-15 year timelines. Given how slow progress has been on the foundational theory of agency (especially at MIRI), why do you expect it to go so much faster?
  What’s the roadmap?
  Why do we need formalizations for engineering?
  Why so much focus on abstraction?
  But, like, 10-15 years?!?
  Why ambitious value learning?
  … but why not aim for some easier strategy?
None
79 comments

This is a high-level overview of the reasoning behind my research priorities, written as a Q&A.

What’s your plan for AI alignment?

Step 1: sort out our fundamental confusions about agency

Step 2: ambitious value learning (i.e. build an AI which correctly learns human values and optimizes for them)

Step 3: …

Step 4: profit!

… and do all that before AGI kills us all.

That sounds… awfully optimistic. Do you actually think that’s viable?

Better than a 50/50 chance of working in time.

Do you just have really long timelines?

No. My median is maybe 10-15 years, though that’s more a gut estimate based on how surprised I was over the past decade rather than a carefully-considered analysis. (I wouldn’t be shocked by another AI winter, especially on an inside view, but on an outside view the models generating that prediction have lost an awful lot of Bayes Points over the past few years.)

Mostly timelines just aren’t that relevant; they’d have to get down to around 18-24 months before I think it’s time to shift strategy a lot.

… Wat. Not relevant until we’re down to two years?!?

To be clear, I don’t expect to solve the whole problem in the next two years. Rather, I expect that even the incremental gains from partial progress on fundamental understanding will be worth far more than marginal time/effort on anything else, at least given our current state.

At this point, I think we’re mostly just fundamentally confused about agency and alignment. I expect approximately-all of the gains-to-be-had come from becoming less confused. So the optimal strategy is basically to spend as much time as possible sorting out as much of that general confusion as possible, and if the timer starts to run out, then slap something together based on the best understanding we have.

18-24 months is about how long I expect it to take to slap something together based on the best understanding we have. (Well, really I expect it to take <12 months, but planning fallacy [LW · GW] and safety margins and time to iterate a little and all that.)

But iterative engineering is important!

In order for iterative engineering to be useful, we first need to have a strong enough understanding of what we even want to achieve in order to recognize when an iteration has brought us a step closer to the goal. No amount of A/B testing changes to our website will make our company profitable if we’re measuring the wrong metrics. I claim that, for alignment, we do not yet have a strong enough understanding for iteration to produce meaningful progress.

When I say “we’re just fundamentally confused about agency and alignment”, that’s the sort of thing I’m talking about.

To be clear: we can absolutely come up with proxy measures of alignment. The problem is that I don’t expect iteration under those proxy measures to get us meaningfully closer to aligned AGI. No reasonable amount of iterating on gliders’ flight-range will get one to the moon.

But engineering is important for advancing understanding too!

I do still expect some amount of engineering to be central for making progress on fundamental confusion. Engineering is one of the major drivers of science [LW · GW]; failed attempts to build amplifiers drove our first decent understanding of semiconductors, for instance. But this is a very different path-to-impact than directly iterating on “alignment”, and it makes sense to optimize our efforts differently if the path-to-impact is through fundamental understanding. Just take some confusing concept which is fundamental to agency and alignment (like abstraction, or optimization, or knowledge, or …) and try to engineer anything which can robustly do something with that concept. For instance, a lot of my own work is driven by the vision of a “thermometer of abstraction”, a device capable of robustly and generalizably measuring abstractions and presenting them in a standard legible format. It’s not about directly iterating on some alignment scheme, it’s about an engineering goal which drives and grounds the theorizing and can be independently useful for something of value.

Also, the theory-practice gap is a thing, and I generally expect the majority of “understanding” work to go into crossing that gap. I consider such work a fundamental part of sorting out confusions; if the theory doesn’t work in practice, then we’re still confused. But I also expect that the theory-practice gap is only very hard to cross the first few times; once a few applications work, it gets much easier. Once the first field-effect transistor works, it’s a lot easier to come up with more neat solid-state devices, without needing to further update the theory much. That’s why it makes sense to consider the theory-practice gap a part of fundamental understanding in its own right: once we understand it well enough for a few applications, we usually understand it well enough to implement many more with much lower marginal effort.

An analogy: to go from medieval castles to skyscrapers, we don’t just iterate on stone towers; we leverage fundamental scientific advances in both materials and structural engineering. My strategy for building the tallest possible metaphorical skyscraper is to put all my effort into fundamental materials and structural science. That includes testing out structures as-needed to check that the theory actually works, but the goal there is understanding, not just making tall test-towers; tall towers might provide useful data, but they’re probably not the most useful investment until we’re near the end-goal. Most of the iteration is on e.g. metallurgy, not on tower-height directly. Most of the experimentation is on e.g. column or beam loading under controlled conditions, again not on tower-height directly. If the deadline is suddenly 18-24 months, then it’s time to slap together a building with whatever understanding is available, but hopefully we figure things out fast enough that the deadline isn’t that limiting of a constraint.

What do you mean by “fundamentally confused”?

My current best explanation of “fundamental confusion” is that we don’t have the right frames [LW · GW]. When thinking about agency or alignment, we do not know:

What are the most important questions to ask?
What approximations work?
What do we need to pay attention to, and what can we safely ignore?
How can we break the problem/system up into subproblems/subsystems?

For all of these, we can certainly make up some answers. The problem is that we don’t have answers to these questions which seem likely to generalize well. Indeed, for most current answers to these questions, I think there are strong arguments that they will not generalize well. Maybe we have an approximation which works well for a particular class of neural networks, but we wouldn’t expect it to generalize to other kinds of agenty systems (like e.g. a bacteria), and it’s debatable whether it will even apply to future ML architectures. Maybe we know of some possible failure modes for alignment, but we don’t know which of them we need to pay attention to vs which will mostly sort themselves out, especially in future regimes/architectures which we currently can’t test. (Even more important: there’s only so much we can pay attention to at all, and we don’t know what details are safe to ignore.) Maybe we have a factorization of alignment [LW · GW] which helps highlight some particular problems, but the factorization is known to be leaky; there are other problems [LW(p) · GW(p)] which it obscures.

By contrast, consider putting new satellites into orbit. At this point, we generally know what the key subproblems are, what approximations we can make, what to pay attention to, what questions to ask. Most importantly, we are fairly confident that our framing for satellite delivery will generalize to new missions and applications, at least in the near-to-medium-term future. When someone needs to put a new satellite in orbit, it’s not like the whole field needs to worry about their frames failing to generalize.

(Note: there’s probably aspects of “fundamental confusion” which this explanation doesn’t capture, but I don’t have a better explanation right now.)

What are we fundamentally confused about?

We’ve already talked about one example: I think we currently do not understand alignment well enough for iterative engineering to get us meaningfully closer to solving the real problem, in the same way that iterating on glider range will not get one meaningfully closer to going to the moon. When iterating, we don’t currently know which questions to ask, we don’t know which things to pay attention to, we don’t know which subproblems are bottlenecks.

Here’s a bunch of other foundational problems/questions where I think we currently don’t know the right framing to answer them in a generalizable way:

Is an e-coli an agent? Does it have a world-model, and if so, what is it? Does it have a utility function, and if so, what is it? Does it have some other kind of “goal”?
What even are "human values"? What’s the type signature of human values?
Given two agents (with potentially completely different world models), how can I tell whether one is "trying to help" the other? What does that even mean?
Given a trained neural network, does it contain any subagents? What are their world-models, and what do they want?
Given an atomically-precise scan of a whole human brain, body, and local environment, and unlimited compute, calculate the human’s goals/wants/values, in a manner legible to an automated optimizer.
Given some physical system, identify any agents in it, and what they’re optimizing for.
Back out the learned objective [LW · GW] of a trained neural net, and compare it to the training objective.

What kinds of “incremental progress” do you have in mind here?

As an example, I’ve spent the last couple years better understanding abstraction [LW · GW] (and I’m currently working to push that across the theory-practice gap). It’s a necessary component for the sorts of questions I want to answer about agency in general (like those above), but in the nearer term I also expect it to provide very strong ML interpretability tools. (This is a technical thing, but if you want to see the rough idea, take a look at the Telephone Theorem [LW · GW] post and imagine that the causal models are computational circuits for neural nets. There are still some nontrivial steps after that to adapt the theorem to neural nets, but it should convey the general idea, and it's a very simple theorem.) If I found out today that AGI was two years away, I’d probably spend a few more months making the algorithms for abstraction-extraction as efficient as I could get them, then focus mainly on applying it to interpretability.

(What I actually expect/hope is that I’ll have efficient algorithms demo-ready in the first half of next year, and then some engineers will come along and apply them to interpretability while I work on other things.)

Another example: the next major thing [LW · GW] to sort out after abstraction will be when and why large optimized systems (e.g. neural nets or biological organisms [LW · GW]) are so modular, and how the trained/evolved modularity corresponds to modular structures in the environment. I expect that will yield additional actionable insights into ML interpretability, and especially into what environmental/training features lead to more transparent ML models.

Ok, the incremental progress makes sense, but the full plan still sounds ridiculously optimistic with 10-15 year timelines. Given how slow progress has been on the foundational theory of agency (especially at MIRI), why do you expect it to go so much faster?

Mostly I think MIRI has been asking not-quite-the-right-questions, in not-quite-the-right-ways.

Not-quite-the-right-questions: when I look at MIRI’s past work on agent foundations, it’s clear that the motivating questions were about how to build AGI which satisfies various desiderata (e.g. stable values under self-modification, corrigibility, etc). Trying to understand agency-in-general was mostly secondary, and was not the primary goal guiding choice of research directions. One clear example of this is MIRI’s work on proof-based decision theories: absolutely nobody would choose this as the most-promising research direction for understanding the decision theory used by, say, an e-coli. But plenty of researchers over the years have thought about designing AGI using proof-based internals.

I’m not directly thinking about how to design an AGI with useful properties. I’m trying to understand agenty systems in general - be it humans, ML systems, e-coli, cats, organizations, markets, what have you. My impression is that MIRI’s agent foundations team has started to think more along these lines over time (especially since Embedded Agency [? · GW] came out), but I think they’re still carrying a lot of baggage.

… which brings us to MIRI tackling questions in not-quite-the-right-ways. The work on Tiling Agents is a central example here: the problem is to come up with models for agents which copy themselves, so copies of the agents “tile” across the environment. When I look at that problem through an “understand agency in general” lens, my immediate thought is “ah, this is a baseline model for evolution”. Once we have a good model for agents which “reproduce” (i.e. tile), we can talk about agents which approximately-reproduce with small perturbations (i.e. mutations) and the resulting evolutionary process. Then we can go look at how evolution actually behaves to empirically check our models.

When MIRI looks at the Tiling Agents problem, on the other hand, they set it up in terms of proof systems proving things about “successor” proof systems. Absolutely nobody would choose this as the most natural setup to talk about evolution. It’s a setup which is narrowly chosen for a particular kind of “agent” (i.e. AI with some provable guarantees) and a particular use-case (i.e. maintaining the guarantees when the AI self-modifies).

Main point: it does not look like MIRI has primarily been trying to sort out fundamental confusions about agency-in-general, at least not for very long; that’s not what they were optimizing for. Their work was much more narrow than that. And this is one of those cases where I expect the more-general theory to be both easier to find (because we can use lots of data from existing agenty systems in biology, economics and ML) and more useful (because it will more likely generalize to many use-cases and many kinds of agenty systems).

Side note: contrary to popular perception, MIRI is an extremely heterogeneous org, and the criticisms above apply to different people at different times to very different degrees. That said, I think it’s a reasonable representation of the median past work done at MIRI. Also, MIRI is still the best org at this sort of thing, which is why I’m criticizing them in particular.

What’s the roadmap?

Abstraction is the main foundational piece (more on that below). After that, the next big piece will be selection theorems [LW · GW], and I expect to ride that train most of the way to the destination.

Regarding selection theorems: I think most of the gap between aspects of agency which we understand in theory, and aspects of agenty systems which seem to occur consistently in practice, come from broad and robust optima. Real search systems (like gradient descent or evolution) don’t find just any optima. They find optima which are “broad”: optima whose basins fill a lot of parameter/genome space. And they find optima which are robust: small changes in the distribution of the environment don’t break them. There are informal arguments that this leads to a lot of key properties:

Modularity of the trained/evolved system (which we do indeed see in practice)
Good generalization properties
Information compression
Goal-directedness

… but we don’t have good formalizations of those arguments, and we’ll need the formalizations in order to properly leverage these properties for engineering.

Besides that, there’s also some cruft to clean up in existing theorems around agency. For instance, coherence theorems (i.e. the justifications for Bayesian expected utility maximization) have some important shortcomings [LW · GW], and are incomplete in important ways [LW(p) · GW(p)]. And of course there’s also work to be done on the theoretical support structure for all this - for instance, sorting out good models of what optimization even means [LW · GW].

Why do we need formalizations for engineering?

It’s not that we need formalizations per se; it’s that we need gears-level understanding. We need to have some understanding of why e.g. modularity shows up in trained/evolved systems, what precisely makes that happen. The need for gears-level understanding, in turn, stems from the need for generalizability [LW · GW].

Let’s get a bit more concrete with the modularity example. We could try to build some non-gears-level (i.e. black-box) model of modularity in neural networks by training some different architectures in different regimes on different tasks and with different parameters, empirically computing some proxy measure of “modularity” for each trained network, and then fitting a curve to it. This will probably work great right up until somebody tries something well outside of the distribution on which this black-box model was fit. (Those crazy engineers are constantly pushing the damn boundaries; that’s largely why they’re so useful for driving fundamental understanding efforts.)

On the other hand, if we understand why modularity occurs in trained/evolved systems, then we can follow the gears of our reasoning even on new kinds of systems. More importantly, we can design new systems to leverage those gears without having to guess and check [LW · GW].

Now, gears-level understanding need not involve formal mathematics in general. But for the sorts of things I’m talking about here (like modularity or good generalization or information compression in evolved/trained systems), gears-level understanding mostly looks like mathematical proofs, or at least informal mathematical arguments. A gears-level answer to the question “Why does modularity show up in evolved systems?”, for instance, should have the same rough shape as a proof that modularity shows up in some broad class of evolved systems (for some reasonably-general formalization of “modularity” and “evolution”). It should tell us what the necessary conditions are, and explain why those conditions are necessary in such a way that we can modify the argument to handle different kinds of conditions without restarting from scratch.

Why so much focus on abstraction?

Abstraction is a common bottleneck to a whole bunch of problems in agency and alignment. Questions like:

If I have some system, what’s the right way to carve out a subsystem (which might be an “agent”, or a “world model”, or an “optimizer”, etc)? This should be robust/general enough to let us confidently say things like e.g. “there are no agents embedded in this trained neural net”.
What kinds-of-things show up in world models? For instance, is an AI likely to have internal notions of “tree” or “rock” or “car” which map to the corresponding human notions, and how closely?
How can we empirically measure high-level abstract things (like trees or agents) in the real world, in robustly generalizable ways?
To the extent that humans care about high-level abstract things like trees or cars, rather than quantum fields, how can we formalize that?
How can we translate the internal concepts used by trained ML systems into human-legible concepts, robustly enough that we won’t miss anything important (or at least can tell if we do)?

… and so forth. The important point isn’t any one of these questions; the important point is that understanding abstraction is a blocker for a whole bunch of different things. That’s what makes it an ideal target to focus on. Once it’s worked out, I expect to be unblocked not just on the above questions, but also on other important questions I haven’t even thought of yet - if it’s a blocker for many things already, it’s probably also a blocker for other things which I haven’t noticed.

If I had to pick one central reason why abstraction matters so much, it’s that we don’t currently have a robust, generalizable and legible way to measure high-level abstractions. Once we can do that, it will open up a lot of tricky conceptual questions to empirical investigation, in the same way that robust, generalizable and legible measurement tools usually open up scientific investigation of new conceptual areas.

But, like, 10-15 years?!?

A crucial load-bearing part of my model here is that agency/alignment work will undergo a phase transition in the next ~5 years. We’ll go from a basically-preparadigmatic state, where we don’t even know what questions to ask or what tools to use to answer them, to a basically-paradigmatic state, where we have a general roadmap and toolset. Or at the very least I expect to have a workable paradigm, whether anyone else jumps on board is a more open question.

There’s more than one possible path here, more than one possible future paradigm. My estimate of “~5 years” comes from eyeballing the current rate of progress, plus a gut feel for how close the frames are to where they need to be for progress to take off.

As an example of one path which I currently consider reasonably likely: abstraction provides the key tool for the phase transition. Once we can take a simulated environment or a trained model or the like, and efficiently extract all the natural abstractions from it, that changes everything. It’ll be like introducing the thermometer to the study of thermodynamics. We’ll be able to directly, empirically answer questions like “does this model know what a tree is?” or “does this model have a notion of human values?” or “is ‘human’ a natural abstraction?” or “are the agenty things in this simulation natural abstractions?” or …. (These won’t be yes/no answers, but they’ll be quantifiable in a standardized and robustly-generalizable way.) This isn’t a possibility I expect to be legibly plausible to other people right now, but it’s one I’m working towards.

Another path: once a few big selection theorems [LW · GW] are sorted out (like modularity of evolved systems, for instance) and empirically verified, we’ll have a new class of tools for empirical study of agenty systems. Like abstraction measurement, this has the potential to open up a whole class of tricky conceptual questions to empirical investigation. Things like “what is this bacteria’s world model?” or “are there any subagents in this trained neural network?”. Again, I don’t necessarily expect this possibility to be legibly plausible to other people right now.

To be clear: not all of my “better than 50/50 chance of working in time” comes from just these two paths. I’ve sketched a fair amount of burdensome detail [LW · GW] here, and there’s a lot of variations which lead to similar outcomes with different details, as well as entirely different paths. But the general theme is that I don’t think it will take too much longer to get to a point where we can start empirically investigating key questions in robustly-generalizable ways (rather than the ad-hoc methods [LW · GW] used for empirical work today), and get proper feedback loops [LW · GW] going for improving understanding.

Why ambitious value learning?

It’s the best-case outcome. I mean, c’mon, it’s got “ambitious” right there in the name.

… but why not aim for some easier strategy?

The main possibly-easier strategy for which I don’t know of any probably-fatal failure mode is to emulate/simulate humans working on the alignment problem for a long time, i.e. a Simulated Long Reflection. The main selling point of this strategy is that, assuming the emulation/simulation is accurate, it probably performs at least as well as we would actually do if we tackled the problem directly.

This is really a whole class of strategies, with many variations, most of which involve training ML systems to mimic humans. (Yes, that implies we’re already at the point where it can probably FOOM.) In general, the further the variations get from just directly simulating humans working on alignment basically the way we do now (but for longer), the more possibly-fatal failure modes show up. HCH [? · GW] is a central example here: for some reason a structure whose most obvious name is The Infinite Bureaucracy was originally suggested as an approximation of a Long Reflection. Look, guys, there is no way in hell that The Infinite Bureaucracy is even remotely a good approximation of a Long Reflection. Naming it “HCH” does not make it any less of an infinite bureaucracy, and yes it is going to fail in basically the same ways as real bureaucracies and for basically the same underlying reasons (except even worse, because it’s infinite).

… but the failure of variations does not necessarily mean that the basic idea is doomed. The basic idea seems basically-sound to me; the problem is implementing it in such a way that the output accurately mimics a real long reflection, while also making it happen before unfriendly AGI kills us all.

Personally, I’m still not working on that strategy, for a few main reasons:

I expect my current strategy to be more competitive. One big advantage of understanding agency in general is that we can apply that understanding to whatever ML/AI progress comes along, even if it ends up looking very different from e.g. GPT-3.
The Simulated Long Reflection strategy gets more likely to work when we have people for it to mimic who are already far down the road to solving alignment. The further, the better.
On a gut level, I just don’t expect ML to emulate humans accurately enough for a Simulated Long Reflection to work until we’ve already passed doomsday. (This is probably the cruxiest issue.)

I am generally happy that other people are working on strategies in the Simulated Long Reflection family, and hope that such work continues.

78 comments

Comments sorted by top scores.

comment by Scott Garrabrant · 2021-12-11T22:09:43.007Z · LW(p) · GW(p)

I want to disagree about MIRI.

Mostly, I think that MIRI (or at least a significant subset of MIRI) has always been primarily directed at agenty systems in general.

I want to separate agent foundations at MIRI into three eras. The Eliezer Era (2001-2013), the Benya Era (2014-2016), and the Scott Era(2017-).

The transitions between eras had an almost complete overhaul of the people involved. In spite of this, I believe that they have roughly all been directed at the same thing, and that John is directed at the same thing.

The proposed mechanism behind the similarity is not transfer, but instead because agency in general is a convergent/natural topic.

I think throughout time, there has always been a bias in the pipeline from ideas to papers towards being more about AI. I think this bias has gotten smaller over time, as the agent foundations research program both started having stable funding, and started carrying less and less of the weight of all of AI alignment on its back. (Before going through editing with Rob, I believe Embedded Agency had no mention of AI at all.)

I believe that John thinks that the Embedded Agency document is especially close to his agenda, so I will start with that. (I also think that both John and I currently have more focus on abstraction than what is in the Embedded Agency document).

Embedded Agency, more so than anything else I have done was generated using an IRL shaped research methodology. I started by taking the stuff that MIRI has already been working on, mostly the artifacts of the Benya Era, and trying to communicate the central justification that would cause one to be interested in these topics. I think that I did not invent a pattern, but instead described a preexisting pattern that originally generated the thoughts.

This is consistent with having the pattern be about agency in general, and so I could find the pattern in ideas that were generated based on agency in AI, but I think this is not the case. I think the use of proof based systems is demonstrating an extreme disregard for the substrate that the agency is made of. I claim that the reason that there was a historic focus on proof-based agents, is because it is a system that we could actually say stuff about. The fact that real life agents looked very different of the surface from proof based agents was a shortfall that most people would use to completely reject the system, but MIRI would work in it because what they really cared about was agency in general, and having another system that is easy to say things about that could be used to triangulate agency in general. If MIRI was directed at a specific type of agency, they would have rejected the proof based systems as being too different.

I think that MIRI is often misrepresented as believing in GOFAI because people look at the proof based systems and think that MIRI would only study those if they thought that is what AI might look at. I think in fact, the reason for the proof based systems is because at the time, this was the most fruitful models we had, and we were just very willing to use any lens that worked when trying to look at something very very general.

(One counterpoint here, is maybe MIRI didn't care about the substrate the agency was running on, but did have a bias towards singleton-like agency, rather than very distributed systems, I think this is slightly true. Today, I think that you need to understand the distributed systems, because realistic singleton-like agents follow many of the same rules, but it is possible that early MIRI did not believe this as much)

Most of the above was generated by looking at the Benya Era, and trying to justify that it was directed at agency in general at least/almost as much as the Scott Era, which seems like the hardest of three for me.

For the Scott Era, I have introspection. I sometimes stop thinking in general, and focus on AI. This is usually a bad idea, and doesn't generate as much fruit, and it is usually not what I do.

For the Eliezer Era, just look at the sequences.

I just looked up and reread, and tried to steel man what you originally wrote. My best steel man is that you are saying that MIRI is trying the develop a prescriptive understanding of agency, and you are trying the develop a descriptive understanding of agency. There might be something to this, but it is really complicated. One way to define agency is as the pipeline from the prescriptive to the descriptive, so I am not sure that prescriptive and descriptive agency makes sense as a distinction.

As for the research methodology, I think that we all have pretty different research methodologies. I do not think Benya and Eliezer and I have especially more in common with each other than we do with John, but I might be wrong here. I also don't think Sam and Abram and Tsvi and I have especially more in common in terms of research methodologies, except in so far as we have been practicing working together.

In fact, the thing that might be going on here is that the distinctions in topics is coming from differences in research skills. Maybe proof based systems are the most fruitful model if you are a Benya, but not if you are a Scott or a John. But this is about what is easiest for you to think about, not about a difference in the shared convergent subgoal of understanding agency in general.

Replies from: johnswentworth, johnswentworth, ADifferentAnonymous

↑ comment by johnswentworth · 2021-12-12T20:03:06.556Z · LW(p) · GW(p)

I generally agree with most of this, but I think it misses the main claim I wanted to make. I totally agree that all three eras of MIRI's agent foundations research had some vision of the general theory of agency behind them, driving things. My point of disagreement is that, for most of MIRI's history, elucidating that general theory has not been the primary optimization objective.

Let's go through some examples.

The Sequences: we can definitely see Eliezer's understanding of the general theory of agency in many places, especially when talking about Bayes and utility. (Engines of Cognition [LW · GW] is a central example.) But most of the sequences talk about things like failure modes of human cognition, how to actually change your mind, social failure modes of human cognition, etc. It sure looks like the primary optimization objective is about better human thinking, plus some general philosophical foundations, not the elucidation of the general theory of agency.

Tiling agents and proof-based decision theories: I'm on board with the use of proof-based setups to make minimal assumptions about "the substrate that the agency is made of". That's an entirely reasonable choice, and it does look like that choice was driven (in large part) by a desire for the theory to apply quite generally. But these models don't look like they were ever intended as general models of agency (I doubt they would apply nicely to e-coli); in your words, they provided "another system that is easy to say things about that could be used to triangulate agency in general". That's not necessarily a bad step on the road to general theory, but the general theory itself was not the main thing those models were doing. (Personally, I think we already have enough points to triangulate from for the time being. I think if someone were just directly, explicitly optimizing for a general theory of agency they'd probably come to that same conclusion. On the other hand, I could imagine someone very focused on self-reference barriers in particular might end up hunting for more data points, and it's plausible that someone directly optimizing for a general theory of agency would end up focused mostly on self-reference.)

Grain of truth: similar to tiling agents and proof-based decision theories, this sounds like "another system that is easy to say things about that could be used to triangulate agency in general". It does not sound like a part of the general theory of agency in its own right.

Logical induction: here we see something which probably would apply to an e-coli; it does sound like a part of a general theory of agency. (For the peanut gallery: I'm talking about LI criterion here, not the particular algorithm.) On the other hand, I wouldn't expect it to say much of interest about an e-coli beyond what we already know from older coherence theorems. It's still mainly of interest in problems of reflection. And I totally buy that reflection is an important bottleneck to the general theory of agency, but I wouldn't expect to see such a singular focus on that one bottleneck if someone were directly optimizing for a general theory of agency as their primary objective.

Embedded agents: in your own words, you "started by taking the stuff that MIRI has already been working on, mostly the artifacts of the Benya Era, and trying to communicate the central justification that would cause one to be interested in these topics". You did not start by taking all the different agenty systems you could think of, and trying to communicate the central concept that would cause one to be interested in those systems. I do think embedded agency came closer than any other example on this list to tackling the general theory of agency, but it still wasn't directly optimizing for that as the primary objective.

Going down that list (and looking at your more recent work), it definitely looks like research has been more and more directly pointed at the general theory of agency over time. But it also looks like that was not the primary optimization objective over most of MIRI's history, which is why I don't think slow progress on agent foundations to date provides strong evidence that the field is very difficult. Conversely, I've seen firsthand how tractable things are when I do optimize directly for a general theory of agency, and based on that experience I expect fairly fast progress.

(Addendum for the peanut gallery: I don't mean to bash any of this work; every single thing on the list was at least great work, and a lot of it was downright brilliant. There's a reason I said MIRI is the best org at this kind of work. My argument is just that it doesn't provide especially strong evidence that agent foundations are hard, because the work wasn't directly optimizing for the general theory of agency as its primary objective.)

Replies from: Scott Garrabrant

↑ comment by Scott Garrabrant · 2021-12-12T23:22:49.289Z · LW(p) · GW(p)

Hmm, yeah, we might disagree about how much reflection(self-reference) is a central part of agency in general.

It seems plausible that it is important to distinguish between the e-coli and the human along a reflection axis (or even more so, distinguish between evolution and a human). Then maybe you are more focused on the general class of agents, and MIRI is more focused on the more specific class of "reflective agents."

Then, there is the question of whether reflection is going to be a central part of the path to (F/D)OOM.

Does this seem right to you?

Replies from: Scott Garrabrant, johnswentworth

↑ comment by Scott Garrabrant · 2021-12-12T23:39:20.557Z · LW(p) · GW(p)

To operationalize, I claim that MIRI has been directed at a close enough target to yours that you probably should update on MIRI's lack of progress at least as much as you would if MIRI was doing the same thing as you, but for half as long.

Replies from: Scott Garrabrant

↑ comment by Scott Garrabrant · 2021-12-13T00:25:29.116Z · LW(p) · GW(p)

Which isn't *that* large an update. The average number of agent foundations researchers (That are public facing enough that you can update on their lack of progress) at MIRI over the last decade is like 4.

Figuring out how to factor in researcher quality is hard, but it seems plausible to me that the amount of quality adjusted attention directed at your subgoal over the next decade is significantly larger than the amount of attention directed at your subgoal over the last decade. (Which would not all come from you. I do think that Agent Foundations today is non-trivially closer to John today that Agent Foundations 5 years ago is to John today.)

It seems accurate to me to say that Agent Foundations in 2014 was more focused on reflection, which shifted towards embeddedness, and then shifted towards abstraction, and that these things all flow together in my head, and so Scott thinking about abstraction will have more reflection mixed in than John thinking about abstraction. (Indeed, I think progress on abstraction would have huge consequences on how we think about reflection.)

In case it is not obvious to people reading, I endorse John's research program. (Which can maybe be inferred by the fact that I am arguing that it is similar to my own). I think we disagree about what is the most likely path after becoming less confused about agency, but that part of both our plans is yet to be written, and I think the subgoal is enough of a simple concept that I don't think disagreements about what to do next to have a strong impact on how to do the first step.

Replies from: johnswentworth

↑ comment by johnswentworth · 2021-12-13T17:54:15.133Z · LW(p) · GW(p)

This all sounds right.

In particular, for folks reading, I symmetrically agree with this part:

In case it is not obvious to people reading, I endorse John's research program. (Which can maybe be inferred by the fact that I am arguing that it is similar to my own). I think we disagree about what is the most likely path after becoming less confused about agency, but that part of both our plans is yet to be written, and I think the subgoal is enough of a simple concept that I don't think disagreements about what to do next to have a strong impact on how to do the first step.

... i.e. I endorse Scott's research program, mine is indeed similar, I wouldn't be the least bit surprised if we disagree about what comes next but we're pretty aligned on what to do now.

Also, I realize now that I didn't emphasize it in the OP, but a large chunk of my "50/50 chance of success" comes from other peoples' work playing a central role, and the agent foundations team at MIRI is obviously at the top of the list of people whose work is likely to fit that bill. (There's also the whole topic of producing more such people, which I didn't talk about in the OP at all, but I'm tentatively optimistic on that front too.)

↑ comment by johnswentworth · 2021-12-13T17:40:26.664Z · LW(p) · GW(p)

That does seem right.

I do expect reflection to be a pretty central part of the path to FOOM, but I expect it to be way easier to analyze once the non-reflective foundations of agency are sorted out. There are good reasons to expect otherwise on an outside view - i.e. all the various impossibility results in logic and computing. On the other hand, my inside view says it will make more sense once we understand e.g. how abstraction produces maps smaller than the territory while still allowing robust reasoning, how counterfactuals naturally pop out of such abstractions, how that all leads to something conceptually like a Cartesian boundary, the relationship between abstract "agent" and the physical parts which comprise the agent, etc.

If I imagine what my work would look like if I started out expecting reflection to be the taut constraint, then it does seem like I'd follow a path a lot more like MIRI's. So yeah, this fits.

Replies from: adamShimi

↑ comment by adamShimi · 2021-12-13T18:03:04.221Z · LW(p) · GW(p)

If I imagine what my work would look like if I started out expecting reflection to be the taut constraint, then it does seem like I'd follow a path a lot more like MIRI's. So yeah, this fits.

One thing I'm still not clear about in this thread is whether you (John) would feel that progress has been made for the theory of agency if all the problems on which MIRI were instantaneously solved. Because there's a difference between saying "this is the obvious first step if you believe reflection is the taut constraint" and "solving this problem would help significantly even if reflection wan't the taut constraint".

Replies from: johnswentworth

↑ comment by johnswentworth · 2021-12-13T19:56:14.643Z · LW(p) · GW(p)

I expect that progress on the general theory of agency is a necessary component of solving all the problems on which MIRI has worked. So, conditional on those problems being instantly solved, I'd expect that a lot of general theory of agency came along with it. But if a "solution" to something like e.g. the Tiling Problem didn't come with a bunch of progress on more foundational general theory of agency, then I'd be very suspicious of that supposed solution, and I'd expect lots of problems to crop up when we try to apply the solution in practice.

(And this is not symmetric: I would not necessarily expect such problems in practice for some more foundational piece of general agency theory which did not already have a solution to the Tiling Problem built into it. Roughly speaking, I expect we can understand e-coli agency without fully understanding human agency, but not vice-versa.)

Replies from: Scott Garrabrant

↑ comment by Scott Garrabrant · 2021-12-13T20:33:05.461Z · LW(p) · GW(p)

I agree with this asymmetry.

One thing I am confused about is whether to think of the e-coli as qualitatively different from the human. The e-coli is taking actions that can be well modeled by an optimization process searching for actions that would be good if this optimization process output them, which has some reflection in it.

It feels like it can behaviorally be well modeled this way, but is mechanistically not shaped like this, I feel like the mechanistic fact is more important, but I feel like we are much closer to having behavioral definitions of agency than mechanistic ones.

Replies from: johnswentworth

↑ comment by johnswentworth · 2021-12-13T20:56:42.603Z · LW(p) · GW(p)

I would say the e-coli's fitness function has some kind of reflection baked into it, as does a human's fitness function. The qualitative difference between the two is that a human's own world model also has an explicit self-model in it, which is separate from the reflection baked into a human's fitness function.

After that, I'd say that deriving the (probable) mechanistic properties from the fitness functions is the name of the game.

... so yeah, I'm on basically the same page as you here.

↑ comment by johnswentworth · 2021-12-12T20:19:33.786Z · LW(p) · GW(p)

Main response is in another comment [LW(p) · GW(p)]; this is a tangential comment about prescriptive vs descriptive viewpoints on agency.

I think viewing agency as "the pipeline from the prescriptive to the descriptive" systematically misses a lot of key pieces. One central example of this: any properties of (inner/mesa) agents which stem from broad optima, rather than merely optima. (For instance, I expect that modularity of trained/evolved systems mostly comes from broad optima.) Such properties are not prescriptive principles; a narrow optimum is still an optimum. Yet we should expect such properties to apply to agenty systems in practice, including humans, other organisms, and trained ML systems.

The Kelly criterion is another good example: Abram has argued [LW · GW] that it's not a prescriptive principle, but it is still a very strong descriptive principle for agents in suitable environments.

More importantly, I think starting from prescriptive principles makes it much easier to miss a bunch of the key foundational questions - for instance, things like "what is an optimizer?" or "what are goals?". Questions like these need some kind of answer in order for many prescriptive principles to make sense in the first place.

Also, as far as I can tell to date, there is an asymmetry: a viewpoint starting from prescriptive principles misses key properties, but I have not seen any sign of key principles which would be missed starting from a descriptive viewpoint. (I know of philosophical arguments to the contrary, e.g. this [LW · GW], but I do not expect such things to cash out into any significant technical problem for agency/alignment, any more than I expect arguments about solipsism to cash out into any significant technical problem.)

↑ comment by ADifferentAnonymous · 2021-12-12T03:26:10.022Z · LW(p) · GW(p)

As a long-time LW mostly-lurker, I can confirm I've always had the impression MIRI's proof-based stuff was supposed to be a spherical-cow model of agency that would lead to understanding of the messy real thing.

What I think John might be getting at is that (my outsider's impression of) MIRI has been more focused on "how would I build an agent" as a lens for understanding agency in general—e.g. answering questions about the agency of e-coli is not the type of work I think of. Which maybe maps to 'prescriptive' vs. 'descriptive'?

comment by Adele Lopez (adele-lopez-1) · 2021-12-11T01:01:55.882Z · LW(p) · GW(p)

I think you've really hit the nail on the head on what's wrong (and right) with the MIRI approach. The Cartesian Frames [LW · GW] stuff seems to be the best stuff they've done in this direction.

I've also felt that our lack of understanding of abstraction is one of the key bottlenecks. How concerned are you about insights on this question also applying to unaligned AGI development?

Replies from: johnswentworth

↑ comment by johnswentworth · 2021-12-12T18:44:40.299Z · LW(p) · GW(p)

How concerned are you about insights on this question also applying to unaligned AGI development?

Enough that I have considered keeping it secret, but I think keeping it public is a strong net positive relative to our current state (i.e. giant inscrutable vectors of floating-points). If there were, say, another AI winter, then I could easily imagine changing my mind about that.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-12-12T12:02:50.819Z · LW(p) · GW(p)

I feel like your answer to "Why do we need formalizations for engineering?" just restates the claim rather than arguing for it. It sounds like you are saying "...we need formalizations because we need gears-level understanding, and formalizations are the way you get gears-level understanding in this domain." But why are formalizations the way to gears-level understanding in this domain? There are plenty of domains where one can have gears-level understanding without formalization.

Now, gears-level understanding need not involve formal mathematics in general. But for the sorts of things I’m talking about here (like modularity or good generalization or information compression in evolved/trained systems), gears-level understanding mostly looks like mathematical proofs, or at least informal mathematical arguments. A gears-level answer to the question “Why does modularity show up in evolved systems?”, for instance, should have the same rough shape as a proof that modularity shows up in some broad class of evolved systems (for some reasonably-general formalization of “modularity” and “evolution”). It should tell us what the necessary conditions are, and explain why those conditions are necessary in such a way that we can modify the argument to handle different kinds of conditions without restarting from scratch.

Maybe I'm just not interpreting "same rough shape" loosely enough. If pretty much any reasonable argument counts as the same rough shape as a proof, then I take back what I said.

Replies from: johnswentworth

↑ comment by johnswentworth · 2021-12-12T17:52:06.373Z · LW(p) · GW(p)

I basically agree with this if we're viewing this post as a standalone. I only had so much space to recursively unpack things, and I figure that the claim will make more sense if people go read a few of the posts on gears-level models and then think for themselves a bit about how what gears-level models look like for questions like "why does modularity show up in evolved/trained systems?".

When I say "same rough shape as a proof", I don't necessarily mean any reasonable-sounding argument; the key is that we want arguments with enough precision that we can map out the boundaries of their necessary conditions, and enough internal structure to adapt them to particular situations or new models without having to start over from scratch. In short, it's about the ability to tell exactly when the argument applies, and to apply the argument in many ways and in many places.

comment by John_Maxwell (John_Maxwell_IV) · 2021-12-21T11:56:42.823Z · LW(p) · GW(p)

(Well, really I expect it to take <12 months, but planning fallacy and safety margins and time to iterate a little and all that.)

There's also red teaming time, and lag in idea uptake/marketing, to account for. It's possible that we'll have the solution to FAI when AGI gets invented, but the inventor won't be connected to our community and won't be aware of/sold on the solution.

Edit: Don't forget to account for the actual engineering effort to implement the safety solution and integrate it with capabilities work. Ideally there is time for extensive testing and/or formal verification.

Replies from: NicholasKross

↑ comment by Nicholas / Heather Kross (NicholasKross) · 2023-04-07T00:35:35.542Z · LW(p) · GW(p)

I fear this too, at least because it's the most "yelling-at-the-people-onscreen-to-act-differently" scenario that still involves the "hard part" getting solved. I wish there was more discussion of this.

comment by Chris_Leong · 2021-12-12T02:07:45.791Z · LW(p) · GW(p)

"Look, guys, there is no way in hell that The Infinite Bureaucracy is even remotely a good approximation of a Long Reflection. Naming it “HCH” does not make it any less of an infinite bureaucracy, and yes it is going to fail in basically the same ways as real bureaucracies and for basically the same underlying reasons"

I guess this isn't immediately obvious for me. Bureaucracies fail because at each level the bosses tell the subordinates what to do and they just have to do it. In HCH, sure each subordinate performs a fixed mental task, but the the boss gets to consider the result and make up its own mind, taking into account the reports from the other subordinates. All this extra processing makes me feel as though it isn't exactly the same thing.

Replies from: johnswentworth

↑ comment by johnswentworth · 2021-12-12T18:28:45.052Z · LW(p) · GW(p)

(I'm going to respond here to two different comments about HCH and why bureaucracies fail.)

I think a major reason why people are optimistic about HCH is that they're confused about why bureaucracies fail.

Responding to Chris: if you go look at real bureaucracies, it is not really the case that "at each level the bosses tell the subordinates what to do and they just have to do it". At every bureaucracy I've worked in/around, lower-level decision makers had many de facto degrees of freedom. You can think of this as a generalization of one of the central problems of jurisprudence: in practice, human "bosses" (or legislatures, in the jurisprudence case) are not able to give instructions which unambiguously specify what to do in all the crazy situations which come up in practice. Nor do people at the top have anywhere near the bandwidth needed to decide every ambiguous case themselves; there is far too much ambiguity in the world. So, in practice, lower-level people (i.e. judges at various levels) necessarily make many many judgement calls in the course of their work.

Also, in general, tons of information flows back up the hierarchy for higher-level people to make decisions. There are already bureacracies whose purpose is very similar to HCH: they exist to support the decision-making of the person at the top. (Government intelligence is a good example.) To my knowledge/experience, such HCH-like bureacracies are not any less dysfunctional than others, nor do normal bureacracies behave less dysfunctionally than normal when passing information up to a high-level decision maker.

Responding to Joe: if you go look at real bureaucracies, most people working in them are generally well-meaning and trying to help. There is still a sense in which incentives are a limiting factor: good incentives are information-carriers in their own right (like e.g. prices), and I'll link below to arguments that information-transmission is the problem. But incentives are not the problem in a way which can be fixed just by having everyone share some non-selfish values.

So why do bureaucracies (and large organizations more generally) fail so badly?

My main model for this is that interfaces are a scarce resource [? · GW]. Or, to phrase it in a way more obviously relevant to factorization: it is empirically hard for humans to find good factorizations of problems which have not already been found. Interfaces which neatly split problems are not an abundant resource (at least relative to humans' abilities to find/build such interfaces). If you can solve that problem well, robustly and at scale, then there's an awful lot of money to be made.

Also, one major sub-bottleneck (though not the only sub-bottleneck) of interface scarcity is that it's hard to tell [? · GW] who has done a good job on a domain-specific problem/question without already having some domain-specific background knowledge. This also applies at a more "micro" level: it's hard to tell whose answers are best without knowing lots of context oneself.

I should also mention: these models came out of me working in/around bureacratic organizations, as they were trying to scale up. I wanted to generally understand the causes of various specific instances of dysfunction. So they are based largely on first-hand knowledge.

Replies from: Chris_Leong, Joe_Collman

↑ comment by Chris_Leong · 2021-12-13T01:29:13.553Z · LW(p) · GW(p)

"At every bureaucracy I've worked in/around, lower-level decision makers had many de facto degrees of freedom." - I wasn't disputing this - just claiming that they had to work within the constraints of the higher-level boss.

It's interesting to here the rest of your model though.

↑ comment by Joe Collman (Joe_Collman) · 2021-12-14T18:15:37.913Z · LW(p) · GW(p)

Thanks for the elaboration. I agree with most/all of this.

However, for a capable, well-calibrated, cautious H, it mostly seems to argue that HCH won't be efficient, not that it won't be capable and something-like-aligned.

Since the HCH structure itself isn't intended to be efficient, this doesn't seem too significant to me. In particular, the bureaucracy analogy seems to miss that HCH can spend >99% of its time on robustness. (this might look more like science: many parallel teams trying different approaches, critiquing each other and failing more often than succeeding)

I'm not sure whether you're claiming:

That an arbitrarily robustness-focused HCH would tend to be incorrect/overconfident/misaligned. (where H might be a team including e.g. you, Eliezer, Paul, Wei Dai, [other people you'd want]...)
That any limits-to-HCH system we train would need to make a robustness/training-efficiency trade-off, and that the levels of caution/redundancy/red-teaming... required to achieve robustness would make training uncompetitive.
1. Worth noting here that this only needs to be a constant multiplier on human training time - once you're distilling or similar, there's no exponential cost increase. (granted distillation has its own issues)
Something else.

To me (2) seems much more plausible than (1), so a perils-of-bureaucracy argument seems more reasonably aimed at IDA etc than at HCH.

I should emphasize that it's not clear to me that HCH could solve any kind of problem. I just don't see strong reasons to expect [wrong/misaligned answer] over [acknowledgement of limitations, and somewhat helpful meta-suggestions] (assuming HCH decides to answer the question).

Replies from: johnswentworth

↑ comment by johnswentworth · 2021-12-14T23:33:13.329Z · LW(p) · GW(p)

This is a capability thing, not just an efficiency thing. If, for instance, I lack enough context to distinguish real expertise from prestigious fakery in some area, then I very likely also lack enough context to distinguish those who do have enough context from those who don't (and so on up the meta-ladder). It's a bottleneck which fundamentally cannot be circumvented by outsourcing cognitive labor.

Similarly, if the interface at the very top level does not successfully convey what I want those one step down to do, then there's no error-correction mechanism for that; there's no way to ground out the top-level question anywhere other than the top-level person. Again, it's a bottleneck which fundamentally cannot be circumvented by outsourcing cognitive labor.

Orthogonal to the "some kinds of cognitive labor cannot be outsourced" problem, there's also the issue that HCH can only spend >99% of its time on robustness if the person being amplified decides to do so, and then the person being amplified needs to figure out the very difficult problem of how to make all that robustness-effort actually useful. HCH could do all sorts of things if the H in question were already superintelligent, could perfectly factor problems, knew exactly the right questions to ask, knew how to deploy lots of copies in such a way that no key pieces fell through the cracks, etc. But actual humans are not perfectly-ideal tool operators who don't miss anything or make any mistakes [LW(p) · GW(p)], and actual humans are also not super-competent managers capable of extracting highly robust performance on complex tasks from giant bureaucracies. Heck, it's a difficult and rare skill just to get robust performance on simple tasks from giant bureaucracies.

In general, if HCH requires some additional assumption that the person being amplified is smart enough to do X, then that should be baked into the whole plan from the start so that we can evaluate it properly. Like, if every time someone says "HCH has problem Y" the answer is "well the humans can just do X", for many different values of Y and X, then that implies there's some giant unstated list of things the humans need to do in order for HCH to actually work. If we're going to rely on the scheme actually working, then we need that whole list in advance, not just some vague hope that the humans operating HCH will figure it all out when the time comes. Humans do not, in practice, reliably ask all the right questions on-the-fly.

And if your answer to that is "ok, the first thing for the HCH operator to do is spin up a bunch of independent HCH instances and ask them what questions we need to ask..." then I want to know why we should expect that to actually generate a list containing all the questions we need to ask. Are we assuming that those subinstances will first ask their subsubinstances (what questions the subinstances need to ask in order to determine (what questions the top instance needs to ask))? Where does that recursion terminate, and when it does terminate, and how does the thing it's terminating on actually end up producing a list which doesn't miss any crucial questions?

Replies from: Joe_Collman

↑ comment by Joe Collman (Joe_Collman) · 2021-12-15T07:30:31.565Z · LW(p) · GW(p)

Similarly, if the interface at the very top level does not successfully convey what I want those one step down to do, then there's no error-correction mechanism for that; there's no way to ground out the top-level question anywhere other than the top-level person. Again, it's a bottleneck which fundamentally cannot be circumvented by outsourcing cognitive labor.

For complex questions I don't think you'd have the top-level H immediately divide the question itself: you'd want to avoid this single-point-of-failure. In unbounded HCH, one approach would be to set up a scientific community (or a set of communities...), to which the question would be forwarded unaltered. You'd have many teams taking different approaches to the question, teams distilling and critiquing the work of others, teams evaluating promising approaches... [again, in strong HCH we have pointers for all of this].
For IDA you'd do something vaguely similar, on a less grand scale.

You can set up error-correction by passing pointers, explicitly asking about ambiguity/misunderstanding at every step (with parent pointers to get context), using redundancy....

I agree that H needs to be pretty capable and careful - but I'm assuming a context where H is a team formed of hand-picked humans with carefully selected tools (and access to a lot of data). It's not clear to me that such a team is going to miss required robustness/safety actions (neither is it clear to me that they won't - I just don't buy your case yet). It's not clear they're in an adversarial situation, so some fixed capability level that can see things in terms of process/meta-levels/abstraction/algorithms... may be sufficient.
[once we get into truly adversarial territory, I agree that things are harder - but there we're beyond things failing for the same reasons bureaucracies do]

I agree it's hard to get giant bureaucracies to robustly perform simple tasks - I just don't buy the analogy. Giant bureaucracies don't have uniform values, and do need to pay for error correction mechanisms.

Like, if every time someone says "HCH has problem Y" the answer is "well the humans can just do X", for many different values of Y and X, then that implies there's some giant unstated list of things the humans need to do in order for HCH to actually work. If we're going to rely on the scheme actually working, then we need that whole list in advance...

Here I want to say:
Of course there's a "giant unstated list of things..." - that's why we're putting H into the system. It'd be great if we could precisely specify all the requirements on H ahead of time - but if we could do that, we probably wouldn't need H. (it certainly makes sense to specify and check for some X, but we're not likely to be able to find the full list)

To the extent that for all Y so far we've found an X, I'm pretty confident that my dream-team H would find X-or-better given a couple of weeks and access to their HCH. While we'd want more than "pretty confident", it's not clear to me that we can get it without fooling ourselves: once you're relying on a human, you're squarely in pretty-confident-land. (even if we had a full list of desiderata, we'd only be guessing that our H satisfied the list)

However, I get less clear once we're in IDA territory rather than HCH. Most of the approaches I first consider for HCH are nowhere near the object level of the question. Since IDA can't afford to set up such elaborate structures, I think the case is harder to make there.

Replies from: johnswentworth

↑ comment by johnswentworth · 2021-12-15T18:09:04.095Z · LW(p) · GW(p)

To the extent that for all Y so far we've found an X, I'm pretty confident that my dream-team H would find X-or-better given a couple of weeks and access to their HCH.

It sounds like roughly this is cruxy.

We're trying to decide how reliable <some scheme> is at figuring out the right questions to ask in general, and not letting things slip between the cracks in general, and not overlooking unknown unknowns in general, and so forth. Simply observing <the scheme> in action does not give us a useful feedback signal on these questions, unless we already know the answers to the questions. If <the scheme> is not asking the right questions, and we don't know what the right questions are, then we can't tell it's not asking the right questions. If <the scheme> is letting things slip between the cracks, and we don't know which things to check for crack-slippage, then we can't tell it's letting things slip between the cracks. If <the scheme> is overlooking unknown unknowns, and we don't already know what the unknown unknowns are, then we can't tell it's overlooking unknown unknowns.

So: if the dream team cannot figure out beforehand all the things it needs to do to get HCH to avoid these sorts of problems, we should not expect them to figure it out with access to HCH either. Access to HCH does not provide an informative feedback signal unless we already know the answers. The cognitive labor cannot be delegated.

(Interesting side-point: we can make exactly the same argument as above about our own reasoning processes. In that case, unfortunately, we simply can't do any better; our own reasoning processes are the final line of defense. That's why a Simulated Long Reflection is special, among these sorts of buck-passing schemes: it is the one scheme which does as well as we would do anyway. As soon as we start to diverge from Simulated Long Reflection, we need to ask whether the divergence will make the scheme more likely to ask the wrong questions, let things slip between cracks, overlook unknown unknowns, etc. In general, we cannot answer this kind of question by observing the scheme itself in operation.)

For complex questions I don't think you'd have the top-level H immediately divide the question itself: you'd want to avoid this single-point-of-failure.

(This is less cruxy, but it's a pretty typical/central example of the problems with this whole way of thinking.) By the time the question/problem has been expressed in English, the English expression is already a proxy for the real question/problem.

One of the central skills involved in conceptual research (of the sort I do) is to not accidentally optimize for something we wrote down in English, rather than the concept which that English is trying to express. It's all too easy to to think that e.g. we need a nice formalization of "knowledge" or "goal directedness" or "abstraction" or what have you, and then come up with some formalization of the English phrase which does not quite match the thing in our head, and which does not quite fit the use-cases which originally generated the line of inquiry.

This is also a major problem in real bureaucracies: the boss can explain the whole problem to the underlings, in a reasonable amount of detail, without attempting to factor it at all, and the underlings are still prone to misunderstand the goal or the use-cases and end up solving the wrong thing. In software engineering, for instance, this happens all the time and is one of the central challenges of the job.

comment by Ruby · 2021-12-19T04:20:19.895Z · LW(p) · GW(p)

Curated. Not that many people pursue agendas to solve the whole alignment problem and of those even fewer write up their plan clearly. I really appreciate this kind of document and would love to see more like this. Shoutout to the back and forth between John and Scott Garrabrant [LW(p) · GW(p)] about John's characterization of MIRI and its relation to John's work.

comment by Mark Xu (mark-xu) · 2021-12-12T08:40:34.455Z · LW(p) · GW(p)

I want to flag that HCH was never intended to simulate a long reflection. It’s main purpose (which it fails in the worse case) is to let humans be epistemically competitive with the systems you’re trying to train.

Replies from: johnswentworth

↑ comment by johnswentworth · 2021-12-12T17:39:56.471Z · LW(p) · GW(p)

I mean, we have this thread [LW(p) · GW(p)] with Paul directly saying "If all goes well you can think of it like 'a human thinking a long time'", plus Ajeya and Rohin both basically agreeing with that.

Replies from: mark-xu

↑ comment by Mark Xu (mark-xu) · 2021-12-12T18:38:57.536Z · LW(p) · GW(p)

Agreed, but the thing you want to use this for isn’t simulating a long reflection, which will fail (in the worst case) because HCH can’t do certain types of learning efficiently.

Replies from: johnswentworth

↑ comment by johnswentworth · 2021-12-13T19:40:14.096Z · LW(p) · GW(p)

Once we get past Simulated Long Reflection, there's a whole pile of Things To Do With AI which strike me as Probably Doomed on general principles.

You mentioned using HCH to "let humans be epistemically competitive with the systems we're trying to train", which definitely falls in that pile. We have general principles saying that we should definitely not rely on humans being epistemically competitive with AGI; using HCH does not seem to get around those general principles at all. (Unless we buy some very strong hypotheses about humans' skill at factorizing problems, in which case we'd also expect HCH to be able to simulate something long-reflection-like.)

Trying to be epistemically competitive with AGI is, in general, one of the most difficult use-cases one can aim for. For that to be easier than simulating a long reflection, even for architectures other than HCH-emulators, we'd need some really weird assumptions.

comment by interstice · 2021-12-11T03:17:36.255Z · LW(p) · GW(p)

Excellent post! This seems like a highly promising and under-explored line of attack. I've had some vaguely [LW · GW] similar thoughts [LW · GW] over the years, but you've done a far better job articulating and developing a coherent programme. Bravo!

I think my biggest intuitive disagreement might be with whether it is likely to be possible to create some sort of efficient 'abstraction thermometer' or 'agency thermometer'. Searching for possible ways of finding agents or abstractions in a system seems like a prototypical np-hard search problem. Now in practice it's often possible to solve such problems efficiently, but the setting with agents seems especially problematic in that keeping yourself obfuscated can be instrumentally useful, so I suspect the instances we're confronted with in the real world may be adversarially selected to be inscrutable to fast search methods in general.

Replies from: Charlie Steiner, Joe_Collman

↑ comment by Charlie Steiner · 2021-12-13T17:07:11.479Z · LW(p) · GW(p)

I'm also interested in what goes on the other side of the equation.How are you defining what to search for in the first place? If you point your abstraction detector at an AI and it outputs "this AI has a concept of trees," how do you gain confidence that the "trees" according to the AI (and according to your abstraction detector) are more or less what you mean by trees?

Some ad-hoc methods spring to mind, but I'm not sure what John would say.

↑ comment by Joe Collman (Joe_Collman) · 2021-12-11T22:55:28.759Z · LW(p) · GW(p)

This is my largest concern too: that we might find a principled-but-inefficient tools that give guarantees, but be unable to find any efficient approximation that doesn't lose those guarantees.

However, I do think there are reasons to be cautiously optimistic, conditional on gaining a solid theoretical understanding [just my impressions: confusion entirely possible]:

We get to pick the structure we're searching over - the only real constraint being that it has to perform competitively. It wouldn't matter that the 'thermometers' were inefficient in 99% of cases, just so long as we were able to find at least one kind of structure combining thermometer-efficiency and performance. If the required [thermometer-friendly] property can be formally specified, it may be possible to incorporate it as a training constraint.
So long as we can use the tools to prevent adversarial situations from arising in the first place, we don't need to meet the bar of working in the face of super-human adversarial selection (I think it's a good idea to view getting into that situation as a presumed loss condition).
In principle, greater theoretical understanding may give us more than just 'thermometers' - e.g. we might hope to find operators that preserve particular agency-related safety properties. If updates could be applied in terms of such operators, that may reduce the required frequency of slower tests. [the specifics may not look like this, but a solid theoretical understanding would usually be expected to help you avoid problems in various ways, not only to test for them]

comment by Shmi (shminux) · 2022-01-01T23:59:05.848Z · LW(p) · GW(p)

s an e-coli an agent? Does it have a world-model, and if so, what is it? Does it have a utility function, and if so, what is it? Does it have some other kind of “goal”

That's the part I find puzzling in terms of lack of time devoted to it: how can one talk about agency without figuring out the basics like that. Though I personally argued that it might not even be possible to do in this post, which conjectured that vapor bubbles"maximizing their volume" in a pot of boiling water are not qualitatively different from bacteria going against sugar gradient in search of food.

Replies from: dkirmani

↑ comment by dkirmani · 2022-01-02T00:58:09.931Z · LW(p) · GW(p)

It's hard to articulate exactly why, but I feel like "utility-maximizing agent(s)" is not the right frame to think about AI in. You can fit a utility function to any sequence of 'actions' an 'agent' makes, so the abstraction "utility function" has no real power to predict the 'actions' of an 'agent'. There's also the fundamental human bias of ascribing agency to non-agentic systems (the weather, printers).

comment by Joe Collman (Joe_Collman) · 2021-12-12T00:08:54.660Z · LW(p) · GW(p)

Great post. To the extent that progress can be made on this, it seems extremely important.

A question on your HCH scepticism:

going to fail in basically the same ways as real bureaucracies and for basically the same underlying reasons

I'd be interested if you could elaborate on that. To me it seems HCH shares some elements of bureaucracy, but that there are important differences.

My thoughts:

They share the property of not reliably optimising for the task they're given (HCH is best considered a sovereign, not an oracle: it's an oracle iff it wants to be).
They differ in terms of common purpose: the Hs in HCH have all their non-selfish values in common. To the extent that they're optimising to achieve something in the world, it's the same something.
1. Internal value conflict is likely a problem here, but perhaps avoidable with the right H?
Given (2), and strong HCH, it should be possible to adopt whatever enlightened form of organisational structure is desired. As a standard bureaucracy scales, it's hard to avoid friction, fragmentation, in-fighting, communication failures... - but a lot of this is due to disparate values, assumptions and incentives.

Overall it's not clear to me that HCH will fail to do something useful.
On the other hand, I do agree that long reflection seems to be one of the least HCH-friendly tasks (unless individual Hs have much more than one day). Long reflection would seem to require the Hs to change significantly during the process.

Replies from: johnswentworth, liam-donovan-1

↑ comment by johnswentworth · 2021-12-12T18:30:04.194Z · LW(p) · GW(p)

Response here [LW(p) · GW(p)].

↑ comment by Liam Donovan (liam-donovan-1) · 2021-12-12T01:56:59.699Z · LW(p) · GW(p)

comment by Jon Garcia · 2021-12-11T03:51:09.143Z · LW(p) · GW(p)

I strongly agree with your focus on ambitious value learning, rather than approaches that focus more on control (e.g., myopia). What we want is an AGI that can robustly identify humans (and I would argue, any agentic system), determine their values in an iteratively improving way, and treat these learned values as its own. That is, we should be looking for models where goal alignment and a desire to cooperate with humanity is situated within a broad basin of attraction (like how corrigibility is supposed to work), where any misalignment that the AGI notices (or that humans point out to it) is treated as an error signal that pulls its value model back into the basin. For such a scheme to work, of course, you need some way for it to infer human goals (watching human behavior?, imagining what it would be trying to achieve that would make it behave the same way?), some way for the AGI to represent "human goals" once it has inferred them, some way for it to represent "my own goals" in the same conceptual space (while still using those goal representations to drive its own behavior), and some way for it to take any differences in these representations to make itself more aligned (something like online gradient descent?).

And I think that solutions to this line of research would involve building generative agentic models into the AGI's architecture to give it strong inductive priors for detecting human agency in its world model (using something along the lines of analysis by synthesis or predictive coding). We wouldn't necessarily have to figure out everything about how the human mind works in order to build this (although that would certainly help), just enough so that it has the tools to teach itself how humans think and act, maintain homeostasis, generate new goals, use moral instincts of empathy, fairness, reciprocity, and status-seeking, etc. And as long as it is built to treat its best model of human values and goals as its own values and goals, I think we wouldn't need to worry about it torturing simulated humans [LW · GW], no matter how sophisticated its agentic models get. Of course, this would require figuring out how to detect agentic models in general systems, as you mentioned, so that we can make sure that the only parts of the AGI capable of simulating agents are those that have their preferences routed to the AGI's own preference modules.

Replies from: Koen.Holtman

↑ comment by Koen.Holtman · 2021-12-11T21:12:46.161Z · LW(p) · GW(p)

I strongly agree with your focus on ambitious value learning, rather than approaches that focus more on control (e.g., myopia).

Interesting observation on the above post! Though I do not read it explicitly in John's Plan, I guess you can indeed implicitly read that John's Plan rejects routes to alignment that focus on control/myopia, routes that do not visit step 2.of successfully solving automatic/ambitious value learning first.

John, can you confirm this?

Background: my own default Plan does focus on control/myopia. I feel that this line of attack for solving AGI alignment (if we ever get weak or strong AGI) is reaching the stage where all the major points of 'fundamental confusion' have been solved. So for me this approach represents the true 'easier strategy'.

Replies from: Jon Garcia

↑ comment by Jon Garcia · 2021-12-12T00:41:03.676Z · LW(p) · GW(p)

It's quite possible that control is easier than ambitious value learning, but I doubt that it's as sustainable. Approaches like myopia, IDA, or HCH would probably get you an AGI that is aligned to much higher levels of intelligence than doing without them, all else being equal. But if there is nothing pulling its motivations explicitly back toward a basin of value alignment, then I feel like these approaches would be prone to diverging from alignment at some level beyond where any human could tell what's going on with the system.

I do think that methods of control are worthwhile to pursue over the short term, but we had better be simultaneously working on ambitious value learning in the meantime for when an ASI inevitably escapes our control anyway. Even if myopia, for instance, worked perfectly to constrain what some AGI is able to conspire, it still seems likely that someone, somewhere, will try fiddling around with another AGI's time horizon parameters and cause a disaster. It would be better if AGI models, from the beginning, had at least some value learning system built in by default to act as an extra safeguard.

Replies from: Koen.Holtman

↑ comment by Koen.Holtman · 2021-12-15T22:39:59.913Z · LW(p) · GW(p)

I agree in general that pursuing multiple alternative alignment approaches (and using them all together to create higher levels of safety) is valuable. I am more optimistic than you that we can design control systems (different from time horizon based myopia) which will be stable and understandable even at higher levels of AGI competence.

it still seems likely that someone, somewhere, will try fiddling around with another AGI's time horizon parameters and cause a disaster.

Well, if you worry about people fiddling with control system tuning parameters, you also need to worry about someone fiddling with value learning parameters so that the AGI will only learn the values of a single group of people who would like to rule the rest of the world. Assming that AGI is possible, I believe it is most likely that Bostrom's orthogonality hypothesis will hold for it. I am not optimistic about desiging an AGI system which is inherently fiddle-proof.

comment by Ben Pace (Benito) · 2023-01-07T00:33:36.658Z · LW(p) · GW(p)

This post is one of the LW posts a younger version of myself would have been most excited to read. Building on what I got from the Embedded Agency sequence, this post lays out a broad-strokes research plan for getting the alignment problem right. It points to areas of confusion, it lists questions we should be able to answer if we got this right, it explains the reasoning behind some of the specific tactics the author is pursuing, and it answers multiple common questions and objections. It leaves me with a feeling of "Yeah, I could pursue that too if I wanted, and I expect I could make some progress" which is a shockingly high bar for a purported plan to solve the alignment problem. I give this post +9.

comment by Gunnar_Zarncke · 2021-12-13T20:25:36.447Z · LW(p) · GW(p)

If you are looking for a very general yet simple model of agency or at least decision making you might want to have a look at The geometry of decision-making in individuals and collectives.

While capturing known, generic features of neural integration, our model is deliberately minimal. This serves multiple purposes. First, following principles of maximum parsimony, we seek to find a simple model that can both predict and explain the observed phenomena. Second, we aim to reveal general principles and thus, consider features that are known to be valid across organisms irrespective of inevitable difference in structural organization of the brain.

I think it is coming from the other direction but still relevant.

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-12-31T22:08:38.582Z · LW(p) · GW(p)

This is big if true! I skimmed that paper and didn't understand its generality. It seems to be a model of how dumb animals and groups of dumb animals make decisions between desired places to be, as they approach a cluster of different desired places to be. The interesting upshot is that instead of picking one option as the best and heading straight for it, they make a series of binary choices.

Can you perhaps help me understand -- is this supposed to generalize to humans and AGIs also? And is it supposed to generalize to choices that aren't about where to travel when travelling fast towards a cluster of desirable destinations? If so, do you think you see how, and would you be willing to explain it to me?

Replies from: Gunnar_Zarncke, Jon Garcia

↑ comment by Gunnar_Zarncke · 2021-12-31T23:33:06.916Z · LW(p) · GW(p)

Happy New Year.

Based on the paper I would predict that it applies to human sub conscious decision making. I'm unsure if it applies to conscious decisions. For AI it depends on the approach chosen.

↑ comment by Jon Garcia · 2022-01-01T02:03:55.642Z · LW(p) · GW(p)

This looks really interesting. The first thought that jumped to mind was how this geometric principle might extend to abstract goal space in general. There is research suggesting that savannah-like environments may have provided human evolution ideal selective pressures for developing the cognitive tools necessary for making complex plans. Becoming adept at navigating physical scenes with obstacles, predators, refuges, and prey gave humans the right kind of brain architecture for also navigating abstract spaces full of abstract goals, anti-goals (bad outcomes to avoid), obstacles, and paths (plans).

The "geometric decision making" in the paper was studied for physical spaces, but I could imagine that animal minds (including humans) use such a bifurcation method in other goal spaces as well. In other words, agents would start out traversing state space toward the average of multiple, moderately distant goals (seeking a state from which multiple goals are still achievable), then would switch to choosing a sub-cluster of the goals to pursue once they get close enough (the binary decision / bifurcation point). This would iterate until the agent has only one easily achievable goal in front of it.

My guess is that this strategy would be safer than choosing a single goal among many at the outset of planning (e.g., the one goal with the highest expected utility upon achievement). If the situation changes while the agent is in the middle of pursuing a goal, it might find itself too far away from any other goal to make up for the sunk cost. If instead it had been pursuing some sort of multi-goal-centroid state, it could still achieve a decent alternative goal even when what would have been its first choice ceases to be an option. As it gets closer to the multi-goal-centroid, it can afford to focus on just a subset (or just a single goal), since it knows that other decent options are still nearby in state space.

comment by awenonian · 2021-12-19T20:59:10.264Z · LW(p) · GW(p)

"Real search systems (like gradient descent or evolution) don’t find just any optima. They find optima which are [broad and robust]"

I understand why you think that broad is true. But I'm not sure I get robust. In fact, robust seems to make intuitive dis-sense to me. Your examples are gradient descent and evolution, neither of which have memory, so, how would they be able to know how "robust" an optima is? Part of me thinks that the idea comes from how, if a system optimized for a non-robust optima, it wouldn't internally be doing anything different, but we probably would say it failed to optimize, so it looks like successful optimizers optimize for robust optima. Plus that broad optima are more likely to be robust. I'm not sure, but I do notice my confused on the inclusion of "robust". My current intuition is kinda like "Broadness and robustness of optima are very coupled. But, given that, optimization for robust optima only happens insofar as it is really optimization for broad optima. Optimization for robust but not broad optima does not happen, and optimization for statically broad but more robust optima does not happen better."

Replies from: johnswentworth

↑ comment by johnswentworth · 2021-12-19T21:40:54.097Z · LW(p) · GW(p)

If we're just optimizing some function, then indeed breadth is the only relevant part. But for something like evolution or SGD, we're optimizing over random samples, and it's the use of many different random samples which I'd expect to select for robustness.

Replies from: awenonian

↑ comment by awenonian · 2021-12-24T04:07:35.725Z · LW(p) · GW(p)

Maybe I misunderstand your use of robust, but this still seems to me to be breadth. If an optima is broader, samples are more likely to fall within it. I took broad to mean "has a lot of (hyper)volume in the optimization space", and robust to mean "stable over time/perturbation". I still contend that those optimization processes are unaware of time, or any environmental variation, and can only select for it in so far as it is expressed as breadth.

The example I have in my head is that if you had an environment, and committed to changing some aspect of it after some period of time, evolution or SGD would optimize the same as if you had committed to a different change. Which change you do would affect the robustness of the environmental optima, but the state of the environment alone determines their breadth. The processes cannot optimize based on your committed change before it happens, so they cannot optimize for robustness.

Given what you said about random samples, I think you might be working under definitions along the lines of "robust optima are ones that work in a range of environments, so you can be put in a variety of random circumstances, and still have them work" and (at this point I struggled a bit to figure out what a "broad" optima would be that's different, and this is what I came up with?) "broad optima are those that you can do approximately and still get a significant chunk of the benefit." I feel like these can still be unified into one thing, because I think approximate strategies in fixed environments are similar to fixed strategies in approximate environments? Like moving a little to the left is similar to the environment being a little to the right?

comment by M. Y. Zuo · 2021-12-13T06:57:48.311Z · LW(p) · GW(p)

“It’s not that we need formalizations per se; it’s that we need gears-level understanding. We need to have some understanding of why e.g. modularity shows up in trained/evolved systems, what precisely makes that happen. The need for gears-level understanding, in turn, stems from the need for generalizability.”

In this case the most straightforward approach would be to simply derive e.coli behaviour from basic quantum chemistry, as that is the closest field where fully deterministic simulations are possible, and verifiable.

The gap between simulating hydrogen-oxygen reactions and simulating e.coli moving around in a petri dish is about the same order of magnitude as the gap between simulating e.coli moving around in a petri dish and simulating human level intelligences in a complex society.

However the clear difficulty here is that it will likely take more energy than is available in the Milky Way to carry out such simulations for a reasonable time span (from the quantum level on up).

Thus any plausible approach can not be an entirely straightforward derivation from first principles, they will invariably produce some quantity of ‘hand-waving’, ‘fudge factors’, preconceptions, biases, etc., that everyone would carry into, or desire from, such endeavours.

i.e. the Gears will only be so at a distance, and will fall apart upon sufficiently motivated close inspection.

Furthermore, possible courses of actions and the resultant consequences are intertwined to an extent, due to the human condition, that it does not seem likely we can presume a greater intelligence will not be able to find some loophole to exploit as if it were a black box anyway. Because ultimately all human understanding is reliant on ’black boxes’ somewhere along the way.

The reason I am writing this is because it would be wise to not get your hopes up that all possible schemes of an ‘unfriendly AI’ can be foiled through such methods, or a combination of such methods.

comment by adamShimi · 2021-12-12T01:07:49.238Z · LW(p) · GW(p)

I think having that post on the AF would be very good. ;)

Replies from: johnswentworth

↑ comment by johnswentworth · 2021-12-12T01:43:39.195Z · LW(p) · GW(p)

Didn't want to scare people away with the "may contain technical blah de blah" header. I'll crosspost it to AF in a few days.

comment by Esben Kran (esben-kran) · 2022-05-31T15:58:09.645Z · LW(p) · GW(p)

In general a great piece. One thing that I found quite relatable is the point about the preparadigmatic stage of AI safety going into later stages soon. It feels like this is already happening to some degree where there are more and more projects readily available, more prosaic alignment and interpretability projects at large scale, more work done in multiple directions and bigger organizations having explicit safety staff and better funding in general.

With these facts, it seems like there's bound to be a relatively big phase shift in research and action within the field that I'm quite excited about.

comment by Oliver Sourbut · 2021-12-20T09:58:54.865Z · LW(p) · GW(p)

Regarding modularity - you might be interested in my Motivations, Natural Selection, and Curriculum Engineering > Modularity of Capability Accumulation [AF · GW] from last week - it has a few speculations and (probably more usefully) a couple of references you might like (including one I stole from you).

comment by Astynax · 2021-12-11T02:21:38.525Z · LW(p) · GW(p)

To me the biggest parallel I see in this to existing work is to that of program correctness. It is as hard IMHO to prove program correctness (as in: this program is supposed to sort records/extract every record with inconsistent ID numbers/whatever, and actually does) as it is to write the program correctly; actually, I think it's harder. So I never pursued it. Now we see a really good reason to pursue it. And even w/ conventional, non-AI programs, we have the problem of precisely defining what we want done.

Replies from: tailcalled

↑ comment by tailcalled · 2021-12-11T08:48:00.268Z · LW(p) · GW(p)

Proving program correctness seems closer to the MIRI approach to me.

comment by Yonatan Cale (yonatan-cale-1) · 2021-12-29T16:12:30.099Z · LW(p) · GW(p)

Hypothesis regarding your confusion about agency:

Describing humans using a "utility function" or through "goals" is wrong.

Humans are a bunch of habits (like CFAR TAPs) which have some correlation with working towards goals, but this is more of an imperfect rationalization than a reasonable/natural way to describe the situation.

Also yes, we have some part that thinks in goals, but it has a very limited effect on anything (like actions) compared to what we'd naturally think.

Credit to a friend

[I have no idea what I'm talking about, feel free to ignore if this doesn't resonate of course, seemed worth a comment]

comment by Blake H. (blake-h) · 2021-12-21T20:57:00.959Z · LW(p) · GW(p)

I'm perpetually surprised by the amount of thought that goes into this sort of thing coupled with the lack of attention to the philosophical literature on theories of mind and agency in the past, let's just say 50 years. I mean look at the entire debate around whether or not it's possible to naturalize normativity - most of the philosophical profession has given up on this or accepts the question was at best too hard to answer, at worst, ill-conceived from the start.

These literatures are very aware of, and conversant with, the latest and greatest in cogsci and stats. They're not just rehashing old stuff. There is a lot of good work done there on how to make those fundamental questions around agency tractable. There's also an important strain in that literature which claims there are in-principle problems for the very idea of a generalized theory of mind or agency (sic. Putnam, McDowell, Wittgenstein, the entire University of Chicago philosophy department, etc.).

I entered a philosophy PhD program convinced that there were genuine worries here about AGI, machine ethics, etc. I sat in the back of MIRI conferences quietly nodding along. Then I really started absorbing Wittgenstein and what's sometimes called the "resolute reading" of his corpus and I have become convinced that what we call cognition, intelligence, agency, these are all a family of concepts which have a really unique foothold in biological life - that naturalizing even basic concepts like life turn out to be notoriously tricky (because of their normativity). And that the intelligence we recognize in human beings and other organisms are so bound up in our biological forms of life that it becomes very difficult to imagine something without the desire to evade death, nourish itself, and protect a physical body having any of the core agential concepts required to even be recognized as intelligent. Light dawns gradually over the whole. Semantic and meaning holism. Embedded biology. If a lion could speak, we couldn't understand it. All that stuff.

A great place to start is with Jim Conant's "The Search for the Logical Alien" and then get into Wittgenstein's discussions of rule following and ontogenesis. Then have a look at some of the challenges naturalizing normativity in biology. This issue runs deep.

In the end, this idea that intelligence is kind of an isolatable property, indepedent from the particular forms of life in which it is manifest, is a really, really old idea. Goes back at least to the Gnostics. Every generation recapitulates it in some way. AGI just is that worry re-wrought for the software age.

If anything, this kind of thing may be worth studying just because it calls into question the assumptions of programs like MIRI and their earnest hand-wringing over AGI. At a minimum, it's convinced me that we under-estimate by many orders of magnitude the volume of inputs needed to shape our "models." It starts before we're even born, and we can't discount the centrality of e.g. experience touching things, having fragile bodies, having hunger, etc. in shaping the overall web of beliefs and desires that constitute our agential understanding.

Basically, if you're looking for the foundational issues confronting any attempt to form a gears-level understanding of the kind of goal-directed organization that all life-forms exhibit (e.g. much of biological theory) you would do well to read some philosophy of biology. Peter Godfrey Smith has an excellent introduction that's pretty level-handed. Natural selection really doesn't bear as much weight as folks in other disciplines would like it too - especially when you realize that the possibility of evolution by drift confounds any attempt at a statistical reduction of biological function.

Hope something in there is interesting for you.

Replies from: Mitchell_Porter, TAG, rsaarelm

↑ comment by Mitchell_Porter · 2021-12-21T23:18:31.589Z · LW(p) · GW(p)

Do you have any thoughts on chess computers, guided missiles, computer viruses, etc, and whether they make a case for worries about AGI, even if you consider them something alien to the human kind of intelligence?

Replies from: blake-h

↑ comment by Blake H. (blake-h) · 2021-12-22T15:27:59.065Z · LW(p) · GW(p)

No - but perhaps I'm not seeing how they would make the case. Is the idea that somehow their existence augurs a future in which tech gets more autonomous to a point where we can no longer control it? I guess I'd say, why should we believe that's true? Its probably uncontroversial to believe many of our tools will get more autonomous - but why should we think that'll lead to the kind of autonomy we enjoy?

Even if you believe that the intelligence and autonomy we enjoy exist on a kind of continuum, from like single celled organisms through chess-playing computers, to us - we'd still need reason to believe that the progress along this continuum will continue at a rate necessary to close the gap between where we sit on the continuum and where our best artifacts currently sit on the continuum. I don't doubt that progress will continue; but even if the continuum view were right, I think we sit way further out on the continuum than most people with the continuum view think. Also, the continuum view itself is very, very controversial. I happen to accept the arguments which aim to show that it faces insurmountable obstacles. The alternate view which I accept is that there's a difference in kind between the intelligence and autonomy we enjoy, and the kind enjoyed by non-human animals and chess-playing computers. Many people think that if we accept that, we have to reject a certain form of metaphysical naturalism (e.g. the view that all natural phenomena can be explained in terms of the basic conceptual tools of physics, maths, and logic).

Some people think that this form of metaphysical naturalism is bedrock stuff; that if we don't accept it, the theists win, blah blah blah, so we must naturalize mentality and agency, it must exist on a continuum, we just need a theory which shows us how. Other people think we can have a non-reductive naturalism which takes as primitive the normative concepts found in biology and psychology. That's the view I hold. So no, I don't think the existence of those things makes a case for worries about AGI. Things which enjoy the kind of mentality and autonomy we enjoy must be like us in many, many ways - that is after all, what enables us to recognize them as having mentality and autonomy like ours. They probably need to have bodies, be mortal, have finite resources, have an ontogenesis period where they go from not like-minded to like-minded (as all children do), have some language, etc.

Also, I think we have to think really carefully about what we mean when we say "human kind of intelligence" - if you read Jim Conant's logically alien thought paper you come to understand why that extra bit of qualification amounts to plainly nonsensical language. There's only intelligence simpliciter; insofar as we're justified in recognizing it as such, it's precisely in virtue of its bearing some resemblance to ours. The very idea of other kinds of intelligence which we might not be able to recognize is conceptually confused (if it bears no resemblance to ours, in virtue of what are we supposed to call it intelligent? Ex hypothesi? If so, I don't know what I'm supposed to be imagining).

The person who wrote this post rightfully calls attention to the conceptual confusions surrounding most casual pre-formal thinking about agency and mentality. I applaud that, and am urging that the most rigorous, well-trodden paths exploring these confusions are to be found in philosophy as practiced (mostly but not exclusively) in the Anglophone tradition over the last 50 years.

That this should be ignored or overlooked out of pretension by very smart people who came up in cogsci, stats, or compsci is intelligible to me; that it should be ignored on a blog that is purportedly about investigating all the available evidence to find quicker pathways to understanding is less intelligible. I would commend everyone with an interest in this stuff to read Stanford Encyclopedia of Philosophy entries on different topics in philosophy of action and philosophy of mind, then go off their bibliographies for more detailed treatments. This stuff is all explored by philosophers really sympathetic to - even involved in - the projects of creating AGI. But more importantly, it is equally explored by those who either think the project is theoretically and practically possible but prudentially mistaken, or by those who think it is theoretically and practically impossible; let alone a prudential possibility.

Most mistakes here are made in the pre-formal thinking. Philosophy is the discipline of making that thinking more rigorous.

Replies from: Mitchell_Porter

↑ comment by Mitchell_Porter · 2021-12-23T23:37:25.092Z · LW(p) · GW(p)

I don't know... If I try to think of Anglophone philosophers of mind who I respect, I think of "Australian materialists" like Armstrong and Chalmers. No doubt there are plenty of worthwhile thoughts among the British, Americans, etc too, but you seem to be promoting something I deplore, the attempt to rule out various hard problems and unwelcome possibilities, by insisting that words shouldn't be used that way. Celia Green even suggested that this 1984-like tactic could be the philosophy of a new dark age in which inquiry was stifled, not by belief in religion, but by "belief in society"; but perhaps technology has averted that future. Head-in-the-sand anthropocentrism is hardly tenable in a world where, already, someone could hook up a GPT3 chatbot to a Boston Dynamics chassis, and create an entity from deep within the uncanny valley.

Replies from: blake-h

↑ comment by Blake H. (blake-h) · 2021-12-24T14:52:16.039Z · LW(p) · GW(p)

Totally get it. There are lots of folks practicing philosophy of mind and technology today in that aussie tradition who I think take these questions seriously and try to cache out what we mean when we talk about agency, mentality, etc. as part of their broader projects.

I'd resist your characterization that I'm insisting words shouldn't be used a particular way, though I can understand why it might seem that way. I'm rather hoping to shed more light on the idea raised by this post that we don't actually know what many of these words even mean when they're used in certain ways (hence the authors totally correct point about the need to clarify confusions about agency while working on the alignment problem). My whole point in wading in here is just to point out to a thoughtful community that there's a really long rich history of doing just this, and even if you prefer the answers given by aussie materialists, it's even better to understand those positions vis-a-vis their present and past interlocutors. If you understand those who disagree with them, and can articulate those positions in terms they'd accept, you understand your preferred positions even better. I wouldn't say I deplore it, but I am always mildly amused when cogsci, compsci, and stats people start wading into plainly philosophical waters ("sort out our fundamental confusions about agency") and talk as if they're the first ones to get there - or the only ones presently splashing around. I guess I would have thought (perhaps naively) that on a site like this people would be at least curious to see what work has already been done on the questions so they can accelerate their inquiry.

Re: ruling out hard problems - lot's of philosophy is the attempt to better understand the problem's framing such that it either reduces to a different problem, or disappears altogether. I'd urge you to see this as an example of that kind of thing, rather than ruling out certain questions from the gun.

And on anthropocentrism - not sure what the point is supposed to be here, but perhaps it's directed at the "difference in kind" statements I made above. If so, I'd hope we can see light between treating humans as if they were the center of the universe and recognizing that there are at least putatively qualitative differences between the type of agency rational animals enjoy and the type of agency enjoyed by non-rational animals and artifacts. Even the aussie materialists do that - and then set about trying to form a theory of mind and agency in physical terms because they rightly see those putatively qualitative differences as a challenge to their particular form of metaphysical naturalism.

So look, if the author of this post is really serious about (1) they will almost certainly have to talk about what we mean when we use agential words. There will almost certainly be disagreements about whether their characterizations (A) fit the facts, and (B) are coherent with the rest of our beliefs. I don't want to come even close to implying that folks in compsci, cogsci, stats, etc. can't do this - they certainly can. I'm just saying that it's really, really conspicuous to not do so in dialogue with those whose entire discipline is devoted to that task. Philosophers are really good at testing our accounts of an agential concept by saying things like, "okay let's run with this idea of yours that we can define agency and mentality in terms of some bayesian predictive processing, or in terms of planning states, or whaterver, but to see if that view really holds up, we have to be able to use your terms or some innocent others to account for all the distinctions we recognize in our thought and talk about minds and agents." That's the bulk of what philosophers of mind and action do nowadays - they take someone's proposal about a theory of mind or action and test whether it can give an account of some region of our thought and talk about minds and agents. If it can't they either propose addenda, push the burden back to the theorist, or point out structural reasons why the theory faces general obstacles that seem difficult to overcome.

Here's some recent work on the topic, just to make it plain that there are philosophers working on these questions:

https://link.springer.com/article/10.1007%2Fs10676-021-09611-0

https://link.springer.com/article/10.1007/s11023-020-09539-2

And a great article by a favorite philosopher of action on three competing theories of human agency

https://onlinelibrary.wiley.com/doi/10.1111/nous.12178

Hope some of that is interesting, and appreciate the response.

Cheers

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-12-31T22:31:09.789Z · LW(p) · GW(p)

Those articles are all paywalled; got free versions? I tried Sci-Hub, no luck.

Replies from: gwern

↑ comment by gwern · 2022-01-01T01:26:53.291Z · LW(p) · GW(p)

? The second is already open-access, and the third both works in SH & GS (with 2 different PDF links). Only the first link fails in SH. (But what an abstract: "I also argue that if future generally intelligent AI possess a predictive processing cognitive architecture, then they will come to share our pro-moral motivations (of valuing humanity as an end; avoiding maleficent actions; etc.), regardless of their initial motivation set." Wow.)

Replies from: daniel-kokotajlo, Jon Garcia

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-01-01T03:56:22.350Z · LW(p) · GW(p)

Huh, I tried the first and third in SH, maybe I messed up somehow. My bad. Thanks!

I still am interested in the first (on the principle that maybe, just maybe, it's the solution to all our problems instead of being yet another terrible argument made by philosophers about why AIs will be ethical by default if only we do X... I think I've seen two already) and would like to have access.

↑ comment by Jon Garcia · 2022-01-01T14:01:57.988Z · LW(p) · GW(p)

I can see how that would work. The author needs to be careful, though. Predictive processing may be a necessary condition for robust AGI alignment, but it is not per se a sufficient condition.

First of all, that only works if you give the AGI strong inductive priors for detecting and predicting human needs, goals, and values. Otherwise, it will tend to predict humans as though we are just "physical" systems (we are, but I mean modeling us without taking our sentience and values into account), no more worthy of special care than rocks or streams.

Second of all, this only works if the AGI has a structural bias toward treating the needs, goals, and values that it infers from predictive processing as its own. Otherwise, it may understand how to align with us, but it won't care by default.

↑ comment by TAG · 2021-12-24T19:14:11.857Z · LW(p) · GW(p)

Why was this downvoted? Sheesh!

↑ comment by rsaarelm · 2021-12-24T09:39:43.445Z · LW(p) · GW(p)

What do you mean by "naturalize" as a verb? What is "naturalizing normativity"?

Some people think that this form of metaphysical naturalism is bedrock stuff; that if we don’t accept it, the theists win, blah blah blah, so we must naturalize mentality and agency, it must exist on a continuum, we just need a theory which shows us how. Other people think we can have a non-reductive naturalism which takes as primitive the normative concepts found in biology and psychology.

Does this amount to you thinking that humans are humans because of some influence from outside of fundamental physics, which computers and non-human animals don't share?

Replies from: TAG, blake-h

↑ comment by TAG · 2021-12-24T19:26:36.129Z · LW(p) · GW(p)

Like, yeah. People can be really impressive, but unless you want to make an explicit case for the contrary, people here still think people are made of parts and there exists some way to go from a large cloud of hydrogen to people.

What's important is that it means coming up with a detailed, step-by-step explanation of how some high level concepts like life, shouldness, and intelligence. Just believing that they are natural is not the required explanation. Believing they are unnatural is not the only reason to disbelieve in the possibility of a reduction.

Reductionism is not just the claim that things are made out of parts. It's a claim about explanation, and humans might not be smart enough to perform certainly reductions .

Replies from: rsaarelm

↑ comment by rsaarelm · 2021-12-25T07:39:47.024Z · LW(p) · GW(p)

Reductionism is not just the claim that things are made out of parts. It’s a claim about explanation, and humans might not be smart enough to perform certainly reductions.

So basically the problem is that we haven't got the explanation yet and can't seem to find it with a philosopher's toolkit? People have figured out a lot of things (electromagnetism, quantum physics, airplanes, semiconductors, DNA, visual cortex neuroscience) by mucking with physical things while having very little idea of them beforehand by just being smart and thinking hard. Seems like figuring out human concepts grounding to physics has a similar blocker, we still don't have good enough neuroscience to do a simulation of how the brain goes from neurons to high-level thoughts (where you could observe a simulated brain-critter doing human-like things in a VR environment to tell you're getting somewhere even when you haven't reverse-engineered the semantics of the opaque processes yet). People having that kind of model to look at and trying to make sense of it could come up with all sorts of new unobvious useful concepts, just like people trying to figure out quantum mechanics came up with all sorts of new unobvious useful concepts.

But this doesn't sound like a fun project for professional philosophers, a research project like that would need many neuroscientists and computer scientists and not very many philosophers. So if philosophers show up, look at a project like that, and go "this is stupid and you are stupid, go read more philosophy", I'm not sure they're doing it out of purely dispassionate pursuit of wisdom.

Replies from: TAG

↑ comment by TAG · 2021-12-25T14:21:50.308Z · LW(p) · GW(p)

Philosophers are not of a single mind. Some are reductionists, some are illusionists, and so on.

Replies from: blake-h

↑ comment by Blake H. (blake-h) · 2021-12-27T21:41:07.428Z · LW(p) · GW(p)

Good - though I'd want to clarify that there are some reductionists who think that there must be a reductive explanation for all natural phenomena, even if some will remain unknowable to us (for practical or theoretical reasons).

Other non-reductionists believe that the idea of giving a causal explanation of certain facts is actually confused - it's not that there is no such explanation, it's that the very idea of giving certain kinds of explanation means we don't fully understand the propositions involved. E.g. if someone were to ask why certain mathematical facts are true, hoping for a causal explanation in terms of brain-facts or historical-evolutionary facts, we might wonder whether they understood what math is about.

↑ comment by Blake H. (blake-h) · 2021-12-24T14:11:30.307Z · LW(p) · GW(p)

Naturalizing normativity just means explaining normative phenomena in terms of other natural phenomena whose existence we accept as part of our broader metaphysics. E.g. explaining biological function in terms of evolution by natural selection, where natural selection is explained by differential survival rates and other statistical facts. Or explaining facts about minds, beliefs, attitudes, etc., in terms of non-humoncular goings-on in the brain. The project is typically aimed at humans, but shows up as soon as you get to biology and the handful of normative concepts (life, function, health, fitness, etc.) that constitute its core subject matter.

Hope that helps.

Replies from: rsaarelm

↑ comment by rsaarelm · 2021-12-24T16:44:42.902Z · LW(p) · GW(p)

I don't think I've seen the term "normative phenomena" before. So basically normative concepts are concepts in everyday language ("life", "health"), which get messy if you try to push them too hard? But what are normative phenomena then? We don't see or touch "life" or "health", we see something closer to the actual stuff going on in the world and then we come up with everyday word-concepts for it that sort of work until they don't.

It's not really helping in that I still have no real intuition about what you're going on about, and your AI critique seems to be aimed at something from 30 years ago instead of contemporary stuff like Omohundro's Basic AI Drives paper (you describe AIs as being "without the desire to evade death, nourish itself, and protect a physical body", the paper's point is that AGIs operating in the physical world would have exactly that) or the whole deep learning explosion with massive datasets of the last few years ("we under-estimate by many orders of magnitude the volume of inputs needed to shape our “models.”", right now people are in a race to feed ginormous input sets to deep learning systems and probably aren't stopping anytime soon).

Like, yeah. People can be really impressive, but unless you want to make an explicit case for the contrary, people here still think people are made of parts and there exists some way to go from a large cloud of hydrogen to people. If you think there's some impossible gap between the human and the nonhuman worlds, then how do you think actual humans got here? Right now you seem to be just giving some sort of smug shrug of someone who on one hand doesn't want to ask that question themselves because it's corrosive to dignified pre-Darwin liberal arts sensibilities, and on the other hand tries to hint at people genuinely interested in the question that it's a stupid question to ask and they should have read better scholarship to convince themselves of that.

Replies from: blake-h

↑ comment by Blake H. (blake-h) · 2021-12-27T21:32:11.977Z · LW(p) · GW(p)

If you think there's some impossible gap between the human and the nonhuman worlds, then how do you think actual humans got here?

There are many types of explanatory claims in our language. Some are causal (how did something come to be), others are constitutive (what is it to be something), others still are normative (why is something good or right). Most mathematical explanation is constitutive, most action explanation is rational, and most material explanation is causal. It's totally possible to think there's a plain causal explanation about how humans evolved (through a combination of drift and natural selection, in which proportion we will likely never know) - while still thinking that the prospects for coming up with a constitutive explanation of normativity are dim (at best) or outright confused (at worst).

A common project shape for reductive naturalists is to try and use causal explanations to form a constitutive explanation for the normative aspects of biological life. If you spend enough time studying the many historical attempts that have been made at these explanations, you begin to see this pattern emerge where a would-be reductive theorist will either smuggle in a normative concept to fill out their causal story (thereby begging the question), or fail to deliver a theory which has the explanatory power to make basic normative distinctions which we intuitively recognize and that the theory should be able to account for (there are several really good tests out there for this - see the various takes on rule-following problems developed by Wittgenstein). Terms like "information" "structure" "fitness" "processing" "innateness" and the like all are subject to this sort of dilemma if you really put them under scrutiny. Magic non-natural stuff (like souls or spirit or that kind of thing) are often devices that people have reached for when forced on to this dilemma. Postulating that kind of thing is just the other side of the coin, and makes exactly the same error.

So I guess I'd say, I find it totally plausible how normative phenomena could be sui generis in much the same way that mathematical phenomena are, without finding it problematic that natural creatures can come to understand those phenomena through their upbringing and education. Some people get wrapped up in bewilderment about how this could even be possible, and I think there's good reason to believe that bewilderment reflects deep misunderstandings about the phenomena themselves, the recourse for which is sometimes called philosophical therapy.

Another point I want to be clear on:

right now people are in a race to feed ginormous input sets to deep learning systems and probably aren't stopping anytime soon

I don't think it's in-principle impossible to get from non-intelligent physical stuff to intelligent physical stuff by doing this - and i'm actually sympathetic to the biological anchors approach described here which was recently discussed on this site [LW(p) · GW(p)]. I just think that the training runs will need to pay the computational costs for evolution to arrive at human brains, and for human brains to develop to maturity. I tend to think that - and I think good research in child development backs this up - that the structure of our thought is inextricably linked to our physicality. If anything, I think that'd push the development point out past Karnovsky's 2093 estimate. Again, not it's clearly not in-principle impossible for a natural thing to get the right amount of inputs to become intelligent (it clearly is possible, every human does it when they go from babies to adults); I just often think we underestimate how deeply important our biological histories (evolutionary and ontogenetic) are in this process. So I hope my urgings don't come across as advocating for a return to some kind of pre-darwinian darkness; if anything I hope they can be seen as advocating for an even more thorough-going biological understanding. That must start with taking very seriously the problems introduced by drift, and the problems with the attempts to derive the normative aspects of life from a concept like genetic information (one which is notoriously subject to the dilemma above).

Thanks for the tip on the Basic AI Drives paper. I'll give it a read. My suspicion is that once the "basic drives" are specified comprehensively enough to yield an intelligible picture of agent in question, we'll find that they're so much like us that the alignment problem disappears; they can only be aligned. That's what someone argues in one of the papers I linked above. A separate question I've wondered about, and please point me to any good discussion of this, is to compare our thinking about AI alignment with intelligent alien alignment.

Finally, to answer this:

So basically normative concepts are concepts in everyday language ("life", "health"), which get messy if you try to push them too hard?

No - normative concepts are a narrower class than the messy ones, though many find them messy. Normative concepts are those which structure our evaluative thought and talk (about the good, the bad, the ugly, etc.).

Anyway, good stuff. Keep the questions coming, happy to answer.

Replies from: rsaarelm

↑ comment by rsaarelm · 2021-12-28T07:44:12.936Z · LW(p) · GW(p)

It’s totally possible to think there’s a plain causal explanation about how humans evolved (through a combination of drift and natural selection, in which proportion we will likely never know) - while still thinking that the prospects for coming up with a constitutive explanation of normativity are dim (at best) or outright confused (at worst).

If we believe there is a plain causal explanation, that rules out some explanations we could imagine. It shouldn't now be possible for humans to have been created by a supernatural agency (as was widely thought in Antiquity, the Middle Ages or Renaissance when most of the canon of philosophy was developed), and basic human functioning probably shouldn't involve processes wildly contrary to known physics (still believed by some smart people like Roger Penrose).

The other aspect is computational complexity. If we assume the causal explanation, we also get quantifiable limits [? · GW] for how much evolutionary work and complexity can have gone into humans. People are generally aware that there's a lot of it, and a lot less aware that it's quantifiably finite. The size of the human genome, which we can measure, creates one hard limit on how complex a human being can be. The limited amount of sensory information a human can pick up growing to adulthood and the limited amount of computation the human brain can do during that time creates another. Evolutionary theory also gives us a very interesting extra hint that everything you see in nature should be reachable by a very gradual ascent of slightly different forms, all of which need to be viable and competitive, all the way from the simplest chemical replicators. So that's another limit to the bin, whatever is going on with humans is probably not something that has to drop out of nowhere as a ball of intractable complexity, but can be reached by some series of small enough to be understandable improvements to a small enough to be understandable initial lifeform.

The entire sphere of complex but finite computational processes has been a blind spot for philosophy. Nobody really understood it until computers had become reasonably common. (Dennett talks about this in Darwin's Dangerous Idea when discussion Conway's Game of Life.) Actually figuring things out from the opaque blobs of computation like human DNA is another problem of course. If you want to have some fun, you can reach for Rice's theorem (basically following from Turing's halting problem) which shows that you can't logically infer any semantic properties whatsoever from the code of an undocumented computer program. Various existing property inferrer groups like hackers and biologists will nod along and then go back to poking the opaque mystery blobs with various clever implements and taking copious notes of what they do when poked, even though full logical closure is not available.

So coming back to the problem,

If you spend enough time studying the many historical attempts that have been made at these explanations, you begin to see this pattern emerge where a would-be reductive theorist will either smuggle in a normative concept to fill out their causal story (thereby begging the question), or fail to deliver a theory which has the explanatory power to make basic normative distinctions which we intuitively recognize and that the theory should be able to account for (there are several really good tests out there for this—see the various takes on rule-following problems developed by Wittgenstein). Terms like “information” “structure” “fitness” “processing” “innateness” and the like all are subject to this sort of dilemma if you really put them under scrutiny.

Okay, two thoughts about this. First, yes. This sounds like pretty much the inadequacy of mainstream philosophy argument that was being made on Lesswrong back in the Sequences days. The lack of satisfactory descriptions of human-level concepts that actually bottom down to reductive gears is real, but the inability to come up with the descriptions might be pretty much equivalent to the inability to write an understandable human-level AI architecture. That might be impossible, or it might be doable, but it doesn't seem like we'll find it out watching philosophers keep doing things with present-day philosopher toolkits. The people poking at the stuff are neuroscientists and computer scientists, and there's a new kind of looking a "mechanized" mind from the outside aspect to that work (see for instance the predictive coding stuff on the neuroscience side) that seems very foreign to how philosophy operates.

Second thing is, I read this and I'm asking "so, what's the actual problem we're trying to solve?" You seem to be talking from the point of general methodological unhappiness with philosophy, where the problem is something like "you want to do philosophy as it's always been done and you want it to get traction at the cutting edge of intellectual problems of the present day". Concrete problems might be "understand how humans came to be and how they are able to do all the complex human thinking stuff", which is a lot of neuroscience plus some evolutionary biology, "build a human-level artificial intelligence that will act in human interests no matter how powerful it is", which, well, the second part is looking pretty difficult so the ideal answer might be "don't", but the first part seems to be coming along with a whole lot of computer science and not having needed a lot of input from philosophy so far. "Help people understand their place in the world, themselves and find life satisfaction" is a different goal again, and something a lot of philosophy used to be about. Taking the high-level human concepts that we don't have satisfactory reductions for yet as granted could work fine at this level. But there seems to be a sense of philosophers becoming glorified talk therapists here, which doesn't really feel like a satisfactory answer either.

Replies from: blake-h

↑ comment by Blake H. (blake-h) · 2021-12-28T16:29:32.975Z · LW(p) · GW(p)

Yeah, I agree with a lot of this. Especially:

If you want to have some fun, you can reach for Rice's theorem (basically following from Turing's halting problem) which shows that you can't logically infer any semantic properties whatsoever from the code of an undocumented computer program. Various existing property inferrer groups like hackers and biologists will nod along and then go back to poking the opaque mystery blobs with various clever implements and taking copious notes of what they do when poked, even though full logical closure is not available.

I take it that this is how most progress in artificial intelligence, neuroscience, and cogsci has (and will continue) to proceed. My caution - and whole point in wading in here - is just that we shouldn't expect progress by trying to come up with a better theory of mind or agency, even with more sophisticated explanatory tools.

I think it's totally coherent and likely even that future artificial agents (generally intelligent or not) will be created without a general theory of mind or action.

In this scenario, you get a complete causal understanding of the mechanisms that enable agents to become minded and intentionally active, but you still don't know what that agency or intelligence consist in beyond our simple, non-reductive folk-psychological explanations. A lot of folks in this scenario would be inclined to say, "who cares, we got the gears-level understanding" and I guess the only people who would care would be those who wanted to use the reductive causal story to tell us what it means to be minded. The philosophers I admire (John McDowell is the best example) appreciate the difference between causal and constitutive explanations when it comes to facts about minds and agents, and urge that progress in the sciences is hindered by running these together. They see no obstacle to technical progress in neuroscientific understanding or artificial intelligence; they just see themselves as sorting out what these disciplines are and are not about. They don't think they're in the business of giving constitutive explanations of what minds and agents are, rather, they're in the business of discovering what enable minds and agents to do their minded and agential work. I think this distinction is apparent even with basic biological concepts like life. Biology can give us a complete account of the gears that enable life to work as it does without shedding any light on what makes it the case that something is alive, functioning, fit, goal-directed, successful, etc. But that's not a problem at all if you think the purpose of biology is just to enable better medicine and engineering (like making artificial life forms or agents). To a task like, "given a region of physical space, identify whether there's an agent there" I don't we should expect any theory, philosophical or otherwise, to be able to yield solutions to that problem. I'm sure we can build artificial systems that can do it reliably (probably already have some), but it won't come by way of understanding what makes an agent an agent.

Insofar as one hopes to advance certain engineering projects by "sorting out fundamental confusions about agency" I just wanted to offer that (1) there's a rich literature in contemporary philosophy, continuous with the sciences, about different approaches to doing just that; and (2) that there are interesting arguments in this literature which aim to demonstrate that any causal-historical theory of these things will face an apparently intractable dilemma: either beg the question or be unable to make the distinctions needed to explain what agency and mentality consist in.

To summarize the points I've been trying to make (meanderingly, I'll admit): On the one hand, I applaud the author for prioritizing that confusion-resolution; on the other hand, I'd urge them not to fall into the trap of thinking that confusion-resolution must take the form of stating an alternative theory of action or mind. The best kind of confusion-resolution is the kind that Wittgenstein introduced into philosophy, the kind where the problems themselves disappear - not because we realize they're impossible to solve with present tools and so we give up, but because we realize we weren't even clear about what we were asking in the first place (so the problems fail to even arise). In this case, the problem that's supposed to disappear is the felt need to give a reductive causal account of minds and agents in terms of the non-normative explanatory tools available from maths and physics. So, go ahead and sort out those confusions, but be warned about what that project involves, who has gone down the road before, and the structural obstacles they've encountered both in and outside of philosophy so that you can be clear-headed about what the inquiry can reasonably be expected to yield.

That's all I'll say on the matter. Great back and forth, I don't think there's really much distance between us here. And for what it's worth, mine is a pretty niche view in philosophy, because taken to its conclusion it means that the whole pursuit of trying to explain what minds and agents are is just confused from the gun - not limited by the particular set of explanatory tools presently available - just conceptually confused. Once that's understood, one stops practicing or funding that sort of work. It is totally possible and advisable to keep studying the enabling gears so we can do better medicine and engineering, but we should get clear on how that medical or engineering understanding will advance and what those advances mean for those fundamental questions about what makes life, agents, minds, what they are. Good philosophy helps to dislodge us from the grip of expecting anything non-circular and illuminating in answer to those questions.

The Plan

Contents

What’s your plan for AI alignment?

That sounds… awfully optimistic. Do you actually think that’s viable?

Do you just have really long timelines?

… Wat. Not relevant until we’re down to two years?!?

But iterative engineering is important!

But engineering is important for advancing understanding too!

What do you mean by “fundamentally confused”?

What are we fundamentally confused about?

What kinds of “incremental progress” do you have in mind here?

Ok, the incremental progress makes sense, but the full plan still sounds ridiculously optimistic with 10-15 year timelines. Given how slow progress has been on the foundational theory of agency (especially at MIRI), why do you expect it to go so much faster?

What’s the roadmap?

Why do we need formalizations for engineering?

Why so much focus on abstraction?

But, like, 10-15 years?!?

Why ambitious value learning?

… but why not aim for some easier strategy?

78 comments