Internal Interfaces Are a High-Priority Interpretability Target

post by Thane Ruthenis · 2022-12-29T17:49:27.450Z · LW · GW · 6 comments

Contents

  1. A Cryptic Analogy
  2. Interface Theory
  3. Internal Interfaces
  4. What Internal Interfaces Can We Expect?
  5. Research Priorities
None
6 comments

tl;dr: ML models, like all software, and like the NAH [LW · GW] would predict, must consist of several specialized "modules". Such modules would form interfaces between each other, and exchange consistently-formatted messages through these interfaces. Understanding the internal data formats of a given ML model should let us comprehend an outsized amount of its cognition, and allow to flexibly interfere in it as well.


1. A Cryptic Analogy

Let's consider three scenarios. In each of them, you're given the source code of a set of unknown programs, and you're tasked with figuring out their exact functionality. Details vary:

  1. In the first scenario, the programs are written in some known programming language, e. g. Python.
  2. In the second scenario, the programs were randomly generated by perturbing machine code until it happened to end up in a configuration that, when ran, instantiates a process externally indistinguishable from a useful intelligently-written program.
  3. In the third scenario, the programs are written in a programming language that's completely unfamiliar to you (or to anyone else).

In the first scenario, the task is all but trivial. You read the source code, make notes on it, run parts of it, and comprehend it. It may not be quick, but it's straightforward.

The second scenario is a nightmare. There must be some structure to every program's implementation — something like the natural abstraction hypothesis [LW · GW] must still apply, there must be modules in this mess of a code that can be understood separately, etc. There is some high-level structure that you can parcel out into tiny pieces that can fit into a human mind [LW(p) · GW(p)]. The task is not impossible.

But suppose you've painstakingly reverse-engineered one of the programs this way. You move on to the next one, and... Yup, you're essentially starting from zero. Well, you've probably figured out something about natural abstractions when working on your first reverse-engineering, so it's somewhat easier, but still a nightmare. Every program is structured in a completely different way, you have to start from the fundamentals every time.

The third scenario is a much milder nightmare. You don't focus on reverse-engineering the programs here — first, you reverse-engineer the programming language. It's a task that may be as complex as reverse-engineering one of the individual programs from (2), but once you've solved it, you're essentially facing the same problem as in (1) — a problem that's comparably trivial.

The difference between (2) and (3) is: 

(2) is a parallel to the general problem of interpretability, different programs being different ML models. Is there some interpretability problem that's isomorphic to (3), however?

I argue there is.


2. Interface Theory

Suppose that we have two separate entities with different specializations: they both can do some useful "work", but there are some types of work that only one of them can perform. Suppose that they want to collaborate: combine their specializations to do tasks neither entity can carry out alone. How can they do so?

For concreteness, imagine that the two entities are a Customer, which can provide any resource from the set of resources , and an Artist, which can make any sculpture from some set  given some resource budget. They want to "trade": there's some sculpture  the Customer wants made, and there are some resources  the Artist needs to make it. How can they carry out such exchanges?

The Customer needs to send some message  to the Artist, such that it would uniquely identify  among the members of . The Artist, in turn, would need to respond with some message , which uniquely identifies  in the set .

That implies an interface: a data channel between the two entities, such that every message passed along this channel from one entity to another uniquely identifies some action the receiving entity must take in response.

This, in turn, implies the existence of interpreters: some protocols that take in a message received by an entity, and map it onto the actions available to that entity (in this case, making a particular sculpture or passing particular resources).

For example, if the sets of statues and resources are sufficiently small, both entities can agree on some orderings of these sets, and then just pass numbers. "1" would refer to "statue 1" and "resource package 1", and so on. These protocols would need to be obeyed consistently: they'd have to always use the same ordering, such that the Customer saying "1" always means the same thing. Otherwise, this system falls apart.

Now, suppose that a) the sets are really quite large, such that sorting them would take a very long time, yet b) members of sets are modular, in the sense that each member of a set is a composition of members of smaller sets. E. g., each sculpture can be broken down into the number of people to be depicted in it, the facial expressions each person has, what height each person is, etc.

In this case, we might see this modularity reflected in the messages. Instead of specifying statues/resources holistically, they'd specify modules: "number of people: N, expression on person 1: type 3...", et cetera.

To make use of this, the Artist's interpreter would need to know what "expression on person 1" translates to, in terms of specifications-over-. And as we've already noted, this meaning would need to be stable across time: "expression on person 1" would need to always mean the same thing, from request to request.

That, in turn, would imply data formats. Messages would have some consistent structure, such that, if you knew the rules by which these structures are generated, you'd be able to understand any message exchanged between the Customer and the Artist at any point.


3. Internal Interfaces

Suppose that we have a ML model with two different modules, each of them specialized for performing different computations. Inasmuch as they'll work together, we'll see the same dynamics between them as between the Customer and the Artist: they'll arrive at some consistent data formats for exchanging information. They'll form an internal interface.

Connecting this with the example from Section 1, I think this has interesting implications for interpretability. Namely: internal interfaces are the most important interpretability targets.

Why? Two reasons:

Cost-efficiency. Understanding a data format would allow us to understand everything that can be specified using this data format, which may shed light on entire swathes of a model's cognition. And as per the third example in Section 1, it should be significantly easier than understanding an arbitrary feature, since every message would follow the same high-level structure.

Editability. Changing a ML model's weights while causing the effects you want and only the effects you want is notoriously difficult — often, a given feature is only responsible for a part of the effect you're ascribing to it, or it has completely unexpected side-effects. An interface, however, is a channel specifically built for communication between two components. It should have minimal side-effects, and intervening on it should have predictable results.

As far as controlling the AI's cognition goes, internal interfaces are the most high-impact points.


4. What Internal Interfaces Can We Expect?

This question is basically synonymous with "what modules can we expect?". There are currently good arguments for the following:

Here's a breakdown of the potential interfaces as I see them:

  1. WM → Shards. Each shard would learn to understand some part of the world-model, and that's what would provide the "context" for their activations.
    • However, I expect this is less of an interface, and more of a set of interfaces. I expect that each shard learns to interface with the WM in its own way, so they wouldn't follow consistent input formats.
  2. Shards → Shards: a "shard coordination mechanism" (SCM). Shards necessarily conflict as part of their activity, biding for mutually incompatible plans/actions, and there needs to be some mechanism for resolving such conflicts. That mechanism would need to know about any possible conflicts, however, meaning all the shards would need to interface with it and signal it how much each of them wants to fire in a given case.
  3. SCM → GPS. Shards influence and shape plan-making, and I suspect [LW · GW] that GPS would appear after the SCM, so the GPS would just be plugged into an already-existing mechanism for summarizing shard activity.
  4. WM → GPS. The GPS is, in a sense, just a principled method for drawing upon the knowledge in the world-model, so interfacing with the WM is its primary function.
  5. GPS → WM. I suspect that another primary function of the GPS is to organize/modify the world-model at runtime, such as by discovering new abstractions or writing new heuristics to it. As such, it'd need write-access as well.
  6. GPS → GPS. The planner would need to understand its own past thoughts, for the purposes of long-term planning. (Though part/most of them might be just written to the world-model directly, i. e. there's no separate format for GPS' plans.)
  7. World → Agent. In a sense, the world-model isn't just a module, it's itself an interface between the ground truth of the world and the decision-making modules of the agent (shards and the GPS)! Thus, we can expect all elements of the world-model (concepts, abstractions...) to be consistently-formatted.

"Cracking" any of these interfaces should give us a lot of control over the ML model: the ability to identify specific shards, spoof or inhibit their activations, literally read off the AI's plans, et cetera.

I'm particularly excited about (7) and (5), though: if the world-model is itself an interface, it must be consistently-formatted in its entirety, and getting the ability to arbitrarily navigate and edit it would, perhaps, allow us to solve the entire alignment problem on the spot. (A post elaborating on this is upcoming.)

(6) is also extremely high-value, but might be much more difficult to decode.


5. Research Priorities

6 comments

Comments sorted by top scores.

comment by Erik Jenner (ejenner) · 2022-12-29T21:20:51.018Z · LW(p) · GW(p)

I think this is an interesting direction and I've been thinking about pretty similar things (or more generally, "quotient [LW · GW]" interpretability research). I'm planning to write much more in the future, but not sure when that will be, so here are some unorganized quick thoughts in the meantime:

  • Considering the internal interfaces of a program/neural net/circuit/... is a special case of the more general idea of describing how a program/... works at a higher level of abstraction. For example, for circuits (and in particular neural networks), we could think of the "interface abstraction" as a quotient on vertices. I.e. we partition vertices into submodules, and then throw away all the information about how each submodule computes its outputs, considering only the interfaces. This corresponds to using a quotient graph of the original computational graph. From this perspective, interfaces are one very sensible abstraction of a computational graph one could consider, but not the only one. So besides interfaces, I'm also interested in what abstractions of programs/computational graphs can look like more generally.
  • This also highlights that you can have submodules and thus interfaces at different levels of abstractions. In programs, you might have small helper functions, which are composed to more complicated methods, which are part of classes, which are part of larger modules. In a computational graph, you could have refinements of partitions of vertices. I'd consider the examples from section 4 pretty high-level submodules, and I think somewhat lower-level ones would also be interesting.
  • You mention that the interfaces themselves have structure ("data formats"). Perhaps this could be modeled by looking at interfaces at different levels of abstractions as mentioned in the previous bullet point. I.e. one high-level interface would be made up of several low-level interfaces. This is pure guesswork though, I haven't tried to work out anything like that yet.
  • When talking about what "good interfaces" or "good submodules" are, a common approach is that we want interfaces to be comparatively sparse. An argument for why this might be desirable is that it makes the abstracted computational graph consisting only of interfaces easier to understand. But if I imagine a high-level description of how a neural network works that's actually human-understandable, the key aspect seems to be that the high-level description should be in terms of human concepts. This suggests that the important thing is that the information at the interfaces can be well approximated using compact descriptions in terms of human concepts. In some hand-wavy way, it seems that (some versions of) NAH should imply that these two desiderata are the same: if we look for submodules as things with sparse interfaces, we also get human-understandable concepts represented at the interfaces. I think formalizing this claim could be a good milestone for conceptual research on modularity/abstractions/...
comment by Ben Amitay (unicode-70) · 2022-12-30T20:32:42.385Z · LW(p) · GW(p)

Hi, great post. Modularity was for a while on the top of my list of things-missing-for-capabilities, and it is interesting to see it's relevance to safety too.

Some points about the hypothesised modules:

  • I did some introspection lately to learn about how I walk, and now suspect that the scm is not so much a separate module, but more like synchronization locks in computing - as if for example there where a lock for each of my muscles, and when two usually-unconscious movements try to use the same one, one of them seem to be blocked - notifying my consciousness when it is important enough. I dare to hypothesize farther that there is a similar lock for every variable that my movement may control - moving my car is treated the same as moving my finger. Actually, it may be even easier than that - the first shard may write into the world model a prediction about my future movement. Than when the second shard will try to do the same, the result will be "inconsistency notification" - same as if it was a surprising observation.
  • in general, I think that the modules will automatically have incentive to use the same protocols as long as their subject of communication is not too far removed.
comment by qbolec · 2022-12-29T21:45:49.847Z · LW(p) · GW(p)

ML models, like all software, and like the NAH [LW · GW] would predict, must consist of several specialized "modules".

After reading source code of MySQL InnoDB for 5 years, I doubt it. I think it is perfectly possible - and actually, what I would expect to happen by default - to have a huge working software, with no clear module boundaries. 

Take a look at this case in point: the row_search_mvcc() function https://github.com/mysql/mysql-server/blob/8.0/storage/innobase/row/row0sel.cc#L4377-L6019 which has 1500+ lines of code and references hundreds of variables. This function is in called in the inner loop of almost every SELECT query you run, so on the one hand it probably works quite correctly, on the other was subject to "optimization pressure" over 20 years, and this is what you get. I think this is because Evolution is not Intelligent Design and it simply uses the cheapest locally available hack to get things done, and that is usually to reuse the existing variable#135, or more realistically combination of variables#135 and #167 to do the trick - see how many of the if statements have conditions which use more than a single atom, for example:

      if (set_also_gap_locks && !trx->skip_gap_locks() &&
          prebuilt->select_lock_type != LOCK_NONE &&
          !dict_index_is_spatial(index)) {

(Speculation: I suspect that unless you chisel your neural network architecture to explicitly disallow connecting a neuron in question directly to neuron#145 and #167, it will, as soon as it discovers they provide useful bits of information. I suspect this is why figuring out what layers and connectivity between them you need is difficult. Also, I suspect this is why simply ensuring right high-level wiring between parts of the brain and how to wire them to input/output channels might the most important part to encode in DNA, as the inner connections and weights can be later figured out relatively easily)

Replies from: ejenner
comment by Erik Jenner (ejenner) · 2022-12-29T22:26:10.486Z · LW(p) · GW(p)

I'm very interested in examples of non-modular systems, but I'm not convinced by this one, for multiple reasons:

  • Even a 1,500 line function is a pretty small part of the entire codebase. So the existence of that function already means that the codebase as a whole seems somewhat modular.
  • My guess is that the function itself is in fact also modular (in the way I'd use the term). I only glanced at the function you link very quickly, but one thing that jumped out are the comments that divide it into "Phase 1" to "Phase 5". So even though it's not explicitly decomposed into submodules in the sense of e.g. helper functions, it does seem that programmers find it a useful abstraction to think of this huge function as a composition of five "submodules" that perform different subtasks. I would guess that this abstraction is reflected somehow in the structure of the function itself, as opposed to being completely arbitrary (i.e. putting the boundaries between phases at random line numbers instead would be worse in some pretty objective sense). So to me, the existence of 1500 line functions is not strong evidence against the ubiquity of modularity, since modularity properly defined should be more general than just functions. I do agree this would be a good counterexample to certain definitions of modularity that are too narrow. (To be clear, I don't think anyone has a good definition yet for how this "natural submodule structure" could be read off from a program.)
  • Regarding the if statements: arguably, "truly non-modular" code would have lots of if statements that use a big fraction or even almost all of the variables in scope (or if we're being strict, in the entire program, since smaller scopes already imply submodules). So I think if an if statement in a function with hundreds of variables contains just 4-8 terms depending on how we count, that's not a lot.

I wouldn't be surprised to learn we just have somewhat different notions of what "modularity" should mean. For sufficiently narrow definitions, I agree that lots of computations are non-modular, I'm just less interested in those.

comment by Ben Amitay (unicode-70) · 2022-12-30T20:50:42.899Z · LW(p) · GW(p)

Last point: if we change the name of "world model" to "long-term memory", we may notice the possibility that much of what you think about as shard-work may be programs stored in memory, and executed by a general-program-executor or a bunch of modules that specializes in specific sorts of programs, functioning as modern CPUs/interpreters (hopefully, stored in organised way that preserve modularity). What will be in the general memory and what is in the weights themselves is non-obvious, and we may want to intervene in this point too (not sure I'm which direction).

comment by Ben Amitay (unicode-70) · 2022-12-30T20:37:16.956Z · LW(p) · GW(p)

(continuation of the same comment - submitted by mistake and cannot edit...) Assuming modules A,B are already "talking" and module C try to talk to B, C would probably find it easier to learn a similar protocol than to invent a new one and teach it to B