Ruling Out Lookup Tables

alfred-harwood

Ruling Out Lookup Tables

post by Alfred Harwood · 2025-02-04T10:39:34.899Z · LW · GW · 11 comments

  Introduction
  The Smallest Possible Lookup Table
  Optimal Compression
    Some Obvious Consequences
  An Example
  Why do we care?
None
11 comments

This post was written during Alex Altair's agent foundations fellowship [LW · GW] funded by the LTFF.

This is not particularly surprising or complex, but I wanted it written up somewhere. When we have been discussing the Agent-like Structure problem, lookup tables often come up as a useful counter-example or intuition pump for how a system could exhibit agent-like behaviour without agent-like structure. It is fairly intuitive that, in the limit of a large number of entries, a lookup table requires a longer program to implement than a program which 'just' computes a function. A lookup table's size can be reduced using efficient encoding but even a lookup table whose entries are encoded using a maximally efficient entropy code still grows in size with the number of entries. This post makes that claim quantitative.

Introduction

Suppose we have a black box which takes in strings and spits out other strings. Based on the inputs and outputs of the black box, it seems like the box is calculating some computable function . One possibility is that the black box contains a program which calculates the function on the input, and then outputs it. Another possibility is that the black box contains a big table with lots of possible inputs and their corresponding outputs. The program inside the black box then looks through this table until it finds an entry matching its input and prints the corresponding output. This is known as a lookup table.

Suppose we say that the box is 'actually calculating the function', if the box contains finite program which will return $f (x)$ for every possible input $x$ . Can we be sure that the box is actually calculating the function rather than using a lookup table?

Just by looking at a finite number of instances of input-output behaviour, we can't. Input-output behaviour for any function can be mimicked by a lookup table. But we can rule out a lookup table if we know the size of the program being executed inside the black box. This is because a program which computes a function has a fixed length. The program which calculates the function $f (x) = x + 1$ is the same length whether we test its behaviour using just one string of length two or on one hundred strings of length one trillion (though other implementation requirements, such as memory, may be different). A lookup table, must, at minimum, contain entries corresponding to all of the inputs and outputs for which we have tested it. This means that the minimum possible size for a lookup table-based program grows as the number of inputs and outputs for which it is tested grows.

Suppose we know the length of the program in the black box. If the box shows the correct input-output behaviour for a number of inputs which is too large to be stored in a lookup table, then we can rule out a lookup table and assume that the box is implementing some other program. We will now make that claim quantitative.

The Smallest Possible Lookup Table

In order to rule out a lookup table on the basis of length, we must first identify the smallest program which constitutes a lookup table and which can mimic the input-output behaviour of some function. We will call something a lookup table if it stores a representation of each input-output pair and, when presented with an input, searches through this list to find an appropriate output.

Let us consider the example above, where it seems like the black box is computing the function $f (x) = x + 1$ . Imagine that we have tested it on one hundred bitstrings, each of length one billion. One naive way to capture this behaviour in a lookup table would be to explicitly represent each input and output in a table (along with a fixed-length program which searches through the entries of the table and outputs the correct entry).

Since each of the 100 inputs and 100 outputs has length 1 billion, the length of such a program would be at least 200 billion bits, just to accommodate the entries to the table, plus a constant to accommodate the program which searches through the table. On top of this, the program would need some way of delimiting the 'rows' and 'columns' of the table. Can we compress the size of this program further, without losing the right to call it a 'lookup table'?

For a start, the table itself doesn't have to contain explicit representations of all of the inputs and outputs. We can add a small program which encodes the inputs and outputs as smaller strings. After searching through the table which contains these small encoded strings, the program then decodes the entry in the table corresponding to the output.

The total program for this procedure will be shorter than the program which explicitly represents each entry in the lookup table, provided that the reduction in length from compression is greater than the extra length from the encoding/decoding programs. A program like this is still spiritually a lookup table since it represents each input and output.

Optimal Compression

Suppose we have a lookup table containing $m$ entries in total which correspond to encodings of the inputs and outputs (so there are $\frac{m}{2}$ inputs and $\frac{m}{2}$ outputs). Let the length of each encoded entry be denoted by $l_{i}$ with $1 \leq i \leq N$ where $N$ is the number of unique entries in the table. Let the number of times each entry appears in the table be denoted by $n_{i}$ so that $\sum_{i = 1}^{N} n_{i} = m$ . In a lookup table without redundancy, each input will only be represented once, but multiple inputs might map to the same output or some input strings may 'double up' as outputs. This would lead to $n_{i} > 1$ for the codewords representing those strings. In order for a program to be able to search through the table and identify each input and output separately, the encoding must be uniquely decodable. Ensuring that the code is uniquely decodable means that we don't need to included extra bits to demarcate where one entry ends and another begins.

The total length of the part of the program constituting the lookup table is:

L (\to l) = N \sum i = 1 n_{i} l_{i},

where we have treated the $l_{i}$ 's as a vector $\to l$ .

Since the codewords form a uniquely decodable code, they must satisfy the Kraft-McMillan Inequality:

g (\to l) = \sum i 2^{- l_{i}} \leq 1 .

Note that the left hand side of this constraint monotonically decreases with $l_{i}$ . Recall that we are interested in finding the shortest possible encoding. Therefore so we can treat the constraint as an equality, since any encoding with $g (\to l) < 1$ can have its codeword lengths shortened until it satisfies $g (\to l) = 1$ . This means that we can treat the constraint as an equality and use Lagrange multipliers to minimize $L (\to l)$ with respect to the constraint $g (\to l) = 1$ .

The Lagrange multipliers method involves solving the system of equations:

\nabla L (\to l) = λ \nabla g (\to l),

for $l_{i}$ and a parameter $λ$ along with the condition

g (\to l) = 1.

This leads to the following set of equations:

n_{i} = λ {log}_{e} (2) 2^{- l_{i}} for 1 \leq i \leq N .

Combined with the constraint, this is a system of $N + 1$ equations with $N + 1$ unknowns (the $N$ values of $l_{i}$ plus one $λ$ ) and can be solved. Details of the solution can be found in this footnote^[1]. The set of lengths obtained by minimizing $L (\to l)$ subject to $2^{- l_{i}} \leq 1$ is

l_{i} = {log}_{2} (\frac{m}{n_{i}}) .

Note that we haven't actually specified what encoding we are using, we have just obtained the minimum lengths of the table entries. For a concrete example, note that the this encoding has properties similar the Shannon-Fano code, Huffman code or other entropy codes. These codes assign codewords of length ${log}_{2} (\frac{1}{p_{i}})$ where $p_{i}$ is the probability that a particular word will appear. Our code assigns codewords similarly if $\frac{n_{i}}{m}$ are interpreted as 'probabilities' that each word appears in any given cell of the lookup table. The most common entries in the table (those with the largest $n_{i}$ ) are assigned the shortest codewords, while entries which only appear once in the table are assigned codewords of length ${log}_{2} (m)$ .

This means that the total length of the smallest lookup table is

L_{m i n} (\to l) = \sum i n_{i} {log}_{2} (\frac{m}{n_{i}})

The complete program which uses the lookup table will have length equal to $L_{m i n}$ plus a value capturing the encoding/decoding program and the procedure for searching through the table. Therefore, we can say that our expression for $L_{m i n}$ corresponds to a lower bound for the total length of the program.

If we write $p_{i} = \frac{n_{i}}{m}$ is the probability that a randomly selected element of the table is the $i$ -th codeword we can write $L_{m i n}$ explicitly as an entropy of a probability distribution $P$ :

L_{m i n} (\to l) = m \sum i (\frac{n_{i}}{m}) {log}_{2} (\frac{m}{n_{i}}) = m \sum i p_{i} {log}_{2} \frac{1}{p_{i}} = m H (P) .

Unlike a program which computes the function directly, this program grows with the number of inputs for which it outputs the correct string. Thus, if we test our black box with a number of inputs and outputs such that $\sum_{i} n_{i} {log}_{2} (\frac{m}{n_{i}})$ is larger than the length of the program in the black box, we can be sure that the program is not implementing a lookup table. (This doesn't necessarily mean that it is computing the function, the black box may contain a different function which happens to overlap with the candidate function for a subset of inputs and outputs.) In practice, the length of lookup table program will be larger than this bound (since this bound does not include the length of the encoding/decoding program or the program which searches through the table). Nonetheless, it is a hard bound. If the length of a program is smaller than $L_{m i n}$ , we can't guarantee that it is computing the function we hoped, but we can guarantee that it is not a lookup table (for our reasonable definition of 'lookup table').

Some Obvious Consequences

Many to one functions are easier to store in a small lookup table than bijections
If you want to rule out a lookup table quickly, choose inputs which are expected to map to diverse outputs, since these take up more room in the table. You can strategically query the black box with these inputs.
If the black box only needs to produce an output on a small number of inputs, it may actually be shorter to store the program as a table, if the program for the function is long.

An Example

Suppose $f (x)$ is a function which takes an input string $x$ and outputs $x$ concatenated with one thousand 0's. A black box containing a program of length $10^{6}$ bits seems that it is calculating this function. Assuming we test in such a way that all inputs and outputs are unique (for example, we only use inputs which do not end in one thousand zeros) and it correctly outputs $f (x)$ for each input, how many inputs do we need to test to rule out a lookup table?

Since each entry in the table is unique, we have $N = m$ and $n_{i} = 1$ . Plugging this in to our equation gives $L_{m i n} = N {log}_{2} (N)$ . Solving $N {log}_{2} N = 10^{6}$ gives $N \approx 63000$ . This means that we can rule out a lookup table if the black box gives the correct output for $N / 2 \approx 31500$ inputs. We could improve on this bound if we had estimates for a lower bound $S$ on the size of the encoding/decoding and search parts of the lookup table program. We could then solve $N {log}_{2} (N) + S = 10^{6}$ to obtain a smaller bound.

Why do we care?

When studying the agent-like structure problem [LW · GW] and selection theorems [LW · GW], there is an important distinction between behaviour and structure. Just because a program exhibits some behaviour, that does not necessarily imply that it has a particular structure. A wide variety of structures can produce the same input-output behaviour and you can make almost no claims about what structure is implied by the behaviour. But by making some limited structural assumptions (such as limiting the length of the program) and combining these with behaviours, we can make stronger claims about the kind of structure implied by behaviours. In this sense, the result in this post can be thought of as a (very weak!) selection theorem. Given a basic structural assumption (the limited length of the program) and some behaviour (computing some function for a large number of inputs/outputs) we say something further about the structure (that is is not a lookup table).

^{^}
We have $n_{i} = λ {log}_{e} (2) 2^{- l_{i}}$ . Summing both sides over $i$ gives:
$\sum_{i = 1}^{N} n_{i} = λ {log}_{e} (2) \sum_{i = 1}^{N} 2^{- l_{i}}$
Now, we use the fact that $\sum_{i = 1}^{N} n_{i} = m$ and $\sum_{i = 1}^{N} 2^{- l_{i}} = 1$ to write
$m = λ {log}_{e} (2)$ so $λ = \frac{m}{{log}_{e} (2)}$ .
Substituting this into $n_{i} = λ {log}_{e} (2) 2^{- l_{i}}$ gives
$n_{i} = m 2^{- l_{i}}$
This allows us to solve for the lengths and obtain $l_{i} = {log}_{2} (\frac{m}{n_{i}})$ .

11 comments

Comments sorted by top scores.

comment by Anon User (anon-user) · 2025-02-04T15:09:53.092Z · LW(p) · GW(p)

This seems to be making a somewhat arbitrary distinction - specifically a program that computes f(x) in some sort of a direct way, and a program that computes it in some less direct way (you call it a "lookup table", but you seem to actually allow combining that with arbitrary decompression/decoding algorithms). But realistically, this is a spectrum - e.g. if you have a program computing a predicate P(x, y) that is only true when y = f(x), and then the program just tries all possible y - is that more like a function, or more like a lookup? What about if you have first compute some simple function of the input (e.g. x mod N), then do a lookup?

Replies from: Alfred Harwood

↑ comment by Alfred Harwood · 2025-02-04T20:15:42.678Z · LW(p) · GW(p)

I agree that there is a spectrum of ways to compute f(x) ranging from efficient to inefficient (in terms of program length). But I think that lookup tables are structurally different from direct ways of computing f because they explicitly contain the relationships between inputs and outputs. We can point to a 'row' of a lookup table and say 'this corresponds to the particular input x_1 and the output y_1' and do this for all inputs and outputs in a way that we can't do with a program which directly computes f(x). I think that allowing for compression preserves this important property, so I don't have a problem calling a compressed lookup table a lookup table. Note that the compression allowed in the derivation is not arbitrary since it only applies to the 'table' part of the programme, not the full input/output behaviour of the programme.

if you have a program computing a predicate P(x, y) that is only true when y = f(x), and then the program just tries all possible y - is that more like a function, or more like a lookup?

In order to test whether y=f(x), the program must have calculated f(x) and stored it somewhere. How did it calculate f(x)? Did it use a table or calculate it directly?

What about if you have first compute some simple function of the input (e.g. x mod N), then do a lookup?

I agree that you can have hybrid cases. But there is still a meaningful distinction between the part of the program which is a lookup table and the part of the program which isn't (in describing the program you used this distinction!). In the example you have given the pre-processing function (x mod N) is not a bijection. This means that we couldn't interpret the pre-processing as an 'encoding' so we couldn't point to parts of the program corresponding to each unique input and output. Suppose the function was f(x) =( x mod N )+ 2 and the pre-processing captured the x mod N part and it then used a 2xN lookup table to calculate the '+2'. I think this program is importantly different from one which stores the input and output for every single x. So when taken as a whole the program would not be a lookup table and might be shorter than the lookup table bound presented above. But this captures something important! Namely, that the program is doing some degree of 'actually computing the function'.

Replies from: anon-user

↑ comment by Anon User (anon-user) · 2025-02-06T05:13:57.024Z · LW(p) · GW(p)

> if you have a program computing a predicate P(x, y) that is only true when y = f(x), and then the program just tries all possible y - is that more like a function, or more like a lookup?

In order to test whether y=f(x), the program must have calculated f(x) and stored it somewhere. How did it calculate f(x)? Did it use a table or calculate it directly?

What I meant is that the program knows how to check the answer, but not how to compute/find one, other than by trying every answer and then checking it. (Think: you have a math equation, no idea how to solve for x, so you are just trying all possible x in a row).

comment by Dagon · 2025-02-04T20:35:43.149Z · LW(p) · GW(p)

i don't exactly disagree with the methodology, but I don't find the "why do we care" very compelling. For most practical purposes, "calculating a function" is only and exactly a very good compression algorithm for the lookup table.

Unless we care about side-effects like heat dissipation or imputed qualia, but those seem like you need to distinguish among different algorithms more than just "lookup table or no".

Replies from: Alfred Harwood

↑ comment by Alfred Harwood · 2025-02-04T21:15:18.898Z · LW(p) · GW(p)

For most practical purposes, "calculating a function" is only and exactly a very good compression algorithm for the lookup table.

I think I disagree. Something like this might be true if you just care about input and output behaviour (it becomes true by definition if you consider that any functions with the same input/output behaviour are just different compressions of each other). But it seems to me that how outputs are generated is an important distinction to make.

I think the difference goes beyond 'heat dissipation or imputed qualia'. As a crude example, imagine that, for some environment f(x) is an 'optimal strategy' (in some sense) for inputs x. Suppose we train AIs in this environment and AI A learns to compute f(x) directly whereas AI B learns to implement a lookup table. Based on performance alone, both A and B are equal, since they have equivalent behaviour. But it seems to me that there is a meaningful distinction between the two. These differences are important since you could use them to make predictions about how the AIs might behave in different environments or when exposed to different inputs.

I agree that there are more degrees of distinction between algorithms than just "lookup table or no". These are interesting, just not covered in the post!

Replies from: Dagon

↑ comment by Dagon · 2025-02-05T00:07:18.127Z · LW(p) · GW(p)

might be true if you just care about input and output behaviour

Yes, that is the assumption for "some computable function" or "black box which takes in strings and spits out other strings."

I'm not sure your example (of an AI with a much wider range of possible input/output pairs than the lookup table) fits this underlying distinction. If the input/output sets are truly identical (or even identical for all tests you can think of), then we're back to the "why do we care" question.

Replies from: Alfred Harwood

↑ comment by Alfred Harwood · 2025-02-05T11:33:03.223Z · LW(p) · GW(p)

hmm, we seem to be talking past each other a bit. I think my main point in response is something like this:

In non-trivial settings, (some but not all) structural differences between programs lead to differences in input/output behaviour, even if there is a large domain for which they are behaviourally equivalent.

But that sentence lacks a lot of nuance! I'll try to break it down a bit more to find if/where we disagree (so apologies if a lot of this is re-hashing).

I agree that if two programs produce the same input output behaviour for literally every conceivable input then there is not much of an interesting difference between them and you can just view one as a compression of the other.
As I said in the post, I consider a program to be 'actually calculating the function', if it is a finite program which will return for every possible input $x$ .
If we have a finite length lookup table, it can only output f(x) for a finite number of inputs.
If that finite number of inputs is less than the total number of possible inputs, this means that there is at least one input (call it x_0) for which a lookup table will not output f(x_0).
I've left unspecified what it will do if you query the lookup table with input x_0. Maybe it doesn't return anything, maybe it outputs an error message, maybe it blows up. The point is that whatever it does, by definition it doesn't return f(x_0).
Maybe the number of possible inputs to f is finite and the lookup table is large enough to accommodate them all. In this case, the lookup table would be a 'finite program which will return $f (x)$ for every possible input $x$ ' so I would be happy to say that there's not much of a distinction between the lookup table and a different method of computing the function. (Of course, there is a trivial difference in the sense that they are different algorithms, but its not the one we are concerned with here).
However, this requires that the size of the program is on the order of (or larger than) the number of possible inputs it might face. I think that in most interesting cases this is not true.
By 'interesting cases', I mean things like humans who do not contain an internal representation of every possible sensory input, or LLMs where the number of possible sentences you could input is larger than the model itself.
This means that in 'interesting cases', a lookup table will, at some point, give a different output to a program which is directly calculating a function.
So structural difference (of the 'lookup table or not' kind) between two finite programs will imply behavioural difference in some domain, even if there is a large domain for which the two programs are behaviourally equivalent (for an environment where the number of possible inputs is larger than the size of the programs).
As I see it, this is the motivation behind the Agent-like structure problem. If you know that a program has agent-like structure, this can help you predict its behaviour in domains where you haven't seen it perform.
Or, conversely, if you know that selecting for certain behaviours is likely to lead to agent-like structure you can avoid selecting for those behaviours (even if, within a certain domain, those behaviours are benign) because in other domains agent-like behaviour is dangerous.
Of course, there are far more types of programs than just 'lookup table' or 'not lookup table'. Lookup tables are just one obvious way for which a program might exhibit certain behaviour in a finite domain which doesn't extend to larger domains.

Replies from: Dagon

↑ comment by Dagon · 2025-02-05T19:21:28.973Z · LW(p) · GW(p)

In non-trivial settings, (some but not all) structural differences between programs lead to differences in input/output behaviour, even if there is a large domain for which they are behaviourally equivalent.

I think this is a crux (of why we're talking past each other; I don't actually know if we have a substantive disagreement). The post was about detecting "smaller than a lookup table would support" implementations, which implied that the input/output functionally-identical-as-tested were actually tested in the broadest possible domain. I fully agree that "tested" and "potential" input/output pairs are not the same sets, but I assert that, in a black-box situation, it CAN be tested in a very broad set of inputs, so the distinction usually won't matter. That said, nobody has built a pure lookup table anywhere near as complete as it would take to matter (unless the universe or my experience is simulated that way, but I'll never know).

My narrower but stronger point is that "lookup table vs algorithm" is almost never as important as "what specific algorithm" for any question we want to predict about the black box. Oh, and almost all real-world programs are a mix of algorithm and lookup.

Replies from: Alfred Harwood

↑ comment by Alfred Harwood · 2025-02-07T11:55:40.789Z · LW(p) · GW(p)

Cool, that all sounds fair to me. I don't think we have any substantive disagreements.

comment by Cole Wyeth (Amyr) · 2025-02-20T20:27:36.962Z · LW(p) · GW(p)

It seems that the relevant thing is not so much how many values you have tested as the domain size of the function. A function with a large domain cannot be explicitly represented with a small lookup table. But this means you also have to consider how the black box behaves when you feed it something outside of its domain, right? If it has some default "missing" value, that doesn't complicate the conclusions much. But what if it is a lookup table, but the index is taken mod the length of the table? Or more generally, what if a hash function is applied to the index first? It seems that counting the number of unique output values taken by the function is more robustly lower-bounds its size as a (generalized) lookup table?

Replies from: Alfred Harwood

↑ comment by Alfred Harwood · 2025-02-21T16:42:24.513Z · LW(p) · GW(p)

It seems that the relevant thing is not so much how many values you have tested as the domain size of the function. A function with a large domain cannot be explicitly represented with a small lookup table. But this means you also have to consider how the black box behaves when you feed it something outside of its domain, right?

This sounds right. Implicitly, I was assuming that when the black box was fed an x outside of its domain it would return an error message or at least, something which is not equal to f(x). I realise that I didn't make this clear in the post.

But what if it is a lookup table, but the index is taken mod the length of the table? Or more generally, what if a hash function is applied to the index first?

In the post I was considering programs which are 'just' a lookup tables. But I agree that there are other programs that use lookup tables but are not 'just' tables. The motivation is that if you want simulate f(x) over a large domain using a function smaller than the bound given in the post, then the program needs to contain some feature which captures something nontrivial about the nature of f.

Like (simple example just to be concrete) suppose f(x)=[x(mod10)]^2 defined on all real numbers. Then the program in the black box might have two steps: 1) compute x(mod10) using a non-lookup table method then 2) use a lookup table containing x^2 for x=1 to 10. This (by my slightly arbitrary criteria) would count as a program which 'actually computes the function' since it has fixed length and its input/output behaviour coincides with f(x) for all possible inputs. It seems to me that if a blackbox program contains a subsection which computes x(mod10) then it has captured something important about f(x), more so than if the program only contains a big lookup table.

It seems that counting the number of unique output values taken by the function is more robustly lower-bounds its size as a (generalized) lookup table?

I think this is right. If you wanted to rule out lookup tables being used at any point in the program then this is bottlenecked by the number of unique outputs.

Ruling Out Lookup Tables

Contents

Introduction

The Smallest Possible Lookup Table

Optimal Compression

Some Obvious Consequences

An Example

Why do we care?

11 comments