Posts
Comments
I am actually currently working on developing these ideas further, and I expect to relatively soon be able to put out some material on this (modulo the fact that I have to finish my PhD thesis first).
I also think that you in practice probably would have to allow some uninterpretable components to maintain competitive performance, at least in some domains. One reason for this is of course that there simply might not be any interpretable computer program which solves the given task (*). Moreover, even if such a program does exist, it may plausibly be infeasibly difficult to find (even with the help of powerful AI systems). However, some black-box components might be acceptable (depending on how the AI is used, etc), and it seems like partial successes would be useful even if the full version of the problem isn't solved (at least under the assumption that interpretability is useful, even if the full version of interpretability isn't solved).
I also think there is good reason to believe that quite a lot of the cognition that humans are capable of can be carried out by interpretable programs. For example, any problem where you can "explain your thought process" or "justify your answer" is probably (mostly) in this category. I also don't think that operations of the form "do X, because on average, this works well" necessarily are problematic, provided that "X" itself can be understood. Humans give each other advice like this all the time. For example, consider a recommendation like "when solving a maze, it's often a good idea to start from the end". I would say that this is interpretable, even without a deeper justification for why this is a good thing to do. At the end of the day, all knowledge must (in some way) be grounded in statistical regularities. If you ask a sequence of "why"-questions, you must eventually hit a point where you are no longer able to answer. As long as the resulting model itself can be understood and reasoned about, I think we should consider this to be a success. This also means that problems that can be solved by a large ensemble of simple heuristics arguably are fine, provided that the heuristics themselves are intelligible.
(*) It is also not fully clear to me if it even makes sense to say that a task can't be solved by an interpretable program. On an intuitive level, this seems to make sense. However, I'm not able to map this statement onto any kind of formal claim. Would it imply that there are things which are outside the reach of science? I consider it to at least be a live possibility that anything can be made interpretable.
What I had in mind is a situation where we have access to the latent variables during training, and only use the model to prove safety properties in situations that are within the range of the training distribution in some sense (eg, situations where we have some learning-theoretic guarantees). As for treacherous turns, I am implicitly assuming that we don't have to worry about a treacherous turn from the world model, but that we may have to worry about it from the AI policy that we're verifying.
However, note that even this is meaningfully different from just using RLHF, especially in settings with some adversarial component. In particular, a situation that is OOD for the policy need not be OOD for the world model. For example, learning a model of the rules of chess is much easier than learning a policy that is good at playing chess. It would also be much easier to prove a learning-theoretic guarantee for the former than the latter.
So, suppose we're training a chess-playing AI, and we want to be sure that it cannot be defeated in moves or less. The RLHF strategy would, in this scenario, essentially amount to letting a bunch of humans play against the AI a bunch of times, try to find cases where the red-teamers find a way to beat the AI in moves, and then train against those cases. This would only give us very weak quantitative guarantees, because there might be strategies that the red teamers didn't think of.
Alternatively, we could also train a network to model the rules of chess (in this particular example, we could of course also specify this model manually, but let's ignore that for the sake of the argument). It seems fairly likely that we could train this model to be highly accurate. Moreover, using normal statistical methods, we could estimate a bound on the fraction of the state-action space on which this model makes an incorrect prediction, and derive other learning-theoretic guarantees (depending on how the training data is collected, etc). We could then formally verify that the chess AI cannot be beaten in moves, relative to this world model. This would produce a much stronger quantitative guarantee, and the assumptions behind this guarantee would be much easier to audit. The guarantee would of course still not be an absolute proof, because there will likely be some errors in the world model, and the chess AI might be targeting these errors, etc, but the guarantee is substantially better than what you get in the RLHF case. Also note that we, as we run the chess AI, could track the predictions of the world model on-line. If the world model ever makes a prediction that contradicts what in fact happens, we could shut down the chess AI, or transition to a safe mode. This gives us even stronger guarantees.
This is of course a toy example, because it is a case where we can design a perfect world model manually (excluding the opponent, of course). We can also design a chess-playing AI manually, etc. However, I think it illustrates that there is a meaningful difference between RLHF and formal verification relative to even a black-box world model. The complexity of the space of all policies grows much faster than the complexity of the environment, and this fact can be exploited.
In practice, I think you are unlikely to end up with a schemer unless you train your model to solve some agentic task (or tain it to model a system that may itself be a schemer, such as a human). However, in order to guarantee that, I agree we need some additional property (such as interpretability, or some learning-theoretic guarantee).
I'm not so convinved of this. Yes, for some complex safety properties, the world model will probably have to be very smart. However, this does not mean that you have to model everything -- depending on your safety specification and use case, you may be able to factor out a huge amount of complexity. We know from existing cases that this is true on a small scale -- why should it not also be true on a medium or large scale?
For example, with a detailed model of the human body, you may be able to prove whether or not a given chemical could be harmful to ingest. This cannot be done with current tools, because we don't have a detailed computational model of the human body (and even if we did, we would not be able to use it for scaleable inference). However, this seems like the kind of thing that could plausibly be created in the not-so-long term using AI tools. And if we had such a model, we could prove many interesting safety properties for e.g. pharmacutical development AIs (even if these AIs know many things that are not covered by this world model).
Suppose you had a world model which was as smart as GPT-3 (but magically interpretable). Do you think this would be useful for something?
I think that would be extremely useful, because it would tell us many things about how to implement cognitive algorithms. But I don't think it would be very useful for proving safety properties (which I assume was your actual question). GPT-3's capabilities are wide but shallow, but in most cases, what we would need are capabilities that are narrow but deep.
I think the distinction between these two cases often can be somewhat vague.
Why do you think that the adversarial case is very different?
I think you're perhaps reading me as being more bullish on Bayesian methods than I in fact am -- I am not necessarily saying that Bayesian methods in fact can solve OOD generalisation in practice, nor am I saying that other methods could not also do this. In fact, I was until recently very skeptical of Bayesian methods, before talking about it with Yoshua Bengio. Rather, my original reply was meant to explain why the Bayesian aspect of Bengio's research agenda is a core part of its motivation, in response to your remark that "from my understanding, the bayesian aspect of [Bengio's] agenda doesn't add much value".
I agree that if a Bayesian learner uses the NN prior, then its behaviour should -- in the limit -- be very similar to training a large ensemble of NNs. However, there could still be advantages to an explicitly Bayesian method. For example, off the top of my head:
- It may be that you need an extremely large ensemble to approximate the posterior well, and that the Bayesian learner can approximate it much better with much less resources.
- It may be that you more easily can prove learning-theoretic guarantees for the Bayesian learner.
- It may be that a Bayesian learner makes it easier to condition on events that have a very small probability in your posterior (such as, for example, the event that a particular complex plan is executed).
- It may be that the Bayesian learner has a more interpretable prior, or that you can reprogram it more easily.
And so on, these are just some examples. Of course, if you get these benefits in practice is a matter of speculation until we have a concrete algorithm to analyse. All I'm saying is that there are valid and well-motivated reasons to explore this particular direction.
But at some point, this is no longer very meaningful. (E.g. you train on solving 5th grade math problems and deploy to the Riemann hypothesis.)
It sounds to me like we agree here, I don't want to put too much weight on "most".
Is this true?
It is true in the sense that you don't have any theoretical guarantees, and in the sense that it also often fails to work in practice.
Aren't NN implicitly ensembles of vast number of models?
They probably are, to some extent. However, in practice, you often find that neural networks make very confident (and wrong) predictions for out-of-distribution inputs, in a way that seems to be caused by them projecting some spurious correlation. For example, you train a network to distinguish different types of tanks, but it learns to distinguish night from day. You train an agent to collect coins, but it learns to go to the right. You train a network to detect criminality, but it learns to detect smiles. Adversarial examples could also be cast as an instance of this phenomenon. In all of these cases, we have a situation where there are multiple patterns in the data that fit a given training objective, but where a neural network ends up giving an unreasonably large weight to some of these patterns at the expense of other plausible patterns. I thus think its fair to say that -- empirically -- neural networks do not robustly quantify uncertainty in a reliable way when out-of-distribution. It may be that this problem mostly goes away with a sufficiently large amount of sufficiently varied data, but it seems hard to get high confidence in that.
Also, does ensembling 5 NNs help?
In practice, this does not seem to help very much.
If we're conservative over a million models, how will we ever do anything?
I mean, there can easily be cases where we assign a very low probability to any given "complete" model of a situation, but where we are still able assign a high probability to many different partial hypotheses. For example, if you randomly sample a building somewhere on earth, then your credence that the building has a particular floor plan might be less than 1 in 1,000,000 for each given floor plan. However, you could still assign a credence of near-1 that the building has a door, and a roof, etc. To give a less contrived example, there are many ways for the stock market to evolve over time. It would be very foolish to assume that it will evolve according to, e.g., your maximum-likelihood model. However, you can still assign a high credence to the hypothesis that it will grow on average. In many (if not all?) cases, such partial hypotheses are sufficient.
If our prior is importantly different in a way which we think will help, why can't we regularize to train a NN in a normal way which will vaguely reasonably approximate this prior?
In the examples I gave above, the issue is less about the prior and more about the need to keep track of all plausible alternatives (which neural networks do not seem to do, robustly). Using ensembles might help, but in practice this does not seem to work that well.
I again don't see how bayes ensures you have some non-schemers while ensembling doesn't.
I also don't see a good reason to think that a Bayesian posterior over agents should give a large weight to non-schemers. However, that isn't the use-case here (the world model is not meant to be agentic).
Another way to put this, is that all the interesting action was happening at the point where you solved the ELK problem.
So, this depends on how you attempt to create the world model. If you try to do this by training a black-box model to do raw sensory prediction, and then attempt to either extract latent variables from that model, or turn it into an interpretable model, or something like that, then yes, you would probably have to solve ELK, or solve interpretability, or something like that. I would not be very optimistic about that working. However, this is in no way the only option. As a very simple example, you could simply train a black-box model to directly predict the values of all latent variables that you need for your definition of harm. This would not produce an interpretable model, and so you may not trust it very much (unless you have some learning-theoretic guarantee, perhaps), but it would not be difficult to determine if such a model "thinks" that harm would occur in a given scenario. As another example, you could build a world model "manually" (with humans and LLMs). Such a model may be interpretable by default. And so on.
I thought the plan was to build it with either AI labor or human labor so that it will be sufficiently intepretable. Not to e.g. build it with SGD. If the plan is to build it with SGD and not to ensure that it is interpretable, then why does it provide any safety guarantee? How can we use the world model to define a harm predicate?
The general strategy may include either of these two approaches. I'm just saying that that the plan does not definitionally rely on the assumption that the wold model is built manually.
Won't predicting safety specific variables contain all of the difficulty of predicting the world?
That very much depends on what the safety specifications are, and how you want to use your AI system. For example, think about the situations where similar things are already done today. You can prove that a cryptographic protocol is unbreakable, given some assumptions, without needing to have a complete predictive model of the humans that use that protocol. You can prove that a computer program will terminate using a bounded amount of time and memory, without needing a complete predictive model of all inputs to that computer program. You can prove that a bridge can withstand an earthquake of such-and-such magnitude, without having to model everything about the earth's climate. And so on. If you want to prove something like "the AI will never copy itself to an external computer", or "the AI will never communicate outside this trusted channel", or "the AI will never tamper with this sensor", or something like that, then your world model might not need to be all that detailed. For more ambitious safety specifications, you might of course need a more detailed world model. However, I don't think there is any good reason to believe that the world model categorically would need to be a "complete" world model in order to prove interesting safety properties.
From my understanding, the bayesian aspect of this agenda doesn't add much value.
A core advantage of Bayesian methods is the ability to handle out-of-distribution situations more gracefully, and this is arguably as important as (if not more important than) interpretability. In general, most (?) AI safety problems can be cast as an instance of a case where a model behaves as intended on a training distribution, but generalises in an unintended way when placed in a novel situation. Traditional ML has no straightforward way of dealing with such cases, since it only maintains a single hypothesis at any given time. However, Bayesian methods may make it less likely that a model will misgeneralise, or should at least give you a way of detecting when this is the case.
I also don't agree with the characterisation that "almost all the interesting work is in the step where we need to know whether a hypothesis implies harm" (if I understand you correctly). Of course, creating a formal definition or model of "harm" is difficult, and creating a world model is difficult, but once this has been done, it may not be very hard to detect if a given action would result in harm. But I may not understand your point in the intended way.
manually build an (interpretable) infra-bayesian world model which is sufficiently predictive of the world (as smart as our AI)
I don't think this is an accurate description of davidad's plan. Specifically, the world model does not necessarily have to be built manually, and it does not have to be as good at prediction as our AI. The world model only needs to be good at predicting the variables that are important for the safety specification(s), within the range of outputs that the AI system may produce.
Proof checking on this world model also seems likely to be unworkable
I agree that this is likely to be hard, but not necessarily to the point of being unworkable. Similar things are already done for other kinds of software deployed in complex contexts, and ASL-2/3 AI may make this substantially easier.
You can imagine different types of world models, going from very simple ones to very detailed ones. In a sense, you could perhaps think of the assumption that the input distribution is i.i.d. as a "world model". However, what is imagined is generally something that is much more detailed than this. More useful safety specifications would require world models that (to some extent) describe the physics of the environment of the AI (perhaps including human behaviour, though it would probably be better if this can be avoided). More detail about what the world model would need to do, and how such a world model may be created, is discussed in Section 3.2. My personal opinion is that the creation of such a world model probably would be challenging, but not more challenging than the problems encountered in other alignment research paths (such as mechanistic interpretability, etc). Also note that you can obtain guarantees without assuming that the world model is entirely accurate. For example, consider the guarantees that are derived in cryptography, or the guarantees derived from formal verification of airplane controllers, etc. You could also monitor the environment of the AI at runtime to look for signs that the world model is inaccurate in a certain situation, and if such signs are detected, transition the AI to a safe mode where it can be disabled.
If a universality statement like the above holds for neural networks, it would tell us that most of the details of the parameter-function map are irrelevant.
I suppose this depends on what you mean by "most". DNNs and CNNs have noticeable and meaningful differences in their (macroscopic) generalisation behaviour, and these differences are due to their parameter-function map. This is also true of LSTMs vs transformers, and so on. I think it's fairly likely that these kinds of differences could have a large impact on the probability that a given type model will learn to exhibit goal-directed behaviour in a given training setup, for example.
The ambitious statement here might be that all the relevant information you might care about (in terms of understanding universality) are already contained in the loss landscape.
Do you mean the loss landscape in the limit of infinite data, or the loss landscape for a "small" amount of data? In the former case, the loss landscape determines the parameter-function map over the data distribution. In the latter case, my guess would be that the statement probably is false (though I'm not sure).
You're right, I put the parameters the wrong way around. I have fixed it now, thanks!
I could have changed it to Why Neural Networks can obey Occam's Razor, but I think this obscures the main point.
I think even this would be somewhat inaccurate (in my opinion). If a given parametric Bayesian learning machine does obey (some version of) Occam's razor, then this must be because of some facts related to its prior, and because of some facts related to its parameter-function map. SLT does not say very much about either of these two things. What the post is about is primarily the relationship between the RLCT and posterior probability, and how this relationship can be used to reason about training dynamics. To connect this to Occam's razor (or inductive bias more broadly), further assumptions and claims would be required.
At the time of writing, basically nobody knew anything about SLT
Yes, thank you so much for taking the time to write those posts! They were very helpful for me to learn the basics of SLT.
As we discussed at Berkeley, I do like the polynomial example you give and this whole discussion has made me think more carefully about various aspects of the story, so thanks for that.
I'm very glad to hear that! :)
My inclination is that the polynomial example is actually quite pathological and that there is a reasonable correlation between the RLCT and Kolmogorov complexity in practice
Yes, I also believe that! The polynomial example is definitely pathological, and I do think that low almost certainly is correlated with simplicity in the case of neural networks. My point is more that the mathematics of SLT does not explain generalisation, and that additional assumptions definitely will be needed to derive specific claims about the inductive bias of neural networks.
Well neural networks do obey Occam's razor, at least according to the formalisation of that statement that is contained in the post (namely, neural networks when formulated in the context of Bayesian learning obey the free energy formula, a generalisation of the BIC which is often thought of as a formalisation of Occam's razor).
Would that not imply that my polynomial example also obeys Occam's razor?
However, I accept your broader point, which I take to be: readers of these posts may naturally draw the conclusion that SLT currently says something profound about (ii) from my other post, and the use of terms like "generalisation" in broad terms in the more expository parts (as opposed to the technical parts) arguably doesn't make enough effort to prevent them from drawing these inferences.
Yes, I think this probably is the case. I also think the vast majority of readers won't go deep enough into the mathematical details to get a fine-grained understanding of what the maths is actually saying.
I'm often critical of the folklore-driven nature of the ML literature and what I view as its low scientific standards, and especially in the context of technical AI safety I think we need to aim higher, in both our technical and more public-facing work.
Yes, I very much agree with this too.
Does that sound reasonable?
Yes, absolutely!
At least right now, the value proposition I see of SLT lies not in explaining the "generalisation puzzle" but in understanding phase transitions and emergent structure; that might end up circling back to say something about generalisation, eventually.
I also think that SLT probably will be useful for understanding phase shifts and training dynamics (as I also noted in my post above), so we have no disagreements there either.
I think I recall reading that, but I'm not completely sure.
Note that the activation function affects the parameter-function map, and so the influence of the activation function is subsumed by the general question of what the parameter-function map looks like.
I'm not sure, but I think this example is pathological.
Yes, it's artificial and cherry-picked to make a certain rhetorical point as simply as possible.
This is the more relevant and interesting kind of symmetry, and it's easier to see what this kind of symmetry has to do with functional simplicity: simpler functions have more local degeneracies.¨
This is probably true for neural networks in particular, but mathematically speaking, it completely depends on how you parameterise the functions. You can create a parameterisation in which this is not true.
You can make the same critique of Kolmogorov complexity.
Yes, I have been using "Kolmogorov complexity" in a somewhat loose way here.
Wild conjecture: [...]
Is this not satisfied trivially due to the fact that the RLCT has a certain maximum and minimum value within each model class? (If we stick to the assumption that is compact, etc.)
Will do, thank you for the reference!
Yes, I completely agree. The theorems that have been proven by Watanabe are of course true and non-trivial facts of mathematics; I do not mean to dispute this. What I do criticise is the magnitude of the significance of these results for the problem of understanding the behaviour of deep learning systems.
Thank you for this -- I agree with what you are saying here. In the post, I went with a somewhat loose equivocation between "good priors" and "a prior towards low Kolmogorov complexity", but this does skim past a lot of nuance. I do also very much not want to say that the DNN prior is exactly towards low Kolmogorov complexity (this would be uncomputable), but only that it is mostly correlated with Kolmogorov complexity for typical problems.
Yes, I mostly just mean "low test error". I'm assuming that real-world problems follow a distribution that is similar to the Solomonoff prior (i.e., that data generating functions are more likely to have low Kolmogorov complexity than high Kolmogorov complexity) -- this is where the link is coming from. This is an assumption about the real world, and not something that can be established mathematically.
I think that it gives us an adequate account of generalisation in the limit of infinite data (or, more specifically, in the case where we have enough data to wash out the influence of the inductive bias); this is what my original remark was about. I don't think classical statistical learning theory gives us an adequate account of generalisation in the setting where the training data is small enough for our inductive bias to still matter, and it only has very limited things to say about out-of-distribution generalisation.
The assumption that small neural networks are a good match for the actual data generating process of the world, is equivalent to the assumption that neural networks have an inductive bias that gives large weight to the actual data generating process of the world, if we also append the claim that neural networks have an inductive bias that gives large weight to functions which can be described by small neural networks (and this latter claim is not too difficult to justify, I think).
I think the second one by Carroll is quite careful to say things like "we can now understand why singular models have the capacity to generalise well" which seems to me uncontroversial, given the definitions of the terms involved and the surrounding discussion.
The title of the post is Why Neural Networks obey Occam's Razor! It also cites Zhang et al, 2017, and immediately after this says that SLT can help explain why neural networks have the capacity to generalise well. This gives the impression that the post is intended to give a solution to problem (ii) in your other comment, rather than a solution to problem (i).
Jesse's post includes the following expression:
I think this also suggests an equivocation between the RLCT measure and practical generalisation behaviour. Moreover, neither post contains any discussion of the difference between (i) and (ii).
Anyway I'm guessing you're probably willing to grant (i), based on SLT or your own views, and would agree the real bone of contention lies with (ii).
Yes, absolutely. However, I also don't think that (i) is very mysterious, if we view things from a Bayesian perspective. Indeed, it seems natural to say that an ideal Bayesian reasoner should assign non-zero prior probability to all computable models, or something along those lines, and in that case, notions like "overparameterised" no longer seem very significant.
Maybe that has significant overlap with the critique of SLT you're making?
Yes, this is basically exactly what my criticism of SLT is -- I could not have described it better myself!
Again, I think this reduction is not trivial since the link between , and generalisation error is nontrivial.
I agree that this reduction is relevant and non-trivial. I don't have any objections to this per se. However, I do think that there is another angle of attack on this problem that (to me) seems to get us much closer to a solution (namely, to investigate the properties of the parameter-function map).
A few things:
1. Neural networks do typically learn functions with low Kolmogorov complexity (otherwise they would not be able to generalise well).
2. It is a type error to describe a function as having low RLCT. A given function may have a high RLCT or a low RLCT, depending on the architecture of the learning machine.
3. The critique is against the supposition that we can use SLT to explain why neural networks generalise well in the small-data regime. The example provides a learning machine which would not generalise well, but which does fit all assumptions made my SLT. Hence, the SLT theorems which appear to prove that learning machines will generalise well when they are subject to the assumptions of SLT must in fact be showing something else.
My point is precisely that SLT does not give us a predictive account of how neural networks behave, in terms of generalisation and inductive bias, because it abstacts away from factors which are necessary to understand generalisation and inductive bias.
To say that neural networks are empirical risk minimisers is just to say that they find functions with globally optimal training loss (and, if they find functions with a loss close to the global optimum, then they are approximate empirical risk minimisers, etc).
I think SLT effectively assumes that neural networks are (close to being) empirical risk minimisers, via the assumption that they are trained by Bayesian induction.
The bounds are not exactly vacuous -- in fact, they are (in a sense) tight. However, they concern a somewhat adversarial setting, where the data distribution may be selected arbitrarily (including by making it maximally opposed to the inductive bias of the learning algorithm). This means that the bounds end up being much larger than what you would typically observe in practice, if you give typical problems to a learning algorithm whose inductive bias is attuned to the structure of "typical" problems.
If you move from this adversarial setting to a more probabilistic setting, where you assume a fixed distribution over or , then you may be able to prove tighter probabilistic bounds. However, I do not have any references of places where this actually has been done (and as far as I know, it has not been done before).
I already posted this in response to Daniel Murfet, but I will copy it over here:
For example, the agnostic PAC-learning theorem says that if a learning machine (for binary classification) is an empirical risk minimiser with VC dimension , then for any distribution over , if is given access to at least data points sampled from , then it will with probability at least learn a function whose (true) generalisation error (under ) is at most worse than the best function which is able to express (in terms of its true generalisation error under ). If we assume that that corresponds to a function which can express, then the generalisation error of will with probability at least be at most .
This means that, in the limit of infinite data, will with probability arbitrarily close to 1 learn a function whose error is arbitrarily close to the optimal value (among all functions which is able to express). Thus, any empirical risk minimiser with a finite VC-dimension will generalise well in the limit of infinite data.
For a bit more detail, see this post.
Does this not essentially amount to just assuming that the inductive bias of neural networks in fact matches the prior that we (as humans) have about the world?
This is basically a justification of something like your point 1, but AFAICT it's closer to a proof in the SLT setting than in your setting.
I think it could probably be turned into a proof in either setting, at least if we are allowed to help ourselves to assumptions like "the ground truth function is generated by a small neural net" and "learning is done in a Bayesian way", etc.
In your example there are many values of the parameters that encode the zero function
Ah, yes, I should have made the training data be (1,1), rather than (0,0). I've fixed the example now!
Is that a fair characterisation of the argument you want to make?
Yes, that is exactly right!
Assuming it is, my response is as follows. I'm guessing you think is simpler than because the former function can be encoded by a shorter code on a UTM than the latter.
The notion of complexity that I have in mind is even more pre-theoretic than that; it's something like " looks like an intuitively less plausible guess than ". However, if we want to keep things strictly mathematical, then we can substitute this for the definition in terms of UTM codes.
But this isn't the kind of complexity that SLT talks about
I'm well aware of that -- that is what my example attempts to show! My point is that the kind of complexity which SLT talks about does not allow us to make inferences about inductive bias or generalisation behaviour, contra what is claimed e.g. here and here.
So we agree that Kolmogorov complexity and the local learning coefficient are potentially measuring different things. I want to dig deeper into where our disagreement lies, but I think I'll just post this as-is and make sure I'm not confused about your views up to this point.
As far as I can tell, we don't disagree about any object-level technical claims. Insofar as we do disagree about something, it may be more methodolocical meta-questions. I think that what would probably be the most important thing to understand about neural networks is their inductive bias and generalisation behaviour, on a fine-grained level, and I don't think SLT can tell you very much about that. I assume that our disagreement must be about one of those two claims?
For example, the agnostic PAC-learning theorem says that if a learning machine (for binary classification) is an empirical risk minimiser with VC dimension , then for any distribution over , if is given access to at least data points sampled from , then it will with probability at least learn a function whose (true) generalisation error (under ) is at most worse than the best function which is able to express (in terms of its true generalisation error under ). If we assume that that corresponds to a function which can express, then the generalisation error of will with probability at least be at most .
This means that, in the limit of infinite data, will with probability arbitrarily close to 1 learn a function whose error is arbitrarily close to the optimal value (among all functions which is able to express). Thus, any empirical risk minimiser with a finite VC-dimension will generalise well in the limit of infinite data.
I'm going to make a few comments as I read through this, but first I'd like to thank you for taking the time to write this down, since it gives me an opportunity to think through your arguments in a way I wouldn't have done otherwise.
Thank you for the detailed responses! I very much enjoy discussing these topics :)
My impression is that you tend to see this as a statement about flatness, holding over macroscopic regions of parameter space
My intuitions around the RLCT are very much geometrically informed, and I do think of it as being a kind of flatness measure. However, I don't think of it as being a "macroscopic" quantity, but rather, a local quantity.
I think the rest of what you say coheres with my current picture, but I will have to think about it for a bit, and come back later!
I have often said that SLT is not yet a theory of deep learning, this question of whether the infinite data limit is really the right one being among one of the main question marks I currently see.
Yes, I agree with this. I think my main objections are (1) the fact that it mostly abstacts away from the parameter-function map, and (2) the infinite-data limit.
My view is that the validity of asymptotics is an empirical question, not something that is settled at the blackboard.
I largely agree, though depends somewhat on what your aims are. My point there was mainly that theorems about generalisation in the infinite-data limit are likely to end up being weaker versions of more general results from statistical and computational learning theory.
That's interesting, thank you for this!
Yes, I meant specifically on LW and in the AI Safety community! In academia, it remains fairly obscure.
I think this is precisely what SLT is saying, and this is nontrivial!
It is certainly non-trivial, in the sense that it takes many lines to prove, but I don't think it tells you very much about the actual behaviour of neural networks.
Note that loss landscape considerations are more important than parameter-function considerations in the context of learning.
One of my core points is, precisely, to deny this claim. Without assumptions about the parameter function map, you cannot make inferences from the characteristics of the loss landscape to conclusions about generalisation behaviour, and understanding generalisation behaviour is crucial for understanding learning. (Unless you mean something like "convergence behaviour" when you say "in the context of learning", in which case I agree, but then you would consider generalisation to be outside the scope of learning.)
For example it's not clear in your example why f(x) = 0 is likely to be learned
My point is precisely that it is not likely to be learned, given the setup I provided, even though it should be learned.
Learning bias in a NN should most fundamentally be understood relative to the weights, not higher-order concepts like Kolmogorov complexity (though as you point out, there might be a relationship between the two).
There is a relationship between the two, and I claim that this relationship underlies the mechanism behind why neural networks work well compared to other learning machines.
The thing is, the "complexity of f" (your K(f)) is not a very meaningful concept from the point of view of a neural net's learning
If we want to explain generalisation in neural networks, then we must explain if and how their inductive bias aligns with out (human) priors. Moreover, our human priors are (in most contexts) largely captured by computational complexity. Therefore, we must somewhere, in some way, connect neural networks to computational complexity.
indeed, there is no way to explain why generalizable networks like modular addition still sometimes memorize without understanding that the two are very distinct
Why not? The memorising solution has some nonzero "posterior" weight, so you would expect it to be found with some frequency. Does the empirical frequency of this solution deviate far from the theoretical prediction?
including stuff Joar has worked on
That is right! See this paper.
which animals cannot do at all, they can't write computer code or a mathematical paper
This is not obvious to me (at least not for some senses of the word "could"). Animals cannot be motivated into attempting to solve these tasks, and they cannot study maths or programming. If they could do those things, then it is not at all clear to me that they wouldn't be able to write code or maths papers. To make this more specific; insofar as humans rely on a capacity for general problem-solving in order to do maths and programming, it would not surprise me if many animals also have this capacity to a sufficient extent, but that it cannot be directed in the right way. Note that animals even outperform humans at some general cognitive tasks. For example, chimps have a much better short-term memory than humans.
Moreover, we know a lot about human performance at those tasks, and it's abysmal, even for top humans, and for AI research as a field.
Abysmal, compared to what? Yes, we can see that it is abysmal compared to what would in principle be information-theoretically possible. However, this doesn't tell us very much about whether or not it is abysmal compared to what is computationally possible.
The problem of finding the minimal complexity hypothesis for a given set of data is not computationally tractable. For Kolmogorov complexity, it is uncomputable, but even for Boolean complexity, it is at least exponentially difficult (depending a bit on how exactly the problem is formalised). This means that in order to reason effectively about large amounts of data, it is (presumably) necessary to model most of it using low-fidelity methods, and then (potentially) use various heuristics in order to determine what pieces of information deserve more attention. I would therefore expect a "saturated" AI system to also frequently miss things that look obvious in hindsight.
So it seems that, at least, there is quite a bit of room for a large initial boost over the current human-equivalent capacity.
I agree that AI systems have many clear and obvious advantages, and that e.g. simply running them at a higher clock speed will give you a clear boost regardless of what assumptions we make about the "quality" of their cognition compared to that of humans. The question I'm concerned with is whether or not a takeoff scenario is better modeled as "AI quickly bootstraps to incomprehensible, Godlike intelligence through recursive self-improvement", or whether it is better modeled as "economic growth suddenly goes up by a lot". All the obvious advantages of AI systems are compatible with the latter.
So, the claim is (of course) not that intelligence is zero-one. We know that this is not the case, from the fact that some people are smarter than other people.
As for the other two points, see this comment and this comment.
So, this model of a takeoff scenario makes certain assumptions about how intelligence works, and these assumptions may or may not be correct. In particular, it assumes that the initial AI systems are very far from being algorithmically optimal. We don't know whether or not this will be the case; that is what I am trying to highlight.
The task of extracting knowledge from data is a computational task, which has a certain complexity-theoretic hardness. We don't know what that hardness is, but there is a lower bound on how efficiently this task can be done. Similarly for all the other tasks of intelligence (such as planning, etc).
Strong recursive self-improvement (given a fixed amount of resources) is only possible if the first AI systems are very far from being algorithmically optimal at all the relevant computational tasks. This is not a given; it could be true or false. For example, while you can optimise a SAT-solver in many ways, it will at the end of the day necessarily have a worst-case exponential runtime complexity (unless P = NP).
Therefore, the question of how much more intelligent AI systems will end up being compared to humans, depends on how close the human brain algorithm is to being (roughly) Pareto-optimal for all the relevant computational tasks. We don't know the answer to this question. Strong, sustained recursive self-improvement is only possible if our initial AGI algorithm, and the human brain, both are very far from being Pareto-optimal.
Is this the case? You could point to the discrepancy between humans and animals, and argue that this demonstrates that there are cognitive algorithms that yield vastly different results given similar amounts of resources (in terms of data and parameters). However, the argument I've written casts doubt on whether or not this evidence is reliable. Given that, I think the case is no longer so clear; perhaps the human neural architecture is (within some small-ish polynomial factor of being) Pareto optimal for most relevant cognitive tasks.
Now, even if the human brain algorithm is roughly optimal, AI systems will almost certainly still end up with vastly more cognitive force (because they can be run faster, and given more memory and more data). However, I think that this scenario is different in relevant ways. In particular, without (strong) recursive self-improvement, you probably don't get an uncontrollable, exponential growth in intelligence, but rather a growth that is bottle-necked by resources which you could control (such as memory, data, CPU cycles, and etc).
I don't have any good evidence that humans raised without language per se are less intelligent (if we understand "intelligence" to refer to a general ability to solve new problems). For example, Genie was raised in isolation for the first 13 years of her life, and never developed a first language. Some researchers have, for various reasons, guessed that she was born with average intelligence, but that she, as a 14-year old, had a mental age "between a 5- and 8-year-old". However, here we have the confounding factor that she also was severely abused, and that she got very little mental stimulus in general for the first 13 years of her life, which would presumably obstruct mental development independently of a lack of language. This makes it hard to draw any strong conclusions (and we would regardless have a very small number of data points).
However, just to clarify, the argument I'm making doesn't crucially rely on the assumption that a human with language is significantly more intelligent than a human without language, but rather on the (presumably much less controversial) assumption that language is a significant advantage regardless of whether or not it is also paired with an increase in intelligence. For example, it would not surprise me if orangutans with language (but orangutan-level intelligence) over time would outcompete humans without language (but otherwise human-level intelligence). This, in turn, makes it difficult to infer how intelligent humans are compared to animals, based on what we have achieved compared to animals.
For example, one might say
"Humans have gone to space, but no other species is anywhere close to being able to do that. This proves that humans are vastly more intelligent than all other species."
However, without access to language, humans can't go to space either. Moreover, we don't know if orangutans would eventually be able to go to space if they did have access to language. This makes it quite non-trivial to make a direct comparison.
I think the broad strokes are mostly similar, but that a bunch of relevant details are different.
Yes, a large collective of near-human AI that is allowed to interact freely over a (subjectively) long period of time is presumably roughly as hard to understand and control as a Bostrom/Yudkowsky-esque God in a box. However, in this scenario, we have the option to not allow free interaction between multiple instances, while still being able to extract useful work from them. It is also probably much easier to align a system that is not of overwhelming intelligence, and this could be done before the AIs are allowed to interact. We might also be able to significantly influence their collective behaviour by controlling the initial conditions of their interactions (similarly to how institutions and cultural norms have a substantial long-term impact on the trajectory of a country, for example). It is also more plausible that humans (or human simulations or emulations) could be kept in the loop for a long time period in this scenario. Moreover, if intelligence is bottle-necked by external resources (such as memory, data, CPU cycles, etc) rather than internal algorithmic efficiency, then you can exert more control over the resulting intelligence explosion by controlling those resources. Etc etc.
Note that this proposal is not about automating interpretability.
The point is that you (in theory) don't need to know whether or not the uninterpretable AGI is safe, if you are able to independently verify its output (similarly to how I can trust a mathematical proof, without trusting the mathematician).
Of course, in practice, the uninterpretable AGI presumably needs to be reasonably aligned for this to work. You must at the very least be able to motivate it to write code for you, without hiding any trojans or backdoors that you are not able to detect.
However, I think that this is likely to be much easier than solving the full alignment problem for sovereign agents. Writing software is a myopic task that can be accomplished without persistent, agentic preferences, which means that the base system could be much more tool-like that the system which it produces.
But regardless of that point, many arguments for why interpretability research will be helpful also apply to the strategy I outline above.
- This is obviously true; any AI complete problem can be trivially reduced to the problem of writing an AI program that solves the problem. That isn't really a problem for the proposal here. The point isn't that we could avoid making AGI by doing this, the point is that we can do this in order to get AI systems that we can trust without having to solve interpretability.
- This is probably true, but the extent to which it is true is unclear. Moreover, if the inner workings of intelligence are fundamentally uninterpretable, then strong interpretability must also fail. I already commented on this in the last two paragraphs of the top-level post.
Yes, I agree with this. I mean, even if we assume that the AIs are basically equivalent to human simulations, they still get obvious advantages from the ability to be copy-pasted, the ability to be restored to a checkpoint, the ability to be run at higher clock speeds, and the ability to make credible pre-commitments, etc etc. I therefore certainly don't think there is any plausible scenario in which unchecked AI systems wouldn't end up with most of the power on earth. However, there is a meaningful difference between the scenario where their advantages mainly come from overwhelmingly great intelligence, and the scenario where their advantages mainly (or at least in large part) come from other sources. For example, scaleable oversight is a more realistic possibility in the latter scenario than it is in the former scenario. Boxing methods are also more realistic in the latter scenario than the former scenario, etc.
No, I don't have any explicit examples of that. However, I don't think that the main issue with GOFAI systems necessarily is that they have bad performance. Rather, I think the main problem is that they are very difficult and laborious to create. Consider, for example, IBM Watson. I consider this system to be very impressive. However, it took a large team of experts four years of intense engineering to create Watson, whereas you probably could get similar performance in an afternoon by simply fine-tuning GPT-2. However, this is less of a problem if you can use a fleet of LLM software engineers and have them spend 1,000 subjective years on the problem over the course of a weekend.
I also want to note that:
1. Some trade-off between performance and transparency is acceptable, as long as it is not too large.
2. The system doesn't have to be an expert system: the important thing is just that it's transparent.
3. If it is impossible to create interpretable software for solving a particular task, then strong interpretability must also fail.
To clarify, the proposal is not (necessarily) to use an LLM to create an interpretable AI system that is isomorphic to the LLM -- their internal structure could be completely different. The key points are that the generated program is interpretable and trustworthy, and that it can solve some problem we are interested in.
What is the exact derivation that gives you claim (1)?
Empirically, the inductive bias that you get when you train with SGD, and similar optimisers, is in fact quite similar to the inductive bias that you would get, if you were to repeatedly re-initialise a neural network until you randomly get a set of weights that yield a low loss. Which optimiser you use does have an effect as well, but this is very small by comparison. See this paper.
""
The kinds of humans that we are worried about are the kinds of humans that can do original scientific research and autonomously form plans for taking over the world. Human brains learn to take actions and plans that previously led to high rewards (outcomes like eating food when hungry, having sex, etc)*. These two things are fundamentally not the same thing. Why, exactly, would we expect that a system that is good at the latter necessarily would be able to do the former?"
""
This feels like a bit of a digression, but we do have concrete examples of systems that are good at eating food when hungry, having sex, and etc, without being able to do original scientific research and autonomously form plans for taking over the world; animals. And the difference between humans and animals isn't just that humans have more training data (or even that we are that much better at survival and reproduction in the environment of evolutionary adaptation). But I should also note that I consider argument 6 to be one of the weaker arguments I know of.
""
We know, from computer science, that it is very powerful to be able to reason in terms of variables and operations on variables. It seems hard to see how you could have human-level intelligence without this ability. However, humans do not typically have this ability, with most human brains instead being more analogous to Boolean circuits, given their finite size and architecture of neuron connections.
""
The fact that human brains have a finite size and architecture of neuron connections does not mean that they are well-modelled as Boolean circuits. For example, a (real-world) computer is better modelled as a Turing machine than as a finite-state automaton, even though there is a sense in which they actually are finite-state automata.
The brain is made out of neurons, yes, but it matters a great deal how those neurons are connected. Depending on the answer to that question, you could end up with a system that behaves more like Boolean circuits, or more like a Turing machine, or more like something else.
With neural networks, the training algorihtm and the architecture together determine how the neurons end up connected, and therefore, if the resulting system is better thought of as a Boolean circuit, or a Turing machine, or otherwise. If the wiring of the brain is determined by a different mechanism than what determines the wiring of a deep learning system, then the two systems could end up with very different properties, even if they are made out of similar kinds of parts.
With the brain, we don't know what determines the wiring. This makes it difficult to draw strong conclusions about the high-level behaviour of brains from their low-level physiology. With deep learning, it is easier to do this.
""
I find it hard to make the argument here because there is no argument -- it's just flatly asserted that neural networks don't use such representations, so all I can do is flatly assert that humans don't use such representations. If I had to guess, you would say something like "matrix multiplications don't seem like they can be discrete and combinatorial", to which I would say "the strength of brain neuron synapse firings doesn't seem like it can be discrete and combinatorial".
""
What representations you end up with does not just depend on the model space, it also depends on the learning algorithm. Matrix multiplications can be discrete and combinatorial. The question is if those are the kinds of representations that you in fact would end up with when you train a neural network by gradient descent, which to me seems unlikely. The brain does (most likely) not use gradient descent, so the argument does not apply to the brain.
""
Do you perhaps agree that you would have a hard time navigating in a 10-D space? Clearly you have simply memorized a bunch of heuristics that together are barely sufficient for navigating 3-D space, rather than truly understanding the underlying algorithm for navigating spaces.
""
It would obviously be harder for me to do this, and narrow heuristics are obviously an important part of intelligence. But I could do it, which suggests a stronger transfer ability than what would be suggested if I couldn't do this.
""
In some other parts, I feel like in many places you are being one-sidedly skeptical.
""
Yes, as I said, my goal with this post is not to present a balanced view of the issue. Rather, my goal is just to summarise as many arguments as possible for being skeptical of strong scaling. This makes the skepticism one-sided in some places.