The Learning-Theoretic AI Alignment Research Agenda

post by Vanessa Kosoy (vanessa-kosoy) · 2018-07-04T09:53:31.000Z · LW · GW · 37 comments

Contents

  Philosophy
    The importance of rigor
    Metaphilosophy and the role of models
    Intelligence is understandable
    Value alignment is understandable
  Foundations
    Statistical Learning Theory
    Computational Learning Theory
    Algorithmic Information Theory
    Towards a rigorous definition of intelligence
  Research Programme Outline
    Universal reinforcement learning
    Value learning protocols
    Taming daemons
    Recursive self-improvement
  Summary
None
39 comments

In this essay I will try to explain the overall structure and motivation of my AI alignment research agenda. The discussion is informal and no new theorems are proved here. The main features of my research agenda, as I explain them here, are


Philosophy

In this section I explain the key principles and assumptions that motivate my research agenda.

The importance of rigor

I believe that the solution to AI alignment must rely on a rigorous mathematical theory. The algorithms that comprise the solution must be justified by formal mathematical properties. All mathematical assumptions should be either proved or at least backed by considerable evidence, like the prominent conjectures of computational complexity theory. This needs to be the case because:

That said, I do expect the ultimate solution to have aspects that are not entirely rigorous, specifically:

Metaphilosophy and the role of models

In order to use mathematics to solve a real-world problem, a mathematical model of the problem must be constructed. When the real-world problem can be defined in terms of data that is observable and measurable, the validity of the mathematical model can be ascertained using the empirical method. However, AI alignment touches on problems that are philosophical in nature, meaning that there is still no agreed-upon empirical or other criterion for evaluating an answer. Dealing with such problems requires a metaphilosophy: a way of evaluating answers to philosophical questions.

Although I do not claim a fully general solution to metaphilosophy, I think that, pragmatically, a quasiscientific approach is possible. In science, we prefer theories that are (i) simple (Occam's razor) and (ii) fit the empirical data. We also test theories by gathering further empirical data. In philosophy, we can likewise prefer theories that are (i) simple and (ii) fit intuition in situations where intuition feels reliable (i.e. situations that are simple, familiar or received considerable analysis and reflection). We can also test theories by applying them to new situations and trying to see whether the answer becomes intuitive after sufficient reflection.

Moreover, I expect progress on most problems to be achieved by the means of successive approximations. This means that we start with a model that is grossly oversimplified but that already captures some key aspects of the problems. Once we have a solution within this model, we can start to attack its assumptions and arrive at a new, more sophistical model. This process should repeat until we arrive at a model that (i) has no obvious shortcomings and that (ii) we seem unable to improve despite our best efforts.

Like in science, we can never be certain that a theory is true. Any assumption or model can be questioned. This requires striking a balance between complacency and excessive skepticism. To avoid complacency, we need to keep working to find better theories. To avoid excessive skepticism, we should entertain hypotheses honestly and acknowledge when a theory is already capable of passing non-trivial quasiscientific tests. Reaching agreement is harder work (because our tests rely on intuition which may vary from individual to individual), but we should not despair of that goal.

Intelligence is understandable

It is possible to question whether a mathematical theory of intelligence is possible at all. After all, we don't expect to have a tractable mathematical theory of Rococo architecture, or a simple equation describing the shape of the coastline of Africa in the year 2018.

The key difference is that intelligence is a natural concept. Intelligence, the way I use this word in the context of AI alignment, is the ability of an agent to make choices in a way that effectively promote its goals, in an environment that is not entirely known or even not entirely knowable. Arguing over the meaning of the word would be a distraction: this is the meaning relevant to AI alignment, because the entire concern of AI alignment is about agents that effectively pursue their goals, undermining the conflicting goals of the human species. Moreover, intelligence is (empirically) a key force in determining the evolution of the physical universe.

I conjecture that natural concepts have useful mathematical theories, and this conjecture seems to me supported by evidence in natural and computer science. It would be nice to have this conjecture itself follow from a mathematical theory, but this is outside of my current scope. Also, we already have some progress towards a mathematical theory of intelligence (I will discuss it in the next section).

A related question is, whether it is possible to design an algorithm for strong AI based on simple mathematical principles, or whether any strong AI will inevitably be an enormous kludge of heuristics designed by trial and error. I think that we have some empirical support for the former, given that humans evolved to survive in a certain environment but succeeded to use their intelligence to solve problems in very different environments. That said, I am less confident about this than about the previous question. In any case, having a mathematical theory of intelligence should allow us to resolve this question too, whether positively or negatively.

Value alignment is understandable

The core of AI alignment is reliably transferring human values to a strong AI. However, the problem of defining what we mean by "human values" is a philosophical problem. A common and natural model of "values" is expected utility maximization: this is what we find in game theory and economics, and this is supported by VNM and Savage theorems. However, as often pointed out, humans are not perfectly rational, therefore it's not clear in what sense they can be said to maximize the expectation of a specific utility function.

Nevertheless, I believe that "values" is also a natural concept. Denying the concept of "values" altogether is paramount to nihilism, and in such a belief system there is no reason to do anything at all, including saving yourself and everyone else from a murderous AI. Admitting the general concept of "values" as something complex and human specific (despite the focus on "values" rather than "human values") seems implausible, since intuitively we can easily imagine alien minds facing a similar AI alignment problem. Moreover, the concept of "values" is part and parcel of the concept of "intelligence", so if we believe that "intelligence" (due to its importance in shaping the physical world) is a natural concept, then so are "values".

Therefore, I conjecture that there is a simple mathematical theory of imperfect rationality, within which the concept of "human values" is well-defined modulo the (observable, measurable) concept of "humans". Some speculation on what this theory looks like appears in the following sections.

Now, that doesn't mean that "human values" are perfectly well-defined, anymore than, for example, the center of mass of the sun is perfectly well-defined (which would require deciding exactly which particles are considered part of the sun). However, like the center of mass of the sun is sufficiently well-defined for many practical purposes in astrophysics, the concept of "human values" should be sufficiently well-defined for designing an aligned AGI. To the extent alignment remains ambiguous, the resolution of these ambiguities doesn't have substantial moral significance.

Foundations

In this section I briefly explain the mathematical tools with which I set out to study AI alignment, and the outline of the mathematical theory of intelligence that these tools already painted.

Statistical Learning Theory

Statistical learning theory studies the information-theoretic constraints on various types of learning tasks, answering questions such as, when is a learning task solvable at all, and how much training data is required to solve the learning task within given accuracy (sample complexity). Learning tasks can be broadly divided into:

It is the last type of learning tasks, in particular assuming no resets or episodes, that is the most relevant for studying intelligence in the relevant sense. Indeed, the abstract setting of reinforcement learning is a good formalization for the informal definition of intelligence we had before. Note that the name "reward" might be misleading: this is not necessarily a signal received from outside, but can just as easily be some formally specified mathematical function.

In online learning and reinforcement learning, the theory typically aims to derive upper and lower bounds on "regret": the difference between the expected utility received by the algorithm and the expected utility it would receive if the environment was known a priori. Such an upper bound is effectively a performance guarantee for the given algorithm. In particular, if the reward function is assumed to be "aligned" then this performance guarantee is, to some extent, an alignment guarantee. This observation is not vacuous, since the learning protocol might be such that the true reward function is not directly available to the algorithm, as exemplified by DIRL and DRL. Thus, formally proving alignment guarantees takes the form of proving appropriate regret bounds.

Computational Learning Theory

In addition to information-theoretic considerations, we have to take into account considerations of computational complexity. Thus, after deriving information-theoretic regret bounds, we should continue to refine them by constraining our algorithms to be computationally feasible (which typically means running on polynomial time, but we may also need to consider stronger restrictions, such as restrictions on space complexity or parallelizability). If we consider Bayesian regret (i.e. the expected value of regret w.r.t. some prior on the environments), this effectively means we are dealing with average-case complexity. Note that, imposing computational constraints on the agent implies bounded reasoning / non-omniscience and already constitutes departure from "perfect rationality" in a certain sense.

More precisely, it is useful to differentiate between at least two levels of computational feasibility (see also this related essay [LW · GW] by Alex Appel). On the first level, which I call "weakly feasible", we allow the computing time to scale polynomially with the number of hypotheses we consider, or exponentially with the description length of the correct hypothesis (these two are more or less interchangeable since, the number of hypotheses of given description length is exponential in this length). Thus, algorithms like Levin's universal search or Solomonoff induction over programs with polynomial time complexity, or Posterior Sampling Reinforcement Learning with a small number of hypotheses fall into this category. On the second level, which I call "strongly feasible", we require polynomial computing time for the "full" hypothesis space. At present, we only know how to achieve theoretical guarantees on this second level in narrow contexts, such as reinforcement learning with a small state space (i.e. with number of states polynomial in the security parameter).

In fact, the current gap in our theoretical understanding of deep learning is strongly related to the gap between weak and strong feasibility. Indeed, results about expressiveness and (statistical) learnability of neural networks are well-known, however exact learning of neural networks is NP-complete in the general case. Understanding how this computational barrier is circumvented in practical problems is a key challenge in understanding deep learning. Such understanding would probably be a positive development in terms of AI alignment (although it might also contribute to increasing AI capacity), but I don't think it's a high priority problem since it seems to already receive considerable attention in mainstream academia (i.e. it is not neglected).

I believe that the development of AI alignment theory should proceed by prioritizing information-theoretic analysis first, complexity-theoretic analysis in the sense of weak feasibility second, and complexity-theoretic analysis in the sense of strong feasibility last. That said, we should keep the complexity-theoretic considerations in mind, and strive to devise solutions that at least seem feasible modulo "miracles" similar to deep learning (i.e. modulo intractable problems that are plausibly tractable in realistic special cases). Moreover, certain complexity-theoretic considerations are already implicit in the choice of the space of hypotheses for your learning problem (e.g. Solomonoff induction has to be truncated to polynomial-time programs to be even weakly feasible). In particular, we should keep in mind that the hypotheses must be computationally simpler than the agent itself, whereas the universe must be computationally more complex than the agent itself. More on resolving this apparent paradox later.

Algorithmic Information Theory

The choice of hypothesis space plays a crucial role in any learning task, and the choice of prior plays a crucial role in Bayesian reinforcement learning. In narrow AI this choice is based entirely on the prior knowledge of the AI designers about the problem. On the other hand, general AI should be able to learn its environment with little prior knowledge, by noticing patterns and using Occam's razor. Indeed, the latter is the basis of epistemic rationality to the best of our understanding. The Solomonoff measure is an elegant formalization of this idea.

However, Solomonoff induction is incomputable, so a realistic agent would have to use some truncated form of it, for example by bounding the computational resources made available to the universal Turing machine. It thus becomes an important problem to find a natural prior such that:

Towards a rigorous definition of intelligence

The combination of perfect Bayesian reinforcement learning and the Solomonoff prior is known as AIXI. AIXI may be regarded as a model ideal intelligence, but there are several issues that were argued to be flaws in this concept:

The Cartesian duality problem and the traps problem are actually strongly related. Indeed, one can model any event that destroys the agent (including modifying its source code) as the transition of the environment into some inescapable state. Such a state should be assigned a reward that corresponds to the expected utility of the universe going on without the agent. However, it's not obvious how the agent can learn to anticipate such states, since observing it once eliminates any chance of using this knowledge later. DRL already partially addresses this problem: more discussion in the next section.

Solving irreflexivity requires going beyond the Bayesian paradigm by including models that don't fully specify the environment. More details in the next section.

Finally, the decision-theoretic paradoxes are a more equivocal issue than it seems, because the usual philosophical way of thinking about decision theory assumes that the model of the environment is given, whereas in our way of thinking, the model is learned. This is important: for example, if AIXI is placed in a repeated Newcomb's problem, it will learn to one-box, since its model will predict that one-boxing causes the money to appear inside the box. In other words, AIXI might be regarded as a CDT, but the learned "causal" relationships are not the same as physical causality. Formalizing other Newcomb-like problems require solving irreflexivity first, because the environment contains Omega which cannot be simulated by the agent. Therefore, my current working hypothesis is that decision theory will be mostly solved (or dissolved) by

To sum up, clarifying all of these issues should result in formulating a certain optimality condition (regret bound) which may be regarded as a rigorous definition of intelligence. This would also constitute progress towards defining "values" (having certain values means being intelligent w.r.t. these values), but the latter might require making the definition even more lax. More on that later.

Research Programme Outline

In this section I break down the research programme into different domains and subproblems. The list below is not intended to be a linear sequence. Indeed, many of the subproblems can be initially attacked in parallel, but also many of them are interconnected and progress in one subproblem can be leveraged to produce a more refined analysis of another. Any concrete plan I have regarding the order with which these questions should be addressed is liable to change significantly as progress is made. Moreover, I expect the entire breakdown to change as progress is made and new insights are available. However, I do believe that the high-level principles of the approach have a good chance of surviving, in some form, into the future.

Universal reinforcement learning

The aim of this part in the agenda is deriving regret bounds or other performance guarantees for certain settings of reinforcement learning that are simultaneously strong enough and general enough to serve as a compelling definition / formalization of the concept of general intelligence. In particular, this involves solving the deficiencies of AIXI that were pointed out in the previous section.

I believe that a key step towards this goal is solving the problem of "irreflexivity". That is, we need to define a form of reinforcement learning in which the agent achieves reasonable performance guarantees despite an environment which is as complex or more than the agent itself. My previous attempts to make progress towards that goal include minimax forecasting and dominant forecasters for incomplete models. There, the aim was passive forecasting rather than reinforcement learning.

The idea of minimax forecasting can be naturally extended to reinforcement learning. Environments in reinforcement learning naturally form a convex set in some topological vector space (where convex linear combinations correspond to probabilistic mixtures). Normally, models are points of , i.e. specific environments. Instead, we can consider incomplete models which are non-empty convex subsets of . Instead of considering , the expected utility of policy interacting with environment , we can consider , where is an incomplete model: the minimal guaranteed expected utility of for environments compatible with the incomplete model . We can define a set of incomplete models to be learnable when there is a metapolicy s.t. for any

Here, is the time discount parameter. Notably, this setting satisfies the analogue of the universality property of Bayes-optimality (see "Proposition 1" in this essay). Here, the role of the Bayes-optimal policy is replaced by the policy

Here, is the "incomplete prior" corresponding to some :

Moreover, it is possible to define an incomplete analogue of MDPs. These are stochastic games, where the choices of the opponent correspond to the "Knightian uncertainty" of the incomplete model. Thus, it is natural to try and derive regret bounds for learning classes of such incomplete MDPs. In fact, this theory might justify the use of finite (or other restricted) MDPs which is common in RL and is needed for deriving most regret bounds. Indeed, there is no reason why physical reality should be a finite MDP, however this does not preclude us from using a finite stochastic game as an incomplete model of reality. In particular, an infinite MDP (and thus also a POMDP, since a POMDP can be reduced to an MDP whose states are belief states = probability measures on the state space of the POMDP) can be approximated by a finite stochastic game by partitioning its state space into a finite number of "cells" and letting the opponent to choose the exact state inside the cell upon each transition.

It is possible to generalize this setting further by replacing "crisp" sets of environments by fuzzy sets. That is, we can define a "fuzzy model" to be a function (the membership function) s.t. is non-empty. The performance of a policy on the model is then given by

Note that is assumed to take values in , so no with can affect the above value.

This generalization allows capturing a broad spectrum of performance guarantees. For example, given any policy we can define by

Then, learning the model amounts to learning to perform at least as well as , whatever the environment is. Thus, the setting of "fuzzy reinforcement learning" might be regarded as a hybrid of model-based and model-free approaches.

One test for any theory attempting to solve irreflexivity is whether it leads to reasonable game-theoretic solution concepts in multi-agent scenarios. For example, it is obvious that incomplete models lead to Nash equilibria in zero-sum games (an incomplete model is a zero-sum game, in some sense), but the situation in more general games in currently unknown. Another sort of test is applying the theory to Newcomb-like decision-theoretic puzzles, although solving all of them might require additional elements, such as self-modification. Further applications of such a theory which may also be regarded as tests will appear in the next subsection.

Next, the problem of traps has to be addressed. DRL partially solves this problem by postulating an advisor that has prior knowledge about the traps. It seems reasonable to draw a parallel between this and real-world human intelligence: humans learn from previous generations regarding the dangers of their environment. In particular, children seems like a salient example of an algorithm employing a lot of exploration while trusting a different agent (the parent) to prevent it from falling into traps. However, from a different perspective, this seems like hiding the difficulty in a different place. Namely, if we consider the whole of humanity as an intelligent agent (which seems a legitimate model at least for the purposes of this particular issue), then how did it avoid traps? To some extent, we can claim that human DNA is another source for prior knowledge, acquired by evolution, but somewhere this recursion must come to an end.

One hypothesis is, the main way humanity avoids traps is by happening to exist in a relatively favorable environment and knowing this fact, on some level. Specifically, it seems rather difficult for a single human or a small group to pursue a policy that will lead all of humanity into a trap (incidentally, this hypothesis doesn't reflect optimistically on our chances to survive AI risk), and also rather rare for many humans to coordinate on simultaneously exploring an unusual policy. Therefore, human history may be very roughly likened to episodic RL where each human life is an episode.

This mechanism should be formalized using the ideas of quantilal control. The baseline policy comes from the prior knowledge / advisor, and the allowed deviation (some variant of Renyi divergence) from the baseline policy is chosen according to the prior assumption about the rate of falling into a trap while following the baseline policy. This should lead to an appropriate regret bound.

I think that another important step towards universal RL is deriving regret bounds that exploit structural hierarchies. This builds on the intuition that, although the real world is very complex and diverse, the presence of structural hierarchies seems like a nearly universal feature. Indeed, it is arguable that we would never reach our current level of understanding physics if there was no separation of scales that allowed studying the macroscopic world without knowing string theory et cetera. I see 3 types of hierarchies that need to be addressed, together with their mutual interactions:

In particular, I expect these hierarchies to yield regret bounds which do not have the "trial and error" form of most known regret bounds. That is, known regret bounds imply a sample complexity that is a large multiple of either the reset time (for RL with resets) or the mixing time (for RL without resets). This seems unsatisfactory: a model-based learner should be able to extrapolate its knowledge forward without waiting for a full "cycle" of environment response. Certainly we expect an artificial superintelligence to achieve a pivotal event from the first attempt, in some sense.

Also, the hierarchies should bridge at least part of the gap between weak and strong feasibility. Indeed, many of the successes of deep learning were based on CNNs and Boltzmann machines which seem to be exploiting the spatial hierarchy.

Returning to the issue of traps, there might be some sense in which our environment is "favorable" which is more sophisticated than the discussion before and which may be formalized using hierarchies (e.g. early levels of the information hierarchy can be learned safely and late levels only contain traps predictable by the early levels).

Finally, as discussed in the previous section, defining the correct universal prior and analyzing its properties is crucial to complete the theory. Given the hypotheses put forth in this section, this prior should be

Value learning protocols

The aim of this part in the agenda is developing learning setups that allow one agent (the AI) to learn the values of a different agent or group of agents (humans). This involves directly or indirectly tackling the issues of, what does it mean for an agent to have particular values if it is imperfectly rational and possibly vulnerable to manipulation or other forms of "corruption".

At present, I conceive of the following possible basic mechanisms for value learning:

Formal communication is difficult because human values are complicated and describing them precisely is hard. A manual reward signal is more realistic than a full specification, but:

Overall, it might be more realistic to rely on formal communication for tasks of limited scope (putting a strawberry on plate) rather than actually learning human values in full (i.e. designing a sovereign). However, it is also possible to combine several mechanisms in a single protocol, and formal communication might be only one of them.

The problem of corruption may be regarded as a special cases of the problem of traps (the latter was outlined in the previous section), if we assume that the agent is expected to achieve its goals without entering corrupt states. Delegative Reinforcement Learning aims to solve both problems by occasionally passing control to the human operator ("advisor"), and using it to learn which actions are safe. The analysis of DRL that I produced so far can and should be improved in multiple ways:

There is another issue with DRL that is worth discussing, although I am not sure whether it calls for a formal analysis soon. So far, we assumed that there are no side effects on the environment from the act of delegation itself. That is, the same action has exactly the same results whether carried out by the advisor or by the agent. Obviously, this is not realistic since any physical isolation layer created to ensure this will not be entirely fool-proof (as a bare minimum, the advisor emself will remember which actions ey took). The sole exception is, perhaps, if both the agent and the advisor are programs running inside a homomorphic cryptography box [LW · GW]. More generally, any RL setup ignores the indirect (i.e. not mediated by actions) side-effects that the execution of the agent's algorithm has on the environment (although it is more realistic to solve this latter problem by homomorphic cryptography). This issue seems solvable via the use of incomplete/fuzzy models (see previous subsection). Although the true physical environment does have side effects as above, the model the agent tries to learn may ignore those side-effects (i.e. subsume them in the "Knightian uncertainty"). Similar remarks apply to the use of a source of random inside the algorithm I analyzed (a form of Posterior-Sampling Reinforcement Learning) that is assumed to be invisible to the environment (although it is also possible to use deterministic algorithms instead: for example, the Bayes-optimal policy is deterministic and necessarily satisfies the same Bayesian regret bound, although it is also not even weakly feasible). One caveat is the possibility of non-Cartesian daemons, defined and discussed in the next subsection.

The demonstration mechanism avoids some of the difficulties with formal communication, but has its own drawbacks. The ability to demonstrate a certain preference is limited by the ability to satisfy this preference. For example, suppose I am offered to play against Kasparov for money: if I win the game, I win $100 and if I lose the game, I lose $100. Then, I will refuse the bet because I know that I have few chances of winning. On the other hand, an AI might be able to win against Kasparov, but, seeing my demonstration it will remain uncertain whether I avoided the game because I'm afraid to lose or because of some other reason (for example, maybe I don't want to have more money, or maybe there is something intrinsically bad about playing chess). Therefore, it seems hard to produce a performance guarantee which will imply successfully learning the human's preferences and significantly outperforming the human in satisfying these preferences. In particular, the regret bound I currently have for Delegative Inverse Reinforcement Learning assumes that the "advisor" (the human) already takes the optimal action with maximal likelihood among all actions on any given time step.

In other words, the advantage of providing a manual reward signal over demonstration is related to the separation between P and NP. Solving a problem is much harder than verifying a solution, and similarly, demonstrating behavior that maximizes certain values is (computationally) much harder than evaluating a behavior according to the same values.

We can use the demonstration mechanism to a much greater advantage by designing a protocol that allows the human to learn from the AI. This is because many tasks can be much easier to solve with external advice. On the other hand, the AI can tell which advice is useful by seeing that the human changes eir behavior. For example, suppose that the task at hand is receiving the adjacency matrices of two graphs and producing an isomorphism between them. Then, if the human is unable to always solve this on eir own, it might be impossible for the AI to unambiguously understand what the task is. However, suppose that the AI can make several plausible guesses and test them by communicating the corresponding answer to the human. It will then observe which of those guesses cause the human to endorse the answer, and will thereby become able to carry out the task on its own.

More generally, we can consider the following learning protocol that I call "Learning By Teaching" (LBT). We have our agent and two additional actors (in the simplest case, humans): an "operator" and an "advisor". The agent can, at each given moment, decide between 3 modes:

In all modes, all actors observe the percept produced by the environment.

Note that, the semantics of the advisor's communications are a priori unknown to the agent, and ultimately it seems reasonable this communication will use natural language, so this learning protocol may be said to incorporate both demonstration and informal communication.

Now, the AI is at risk of committing dangerous actions both in mode II and in mode III. In particular, there might be some forms of advice that will corrupt the operator and cause irreversible damage. However, like in DRL, it can mitigate this risk by learning from the advisor and the operator which actions are safe.

Finally, whatever the ultimate value learning protocol will be, it is desirable to have it grounded in a coherent theory of imperfect rationality. Conversely, I believe that a reasonable theory of imperfect rationality should admit a value learning protocol (i.e. the concept of "values" should be observable and measurable in an appropriate sense). Specifically LBT suggests 3 types of "flaws" an agent is allowed to have while maintaining particular values:

In my opinion, the theory of imperfect rationality suggested by these considerations might already be close to capturing all the nuances of human irrationality.

Taming daemons

"Daemons" are intelligent agents that are simulated, within some degree of fidelity, inside the AI's reasoning process and whose malign agendas can render the entire system unaligned even if the AI's "main" reasoning process is correctly designed to acquire human values in some sense. The aim of this part of the agenda is to formalize the problem and provide solutions in the form of theoretical guarantees.

I distinguish between two types of daemons: Cartesian and non-Cartesian. Cartesian daemons are those that are dangerous even if the Cartesian boundary between the AI and the rest of the universe is perfect. In other words, Cartesian daemons influence the word through the "legitimate" output channels given to the AI and while preserving the computational process inside the AI's hardware conformant to the software model assumed by the designers. On the other hand, non-Cartesian daemons rely on violations of the Cartesian boundary: they either exploit some physical side effects of the computation on the world or "hack" the AI from the inside in a way that breaks its software model (e.g. violate an abstraction layer in the software). Of course we might have a single (in some sense) subagent that utilizes both Cartesian and non-Cartesian attack mechanisms. In the following, we will refer to a "Cartesian daemon" in the sense of "a Cartesian attack mechanism of some daemon" and similarly for a "non-Cartesian daemon".

The taming of Cartesian daemons is guaranteed by the same regret bounds that guarantee the AI's alignment in general. For example, we can examine how it works in the case of the acausal attack identified by Christiano. In this scenario, some malign superintelligence (let's call it Mu) somewhere in the multiverse simulates an environment very similar to that experienced by the AI we constructed on Earth. This makes the AI uncertain about whether it is on Earth or in Mu's realm, under some conditions assigning a higher probability to Mu. Mu then arranges the environment to diverge from Earth at some pivotal point s.t. the AI's expectation of this divergence leads it to perform malign actions before the divergence happens. Now, if the AI is implementing DRL, the uncertainty between Earth and Mu leads it to delegate to the advisor precisely at the moment this difference is important. Seeing the advisor display Earth-like rather than Mu-like behavior, it is then able to discard the Mu hypothesis.

Of course, it is important to make the discussion above (and more generally, the performance guarantee that implies this discussion as a special case) quantitative which will require better understanding of universal priors and their corresponding regret bounds, as discussed before. In particular, it appears that for some parameters the acausal attack might still succeed if Mu will manage to make the malign hypothesis overwhelmingly more likely than Earth. Avoiding this might requiring carefully defining the prior, for example counting the AI's own source code as only a single unit of description complexity, which might lower the description length cost of locating Earth in the universe considerably.

Non-Cartesian daemons theoretically can be completely eliminated by placing the AI in a homomorphic encryption box [LW · GW]. This relies mainly on two assumptions:

Of course one would also have to take care that the implementation of this idea is sound, for example the key should probably be generated from quantum randomness and kept at a location sufficiently remote/isolated (in some sense that requires careful physical/cryptographic analysis) from the AI's hardware (with a linking communication channels for encrypted inputs/outputs).

More troubling is the large performance overhead this would impose. This overhead might make the hypothetical aligned AI uncompetitive relative to unaligned projects. It might be possible to tolerate this overhead if either there is considerable computing overhang (i.e. by the time the relevant algorithms are discovered, the hardware is sufficient to quickly grow a superintelligence even given this overhead, resulting in a singleton) or the aligned AI is a national or multinational project whose budget allows it to be orders of magnitudes ahead of the competition in terms of computing power. However, since this is far from certain, designing alternative solutions is desirable.

One alternative approach is, employing extremely meticulous formal verification. That is, a physical model of the AI's software, hardware and its immediate environment will be developed and it will be formally verified that:

We will then need some analysis which shows that, given the very small "illegitimate" bandwidth, the aligned AI will "defeat" the unaligned daemons with high probability. This requires some game-theoretic model incorporating physical assumptions the shape of which is currently unknown to me. Needless to say, we will also have to make sure the implementation fits the model, which might require, for example, extreme means to isolate the hardware from contaminants. There will be a tradeoff between making the model simpler and thus easier to verify, versus making the model more complex and thus easier to physically protect and possibly allowing for more efficient hardware.

Also, at some point the AI should be mature enough to design an anti-daemon immune system for itself. However, it is not clear whether that simplifies the challenge, since we have to ensure the daemons will not grow out of control before this point.

Recursive self-improvement

The aim of this part of the agenda is formalize and analyze the concept of "recursive self-improvement" in learning-theoretic language.

Recursive self-improvement as a method of extremely rapid capability growth is an intriguing idea, however so far it has little rigorous support. Moreover, it far from clear that the first AGI will be recursively self-improving, even if the concept is sound. Therefore, I do not see it as high priority item on the agenda. Nevertheless, it is worth some attention both because of the capability angle and because of possible applications to decision-theory.

At present, I have only a few observations on how the subject might be approached:

Summary

In this section, I recap and elaborate the main features of the agenda as I initially stated them.

This agenda is not intended as a territorial claim on my part. On the contrary, I encourage other researchers to work on parts of it or even adopt it entirely, whether in collaboration with me or independently. Conversely, I am also very interested to hear criticism.

37 comments

Comments sorted by top scores.

comment by AlexMennen · 2018-07-01T23:52:26.000Z · LW(p) · GW(p)

A related question is, whether it is possible to design an algorithm for strong AI based on simple mathematical principles, or whether any strong AI will inevitably be an enormous kludge of heuristics designed by trial and error. I think that we have some empirical support for the former, given that humans evolved to survive in a certain environment but succeeded to use their intelligence to solve problems in very different environments.

I don't understand this claim. It seems to me that human brains appear to be "an enormous kludge of heuristics designed by trial and error". Shouldn't the success of humans be evidence for the latter?

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-02T19:28:51.000Z · LW(p) · GW(p)

The fact that the human brain was designed by trial and error is a given. However, we don't really know how the brain works. It is possible that the brain contains a simple mathematical core, possibly implemented inefficiently and with bugs and surrounded by tonnes of legacy code, but nevertheless responsible for the broad applicability of human intelligence.

Consider the following two views (which might also admit some intermediates):

View A: There exists a simple mathematical algorithm M that corresponds to what we call "intelligence" and that allows solving any problem in some very broad natural domain .

View B: What we call intelligence is a collection of a large number of unrelated algorithms tailored to individual problems, and there is no "meta-algorithm" that produces them aside from relatively unsophisticated trial and error.

If View B is correct, then we expect that doing trial and error on a collection of problems will produce an algorithm that solves problems in and almost only in . The probability that you were optimizing for but solved a much larger domain is vanishingly small: it is about the same as the probability of a completely random algorithm to solve all problems in .

If View A is correct, then we expect that doing trial and error on has a non-negligible chance of producing M (since M is simple and therefore sampled with a relatively large probability), which would be able to solve all of .

So, the fact that homo sapiens evolved in a some prehistoric environment but was able to e.g. land on the moon should be surprising to everyone with View B but not surprising to those with View A.

Replies from: paulfchristiano
comment by paulfchristiano · 2018-07-04T01:41:47.000Z · LW(p) · GW(p)

I think the most plausible view is: what we call intelligence is a collection of a large number of algorithms and innovations each of which slightly increases effectiveness in a reasonably broad range of tasks.

To see why both view A and B seem strange to me, consider the analog for physical tasks. You could say that there is a simple core to human physical manipulation which allows us to solve any problem in some very broad natural domain. Or you could think that we just have a ton of tricks for particular manipulation tasks. But neither of those seems right, there is no simple core to the human body plan but at the same time it contains many features which are helpful across a broad range of tasks.

Replies from: vanessa-kosoy, vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-10T19:29:40.000Z · LW(p) · GW(p)

Regarding the physical manipulation analogy: I think that there actually is a simple core to the human body plan. This core is, more or less: a spine, two arms with joints in the middle, two legs with joints in the middle, feet and arms with fingers. This is probably already enough to qualitatively solve more or less all physical manipulation problems humans can solve. All the nuances are needed to make it quantitatively more efficient and deal with the detailed properties of biological tissues, biological muscles et cetera (the latter might be considered analogous to the detailed properties of computational hardware and input/output channels for brains/AGIs).

comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-04T08:02:28.000Z · LW(p) · GW(p)

I think that your view is plausible enough, however, if we focus only on qualitative performance metrics (e.g. time complexity up to a polynomial, regret bound up to logarithmic factors), then this collection probably includes only a small number of innovations that are important.

comment by jessicata (jessica.liu.taylor) · 2018-06-30T23:41:58.000Z · LW(p) · GW(p)

One hypothesis is, the main way humanity avoids traps is by happening to exist in a relatively favorable environment and knowing this fact, on some level. Specifically, it seems rather difficult for a single human or a small group to pursue a policy that will lead all of humanity into a trap (incidentally, this hypothesis doesn’t reflect optimistically on our chances to survive AI risk), and also rather rare for many humans to coordinate on simultaneously exploring an unusual policy. Therefore, human history may be very roughly likened to episodic RL where each human life is an episode.

It's pretty clear that humans avoid traps using thinking, not just learning. See: CFCs, mutually assured destruction. Yes, principles of thinking can be learned, but then they generalize better than learning theory can prove.

See also: Not just learning

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-01T09:00:38.000Z · LW(p) · GW(p)

When I say "learning" I only mean that the true environment is initially unknown. I'm not assuming anything about the internals of the algorithm. So, the question is, what desiderata can we formulate that are possible to satisfy by any algorithm at all. The collection of all environments is not learnable (because of traps), so we cannot demand the algorithm to be asymptotically optimal on every environment. Therefore, it seems like we need to assume something about the environment, if we want a definition of intelligence that accounts for the effectiveness of intelligence. Formulating such an assumption, making it rigorous, and backing it by rigorous analysis is the subproblem I'm presenting here. The particular sort of assumption I'm pointing at here might be oversimplified, but the question remains.

Replies from: jessica.liu.taylor
comment by jessicata (jessica.liu.taylor) · 2018-07-02T07:37:06.000Z · LW(p) · GW(p)

I agree that we'll want some reasonable assumption on the environment (e.g. symmetry of physical laws throughout spacetime) that will enable thinking to generalize well. I don't think that assumption looks like "it's hard to cause a lot of destruction" or "the environment is favorable to you in general". And I'm pretty sure that individual human lives are not the most important level of analysis for thinking about the learning required to avoid civilization-level traps (e.g. with CFCs, handling the situation required scientific and policy knowledge that no one knows at birth and no one could discover by themself over a lifetime)

Replies from: vanessa-kosoy, vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-03T19:08:06.000Z · LW(p) · GW(p)

Consider also, evolution. Evolution can also be regarded as a sort of reinforcement learning algorithm. So why, during billions years of evolution, no gene sequence was created that somehow destroyed all life on Earth? It seems hard to come up with an answer other than “it’s hard to cause a lot of destruction”.

Some speculation:

I think that we have a sequence of reinforcement algorithms: evolution -> humanity -> individual human / small group (maybe followed by -> AGI) s.t. each step inherits the knowledge generated by the previous step and also applies more optimization pressure than the previous step. This suggests formulating a "favorability" assumption of the following form: there is a (possibly infinite) sequence of reinforcement learning algorithms A0, A1, A2... s.t. each algorithm is more powerful than the previous (e.g. has more computing power), and our environment has to be s.t.

(1) Running policy A0 has a small rate (at most ) of falling into traps. (2) If we run A0 for some time (s.t. ), and then run A1 after updating on the observations during , then A1 has a small rate (at most ) of falling into traps. (3) Ditto when we add A2

...And so forth.

The sequence {Ai} may be thought of as a sequence of agents or as just steps in the exploration of the environment by a single agent. So, our condition is that, each new "layer of reality" may be explored safely given that the previous layers were already studied.

Replies from: jessica.liu.taylor
comment by jessicata (jessica.liu.taylor) · 2018-07-04T01:53:54.000Z · LW(p) · GW(p)

Most species have gone extinct in the past. I would not be satisfied with an outcome where all humans die or 99% of humans die, even though technically humans might rebuild if there are any left and other intelligent life can evolve if humanity is extinct. These extinction levels can happen with foreseeable tech. Additionally, avoiding nuclear war requires continual cognitive effort to be put into the problem; it would be insufficient to use trial-and-error to avoid nuclear war.

I don't see why you would want a long sequence of reinforcement learning algorithms. At some point the algorithms produce things that can think, and then they should use their thinking to steer the future rather than trial-and-error alone. I don't think RL algorithms would get the right answer on CFCs or nuclear war prevention.

I am pretty sure that we can't fully explore our current level, e.g. that would include starting nuclear wars to test theories about nuclear deterrence and nuclear winter.

I really think that you are taking the RL analogy too far here; decision-making systems involving humans have some things in common with RL but RL theory only describes a fragment of the reasoning that these systems do.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-04T09:17:44.000Z · LW(p) · GW(p)

I don't think you're interpreting what I'm saying correctly.

First, when I say "reinforcement learning" I don't necessarily mean the type of RL algorithms that exist today. I just mean something that is designed to perform well (in some sense) in the face of uncertainty about the environment.

Second, even existing RL algorithms are not pure trial-and-error. For example, posterior sampling maintains a belief state about the environment and runs the optimal policy for some environment sampled from the belief state. So, if the belief state "knows" that something is a bad/good idea then the algorithm doesn't need to actually try it.

Third, "starting nuclear wars to test theories" is the opposite of I'm trying to describe. What I'm saying is, we already have enough knowledge (acquired by exploring previous levels) to know that nuclear war is a bad idea, so exploring this level will not involve starting nuclear wars. What I'm trying to formalize is, what kind of environments allow this to happen consistently, i.e. being able to acquire enough knowledge to deal with a trap before you arrive at the trap.

Replies from: jessica.liu.taylor
comment by jessicata (jessica.liu.taylor) · 2018-07-04T22:09:38.000Z · LW(p) · GW(p)

First, when I say “reinforcement learning” I don’t necessarily mean the type of RL algorithms that exist today. I just mean something that is designed to perform well (in some sense) in the face of uncertainty about the environment.

That is broad enough to include Bayesianism. I think you are imagining a narrower class of algorithms that can achieve some property like asymptotic optimality. Agree that this narrower class is much broader than current RL, though.

Second, even existing RL algorithms are not pure trial-and-error. For example, posterior sampling maintains a belief state about the environment and runs the optimal policy for some environment sampled from the belief state. So, if the belief state “knows” that something is a bad/good idea then the algorithm doesn’t need to actually try it.

I agree that if it knows for sure that it isn't in some environment then it doesn't need to test anything to perform well in that environment. But what if there is a 5% chance that the environment is such that nuclear war is good (e.g. because it eliminates other forms of destructive technology for a long time)? Then this AI would start nuclear war with 5% probability per learning epoch. This is not pure trial-and-error but it is trial-and-error in an important relevant sense.

What I’m trying to formalize is, what kind of environments allow this to happen consistently, i.e. being able to acquire enough knowledge to deal with a trap before you arrive at the trap.

This seems like an interesting research approach and I don't object to it. I would object to thinking that algorithms that only handle this class of environments are safe to run in our world (which I expect is not of this form). To be clear, while I expect that a Bayesian-ish agent has a good chance to avoid very bad outcomes using the knowledge it has, I don't think anything that attains asymptotic optimality will be useful while avoiding very bad outcomes with decent probability.

Replies from: vanessa-kosoy, vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-06T08:45:34.000Z · LW(p) · GW(p)

After thinking some more, maybe the following is natural way towards formalizing the optimism condition.

Let be the space of hypotheses and be the "unbiased" universal prior. Given any , we denote , i.e. the environment resulting from mixing the environments in the belief state . Given an environment , let be the Bayes-optimal policy for and the perturbed Bayes-optimal policy for , where is a perturbation parameter. Here, "perturbed" probably means something like softmax expected utility, but more thought is needed. Then, the "optimistic" prior is defined as a solution to the following fixed point equation:

Here, is a normalization constant and is an additional parameter.

This equation defines something like a softmax Nash equilibrium in a cooperative game of two players where, one player chooses (so that is eir mixed strategy), another player chooses and the utility is minus regret (alternatively, we might want to choose only Pareto efficient Nash equilibria). The parameter controls optimism regarding the ability to learn the environment, whereas the parameter represents optimism regarding the presence of slack: ability to learn despite making some errors or random exploration (how to choose these parameters is another question).

Possibly, the idea of exploring the environment "layer by layer" can be recovered from combining this with hierarchy assumptions.

Replies from: jessica.liu.taylor
comment by jessicata (jessica.liu.taylor) · 2018-07-07T07:54:25.000Z · LW(p) · GW(p)

This seems like a hack. The equilibrium policy is going to assume that the environment is good to it in general in a magical fashion, rather than assuming the environment is good to it in the specific ways we should expect given our own knowledge of how the environment works. It's kind of like assuming "things magically end up lower than you expected on priors" instead of having a theory of gravity.

I think there is something like a theory of gravity here. The things I would note about our universe that make it possible to avoid a lot of traps include:

  • Physical laws are symmetric across spacetime.
  • Physical laws are spacially local.
  • The predictable effects of a local action are typically local; most effects "dissipate" after a while (e.g. into heat). The butterfly effect is evidence for this rather than against this, since it means many effects are unpredictable and so can be modeled thermodynamically.
  • When small changes have big and predictable effects (e.g. in a computer), there is often agentic optimization power towards the creation and maintenance of this system of effects, and in these cases it is possible for at least some agents to understand important things about how the system works.
  • Some "partially-dissipated" effects are statistical in nature. For example, an earthquake hitting an area has many immediate effects, but over the long term the important effects are things like "this much local productive activity was disrupted", "this much local human health was lost", etc.
  • You have the genes that you do because evolution, which is similar to a reinforcement learning algorithm, believed that these genes would cause you to survive and reproduce. If we construct AI systems, we will give them code (including a prior) that we expect to cause them to do something useful for us. In general, the agency of an agent's creator should affect the agent's beliefs.
  • If there are many copies of an agent, and successful agents are able to repurpose the resources of unsuccessful ones, then different copies can try different strategies; some will fail but the successful ones can then repurpose their resources. (Evolution can be seen as a special case of this)
  • Some phenemona have a "fractal" nature, where a small thing behaves similar to a big thing. For example, there are a lot of similarities between the dynamics of a nation and the dynamics of a city. Thus small things can be used as models of big things.
  • If your interests are aligned with those of agents in your local vicinity, then they will mostly try to help you. (This applies to parents making their children's environment safe)

I don't have an elegant theory yet but these observations seem like a reasonable starting point for forming one.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-08T11:25:57.000Z · LW(p) · GW(p)

I think that we should expect evolution to give us a prior that is a good lossy compression of actual physics (where "actual physics" means, those patterns the universe has that can be described within our computational complexity bounds). Meaning that, on the one hand it should be low description complexity (otherwise it will be hard for evolution to find it), and on the other hand it should be assign high probability to the true environment (in other words, the KL divergence of the true environment from the prior should be small). And also it should be approximately learnable, otherwise it won't go from assigning high probability to actually performing well.

The principles you outlined seem reasonable overall.

Note that the locality/dissipation/multiagent assumptions amount to a special case of "the environment is effectively reversible (from the perspective of the human species as a whole) as long as you don't apply too much optimization power" ("optimization power" probably translates to divergence from some baseline policy plus maybe computational complexity considerations). Now, as you noted before, actual macroscopic physics is not reversible, but it might still be effectively reversible if you have a reliable long-term source of negentropy (like the sun). Maybe we can also slightly relax them by allowing irreversible changes as long as they are localized and the available space is sufficiently big.

"If we construct AI systems, we will give them code (including a prior) that we expect to cause them to do something useful for us. In general, the agency of an agent’s creator should affect the agent’s beliefs" is essentially what DRL does: allows transferring our knowledge to the AI without hard-coding it by hand.

"When small changes have big and predictable effects (e.g. in a computer), there is often agentic optimization power towards the creation and maintenance of this system of effects, and in these cases it is possible for at least some agents to understand important things about how the system works" seems like it would allow us to go beyond effective reversibility, but I'm not sure how to formalize it or whether it's a justified assumption. One way towards formalizing it is, the prior is s.t. studying the initial state approximate communication class allows determining the entire environment, but this seems to point at a very broad class of approximately learnable priors w/o specifying a criterion how to choose among them.

Another principle that we can try to use is, the ubiquity of analytic functions. Analytic functions have the property that, knowing the function in a bounded domain allows extrapolating it everywhere. This is different from allowing arbitrary computable functions which may have "if" clauses, so that studying the function in a bounded domain is never enough to be sure about its behavior outside it. In particular, this line of inquiry seems relatively easy to formalize using continuous MDPs (although we run into the problem that finding the optimal policy is infeasible, in general). Also, it might have something to do with the effectiveness of neural networks (although the popular ReLU response function is not analytic).

comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-05T15:45:55.000Z · LW(p) · GW(p)

Actually, I am including Bayesianism in "reinforcement learning" in the broad sense, although I am also advocating for some form of asymptotic optimality (importantly, it is not asymptotic in time like often done in the literature, but asymptotic in the time discount parameter; otherwise you give up on most of the utility, like you pointed out in an earlier discussion we had).

In the scenario you describe, the agent will presumably discard (or, strongly penalize the probability of) the pro-nuclear-war hypothesis first since the initial policy loses value much faster on this hypothesis compared to the anti-nuclear-war hypothesis (since the initial policy is biased towards the more likely anti-nuclear-war hypothesis). It will then remain with the anti-nuclear-war hypothesis and follow the corresponding policy (of not starting nuclear war). Perhaps this can be formalized as searching for a fixed point of some transformation.

comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-02T20:06:41.000Z · LW(p) · GW(p)

Consider a panel with two buttons, A and B. One button sends you to Heaven and one to Hell, but you don't know which is which and there is no way to check without pressing one. To make it more fun, you have to choose a button within one minute or you go to Hell automatically.

So, there are are two environments: in environment X, button A corresponds to Heaven and in environment Y, button B corresponds to Heaven. Obviously both cannot be in a learnable class simultaneously. So, at least one of them has to be ruled out (and if we also want to preserve symmetry then both). What sort of assumption do you think will rule them out?

I think that "scientific and policy knowledge that no one knows at birth and no one could discover by themself over a lifetime" is absolutely compatible with the hypothesis I outlined, even in its most naive form. If humanity's progress is episodic RL where each human life is an episode, then of course each human uses the knowledge accumulated by previous humans. This is the whole idea of a learning algorithm in this setting.

Also, I think that success with CFC is not a lot of evidence against the hypothesis since, for one thing, CFC doesn't allow a small group to easily destroy all of humanity, and for another thing, AFAIK action against CFC was only taken when some damage was already apparent. This is different from risks that have to be handled correctly on the first try.

That said, "doesn’t reflect optimistically on our chances to survive AI risk" wasn't intended as a strong claim but as something very speculative. Possibly I should have made it clearer.

More generally, the idea of restricting to environments s.t. some base policy doesn't fall into traps on them is not very restrictive. Indeed, for any learnable class H you can just take the base policy to be the learning algorithm itself and tautologically get a class at least as big as H. It becomes more interesting if we impose some constraints on the base policy, such as maybe restricting its computational complexity.

Intuitively, it seems alluring to say that our environment may contain X-risks, but they are s.t. by the time we face them we have enough knowledge to avoid them. However, this leads to assumptions that depend on the prior as a whole rather than on particular environments (basically, it's not clear whether this is saying anything besides just assuming the prior is learnable). This complicates things, and in particular it becomes less clear what does it mean for such a prior to be "universal". Moreover, the notion of a "trap" is not even a function of the prior regarded a single mixed environment, but a function of the particular partition of the prior into constituent hypotheses. In other words, it depends on which uncertainty is considered subjective (a property of the agent's state of knowledge) and which uncertainty is considered objective (an inherent unpredictability of the world). For example, if we go to the initial example but assume that there is a fair coin inside the environment that decides which button is Heaven, then instead of two environments we get one and tautologically there is no trap.

In short, I think there is a lot more thinking to do about this question.

Replies from: jessica.liu.taylor, jessica.liu.taylor, jessica.liu.taylor
comment by jessicata (jessica.liu.taylor) · 2018-07-04T02:05:33.000Z · LW(p) · GW(p)

So, there are are two environments: in environment X, button A corresponds to Heaven and in environment Y, button B corresponds to Heaven. Obviously both cannot be in a learnable class simultaneously. So, at least one of them has to be ruled out (and if we also want to preserve symmetry then both). What sort of assumption do you think will rule them out?

I don't think we should rule either of these out. The obvious answer is to give up on asymptotic optimality and do something more like utility function optimization instead. That would be moving out of the learning theory setting, which is a good thing.

Asymptotic optimality can apply to bounded optimization problems and can't apply to civilization-level steering problems.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-04T08:46:22.000Z · LW(p) · GW(p)

Well, we could give up on regret bounds and instead just consider algorithms that asymptotically approach Bayes-optimality. (This would not be moving out of the learning theory setting though? At least not the way I use this terminology.) Regret bounds would still be useful in the context of guaranteeing transfer of human knowledge and values to the AGI, but not in the context of defining intelligence.

However, my intuition is that it would be the wrong way to go.

For one thing, it seems that it is computationally feasible (at least in some weak sense, i.e. for a small number of hypotheses s.t. the optimal policy for each is feasible) to get asymptotic Bayes-optimality for certain learnable classes (PSRL is a simple example) but not in general. I don't have a proof (and I would be very interested to see either a proof or a refutation), but it seems to be the case AFAIK.

For another thing, consider questions such as, why intelligent agents outcompete instinct-based agents, and why general intelligence (i.e. Bayes optimality or at least some notion of good performance w.r.t. a prior that is "universal" or "nearly universal" in some sense) can be developed by evolution in a rather restricted environment. These questions seem much easier to answer if intelligence has some frequentist property (i.e. it is in some sense effective in all or most environments) compared to, if intelligence has only purely Bayesian properties (i.e. it is only good on average w.r.t. some very broad ensemble of environments).

Replies from: jessica.liu.taylor, jessica.liu.taylor
comment by jessicata (jessica.liu.taylor) · 2018-07-04T21:55:10.000Z · LW(p) · GW(p)

For another thing, consider questions such as, why intelligent agents outcompete instinct-based agents, and why general intelligence (i.e. Bayes optimality or at least some notion of good performance w.r.t. a prior that is “universal” or “nearly universal” in some sense) can be developed by evolution in a rather restricted environment. These questions seem much easier to answer if intelligence has some frequentist property (i.e. it is in some sense effective in all or most environments) compared to, if intelligence has only purely Bayesian properties (i.e. it is only good on average w.r.t. some very broad ensemble of environments).

I don't understand why you think this. Suppose there is some simple "naturalized AIXI"-ish thing that is parameterized on a prior, and there exists a simple prior for which an animal running this algorithm with this prior does pretty well in our world. Then evolution may produce an animal running something like naturalized AIXI with this prior. But naturalized AIXI is only good on average rather than guaranteeing effectiveness in almost all environments.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-05T14:53:57.000Z · LW(p) · GW(p)

My intuition is that it must not be just a coincidence that the agent happens to works well in our world, otherwise your formalism doesn't capture the concept of intelligence in full. For example, we are worried that a UFAI would be very likely to kill us in this particular universe, not just in some counterfactual universes. Moreover, Bayesian agents with simple priors often do very poorly in particular worlds, because of what I call "Bayesian paranoia". That is, if your agent thinks that lifting its left arm will plausibly send it to hell (a rather simple hypothesis), it will never lift its left arm and learn otherwise.

In fact, I suspect that a certain degree of "optimism" is inherent in our intuitive notion of rationality, and it also has a good track record. For example, when scientists did early experiments with electricity, or magnetism, or chemical reactions, their understanding of physics at the time was arguably insufficient to know this will not destroy the world. However, there were few other ways to go forward. AFAIK the first time anyone seriously worried about a physics experiment was the RHIC (unless you also count the Manhattan project, when Edward Teller suggested the atom bomb might create a self-sustaining nuclear fusion reaction that will envelope the entire atmosphere). These latter concerns were only raised because we already knew enough to point at specific dangers. Of course this doesn't mean we shouldn't be worried about X-risks! But I think that some form of a priori optimism is likely to be correct, in some philosophical sense. (There was also some thinking in that direction by Sunehag and Hutter although I'm not sold on the particular formalism they consider).

Replies from: jessica.liu.taylor, jessica.liu.taylor, None
comment by jessicata (jessica.liu.taylor) · 2018-07-07T07:46:05.000Z · LW(p) · GW(p)

I think I understand your point better now. It isn't a coincidence that an agent produced by evolution has a good prior for our world (because evolution tries many priors, and there are lots of simple priors to try). But the fact that there exists a simple prior that does well in our universe is a fact that needs an explanation. It can't be proven from Bayesianism; the closest thing to a proof of this form is that computationally unbounded agents can just be born with knowledge of physics if physics is sufficiently simple, but there is no similar argument for computationally bounded agents.

comment by jessicata (jessica.liu.taylor) · 2018-07-07T07:41:57.000Z · LW(p) · GW(p)

My intuition is that it must not be just a coincidence that the agent happens to works well in our world, otherwise your formalism doesn’t capture the concept of intelligence in full.

It's not a coincidence because evolution selects the prior, and evolution tries lots of priors. (There are lots of simple priors)

comment by [deleted] · 2022-05-30T17:46:01.942Z · LW(p) · GW(p)
comment by jessicata (jessica.liu.taylor) · 2018-07-04T22:11:43.000Z · LW(p) · GW(p)

Well, we could give up on regret bounds and instead just consider algorithms that asymptotically approach Bayes-optimality.

I am not proposing this. I am proposing doing something more like AIXI, which has a fixed prior and does not obtain optimality properties on a broad class of environments. It seems like directly specifying the right prior is hard, and it's plausible that learning theory research would help give intuitions/models about which prior to use or what non-Bayesian algorithm would get good performance in the world we actually live in, but I don't expect learning theory to directly produce an algorithm we would be happy with running to make big decisions in our universe.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-05T14:47:43.000Z · LW(p) · GW(p)

Yes, I think that we're talking about the same thing. When I say "asymptotically approach Bayes-optimality" I mean the equation from Proposition A.0 here. I refer to this instead of just Bayes-optimality, because exact Bayes-optimality is computationally intractable even for a small number of hypothesis each of which is a small MDP. However, even asymptotic Bayes-optimality is usually only tractable for some learnable classes, AFAIK: for example if you have environments without traps then PSRL is asymptotically Bayes-optimal.

comment by jessicata (jessica.liu.taylor) · 2018-07-04T01:53:05.000Z · LW(p) · GW(p)

I think that “scientific and policy knowledge that no one knows at birth and no one could discover by themself over a lifetime” is absolutely compatible with the hypothesis I outlined, even in its most naive form. If humanity’s progress is episodic RL where each human life is an episode, then of course each human uses the knowledge accumulated by previous humans. This is the whole idea of a learning algorithm in this setting.

If RL is using human lives as episodes then humans should already be born with the relevant knowledge. There would be no need for history since all learning is encoded in the policy. History isn't RL; it's data summarization, model building, and intertemporal communication.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-04T08:31:55.000Z · LW(p) · GW(p)

This seems to be interpreting the analogy too literally. Humans are not born with the knowledge, but they acquire the knowledge through some protocol that is designed to be much easier than rediscovering it. Moreover, by "reinforcement learning" I don't mean the same type of algorithms used for RL today, I only mean that the performance guarantee this process satisfies is of a certain form.

comment by jessicata (jessica.liu.taylor) · 2018-07-04T01:50:42.000Z · LW(p) · GW(p)

More generally, the idea of restricting to environments s.t. some base policy doesn’t fall into traps on them is not very restrictive.

This rules out environments in which the second law of thermodynamics holds.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-04T08:35:04.000Z · LW(p) · GW(p)

No, it doesn't rule out any particular environment. A class that consists only of one environment is tautologically learnable, by the optimal policy for this environment. You might be thinking of learnability by anytime algorithms whereas I'm thinking of learnability by non-anytime algorithms (what I called "metapolicies"), the way I defined it here (see Definition 1).

Replies from: jessica.liu.taylor
comment by jessicata (jessica.liu.taylor) · 2018-07-04T21:43:25.000Z · LW(p) · GW(p)

Ok, I am confused by what you mean by "trap". I thought "trap" meant a set of states you can't get out of. And if the second law of thermodynamics is true, you can't get from a high-entropy state to a low-entropy state. What do you mean by "trap"?

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-05T14:01:22.000Z · LW(p) · GW(p)

To first approximation, a "trap" is a an action s.t. taking it loses long-term value in expectation, i.e an action which is outside the set that I defined here (see the end of Definition 1). This set is always non-empty, since it at least has to contain the optimal action. However, this definition is not very useful when, for example, your environment contains a state that you cannot escape and you also cannot avoid (for example, the heat death of the universe might be such a state), since, in this case, nothing is a trap. To be more precise we need to go from an analysis which is asymptotic in the time discount parameter to an analysis with a fixed, finite time discount parameter (similarly to how with time complexity, we usually start from analyzing the asymptotic complexity of an algorithm, but ultimately we are interested in particular inputs of finite size). For a fixed time time discount parameter, the concept of a trap becomes "fuzzy": a trap is an action which loses a substantial fraction of the value.

comment by jessicata (jessica.liu.taylor) · 2018-06-30T23:48:15.000Z · LW(p) · GW(p)

Delegative Reinforcement Learning aims to solve both problems by occasionally passing control to the human operator (“advisor”), and using it to learn which actions are safe.

Why would you assume the existence of an advisor who can avoid taking catastrophic actions and sometimes take an optimal action? This would require some process capable of good judgment to understand many aspects of the AI's decision-making process, such as its world models (as these models are relevant to which actions are catastrophic/optimal). Are you proposing a high degree of transparency, a bootstrapping process as in ALBA, or something else?

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-01T08:50:16.000Z · LW(p) · GW(p)

I think that what you're saying here can be reformulated as follows (please correct me if I end up not answering your question):

The action that a RL agent takes depends both on the new observation and its internal state. Often we ignore the latter and pretend the action depends only on the history of observations and actions, and this is okay because we can always produce the probability distribution over internal states conditional on the given history. However, this is only ok for information-theoretic analysis, since sampling this probability distribution given only the history as input is computationally intractable.

So, it might be a reasonable assumption that the advisor takes "sane" actions when left to its own devices, but it is not reasonable to assume the same when it works together with the AI. This is because, even if the AI behaved exactly as the advisor, it would hide the simulated advisor's internal state, which would preclude the advisor from taking the wheel and proceeding with the same policy.

I think this is a real problem, but we can overcome it by letting the advisor write some kind of "diary" that documents eir reasoning process, as much as possible. The diary is also considered a part of the environment (although we might want to bake into the prior the rules of operating the diary and a "cheap talk" assumption which says the diary has no side effects on the world). This way, the internal state is externalized, and the AI will effectively become transparent by maintaining the diary too (essentially the AI in this setup is emulating a "best case" version of the advisor). It would be great if we could make this idea into a formal analysis.

Replies from: jessica.liu.taylor
comment by jessicata (jessica.liu.taylor) · 2018-07-02T07:25:36.000Z · LW(p) · GW(p)

That captures part of it but I also don't think the advisor takes sane actions when the AI is doing things to the environment that change the environment. E.g. the AI is implementing some plan to create a nuclear reactor, and the advisor doesn't understand how nuclear reactors work.

I guess you could have the AI first write the nuclear reactor plan in the diary, but this is essentially the same thing is transparency.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-02T19:35:03.000Z · LW(p) · GW(p)

Well, you could say it is the same thing as transparency. What is interesting about it is that, in principle, you don't have to put in transparency by hand using some completely different techniques. Instead, transparency arises naturally from the DRL paradigm and some relatively mild assumptions (that there is a "diary"). The idea is that, the advisor would not build a nuclear reaction without seeing an explanation of nuclear reactors, so the AI also won't do it too.

comment by jessicata (jessica.liu.taylor) · 2018-06-30T23:31:05.000Z · LW(p) · GW(p)

Now, if the AI is implementing DRL, the uncertainty between Earth and Mu leads it to delegate to the advisor precisely at the moment this difference is important.

It seems like this is giving up on allowing the AI to make long-term predictions. It can make short-term, testable predictions (since if different advisors disagree, it is possible to see who is right). But long-term predictions can't be cheaply tested.

In the absence of long-term predictions, it still might be possible to do something along the lines of what Paul is thinking of (i.e. predicting human judgments of longer-term things), but I don't see what else you could do. Does this match your model?

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2018-07-01T09:24:52.000Z · LW(p) · GW(p)

I'm not giving up on long-term predictions in general. It's just that, because of traps, some uncertainties cannot be resolved by testing, as you say. In those cases the AI has to rely on what it learned from the advisor, which indeed amounts to human judgment.

comment by [deleted] · 2021-11-17T18:40:03.954Z · LW(p) · GW(p)