Posts
Comments
Wait. Does it mean that, given that I prefer instrumentalism over realism in metaphysics and Copenhagen over MWI in QM (up to some nuances), I am a postrationalist now? That doesn't feel right. I don't believe that the rationalist project is "misguided or impossible", unless you use a very narrow definition of the "rationalist project". Here and here I defended what is arguably the core of the rationalist project.
I see the problem of counterfactuals as essentially solved by quasiBayesianism, which behaves like UDT in all Newcomblike situations. The source code in your presentation of the problem is more or less equivalent to Omega in Newcomblike problems. A TRL agent can also reason about arbitrary programs, and learn that a certain program acts as a predictor for its own actions.
This approach has some similarity with material implication and proofbased decision theory, in the sense that out of several hypothesis about counterfactuals that are consistent with observations, the decisive role is played by the most optimistic hypothesis (the one that can be exploited for the most expected utility). However, it has no problem with global accounting and indeed it solves counterfactual mugging successfully.
Regarding regret bounds, I don't think regret bounds are realistic for an AGI, unless it queried an optimal teacher for every action (which would make it useless). In the real world, no actions are recoverable, and any time picks an action on its own, we cannot be sure it is acting optimally.
First, you can have a subjective regret bound which doesn't require all actions to be recoverable (it does require some actions to be approximately recoverable, which is indeed the case in the real world).
Second, dealing rationally with nonrecoverable actions should still translate into mathematical conditions some of which might still look like sort of regret bounds, and in any case finite MDPs are a natural starting point for analyzing them.
Third, checking regret bounds for priors in which all actions are recoverable serves as a sanity test for candidate AGI algorithms. It is not a sufficient desideratum, but I do think it is necessary.
But I think many of the difficulties with general intelligence are not captured in the simple setting
I agree that some of the difficulties are not captured. I am curious whether you have more concrete examples in mind than what you wrote in the post?
I don't quite know what to think of continuous MDPs. I'll wildly and informally conjecture that if the state space is compact, and if the transitions are Lipschitz continuous with respect to the state, it's not a whole lot more powerful than the finitestate MDP formalism.
This seems wrong to me. Can you elaborate what do you mean by "powerful" in this context? Continuous MDPs definitely describe a large variety of environments that cannot be captured by a finite state MDP, at least not without approximations. Solving continuous MDPs can also be much more difficult than finite state MDPs. For example, any POMDP can be made into a continuous MDP by treating beliefs as states, and finding the optimal policy for a POMDP is PSPACEhard (as opposed to the case of finite state MDPs which is Peasy).
But the upshot of those MDP techniques is mainly to not search through same plans twice, and if we have an advanced agent that is managing to not evaluate many plans even once, I think there's a good chance that we'll get for free the don'tevaluateplanstwice behavior.
I guess that you might be thinking exclusively of algorithms that have something like a uniform prior over transition kernels. In this case there is obviously no way to learn about a state without visiting it. But we can also consider algorithms with more sophisticated priors and get much faster learning rates (if the environment is truly sampled from this prior ofc). The best example is, I think, the work of Osband and Van Roy where a regret bound is derived that scales with a certain dimension parameter of the hypothesis space (that can be much smaller than the number of states and actions), work on which I continued to build.
The problem is not in one of the conditions separately but in their conjunction: see my followup comment. You could argue that learning an exact model of Carol doesn't really imply condition 2 since, although the model does imply everything Carol is ever going to say, Alice is not capable of extracting this information from the model. But then it becomes a philosophical question of what does it mean to "believe" something. I think there is value in the "behaviorist" interpretation that "believing X" means "behaving optimally given X". In this sense, Alice can separately believe the two facts described by conditions 1 and 2, but cannot believe their conjunction.
IMO there are two reasons why finitestate MDPs are useful.
First, proving regret bounds for finitestate MDPs is just easier than for infinitestate MDPs (of course any environment can be thought of as an infinitestate MDP), so it serves as good warmup even if you want to go beyond it. Certainly many problems can be captured already within this simple setting. Moreover, some algorithms and proof techniques for finitestate MDPs can be generalized to e.g. continuous MDPs (which is already a far more general setting).
Second, we may be able to combine finitestate MDP techniques with an algorithm that learns the relevant features, where "features" in this case corresponds to a mapping from histories to states. Now, of course there needn't be any projection into a finite state space that preserves the exact dynamics of the environment. However, if your algorithm can work with approximate models (as it must anyway), for example using my quasiBayesian approach, then such MDP models can be powerful.
I think there is some confusion here coming from the unclear notion of a Bayesian agent with beliefs about theorems of PA. The reformulation I gave with Alice, Bob and Carol makes the problem clearer, I think.
Well, being surprised by Omega seems rational. If I found myself in a real life Newcomb problem I would also be very surprised and suspect a trick for a while.
Moreover, we need to unpack "learns that causality exists". A quasiBayesian agent will eventually learn that it is part of a universe ruled by the laws of physics. The laws of physics are the ultimate "Omega": they predict the agent and everything else. Given this understanding, it is not more difficult than it should be to understand Newcomb!Omega as a special case of Physics!Omega. (I don't really have an understanding of quasiBayesian learning algorithms and how learning one hypothesis affects the learning of further hypotheses, but it seems plausible that things can work this way.)
Learning theory distinguishes between two types of settings: realizable and agnostic (nonrealizable). In a realizable setting, we assume that there is a hypothesis in our hypothesis class that describes the real environment perfectly. We are then concerned with the sample complexity and computational complexity of learning the correct hypothesis. In an agnostic setting, we make no such assumption. We therefore consider the complexity of learning the best approximation of the real environment. (Or, the best reward achievable by some space of policies.)
In offline learning and certain varieties of online learning, the agnostic setting is wellunderstood. However, in more general situations it is poorly understood. The only agnostic result for longterm forecasting that I know is Shalizi 2009, however it relies on ergodicity assumptions that might be too strong. I know of no agnostic result for reinforcement learning.
QuasiBayesianism was invented to circumvent the problem. Instead of considering the agnostic setting, we consider a "quasirealizable" setting: there might be no perfect description of the environment in the hypothesis class, but there are some incomplete descriptions. But, so far I haven't studied quasiBayesian learning algorithms much, so how do we know it is actually easier than the agnostic setting? Here is a simple example to demonstrate that it is.
Consider a multiarmed bandit, where the arm space is . First, consider the follow realizable setting: the reward is a deterministic function which is known to be a polynomial of degree at most. In this setting, learning is fairly easy: it is enough to sample arms in order to recover the reward function and find the optimal arm. It is a special case of the general observation that learning is tractable when the hypothesis space is lowdimensional in the appropriate sense.
Now, consider a closely related agnostic setting. We can still assume the reward function is deterministic, but nothing is known about its shape and we are still expected to find the optimal arm. The arms form a lowdimensional space (onedimensional actually) but this helps little. It is impossible to predict anything about any arm except those we already tested, and guaranteeing convergence to the optimal arm is therefore also impossible.
Finally, consider the following quasirealizable setting: each incomplete hypothesis in our class states that the reward function is lowerbounded by a particular polynomial of degree at most. Our algorithm needs to converge to a reward which is at least the maximum of maxima of correct lower bounds. So, the desideratum is weaker than in the agnostic case, but we still impose no hard constraint on the reward function. In this setting, we can use the following algorithm. On each step, fit the most optimistic lower bound to those arms that were already sampled, find its maximum and sample this arm next. I haven't derived the convergence rate, but it seems probable the algorithm will converge rapidly (for low ). This is likely to be a special case of some general result on quasiBayesian learning with lowdimensional priors.
Here's another perspective. Suppose that now Bob and Carol have symmetrical roles: each one asks a question, allows Alice to answer, and then reveals the right answer. Alice gets a reward when ey answer correctly. We can now see that perfect honesty actually is tractable. It corresponds to an incomplete hypothesis. If Alice learns this hypothesis, ey answer correctly any question ey already heard before (no matter who asks now and who asked before). We can also consider a different incomplete hypothesis that allows realtime simulation of Carol. If Alice learns this hypothesis, ey answer correctly any question asked by Carol. However, the conjunction of both hypotheses is already intractable. There's no impediment for Alice to learn both hypotheses: ey can both memorize previous answers and answer all questions by Carol. But, this doesn't automatically imply learning the conjunction.
From my perspective, the trouble here comes from the honesty condition. This condition hides an unbounded quantifier: "if the speaker will ever say something, then it is true". So it's no surprise we run into computational complexity and even computability issues.
Consider the following setting. The agent Alice repeatedly interacts with two other entities: Bob and Carol. When Alice interacts with Bob, Bob asks Alice a yes/no question, Alice answers it and receives either +1 or 1 reward depending on whether the answer is correct. When Alice interacts with Carol, Carol tells Alice some question and the answer to that question.
Suppose that Alice starts with some lowinformation prior and learns over time about Bob and Carol both. The honesty condition becomes "if Carol will ever say and Bob asks the question , then the correct answer is ". But, this condition might be computationally intractable so it is not in the prior and cannot be learned. However, weaker versions of this condition might be tractable, for example "if Carol says at time step between and , and Bob asks at time , then the correct answer is ". Since simulating Bob is still intractable, this condition cannot be expressed as a vanilla Bayesian hypothesis. However, it can be expressed as an incomplete hypothesis. We can also have an incomplete hypothesis that is the conjunction of this weak honesty condition with a full simulation of Carol. Once Alice learned this incomplete hypothesis, ey answer correctly at least those questions which Carol have already taught em or will teach em within 1000 time steps.
I think that your reasoning here is essentially the same thing I was talking about before:
...the usual philosophical way of thinking about decision theory assumes that the model of the environment is given, whereas in our way of thinking, the model is learned. This is important: for example, if AIXI is placed in a repeated Newcomb's problem, it will learn to onebox, since its model will predict that oneboxing causes the money to appear inside the box. In other words, AIXI might be regarded as a CDT, but the learned "causal" relationships are not the same as physical causality
Since then I evolved this idea into something that wins in counterfactual mugging as well, using quasiBayesianism.
I have repeatedly argued for a departure from pure Bayesianism that I call "quasiBayesianism". But, coming from a LessWrongish background, it might be hard to wrap your head around the fact Bayesianism is somehow deficient. So, here's another way to understand it, using Bayesianism's own favorite trick: Dutch booking!
Consider a Bayesian agent Alice. Since Alice is Bayesian, ey never randomize: ey just follow a Bayesoptimal policy for eir prior, and such a policy can always be chosen to be deterministic. Moreover, Alice always accepts a bet if ey can choose which side of the bet to take: indeed, at least one side of any bet has nonnegative expected utility. Now, Alice meets Omega. Omega is very smart so ey know more than Alice and moreover ey can predict Alice. Omega offers Alice a series of bets. The bets are specifically chosen by Omega s.t. Alice would pick the wrong side of each one. Alice takes the bets and loses, indefinitely. Alice cannot escape eir predicament: ey might know, in some sense, that Omega is cheating em, but there is no way within the Bayesian paradigm to justify turning down the bets.
A possible counterargument is, we don't need to depart far from Bayesianism to win here. We only need to somehow justify randomization, perhaps by something like infinitesimal random perturbations of the belief state (like with reflective oracles). But, in a way, this is exactly what quasiBayesianism does: a quasiBayesoptimal policy is in particular Bayesoptimal when the prior is taken to be in Nash equilibrium of the associated zerosum game. However, Bayesoptimality underspecifies the policy: not every optimal reply to a Nash equilibrium is a Nash equilibrium.
This argument is not entirely novel: it is just a special case of an environment that the agent cannot simulate, which is the original motivation for quasiBayesianism. In some sense, any Bayesian agent is dogmatic: it dogmatically beliefs that the environment is computationally simple, since it cannot consider a hypothesis which is not. Here, Omega exploits this false dogmatic belief.
I remember reading some speculation that Zinc supplements and (separate speculation) garlic supplements might have some beneficial effect against COVID19, but can't find the source. Anyone knows what's the status on that?
Probably stupid question, but why electrolyte drinks rather than just water?
I don't want this. There's a field of alignment outside of the community that uses the Alignment Forum, with very different ideas about how progress is made; it seems bad to have an evaluation of work they produce according to metrics that they don't endorse.
This seems like a very strange claim to me. If the proponents of the MIRIrationalist view think that (say) a paper by DeepMind has valuable insights from the perspective of the MIRIrationalist paradigm, and should be featured in "best [according to MIRIrationalists] of AI alignment work in 2018", how is it bad? On the contrary, it is very valuable the the MIRIrationalist community is able to draw each other's attention to this important paper.
So, such a rating seems to have not much upside, and does have downside, in that nonexperts who look at these ratings and believe them will get wrong beliefs about which work is useful.
Anything anyone says publicly can be read by a nonexpert, and if something wrong was said, and the nonexpert believes it, then the nonexpert gets wrong beliefs. This is a general problem with nonexperts, and I don't see how is it worse here. Of course if the MIRIrationalist viewpoint is true then the resulting beliefs will not be wrong at all. But this just brings us back to the objectlevel question.
(I already see people interested in working on CHAIstyle stuff who say things that MIRIrationalist viewpoint says where my internal response is something like "I wish you hadn't internalized these ideas before coming here".)
So, not only is the MIRIrationalist viewpoint wrong, it is so wrong that it irreversibly poisons the mind of anyone exposed to it? Isn't it a good idea to let people evaluate ideas on their own merits? If someone endorses a wrong idea, shouldn't you be able to convince em by presenting counterarguments? If you cannot present counterarguments, how are you so sure the idea is actually wrong? If the person in question cannot understand the counterargument, doesn't it make em much less valuable for your style of work anyway? Finally, if you actually believe this, doesn't it undermine the entire principle of AI debate? ;)
If I naively imagine using something close to the 2019 review for alignment (even within a single paradigm), I expect my concerns about "sort by prestige" to be much worse, because there are greater political consequences that one could screw up (and, lack of common knowledge about how large those consequences are and how bad they might be might make everyone too anxious to get buyin).
I don't think so.
Your main example for the prestige problem with the LW review was "affordance widths". I admit that I was one of the people who assigned a lot of negative points to "affordance widths", and also that I did it not purely on abstract epistemic grounds (in those terms the essay is merely mediocre) but because of the added context about the author. When I voted, the question I was answering was "should this be included in Best of 2018", including all considerations. If I wasn't supposed to do this then I'm sorry, I haven't noticed before.
The main reason I think it would be terrible to include "affordance widths" is not exactly prestige. The argument I used before is prestigebased, but that's because I expected this part to be more broadly accepted, and wished to avoid the more charged debate I anticipated if I ventured closer to the core. The main reason is, I think it would send a really bad message to women and other vulnerable populations who are interested in LessWrong: not because of the identity of the author, but because the essay was obviously designed to justify the author's behavior. Some of the reputational ramifications of that would be wellearned (although I also expect the response to be disproportional).
On the other hand, it is hard for me to imagine anything of the sort applying to the Alignment Forum. It would be much more tricky to somehow justify sexual abuse through discussion about AI risk, and if someone accomplished it then surely the AIalignmentquaAIalignment value of that work would be very low. The sort of political considerations that do apply here are not considerations that would affect my vote, and I suspect (although ofc I cannot be sure) the same is true about most other voters.
Also, next time I will adjust my behavior in the LW vote also, since clearly it is against the intent of the organizers. However, I suggest that some process is created in parallel to the main vote, where contextdependent considerations can be brought up, either for public discussion or for the attention of the moderator team specifically.
I wonder whether Korzybski was indeed a "memetic ancestor" of LessWrong or more like a slightly crazy elder sibling? In other words, were Yudkowsky or other prominent rationalists significantly influenced by Korzybski, or they just came up with similarish ideas independently?
I decided that the answer deserves its own post.
As far as I can tell, 2, 3, 4, and 10 are proposed implementations, not features. (E.g. the feature corresponding to 3 is "doesn't manipulate the user" or something like that.) I'm not sure what 9, 11 and 13 are about. For the others, I'd say they're all features that an intentaligned AI should have; just not in literally all possible situations. But the implementation you want is something that aims for intent alignment; then because the AI is intent aligned it should have features 1, 5, 6, 7, 8. Maybe feature 12 is one I think is not covered by intent alignment, but is important to have.
Hmm. I appreciate the effort, but I don't understand this answer. Maybe discussing this point further is not productive in this format.
I am not an expert but I expect that bridges are constructed so that they don't enter highamplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation).
This is probably true now that we know about resonance (because bridges have fallen down due to resonance); I was asking you to take the perspective where you haven't yet seen a bridge fall down from resonance, and so you don't think about it.
Yes, and in that perspective, the mathematical model can tell me about resonance. It's actually incredibly easy: resonance appears already in simple harmonic oscillators. Moreover, even if I did not explicitly understand resonance, if I proved that the bridge is stable under certain assumptions about external forces magnitudes and spacetime spectrum, it automatically guarantees that resonance will not crash the bridge (as long as the assumptions are realistic). Obviously people have not been so cautious over history, but that doesn't mean we should be careless about AGI as well.
I understand the argument that sometimes creating and analyzing a realistic mathematical model is difficult. I agree that under time pressure it might be better to compromise on a combination of unrealistic mathematical models, empirical data and informal reasoning. But I don't understand why should we give up so soon? We can work towards realistic mathematical models and prepare fallbacks, and even if we don't arrive at a realistic mathematical model it is likely that the effort will produce valuable insights.
Maybe I'm falling prey to the typical mind fallacy, but I really doubt that you use mathematical models to write code in the way that I mean, and I suspect you instead misunderstood what I meant.
Like, if I asked you to write code to check if an element is present in an array, do you prove theorems? I certainly expect that you have an intuitive model of how your programming language of choice works, and that model informs the code that you write, but it seems wrong to me to describe what I do, what all of my students do, and what I expect you do as using a "mathematical theory of how to write code".
First, if I am asked to check whether an element is in an array, or some other easy manipulation of data structures, I obviously don't literally start proving a theorem with pencil and paper. However, my notfullyformal reasoning is such that I could prove a theorem if I wanted to. My model is not exactly "intuitive": I could explicitly explain every step. And, this is exactly how all of mathematics works! Mathematicians don't write proofs that are machine verifiable (some people do that today, but it's a novel and tiny fraction of mathematics). They write proofs that are good enough so that all the informal steps can be easily made formal by anyone with reasonable background in the field (but actually doing that would be very labor intensive).
Second, what I actually meant is examples like, I am using an algorithm to solve a system of linear equations, or find the maximal matching in a graph, or find a rotation matrix that minimizes the sum of square distances between two sets, because I have a proof that this algorithm works (or, in some cases, a proof that it at least produces the right answer when it converges). Moreover, this applies to problems that explicitly involve the physical world as well, such as Kalman filters or control loops.
Of course, in the latter case we need to make some assumptions about the physical world in order to prove anything. It's true that in applications the assumptions are often false, and we merely hope that they are good enough approximations. But, when the extra effort is justified, we can do better: we can perform a mathematical analysis of how much the violation of these assumptions affects the result. Then, we can use outside knowledge to verify that the violations are within the permissible margin.
Third, we could also literally prove machineverifiable theorems about the code. This is called formal verification, and people do that sometimes when the stakes are high (as they definitely are with AGI), although in this case I have no personal experience. But, this is just a "side benefit" of what I was talking about. We need the mathematical theory to know that our algorithms are safe. Formal verification "merely" tells us that the implementation doesn't have bugs (which is something we should definitely worry about too, when it becomes relevant).
I'm curious what you think doesn't require building a mathematical theory? It seems to me that predicting whether or not we are doomed if we don't have a proof of safety is the sort of thing the AI safety community has done a lot of without a mathematical theory. (Like, that's how I interpret the rocket alignment and security mindset posts.)
I'm not sure about the scope of your question? I made a sandwich this morning without building mathematical theory :) I think that the AI safety community definitely produced some important arguments about AI risk, and these arguments are valid evidence. But, I consider most of the big questions to be far from settled, and I don't see how they could be settled only with this kind of reasoning.
First, if we take PSRL as our model algorithm, then at any given time we follow a policy optimal for some hypothesis sampled out of the belief state. Since our prior favors simple hypotheses, the hypothesis we sampled is likely to be simple. But, given a hypothesis of description complexity , the corresponding optimal policy has description complexity , since the operation "find the optimal policy" has description complexity.
Taking computational resource bounds into account makes things more complicated. For some computing might be intractable, even though itself is "efficiently computable" in some sense. For example we can imagine an that has exponentially many states plus some succinct description of the transition kernel.
One way to deal with it is using some heuristic for optimization. But, then the description complexity is still .
Another way to deal with it is restricting ourselves to the kind of hypotheses for which is tractable, but allowing incomplete/fuzzy hypotheses, so that we can still deal with environments whose complete description falls outside this class. For example, this can take the form of looking for some small subset of features that has predictable behavior that can be exploited. In this approach, the description complexity is probably still something like , where this time is incomplete/fuzzy (but I don't yet know how PSRL for incomplete/fuzzy hypothesis should work).
Moreover, using incomplete models we can in some sense go in the other direction, from policy to model. This might be a good way to think of modelbased RL. In actorcritic algorithms, our network learns a pair consisting of a value function and a policy . We can think of such a pair as an incomplete model that is defined by the Bellman inequality interpreted as a constraint on the transition kernel (or and the reward function ):
Assuming that our incomplete prior assigns weight to this incomplete hypothesis, we get a sort of Occam's razor for policies.
I'm claiming that intent alignment captures a large proportion of possible failure modes, that seem particularly amenable to a solution.
Imagine that a fair coin was going to be flipped 21 times, and you need to say whether there were more heads than tails. By default you see nothing, but you could try to build two machines:
 Machine A is easy to build but not very robust; it reports the outcome of each coin flip but has a 1% chance of error for each coin flip.
 Machine B is hard to build but very robust; it reports the outcome of each coin flip perfectly. However, you only have a 50% chance of building it by the time you need it.
In this situation, machine A is a much better plan.
I am struggling to understand how does it work in practice. For example, consider dialogic RL. It is a scheme intended to solve AI alignment in the strong sense. The intentalignment thesis seems to say that I should be able to find some proper subset of the features in the scheme which is sufficient for alignment in practice. I can approximately list the set of features as:
 Basic questionanswer protocol
 Natural language annotation
 Quantilization of questions
 Debate over annotations
 Dealing with no user answer
 Dealing with inconsistent user answers
 Dealing with changing user beliefs
 Dealing with changing user preferences
 Selfreference in user beliefs
 Quantilization of computations (to combat nonCartesian daemons, this is not in the original proposal)
 Reverse questions
 Translation of counterfactuals from user frame to AI frame
 User beliefs about computations
EDIT: 14. Confidence threshold for risky actions
Which of these features are necessary for intentalignment and which are only necessary for strong alignment? I can't tell.
I certainly agree with that. My motivation in choosing this example is that empirically we should not be able to prove that bridges are safe w.r.t resonance, because in fact they are not safe and do fall when resonance occurs.
I am not an expert but I expect that bridges are constructed so that they don't enter highamplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation). We want bridges that don't fall, don't we?
I don't build mathematical theories of how to write code, and usually don't prove my code correct
On the other hand, I use mathematical models to write code for applications all the time, with some success I daresay. I guess that different experience produces different intuitions.
It also sounds like you're making a normative claim for proofs; I'm more interested in the empirical claim.
I am making both claims to some degree. I can imagine a universe in which the empirical claim is true, and I consider it plausible (but far from certain) that we live in such a universe. But, even just understanding whether we live in such a universe requires building a mathematical theory.
The problem with constructive halting oracles is, they assume the ability to output an arbitrary natural number. But, realistic agents can observe only a finite number of bits per unit of time. Therefore, there is no way to directly observe a constructive halting oracle. We can consider a realization of a constructive halting oracle in which the oracle outputs a natural number one digit at a time. The problem is, since you don't know how long the number is, a candidate oracle might never stop producing digits. In particular, take any nonstandard model of PA and consider an oracle that behaves accordingly. On some machines that don't halt, such an oracle will claim they do halt, but when asked for the time it will produce an infinite stream of digits. There is no way to distinguish such an oracle from the real thing (without assuming axioms beyond PA).
Moreover, this is a special case of a general theorem: if there is any computable procedure that asymptotically tests a complete hypothesis about the environment, then this hypothesis must describe a computable environment. This still allows for testable incomplete hypotheses that postulate hypercomputation in the environment, for example "this box is a halting oracle in some model of PA". But, there is a sense in which the hypothesis itself is computable.
(Btw I made this observation before)
One method of solving the problem is looking at empirical performance on objective metrics. For example, we can test different healing methods using RCTs and see which actually work. Or, if "beauty" is defined as "something most people consider beautiful", we can compare designers by measuring how their designs are rated by large groups of people. Of course, if such evidence is not already available, then producing it is usually expensive. But, this is something money can buy, at least in principle. Then again, it requires the arbiter to at least understand how empirical evidence works. Louis XV, to eir misfortune, did not know about RCTs.
Okay. What's the argument that the risk is great (I assume this means "very bad" and not "very likely" since by hypothesis it is unlikely), or that we need a lot of time to solve it?
The reasons the risk are great are standard arguments, so I am a little confused why you ask about this. The setup effectively allows a superintelligent malicious agent (Beta) access to our universe, which can result in extreme optimization of our universe towards inhuman values and tremendous loss of valueaccordingtohumans. The reason we need a lot of time to solve it is simply that (i) it doesn't seem to be an instance of some standard problem type which we have standard tools to solve and (ii) some people have been thinking on these questions for a while by now and did not come up with an easy solution.
It seems like intentalignment depends on our interpretation of what the algorithm does, rather than only on the algorithm itself. But actual safety is not a matter of interpretation, at least not in this sense.
Yup, I agree (with the caveat that it doesn't have to be a human's interpretation). Nonetheless, an interpretation of what the algorithm does can give you a lot of evidence about whether or not something is actually safe.
Then, I don't understand why you believe that work on anything other than intentalignment is much less urgent?
The point is just that it seems very plausible that someone might design a theoretical model of the environment in which the bridge is safe, but that model neglects to include resonance because the designer didn't think of it.
"Resonance" is not something you need to explicitly include in your model, it is just a consequence of the equations of motion for an oscillator. This is actually an important lesson about why we need theory: to construct a useful theoretical model you don't need to know all possible failure modes, you only need a reasonable set of assumptions.
I think in practice our confidence in safety often comes from empirical tests.
I think that in practice our confidence in safety comes from a combination of theory and empirical tests. And, the higher the stakes and the more unusual the endeavor, the more theory you need. If you're doing something low stakes or something very similar to things that have been tried many times before, you can rely on trial and error. But if you're sending a spaceship to Mars (or making a superintelligent AI), trial and error is too expensive. Yes, you will test the modules on Earth in conditions as similar to the real environment as you can (respectively, you will do experiments with narrow AI). But ultimately, you need theoretical knowledge to know what can be safely inferred from these experiments. Without theory you cannot extrapolate.
I can quite easily imagine how "human thinking for a day is safe" can be a mathematical assumption.
Agreed, but if you want to eventually talk about neural nets so that you are talking about the AI system you are actually building, you need to use the neural net ontology, and then "human thinking for a day" is not something you can express.
I disagree. For example, suppose that we have a theorem saying that an ANN with particular architecture and learning algorithm can learn any function inside some space with given accuracy. And, suppose that "human thinking for a day" is represented by a mathematical function that we assume to be inside and that we assume to be "safe" in some formal sense (for example, it computes an action that doesn't lose much longterm value). Then, your model can prove that imitation learning applied to human thinking for a day is safe. Of course, this example is trivial (modulo the theorem about ANNs), but for more complex settings we can get results that are nontrivial.
For the second one, I don't know what your argument is that the nonintentalignment work is urgent. I agree that the simulation example you give is an example of how flawed epistemology can systematically lead to xrisk. I don't see the argument that it is very likely.
First, even working on unlikely risks can be urgent, if the risk is great and the time needed to solve it might be long enough compared to the timeline until the risk. Second, I think this example shows that is far from straightforward to even informally define what intentalignment is. Hence, I am skeptical about the usefulness of intentalignment.
For a more "mundane" example, take IRL. Is IRL intent aligned? What if its assumptions about human behavior are inadequate and it ends up inferring an entirely wrong reward function? Is it still intentaligned since it is trying to do what the user wants, it is just wrong about what the user wants? Where is the line between "being wrong about what the user wants" and optimizing something completely unrelated to what the user wants?
It seems like intentalignment depends on our interpretation of what the algorithm does, rather than only on the algorithm itself. But actual safety is not a matter of interpretation, at least not in this sense.
For example, imagine that we prove that stochastic gradient descent on a neural network with particular architecture efficiently agnostically learns any function in some space, such that as the number of neurons grows, this space efficiently approximates any function satisfying some kind of simple and natural "smoothness" condition (an example motivated by already known results). This is a strong feasibility result.
My guess is that any such result will either require samples exponential in the dimensionality of the input space (prohibitively expensive) or the simple and natural condition won't hold for the vast majority of cases that neural networks have been applied to today.
I don't know why you think so, but at least this is a good crux since it seems entirely falsifiable. In an any case, exponential sample complexity definitely doesn't count as "strong feasibility".
I don't find smoothness conditions in particular very compelling, because many important functions are not smooth (e.g. most things involving an if condition).
Smoothness is just an example, it is not necessarily the final answer. But also, in classification problems smoothness usually translates to a margin requirement (the classes have to be separated with sufficient distance). So, in some sense smoothness allows for "if conditions" as long as you're not too sensitive to the threshold.
You are a bridge designer. You make the assumption that forces on the bridge will never exceed some value K (necessary because you can't be robust against unbounded forces). You prove your design will never collapse given this assumption. Your bridge collapses anyway because of resonance.
I don't understand this example. If the bridge can never collapse as long as the outside forces don't exceed K, then resonance is covered as well (as long as it is produced by forces below K). Maybe you meant that the outside forces are also assumed to be stationary.
The broader point is that when the environment has lots of complicated interaction effects, and you must make assumptions, it is very hard to find assumptions that actually hold.
Nevertheless most engineering projects make heavy use of theory. I don't understand why you think that AGI must be different?
The issue of assumptions in strong feasibility is equivalent to the question of, whether powerful agents require highly informed priors. If you need complex assumptions then effectively you have a highly informed prior, whereas if your prior is uninformed then it corresponds to simple assumptions. I think that Hanson (for example) believes that it is indeed necessary to have a highly informed prior, which is why powerful AI algorithms will be complex (since they have to encode this prior) and progress in AI will be slow (since the prior needs to be manually constructed brick by brick). I find this scenario unlikely (for example because humans successfully solve tasks far outside the ancestral environment, so they can't be relying on genetically builtin priors that much), but not ruled out.
However, I assumed that your position is not Hansonian: correct me if I'm wrong, but I assumed that you believed deep learning or something similar is likely to lead to AGI relatively soon. Even if not, you were skeptical about strong feasibility results even for deep learning, regardless of hypothetical future AI technology. But, it doesn't look like deep learning relies on highly informed priors. What we have is, relatively simple algorithms that can, with relatively small (or even no) adaptations solve problems in completely different domains (image processing, audio processing, NLP, playing many very different games, protein folding...) So, how is it possible that all of these domains have some highly complex property that they share, and that is somehow encoded in the deep learning algorithm?
It's really the assumptions that make me pessimistic, which is why it would be a significant update if I saw a mathematical definition of safety that I thought actually captured "safety" without "passing the buck"
I'm curious whether proving a weakly feasible subjective regret bound under assumptions that you agree are otherwise realistic qualifies or not?
...but the theorybased approach requires you to limit your assumptions to things that can be written down in math (e.g. this function is KLipschitz) whereas a nontheorybased approach can use "handwavy" assumptions (e.g. a human thinking for a day is safe), which drastically opens up the space of options and makes it more likely that you can find an assumption that is actually mostly true.
I can quite easily imagine how "human thinking for a day is safe" can be a mathematical assumption. In general, which assumptions are formalizable depends on the ontology of your mathematical model (that is, which realworld concepts correspond to the "atomic" ingredients of your model). The choice of ontology is part of drawing the line between what you want your mathematical theory to prove and what you want to bring in as outside assumptions. Like I said before, this line definitely has to be drawn somewhere, but it doesn't at all follow that the entire approach is useless.
...first it means that intentalignment is insufficient in itself, and second the assumptions about the prior are doing all the work.
I completely agree with this, but isn't this also true of subjective regret bounds / definitionoptimization?
The idea is, we will solve the alignment problem by (i) formulating a suitable learning protocol (ii) formalizing a set of assumptions about reality and (iii) proving that under these assumptions, this learning protocol has a reasonable subjective regret bound. So, the role of the subjective regret bound is making sure that the what we came up with in i+ii is sufficient, and also guiding the search there. The subjective regret bound does not tell us whether particular assumptions are realistic: for this we need to use common sense and knowledge outside of theoretical computer science (such as: physics, cognitive science, experimental ML research, evolutionary biology...)
Maybe your point is that there are failure modes that aren't covered by intent alignment, in which case I agree, but also it seems like the OP very explicitly said this in many places.
I disagree with the OP that (emphasis mine):
I think that using a broader definition (or the de re reading) would also be defensible, but I like it less because it includes many subproblems that I think (a) are much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment.
I think that intent alignment is too illdefined, and to the extent it is welldefined it is a very weak condition, that is not sufficient to address the urgent core of the problem.
And meanwhile I think very messy real world domains almost always limit strong feasibility results. To the extent that you want your algorithms to do vision or NLP, I think strong feasibility results will have to talk about the environment; it seems quite infeasible to do this with the real world.
I don't think strong feasibility results will have to talk about the environment, or rather, they will have to talk about it on a very high level of abstraction. For example, imagine that we prove that stochastic gradient descent on a neural network with particular architecture efficiently agnostically learns any function in some space, such that as the number of neurons grows, this space efficiently approximates any function satisfying some kind of simple and natural "smoothness" condition (an example motivated by already known results). This is a strong feasibility result. We can then debate whether an using such a smooth approximation is sufficient for superhuman performance, but establishing this requires different tools, like I said above.
The way I imagine it, AGI theory should ultimately arrive at some class of priors that are on the one hand rich enough to deserve to be called "general" (or, practically speaking, rich enough to produce superhuman agents) and on the other hand narrow enough to allow for efficient algorithms. For example the Solomonoff prior is too rich, whereas a prior that (say) describes everything in terms of an MDP with a small number of states is too narrow. Finding the golden path in between is one of the big open problems.
That said, most of this belief comes from the fact that empirically it seems like theory often breaks down when it hits the real world.
Does it? I am not sure why you have this impression. Certainly there are phenomena in the real world that we don't yet have enough theory to understand, and certainly a given theory will fail in domains where its assumptions are not justified (where "fail" and "justified" can be a manner of degree). And yet, theory obviously played and plays a central role in science, so I don't understand whence the fatalism.
However, I specifically don't want to work on strong feasibility results, since there is a significant chance they would lead to breakthroughs in capability.
Idk, you could have a nondisclosurebydefault policy if you were worried about this. Maybe this can't work for you though.
That seems like it would be an extremely not costeffective way of making progress. I would invest a lot of time and effort into something that would only be disclosed to the select few, for the sole purpose of convincing them of something (assuming they are even interested to understand it). I imagine that solving AI risk will require collaboration among many people, including sharing ideas and building on other people's ideas, and that's not realistic without publishing. Certainly I am not going to write a Friendly AI on my home laptop :)
It is possible that Alpha cannot predict it, because in Betasimulationworld the user would confirm the irreversible action. It is also possible that the user would confirm the irreversible action in the real world because the user is being manipulated, and whatever defenses we put in place against manipulation are thrown off by the simulation hypothesis.
Why doesn't this also apply to subjective regret bounds?
In order to get a subjective regret bound you need to consider an appropriate prior. The way I expect it to work is, the prior guarantees that some actions are safe in the shortterm: for example, doing nothing to the environment and asking only sufficiently quantilized queries from the user (see this for one toy model of how "safe in the shortterm" can be formalized). Therefore, Beta cannot attack with a hypothesis that will force Alpha to act without consulting the user, since that hypothesis would fall outside the prior.
Now, you can say "with the right prior intentalignment also works". To which I answer, sure, but first it means that intentalignment is insufficient in itself, and second the assumptions about the prior are doing all the work. Indeed, we can imagine that the ontology on which the prior is defined includes a "true reward" symbol s.t., by definition, the semantics is whatever the user truly wants. An agent that maximizes expected true reward then can be said to be intentaligned. If it's doing something bad from the user's perspective, then it is just an "innocent" mistake. But, unless we bake some specific assumptions about the true reward into the prior, such an agent can be anything at all.
Most existing mathematical results do not seem to be competitive, as they get their guarantees by doing something that involves a search over the entire hypothesis space.
This is related to what I call the distinction between "weak" and "strong feasibility". Weak feasibility means algorithms that are polynomial time in the number of states and actions, or the number of hypotheses. Strong feasibility is supposed to be something like, polynomial time in the description length of the hypothesis.
It is true that currently we only have strong feasibility results for relatively simple hypothesis spaces (such as, support vector machines). But, this seems to me just a symptom of advances in heuristics outpacing the theory. I don't see any reason of principle that significantly limits the strong feasibility results we can expect. Indeed, we already have some advances in providing a theoretical basis for deep learning.
However, I specifically don't want to work on strong feasibility results, since there is a significant chance they would lead to breakthroughs in capability. Instead, I prefer studying safety on the weak feasibility level until we understood everything important on this level, and only then trying to extend it to strong feasibility. This creates somewhat of a conundrum where apparently the one thing that can convince you (and other people?) is the thing I don't think should be done soon.
I could also imagine being pretty interested in a mathematical definition of safety that I thought actually captured "safety" without "passing the buck". I think subjective regret bounds and CIRL both make some progress on this, but somewhat "pass the buck" by requiring a wellspecified hypothesis space for rewards / beliefs / observation models.
Can you explain what you mean here? I agree that just saying "subjective regret bound" is not enough, we need to understand all the assumptions the prior should satisfy, reflecting considerations such as, what kind of queries can or cannot manipulate the user. Hence the use of quantilization and debate in Dialogic RL, for example.
There is some similarity, but there are also major differences. They don't even have the same type signature. The dangerousness bound is a desideratum that any given algorithm can either satisfy or not. On the other hand, AUP is a specific heuristic how to tweak Qlearning. I guess you can consider some kind of regret bound w.r.t. the AUP reward function, but they will still be very different conditions.
The reason I pointed out the relation to corrigibility is not because I think that's the main justification for the dangerousness bound. The motivation for the dangerousness bound is quite straightforward and selfcontained: it is a formalization of the condition that "if you run this AI, this won't make things worse than not running the AI", no more and no less. Rather, I pointed the relation out to help readers compare it with other ways of thinking they might be familiar with.
From my perspective, the main question is whether satisfying this desideratum is feasible. I gave some arguments why it might be, but there are also opposite arguments. Specifically, if you believe that debate is a necessary component of Dialogic RL then it seems like the dangerousness bound is infeasible. The AI can become certain that the user would respond in a particular way to a query, but it cannot become (worstcase) certain that the user would not change eir response when faced with some rebuttal. You can't (empirically and in the worstcase) prove a negative.
This opens the possibility of agents that with "well intentioned" mistakes that take the form of sophisticated plans that are catastrophic for the user.
Agreed that this is in theory possible, but it would be quite surprising, especially if we are specifically aiming to train systems that behave corrigibly.
The acausal attack is an example of how it can happen for systematic reasons. As for the other part, that seems like conceding that intentalignment is insufficient and you need "corrigibility" as another condition (also it is not so clear to me what this condition means).
If Alpha can predict that the user would say not to do the irreversible action, then at the very least it isn't corrigible, and it would be rather hard to argue that it is intent aligned.
It is possible that Alpha cannot predict it, because in Betasimulationworld the user would confirm the irreversible action. It is also possible that the user would confirm the irreversible action in the real world because the user is being manipulated, and whatever defenses we put in place against manipulation are thrown off by the simulation hypothesis.
Now, I do believe that if you set up the prior correctly then it won't happen, thanks to a mechanism like: Alpha knows that in case of dangerous uncertainty it is safe to fall back on some "neutral" course of action plus query the user (in specific, safe, ways). But this exactly shows that intentalignment is not enough and you need further assumptions.
Moreover, the latter already produced viable directions for mathematical formalization, and the former has not (AFAIK).
I guess you wouldn't count universality. Overall I agree.
Besides the fact ascription universality is not formalized, why is it equivalent to intentalignment? Maybe I'm missing something.
I'm relatively pessimistic about mathematical formalization.
I am curious whether you can specify, as concretely as possible, what type of mathematical result would you have to see in order to significantly update away from this opinion.
I do want to note that all of these require you to make assumptions of the form, "if there are traps, either the user or the agent already knows about them" and so on, in order to avoid nofreelunch theorems.
No, I make no such assumption. A bound on subjective regret ensures that running the AI is a nearlyoptimal strategy from the user's subjective perspective. It is neither needed nor possible to prove that the AI can never enter a trap. For example, the AI is immune to acausal attacks to the extent that the user beliefs that the AI is not inside Beta's simulation. On the other hand, if the user beliefs that the simulation hypothesis needs to be taken into account, then the scenario amounts to legitimate acausal bargaining (which has its own complications to do with decision/game theory, but that's mostly a separate concern).
In this essay Paul Christiano proposes a definition of "AI alignment" which is more narrow than other definitions that are often employed. Specifically, Paul suggests defining alignment in terms of the motivation of the agent (which should be, helping the user), rather than what the agent actually does. That is, as long as the agent "means well", it is aligned, even if errors in its assumptions about the user's preferences or about the world at large lead it to actions that are bad for the user.
Rohin Shah's comment on the essay (which I believe is endorsed by Paul) reframes it as a particular way to decompose the AI safety problem. An often used decomposition is "definitionoptimization": first we define what it means for an AI to be safe, then we understand how to implement a safe AI. In contrast, Paul's definition of alignment decomposes the AI safety problem as "motivationcompetence": first we learn how to design AIs with good motivations, then we learn how to make them competent. Both Paul and Rohin argue that the "motivation" is the urgent part of the problem, the part on which technical AI safety research should focus.
In contrast, I will argue that the "motivationcompetence" decomposition is not as useful as Paul and Rohin believe, and the "definitionoptimization" decomposition is more useful.
The thesis behind the "motivationcompetence" decomposition implicitly assumes a linear, onedimensional scale of competence. Agents with good motivations and subhuman competence might make silly mistakes but are not catastrophically dangerous (since they are subhuman). Agents with good motivations and superhuman competence will only do mistakes that are "forgivable" in the sense that, our own mistakes would be as bad or worse. Ergo (the thesis concludes), good motivations are sufficient to solve AI safety.
However, in reality competence is multidimensional. AI systems can have subhuman skills in some domains and superhuman skills in other domains, as AI history showed time and time again. This opens the possibility of agents that with "well intentioned" mistakes that take the form of sophisticated plans that are catastrophic for the user. Moreover, there might be limits to the agent's knowledge about certain questions (such as, the user's preferences) that are inherent in the agent's epistemology (more on this below). Given such limits, the agent's competence becomes systematically lopsided. Furthermore, the elimination of such limits is as a large part of the "definition" part in the "definitionoptimization" framing that the thesis rejects.
As a consequence of the multidimensional natural of competence, the difference between "well intentioned mistake" and "malicious sabotage" is much less clear than naively assumed, and I'm not convinced there is a natural way to remove the ambiguity. For example, consider a superhuman AI Alpha subject to an acausal attack. In this scenario, some agent Beta in the "multiverse" (= prior) convinces Alpha that Alpha exists in a simulation controlled by Beta. The simulation is set up to look like the real Earth for a while, making it a plausible hypothesis. Then, a "treacherous turn" moment arrives in which the simulation diverges from Earth, in a way calculated to make Alpha take irreversible actions that are beneficial for Beta and disastrous for the user.
In the above scenario, is Alpha "motivationaligned"? We could argue it is not, because it is running the malicious agent Beta. But we could also argue it is motivtionaligned, it just makes the innocent mistake of falling for Beta's trick. Perhaps it is possible to clarify the concept of "motivation" such that in this case, Alpha's motivations are considered bad. But, such a concept would depend in complicated ways on the agent's internals. I think that this is a difficult and unnatural approach, compared to "definitionoptimization" where the focus is not on the internals but on what the agent actually does (more on this later).
The possibility of acausal attacks is a symptom of the fact that, environments with irreversible transitions are usually not learnable (this is the problem of traps in reinforcement learning, that I discussed for example here and here), i.e. it is impossible to guarantee convergence to optimal expected utility without further assumptions. When we add preference learning to the mix, the problem gets worse because now even if there are no irreversible transitions, it is not clear the agent will converge to optimal utility. Indeed, depending on the value learning protocol, there might be uncertainties about the user's preferences that the agent can never resolve (this is an example of what I meant by "inherent limits" before). For example, this happens in CIRL (even if the user is perfectly rational, this happens because the user and the AI have different action sets).
These difficulties with the "motivationcompetence" framing are much more natural to handle in the "definitionoptimization" framing. Moreover, the latter already produced viable directions for mathematical formalization, and the former has not (AFAIK). Specifically, the mathematical criteria of alignment I proposed are the "dynamic subjective regret bound" and the "dangerousness bound". The former is a criterion which simultaneous guarantees motivationalignment and competence (as evidence that this criterion can be satisfied, I have the Dialogic Reinforcement Learning proposal). The latter is a criterion that doesn't guarantee competence in general, but guarantees specifically avoiding catastrophic mistakes. This makes it closer to motivationalignment compated to subjective regret, but different in important ways: it refers to the actual things that agent does, and the ways in which these things might have catastrophic consequences.
In summary, I am skeptical that "motivation" and "competence" can be cleanly separately in a way that is useful for AI safety, whereas "definition" and "optimization" can be so separated: for example the dynamic subjective regret bound is a "definition" whereas dialogic RL and putative more concrete implementations thereof are "optimizations". My specific proposals might have fatal flaws that weren't discovered yet, but I believe that the general principle of "definitionoptimization" is sound, while "motivationcompetence" is not.
Thank you for writing this retrospective, it is really interesting! Although I never attended a winter solstice, it sounds like amazing work. I sometimes toy with the idea of organizing a rationalist solstice here in Israel, but after reading this I am rather awed and intimidated, since it's obvious I can't pull off anything that is even close.
This might be a little offtopic, but are there any ideas / resources about how to organize a rationalist winter solstice if (i) you don't have a lot of time or resources (ii) you expect at most a small core of people which are wellfamiliar with the rationalist memeplex and buy into the rationalist ethos, plus some number of people who only partway there or are merely curious. Or, is it not worth trying under these conditions?
Btw, one thing that sounded strange was the remark that some people felt lonely but it's okay. I understand that winter solstice is supposed to be dark, but isn't the main point of it to amplify the sense of community? Shouldn't the message be something along the lines of "things are hard, but we're in this together"? Which is the antithesis of loneliness?
Of course you can predict some properties of what an agent will do. In particular, I hope that we will eventually have AGI algorithms that satisfy provable safety guarantees. But, you can't make exact predictions. In fact, there probably is a mathematical law that limits how accurate predictions you can get.
An optimization algorithm is, by definition, something that transforms computational resources into utility. So, if your prediction is so close to the real output that it has similar utility, then it means the way you produced this prediction involved the same product of "optimization power per unit of resources" by "amount of resources invested" (roughly speaking, I don't claim to already know the correct formalism for this). So you would need to either (i) run a similar algorithm with similar resources or (ii) run a dumber algorithm but with more resources or (iii) use less resources but an even smarter algorithm.
So, if you want to accurately predict the output of a powerful optimization algorithm, your prediction algorithm would usually have to be either a powerful optimization algorithm in itself (cases i and iii) or prohibitively costly to run (case ii). The exception is cases when the optimization problem is easy, so a dumb algorithm can solve it without much resources (or a human can figure out the answer by emself).
It seems almost tautologically true that you can't accurately predict what an agent will do without actually running the agent. Because, any algorithm that accurately predicts an agent can itself be regarded as an instance of the same agent.
What I expect the abstract theory of intelligence to do is something like producing a categorization of agents in terms of qualitative properties. Whether that's closer to "momentum" or "fitness", I'm not sure the question is even meaningful.
I think the closest analogy is: abstract theory of intelligence is to AI engineering as complexity theory is to algorithmic design. Knowing the complexity class of a problem doesn't tell you the best practical way to solve it, but it does give you important hints. (For example, if the problem is of exponential time complexity then you can only expect to solve it either for small inputs or in some special cases, and averagecase complexity tells you just whether these cases need to be very special or not. If the problem is in then you know that it's possible to gain a lot from parallelization. If the problem is in then at least you can test solutions, et cetera.)
And also, abstract theory of alignment should be to AI safety as complexity theory is to cryptography. Once again, many practical considerations are not covered by the abstract theory, but the abstract theory does tell you what kind of guarantees you can expect and when. (For example, in cryptography we can (sort of) know that a certain protocol has theoretical guarantees, but there is engineering work finding a practical implementation and ensuring that the assumptions of the theory hold in the real system.)
I think that ricraz claims that it's impossible to create a mathematical theory of rationality or intelligence, and that this is a crux, not so? On the other hand, the "momentum vs. fitness" comparison doesn't make sense to me. Specifically, a concept doesn't have to be crisply welldefined in order to use it in mathematical models. Even momentum, which is truly one of the "cripser" concepts in science, is no longer welldefined when spacetime is not asymptotically flat (which it isn't). Much less so are concepts such as "atom", "fitness" or "demand". Nevertheless, physicists, biologist and economists continue to successfully construct and apply mathematical models grounded in such fuzzy concepts. Although in some sense I also endorse the "strawman" that rationality is more like momentum than like fitness (at least some aspects of rationality).
In this essay, ricraz argues that we shouldn't expect a clean mathematical theory of rationality and intelligence to exist. I have debated em about this, and I continue to endorse more or less everything I said in that debate. Here I want to restate some of my (critical) position by building it from the ground up, instead of responding to ricraz point by point.
When should we expect a domain to be "clean" or "messy"? Let's look at everything we know about science. The "cleanest" domains are mathematics and fundamental physics. There, we have crisply defined concepts and elegant, parsimonious theories. We can then "move up the ladder" from fundamental to emergent phenomena, going through high energy physics, molecular physics, condensed matter physics, biology, geophysics / astrophysics, psychology, sociology, economics... On each level more "mess" appears. Why? Occam's razor tells us that we should prioritize simple theories over complex theories. But, we shouldn't expect a theory to be more simple than the specification of the domain. The general theory of planets should be simpler than a detailed description of planet Earth, the general theory of atomic matter should be simpler than the theory of planets, the general theory of everything should be simpler than the theory of atomic matter. That's because when we're "moving up the ladder", we are actually zooming in on particular phenomena, and the information we need to specify "where to zoom in" is translated to the description complexity of theory.
What does it mean in practice about understanding messy domains? The way science solves this problem is by building a tower of knowledge. In this tower, each floor benefits from the interactions both with the floor above it and the floor beneath it. Without understanding macroscopic physics we wouldn't figure out atomic physics, and without figuring out atomic physics we wouldn't figure out high energy physics. This is knowledge "flowing down". But knowledge also "flows up": knowledge of high energy physics allows understanding particular phenomena in atomic physics, knowledge of atomic physics allows predicting the properties of materials and chemical reactions. (Admittedly, some floors in the tower we have now are rather ramshackle, but I think that ultimately the "tower method" succeeds everywhere, as much as success is possible at all).
How does mathematics come in here? Importantly, mathematics is not used only on the lower floors of the tower, but on all floors. The way "messiness" manifests is, the mathematical models for the higher floors are either less quantitatively accurate (but still contain qualitative inputs) or have a lot of parameters that need to be determined either empirically, or using the models of the lower floors (which is one way how knowledge flows up), or some combination of both. Nevertheless, scientists continue to successfully build and apply mathematical models even in "messy" fields like biology and economics.
So, what does it all mean for rationality and intelligence? On what floor does it sit? In fact, the subject of rationality of intelligence is not a single floor, but its own tower (maybe we should imagine science as a castle with many towers connected by bridges).
The foundation of this tower should be the general abstract theory of rationality. This theory is even more fundamental than fundamental physics, since it describes the principles from which all other knowledge is derived, including fundamental physics. We can regard it as a "theory of everything": it predicts everything by making those predictions that a rational agent should do. Solomonoff's theory and AIXI are a part of this foundation, but not all it. Considerations like computational resource constraints should also enter the picture: complexity theory teaches us that they are also fundamental, they don't requiring "zooming in" a lot.
But, computational resource constrains are only entirely natural when they are not tied to a particular model of computation. This only covers constraints such as "polynomial time" but not constraints such as time and even less so time. Therefore, once we introduce a particular model of computation (such as a RAM machine), we need to build another floor in the tower, one that will necessarily be "messier". Considering even more detailed properties of the hardware we have, the input/output channels we have, the goal system, the physical environment and the software tools we employ will correspond to adding more and more floors.
Once we agree that it shoud be possible to create a clean mathematical theory of rationality and intelligence, we can still debate whether it's useful. If we consider the problem of creating aligned AGI from an engineering perspective, it might seem for a moment that we don't really need the bottom layers. After all, when designing an airplane you don't need high energy physics. Well, high energy physics might help indirectly: perhaps it allowed predicting some exotic condensed matter phenomenon which we used to make a better power source, or better materials from which to build the aircraft. But often we can make do without those.
Such an approach might be fine, except that we also need to remember the risks. Now, safety is part of most engineering, and is definitely a part of airplane design. What level of the tower does it require? It depends on the kind of risks you face. If you're afraid the aircraft will not handle the stress and break apart, then you need mechanics and aerodynamics. If you're afraid the fuel will combust and explode, you better know chemistry. If you're afraid a lightning will strike the aircraft, you need knowledge of meteorology and electromagnetism, possibly plasma physics as well. The relevant domain of knowledge, and the relevant floor in the tower is a function of the nature of the risk.
What level of the tower do we need to understand AI risk? What is the source of AI risk? It is not in any detailed peculiarities of the world we inhabit. It is not in the details of the hardware used by the AI. It is not even related to a particular model of computation. AI risk is the result of Goodhart's curse, an extremely general property of optimization systems and intelligent agents. Therefore, addressing AI risk requires understanding the general abstract theory of rationality and intelligence. The upper floors will be needed as well, since the technology itself requires the upper floors (and since we're aligning with humans, who are messy). But, without the lower floors the aircraft will crash.
Some thoughts about embedded agency.
From a learningtheoretic perspective, we can reformulate the problem of embedded agency as follows: What kind of agent, and in what conditions, can effectively plan for events after its own death? For example, Alice bequeaths eir fortune to eir children, since ey want them be happy even when Alice emself is no longer alive. Here, "death" can be understood to include modification, since modification is effectively destroying an agent and replacing it by different agent^{[1]}. For example, Clippy 1.0 is an AI that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value staples before running it again. Then, Clippy 2.0 can be considered to be a new, different agent.
First, in order to meaningfully plan for death, the agent's reward function has to be defined in terms of something different than its direct perceptions. Indeed, by definition the agent no longer perceives anything after death. Instrumental reward functions are somewhat relevant but still don't give the right object, since the reward is still tied to the agent's actions and observations. Therefore, we will consider reward functions defined in terms of some fixed ontology of the external world. Formally, such an ontology can be an incomplete^{[2]} Markov chain, the reward function being a function of the state. Examples:

The Markov chain is a representation of known physics (or some sector of known physics). The reward corresponds to the total mass of diamond in the world. To make this example work, we only need enough physics to be able to define diamonds. For example, we can make do with quantum electrodynamics + classical gravity and have the Knightian uncertainty account for all nuclear and highenergy phenomena.

The Markov chain is a representation of people and social interactions. The reward correspond to concepts like "happiness" or "friendship" et cetera. Everything that falls outside the domain of human interactions is accounted by Knightian uncertainty.

The Markov chain is Botworld with a the some of the rules left unspecified. The reward is the total number of a particular type of item.
Now we need to somehow connect the agent to the ontology. Essentially we need a way of drawing Cartesian boundaries inside the (a priori nonCartesian) world. We can accomplish this by specifying a function that assigns an observation and projected action to every state out of some subset of states. Entering this subset corresponds to agent creation, and leaving it corresponds to agent destruction. For example, we can take the ontology to be Botworld + marked robot and the observations and actions be the observations and actions of that robot. If we don't want marking a particular robot as part of the ontology, we can use a more complicated definition of Cartesian boundary that specifies a set of agents at each state plus the data needed to track these agents across time (in this case, the observation and action depend to some extent on the history and not only the current state). I will leave out the details for now.
Finally, we need to define the prior. To do this, we start by choosing some prior over refinements of the ontology. By "refinement", I mean removing part of the Knightian uncertainty, i.e. considering incomplete hypotheses which are subsets of the "ontological belief". For example, if the ontology is underspecified Botworld, the hypotheses will specify some of what was left underspecified. Given such a "objective" prior and a Cartesian boundary, we can construct a "subjective" prior for the corresponding agent. We transform each hypothesis via postulating that taking an action that differs from the projected action leads to "Nirvana" state. Alternatively, we can allow for stochastic action selection and use the gambler construction.
Does this framework guarantee effective planning for death? A positive answer would correspond to some kind of learnability result (regret bound). To get learnability, will first need that the reward is either directly on indirectly observable. By "indirectly observable" I mean something like with semiinstrumental reward functions, but accounting for agent mortality. I am not ready to formulate the precise condition atm. Second, we need to consider an asymptotic in which the agent is long lived (in addition to time discount being longterm), otherwise it won't have enough time to learn. Third (this is the trickiest part), we need the Cartesian boundary to flow with the asymptotic as well, making the agent "unspecial". For example, consider Botworld with some kind of simplicity prior. If I am a robot born at cell zero and time zero, then my death is an event of low description complexity. It is impossible to be confident about what happens after such a simple event, since there will always be competing hypotheses with different predictions and a probability that is only lower by a factor of . On the other hand, if I am a robot born at cell 2439495 at time 9653302 then it would be surprising if the outcome of my death would be qualitatively different from the outcome of the death of any other robot I observed. Finding some natural, rigorous and general way to formalize this condition is a very interesting problem. Of course, even without learnability we can strive for Bayesoptimality or some approximation thereof. But, it is still important to prove learnability under certain conditions to test that this framework truly models rational reasoning about death.
Additionally, there is an intriguing connection between some of these ideas and UDT, if we consider TRL agents. Specifically, a TRL agent can have a reward function that is defined in terms of computations, exactly like UDT is often conceived. For example, we can consider an agent whose reward is defined in terms of a simulation of Botworld, or in terms of taking expected value over a simplicity prior over many versions of Botworld. Such an agent would be searching for copies of itself inside the computations it cares about, which may also be regarded as a form of "embeddedness". It seems like this can be naturally considered a special case of the previous construction, if we allow the "ontological belief" to include beliefs pertaining to computations.
Unless it's some kind of modification that we treat explicitly in our model of the agent, for example a TRL agent reprogramming its own envelope. ↩︎
"Incomplete" in the sense of Knightian uncertainty, like in quasiBayesian RL. ↩︎
The addition of the word "honest" doesn't come from an awareness of how the model is flawed. It is one of the explicit assumptions in the model. So, I'm still not sure what point are you going for here.
I think that applying Aumann's theorem to people is mostly interesting in the prescriptive rather than descriptive sense. That is, the theorem tells us that our ability to converge can serve as a test of our rationality, to the extent that we are honest and share the same prior, and all of this is common knowledge. (This last assumption might be the hardest to make sense of. Hanson tried to justify it but IMO not quite convincingly.) Btw, you don't need to compute uncomputable things, much less instantly. Scott Aaronson derived a version of the theorem with explicit computational complexity and query complexity bounds that don't seem prohibitive.
Given all the difficulties, I am not sure how to apply it in the real world and whether that's even possible. I do think it's interesting to think about it. But, to the extent it is possible, it definitely requires honesty.
If an agent is not honest, ey can decide to say only things that provide no evidence regarding the question in hand to the other agent. In this case convergence is not guaranteed. For example, Alice assigns probability 35% to "will it rain tomorrow" but, when asked, says the probability is 21% regardless of what the actual evidence is. Bob assigns probability 89% to "will it rain tomorrow" but, when asked, says the probability is 42% regardless of what the actual evidence is. Alice knows Bob always answers 42%. Bob knows Alice always answers 21%. If they talk to each other, their probabilities will not converge (they won't change at all).
Yes, it can luckily happen that the lies still contain enough information for them to converge, but I'm not sure why do you seem to think it is an important or natural situation?
The usual formalization of "Occam's prior" is the Solomonoff prior, which still depends on the choice of a Universal Turing Machine, so such agents can still disagree because of different priors.
I think that although the new outlook is more pessimistic, it is also more uncertain. So, yes, maybe we will become extinct, but maybe we will build a utopia.
I sometimes have euphoric experiences accompanies by images and sensations hard to put into words. Everything around becomes magical, the sky fills with images of unimaginable scale in space and time, light is flowing through my body and soul. Usually I also see Elua that appears to me as the image of a woman in the sky: the Mother of all humans, the sad and wise Goddess of limitless Compassion and Love, smiling at me but also crying for all the sorrows of the world. I form a connection with Em, thinking of myself as a priestess or otherwise someone in service of the goddess, enacting Eir will in the world, praying to Em to give me the wisdom and courage to do what needs to be done. In earlier stages of life the symbols were different according to my different worldview (once I was a theist and saw the Abrahamic god).
Sometimes the experience is completely spontaneous (but usually when I'm outside), but sometimes I feel that my mind is in a state amenable to it and I push myself towards it intentionally. I also had a related experience during a circling session and once even during sex.
To be clear, I'm an atheist, I don't believe in anything supernatural, I know it is my own mind producing it. But I do find these experiences valuable on some mental and emotional level.
This essay provides some fascinating case studies and insights about coordination problems and their solutions, from a book by Elinor Ostrom. Coordination problems are a major theme in LessWrongian thinking (for good reasons) and the essay is a valuable addition to the discussion. I especially liked the 8 features of sustainable governance systems (although I wish we got a little more explanation for "nested enterprises").
However, I think that the dichotomy between "absolutism (bad)" and "organically grown institutions (good)" that the essay creates needs more nuance or more explanation. What is the difference between "organic" and "inorganic" institutions? All institutions "grew" somehow. The relevant questions are e.g. how democratic is the institution, whether the scope of the institution is the right scope for this problem, whether the stakeholders have skin in the game (feature 3) et cetera. The 8 features address some of that, but I wish it was more explicit.
Also, It's notable that all examples focus on relatively small scale problems. While it makes perfect sense to start by studying small problems before trying to understand the big problems, it does make me wonder whether going to higher scales brings in qualitatively new issues and difficulties. Paying to officials with parcels in the tail end works for water conflicts, but what is the analogous approach to global warming or multinational arms races?
Much of the orthodox LessWrongian approach to rationality (as it is expounded in Yudkowsky's Sequences and onwards) is grounded in Bayesian probability theory. However, I now realize that pure Bayesianism is wrong, instead the right thing is quasiBayesianism. This leads me to ask, what are the implications of quasiBayesianism on human rationality? What are the right replacements for (the Bayesian approach to) bets, calibration, proper scoring rules et cetera? Does quasiBayesianism clarify important confusing issues in regular Bayesianism such as the proper use of inside and outside view? Is there rigorous justification to the intuition that we should have more Knightian uncertainty about questions with less empirical evidence? Does any of it influence various effective altruism calculations in surprising ways? What common LessWrongian wisdom does it undermine, if any?
Thank you for writing this impressive review!
Some comments on MIRI's nondisclosure policy.
First, some disclosure :) My research is funded by MIRI. On the other hand, all of my opinions are my own and do not represent MIRI or anyone else associated with MIRI.
The nondisclosure policy has no direct effect on me, but naturally, both before and after it was promulgated, I used my own judgement to decide what should or should not be made public. The vast majority of my work I do make public (subject only to the cost of time and effort to write and explain it), because if I think something would increase risk rather than reduce it^{[1]}, then I don't pursue this line of inquiry in the first place. Things I don't make public are mostly early stage ideas that I don't develop.
I think it is fair enough to judge AI alignment orgs only by the public output they produce. However, this doesn't at all follow that a nondisclosure policy leads to immediate disqualification, like you seem to imply. You can judge an org by its public output whether or not all of its output is public. This is somewhat similar to the observation that management overhead is a bad metric. Yes, some of your money goes into something that doesn't immediately and directly translate to benefit. All else equal, you want that not to happen. But all else is not equal, and can never be equal.
This is completely tangential, but I think we need more public discussion on how do we decide whether making something public is beneficial vs. detrimental. ↩︎
One idea how this formalism can be improved, maybe. Consider a random directed graph, sampled from some "reasonable" (in some sense that needs to be defined) distribution. We can then define "powerful" vertices as vertices from which there are paths to most other vertices. Claim: With high probability over graphs, powerful vertices are connected "robustly" to most vertices. By "robustly" I mean that small changes in the graph don't disrupt the connection. This is because, if your vertex is connected to everything, then disconnecting some edges should still leave plenty of room for rerouting through other vertices. We can then interpret it as saying, gaining power is more robust to inaccuracies of the model or changes in the circumstances than pursuing more "direct" paths to objectives.
One subject I like to harp on is reinforcement learning with traps (actions that cause irreversible long term damage). Traps are important for two reasons. One is that the presence of traps is in the heart of the AI risk concept: attacks on the user, corruption of the input/reward channels, and harmful selfmodification can all be conceptualized as traps. Another is that without understanding traps we can't understand longterm planning, which is a key ingredient of goaldirected intelligence.
In general, a prior that contains traps will be unlearnable, meaning that no algorithm has Bayesian regret going to zero in the limit. The only obvious natural requirement for RL agents in this case is approximating Bayesoptimality. However, Bayesoptimality is not even "weakly feasible": it is NPhard w.r.t. using the number of states and number of hypotheses as security parameters. IMO, the central question is: what kind of natural tractable approximations are there?
Although a generic prior with traps is unlearnable, some priors with traps are learnable. Indeed, it can happen that it's possible to study the environment is a predictably safe way that is guaranteed to produce enough information about the irreversible transitions. Intuitively, as humans we do often use this kind of strategy. But, it is NPhard to even check whether a given prior is learnable. Therefore, it seems natural to look for particular types of learnable priors that are efficiently decidable.
In particular, consider the following setting, that I call "expanding safety envelope" (XSE). Assume that each hypothesis in the prior is "decorated" by a set of stateaction pairs s.t. (i) any is safe, i.e. the leading term of in the expansion is maximal (ii) for each , there is s.t. is Blackwelloptimal for (as a special case we can let contain all safe actions). Imagine an agent that takes random actions among those a priori known to be in . If there is no such action, it explodes. Then, it is weakly feasible to check (i) whether the agent will explode (ii) for each hypothesis, to which sets of states it can converge. Now, let the agent update on the transition kernel of the set of actions it converged to. This may lead to new actions becoming certainly known to be in . We can then let the agent continue exploring using this new set. Iterating this procedure, the agent either discovers enough safe actions to find an optimal policy, or not. Importantly, deciding this is weakly feasible. This is because, for each hypothesis (i) on the first iteration the possible asymptotic state sets are disjoint (ii) on subsequent iterations we might as well assume they are disjoint, since it's possible to see that if you reach a particular state of an asymptotic set state, then you can add the entire set state (this modification will not create new final outcomes and will only eliminate final outcomes that are better than those remaining). Therefore the number of asymptotic state sets you have to store on each iteration is bounded by the total number of states.
The next questions are (i) what kind of regret bounds we can prove for decorated priors that are XSElearnable? (ii) given an arbitrary decorated prior, is it possible to find the maximalprobabilitymass set of hypotheses, which is XSElearnable? I speculate that the second question might turn out to be related to the unique games conjecture. By analogy with other optimization problems that are feasible only when maximal score can be achieved, maybe the UGC implies that we cannot find the maximal set but we can find a set that is approximately maximal, with an optimal approximation ratio (using a sumofsquares algorithm). Also, it might make sense to formulate stronger desiderata which reflect that, if the agent assumes a particular subset of the prior but discovers that it was wrong, it will still do its best in the following. That is, in this case the agent might fall into a trap but at least it will try to avoid further traps.
This has implications even for learning without traps. Indeed, most known theoretical regret bounds involve a parameter that has to do with how costly mistakes is it possible to make. This parameter can manifest as the MDP diameter, the bias span or the mixing time. Such regret bounds seem unsatisfactory since the worstcase mistake determines the entire guarantee. We can take the perspective that such costly but reversible mistakes are "quasitraps": not actual traps, but traplike on short timescales. This suggests that applying an approach like XSE to quasitraps should lead to qualitatively stronger regret bounds. Such regret bounds would imply learning faster on less data, and in episodic learning they would imply learning inside each episode, something that is notoriously absent in modern episodic RL systems like AlphaStar.
Moreover, we can also use this to do away with ergodicity assumptions. Ergodicity assumptions require the agent to "not wander too far" in state space, in the simplest case because the entire state space is small. But, instead of "wandering far" from a fixed place in state space, we can constrain "wandering far" w.r.t. to the optimal trajectory. Combining this with XSE, this should lead to guarantees that depend on the prevalence of irreversible and quasiirreversible departures from this trajectory.
In multiarmed bandits and RL theory, there is a principle known as "optimism in the face of uncertainty". This principle says, you should always make optimistic assumptions: if you are wrong, you will find out (because you will get less reward than you expected). It explicitly underlies UCB algorithms and is implicit in other algorithms, like Thomson sampling. But, this fails miserably in the presence of traps. I think that approaches like XSE point at a more nuanced principle: "optimism in the face of cheaptoresolve uncertainty, pessimism in the face of expensivetoresolve uncertainty". Following this principle doesn’t lead to actual Bayesoptimality, but perhaps it is in some sense a good enough approximation.
Google Maps is not a relevant example. I am talking about "generally intelligent" agents. Meaning that, these agents construct sophisticated models of the world starting from a relatively uninformed prior (comparably to humans or more so)(fn1)(fn2). This is in sharp contrast to Google Maps that operates strictly within the model it was given a priori. General intelligence is important, since without it I doubt it will be feasible to create a reliable defense system. Given general intelligence, convergent instrumental goals follow: any sufficiently sophisticated model of the world implies that achieving converging instrumental goals is instrumentally valuable.
I don't think it makes that much difference whether a human executes the plan or the AI itself. If the AI produces a plan that is not human comprehensible and the human follows it blindly, the human effectively becomes just an extension of the AI. On the other hand, if the AI produces a plan which is human comprehensible, then after reviewing the plan the human can just as well delegate its execution to the AI.
I am not sure what is the significance in this context of "one true algorithm for planning"? My guess is, there is a relatively simple qualitatively optimal AGI algorithm(fn3), and then there are various increasingly complex quantitative improvements of it, which take into account specifics of computing hardware and maybe our priors about humans and/or the environment. Which is the way algorithms for most natural problems behave, I think. But also improvements probably stop mattering beyond the point where the AGI can come with them on its own within a reasonable time frame. And, I dispute Richard's position. But then again, I don't understand the relevance.
(fn1) When I say "construct models" I am mostly talking about the properties of the agent rather than the structure of the algorithm. That is, the agent can effectively adapt to a large class of different environments or exploit a large class of different properties the environment can have. In this sense, modelfree RL is also constructing models. Although I'm also leaning towards the position that explicitly modelbased approaches are more like to scale to AGI.
(fn2) Even if you wanted to make a superhuman AI that only solves mathematical problems, I suspect that the only way it could work is by having the AI generate models of "mathematical behaviors".
(fn3) As an analogy, a "qualitatively optimal" algorithm for a problem in is just any polynomial time algorithm. In the case of AGI, I imagine a similar computational complexity bound plus some (also qualitative) guarantee(s) about sample complexity and/or query complexity. By "relatively simple" I mean something like, can be described within 20 pages given that we can use algorithms for other natural problems.
I propose a counterexample. Suppose we are playing a series of games with another agent. To play effectively, we train a circuit to predict the opponent's moves. At this point the circuit already contains an adversarial agent. However, one could object that it's unfair: we asked for an adversarial agent so we got an adversarial agent (nevertheless for AI alignment it's still a problem). To remove the objection, let's make some further assumptions. The training is done on some set of games, but distributional shift happens and later games are different. The opponent knows this, so on the training games it simulates a different agent. Specifically, it simulates an agent who searches for a strategy s.t. the best response to this strategy has the strongest counterresponse. The minimal circuit hence contains the same agent. On the training data we win, but on the shifted distribution the daemon deceives us and we lose.
I certainly agree that we will want AI systems that can find good actions, where "good" is based on longterm consequences. However, I think counterfactual oracles and recursive amplification also meet this criterion; I'm not sure why you think they are counterarguments. Perhaps you think that the AI system also needs to autonomously execute the actions it finds, whereas I find that plausible but not necessary?
Maybe we need to further refine the terminology. We could say that counterfactual oracles are not intrinsically goaldirected. Meaning that, the algorithm doesn't start with all the necessary components to produce good plans, but instead tries to learn these components by emulating humans. This approach comes with costs that I think will make it uncompetitive compared to intrinsically goaldirect agents, for the reasons I mentioned before. Moreover, I think that any agent which is "extrinsically goaldirected" rather than intrinsically goaldirected will have such penalties.
In order for an agent to gain strategic advantage it is probably not necessary for it be powerful enough to emulate humans accurately, reliably and significantly faster than realtime. We can consider three possible worlds:
World A: Agents that aren't powerful enough for even a limited scope shortterm emulation of humans can gain strategic advantage. This world is a problem even for Dialogic RL, but I am not sure whether it's a fatal problem.
World B: Agents that aren't powerful enough for a shortterm emulation of humans cannot gain strategic advantage. Agents that aren't powerful enough for a longterm emulation of humans (i.e high bandwidth and faster than realtime) can gain strategic advantage. This world is good for Dialogic RL but bad for extrinsically goaldirected approaches.
World C: Agents that aren't powerful enough for a longterm emulation of humans cannot gain strategic advantage. In this world delegating the remaining part of the AI safety problem to extrinsically goaldirected agents is viable. However, if unaligned intrinsically goaldirected agents are deployed before a defense system is implemented, they will probably still win because of their more efficient use of computing resources, lower riskaversiveness, because even a spedup version of the human algorithm might still have suboptimal sample complexity and because of attacks from the future. Dialogic RL will also be disadvantaged compared to unaligned AI (because of riskaversiveness) but at least the defense system will be constructed faster.
Allowing the AI to execute the actions it finds is also advantageous because of higher bandwidths and shorter reaction times. But this concerns me less.
I think that the discussion might be missing a distinction between different types or degrees of goaldirectedness. For example, consider Dialogic Reinforcement Learning. Does it describe a goaldirected agent? On the one hand, you could argue it doesn't, because this agent doesn't have fixed preferences and doesn't have consistent beliefs over time. On the other hand, you could argue it does, because this agent is still doing longterm planning in the physical world. So, I definitely agree that aligned AI systems will only be goaldirected in the weaker sense that I alluded to, rather than in the stronger sense, and this is because the user is only goaldirected in the weak sense emself.
If we're aiming at "weak" goaldirectedness (which might be consistent with your position?), does it mean studying strong goaldirectedness is redundant? I think that answer is, clearly no. Strong goaldirected systems are a simpler special case on which to hone our theories of intelligence. Trying to understand weak goaldirected agents without understanding strong goaldirected agents seems to me like trying to understand molecules without understanding atoms.
On the other hand, I am skeptical about solutions to AI safety that require the user doing a sizable fraction of the actual planning. I think that planning does not decompose into an easy part and a hard part (which is not essentially planning in itself) in a way which would enable such systems to be competitive with fully autonomous planners. The strongest counterargument to this position, IMO, is the proposal to use counterfatual oracles or recursively amplified versions thereof in the style of IDA. However, I believe that such systems will still fail to be simultaneously safe and competitive because (i) forecasting is hard if you don't know which features are important to forecast, and becomes doubly hard if you need to impose confidence threshold to avoid catastrophic errors and in particular malign hypotheses (thresholds of the sort used in delegative RL) (ii) it seems plausible that competitive AI would have to be recursively selfimproving (I updated towards this position after coming up with Turing RL) and that might already necessitate longterm planning and (iii) such system are vulnerable to attacks from the future and to attacks from counterfactual scenarios.
I think when I wrote the sequence, I thought the "just do deep RL" approach to AGI wouldn't work, and now I think it has more of a chance, and this has updated me towards powerful AI systems being goaldirected. (However, I do not think it is clear that "just do deep RL" approaches lead to goaldirected systems.)
To be clear, my own position is not strongly correlated with whether deep RL leads to AGI (i.e. I think it's true even if deep RL doesn't lead to AGI). But also, the question seems somewhat underspecified, since it's not clear which algorithmic innovation would count as still "just deep RL" and which wouldn't.