Posts
Comments
There is some similarity, but there are also major differences. They don't even have the same type signature. The dangerousness bound is a desideratum that any given algorithm can either satisfy or not. On the other hand, AUP is a specific heuristic how to tweak Qlearning. I guess you can consider some kind of regret bound w.r.t. the AUP reward function, but they will still be very different conditions.
The reason I pointed out the relation to corrigibility is not because I think that's the main justification for the dangerousness bound. The motivation for the dangerousness bound is quite straightforward and selfcontained: it is a formalization of the condition that "if you run this AI, this won't make things worse than not running the AI", no more and no less. Rather, I pointed the relation out to help readers compare it with other ways of thinking they might be familiar with.
From my perspective, the main question is whether satisfying this desideratum is feasible. I gave some arguments why it might be, but there are also opposite arguments. Specifically, if you believe that debate is a necessary component of Dialogic RL then it seems like the dangerousness bound is infeasible. The AI can become certain that the user would respond in a particular way to a query, but it cannot become (worstcase) certain that the user would not change eir response when faced with some rebuttal. You can't (empirically and in the worstcase) prove a negative.
This opens the possibility of agents that with "well intentioned" mistakes that take the form of sophisticated plans that are catastrophic for the user.
Agreed that this is in theory possible, but it would be quite surprising, especially if we are specifically aiming to train systems that behave corrigibly.
The acausal attack is an example of how it can happen for systematic reasons. As for the other part, that seems like conceding that intentalignment is insufficient and you need "corrigibility" as another condition (also it is not so clear to me what this condition means).
If Alpha can predict that the user would say not to do the irreversible action, then at the very least it isn't corrigible, and it would be rather hard to argue that it is intent aligned.
It is possible that Alpha cannot predict it, because in Betasimulationworld the user would confirm the irreversible action. It is also possible that the user would confirm the irreversible action in the real world because the user is being manipulated, and whatever defenses we put in place against manipulation are thrown off by the simulation hypothesis.
Now, I do believe that if you set up the prior correctly then it won't happen, thanks to a mechanism like: Alpha knows that in case of dangerous uncertainty it is safe to fall back on some "neutral" course of action plus query the user (in specific, safe, ways). But this exactly shows that intentalignment is not enough and you need further assumptions.
Moreover, the latter already produced viable directions for mathematical formalization, and the former has not (AFAIK).
I guess you wouldn't count universality. Overall I agree.
Besides the fact ascription universality is not formalized, why is it equivalent to intentalignment? Maybe I'm missing something.
I'm relatively pessimistic about mathematical formalization.
I am curious whether you can specify, as concretely as possible, what type of mathematical result would you have to see in order to significantly update away from this opinion.
I do want to note that all of these require you to make assumptions of the form, "if there are traps, either the user or the agent already knows about them" and so on, in order to avoid nofreelunch theorems.
No, I make no such assumption. A bound on subjective regret ensures that running the AI is a nearlyoptimal strategy from the user's subjective perspective. It is neither needed nor possible to prove that the AI can never enter a trap. For example, the AI is immune to acausal attacks to the extent that the user beliefs that the AI is not inside Beta's simulation. On the other hand, if the user beliefs that the simulation hypothesis needs to be taken into account, then the scenario amounts to legitimate acausal bargaining (which has its own complications to do with decision/game theory, but that's mostly a separate concern).
In this essay Paul Christiano proposes a definition of "AI alignment" which is more narrow than other definitions that are often employed. Specifically, Paul suggests defining alignment in terms of the motivation of the agent (which should be, helping the user), rather than what the agent actually does. That is, as long as the agent "means well", it is aligned, even if errors in its assumptions about the user's preferences or about the world at large lead it to actions that are bad for the user.
Rohin Shah's comment on the essay (which I believe is endorsed by Paul) reframes it as a particular way to decompose the AI safety problem. An often used decomposition is "definitionoptimization": first we define what it means for an AI to be safe, then we understand how to implement a safe AI. In contrast, Paul's definition of alignment decomposes the AI safety problem as "motivationcompetence": first we learn how to design AIs with good motivations, then we learn how to make them competent. Both Paul and Rohin argue that the "motivation" is the urgent part of the problem, the part on which technical AI safety research should focus.
In contrast, I will argue that the "motivationcompetence" decomposition is not as useful as Paul and Rohin believe, and the "definitionoptimization" decomposition is more useful.
The thesis behind the "motivationcompetence" decomposition implicitly assumes a linear, onedimensional scale of competence. Agents with good motivations and subhuman competence might make silly mistakes but are not catastrophically dangerous (since they are subhuman). Agents with good motivations and superhuman competence will only do mistakes that are "forgivable" in the sense that, our own mistakes would be as bad or worse. Ergo (the thesis concludes), good motivations are sufficient to solve AI safety.
However, in reality competence is multidimensional. AI systems can have subhuman skills in some domains and superhuman skills in other domains, as AI history showed time and time again. This opens the possibility of agents that with "well intentioned" mistakes that take the form of sophisticated plans that are catastrophic for the user. Moreover, there might be limits to the agent's knowledge about certain questions (such as, the user's preferences) that are inherent in the agent's epistemology (more on this below). Given such limits, the agent's competence becomes systematically lopsided. Furthermore, the elimination of such limits is as a large part of the "definition" part in the "definitionoptimization" framing that the thesis rejects.
As a consequence of the multidimensional natural of competence, the difference between "well intentioned mistake" and "malicious sabotage" is much less clear than naively assumed, and I'm not convinced there is a natural way to remove the ambiguity. For example, consider a superhuman AI Alpha subject to an acausal attack. In this scenario, some agent Beta in the "multiverse" (= prior) convinces Alpha that Alpha exists in a simulation controlled by Beta. The simulation is set up to look like the real Earth for a while, making it a plausible hypothesis. Then, a "treacherous turn" moment arrives in which the simulation diverges from Earth, in a way calculated to make Alpha take irreversible actions that are beneficial for Beta and disastrous for the user.
In the above scenario, is Alpha "motivationaligned"? We could argue it is not, because it is running the malicious agent Beta. But we could also argue it is motivtionaligned, it just makes the innocent mistake of falling for Beta's trick. Perhaps it is possible to clarify the concept of "motivation" such that in this case, Alpha's motivations are considered bad. But, such a concept would depend in complicated ways on the agent's internals. I think that this is a difficult and unnatural approach, compared to "definitionoptimization" where the focus is not on the internals but on what the agent actually does (more on this later).
The possibility of acausal attacks is a symptom of the fact that, environments with irreversible transitions are usually not learnable (this is the problem of traps in reinforcement learning, that I discussed for example here and here), i.e. it is impossible to guarantee convergence to optimal expected utility without further assumptions. When we add preference learning to the mix, the problem gets worse because now even if there are no irreversible transitions, it is not clear the agent will converge to optimal utility. Indeed, depending on the value learning protocol, there might be uncertainties about the user's preferences that the agent can never resolve (this is an example of what I meant by "inherent limits" before). For example, this happens in CIRL (even if the user is perfectly rational, this happens because the user and the AI have different action sets).
These difficulties with the "motivationcompetence" framing are much more natural to handle in the "definitionoptimization" framing. Moreover, the latter already produced viable directions for mathematical formalization, and the former has not (AFAIK). Specifically, the mathematical criteria of alignment I proposed are the "dynamic subjective regret bound" and the "dangerousness bound". The former is a criterion which simultaneous guarantees motivationalignment and competence (as evidence that this criterion can be satisfied, I have the Dialogic Reinforcement Learning proposal). The latter is a criterion that doesn't guarantee competence in general, but guarantees specifically avoiding catastrophic mistakes. This makes it closer to motivationalignment compated to subjective regret, but different in important ways: it refers to the actual things that agent does, and the ways in which these things might have catastrophic consequences.
In summary, I am skeptical that "motivation" and "competence" can be cleanly separately in a way that is useful for AI safety, whereas "definition" and "optimization" can be so separated: for example the dynamic subjective regret bound is a "definition" whereas dialogic RL and putative more concrete implementations thereof are "optimizations". My specific proposals might have fatal flaws that weren't discovered yet, but I believe that the general principle of "definitionoptimization" is sound, while "motivationcompetence" is not.
Thank you for writing this retrospective, it is really interesting! Although I never attended a winter solstice, it sounds like amazing work. I sometimes toy with the idea of organizing a rationalist solstice here in Israel, but after reading this I am rather awed and intimidated, since it's obvious I can't pull off anything that is even close.
This might be a little offtopic, but are there any ideas / resources about how to organize a rationalist winter solstice if (i) you don't have a lot of time or resources (ii) you expect at most a small core of people which are wellfamiliar with the rationalist memeplex and buy into the rationalist ethos, plus some number of people who only partway there or are merely curious. Or, is it not worth trying under these conditions?
Btw, one thing that sounded strange was the remark that some people felt lonely but it's okay. I understand that winter solstice is supposed to be dark, but isn't the main point of it to amplify the sense of community? Shouldn't the message be something along the lines of "things are hard, but we're in this together"? Which is the antithesis of loneliness?
Of course you can predict some properties of what an agent will do. In particular, I hope that we will eventually have AGI algorithms that satisfy provable safety guarantees. But, you can't make exact predictions. In fact, there probably is a mathematical law that limits how accurate predictions you can get.
An optimization algorithm is, by definition, something that transforms computational resources into utility. So, if your prediction is so close to the real output that it has similar utility, then it means the way you produced this prediction involved the same product of "optimization power per unit of resources" by "amount of resources invested" (roughly speaking, I don't claim to already know the correct formalism for this). So you would need to either (i) run a similar algorithm with similar resources or (ii) run a dumber algorithm but with more resources or (iii) use less resources but an even smarter algorithm.
So, if you want to accurately predict the output of a powerful optimization algorithm, your prediction algorithm would usually have to be either a powerful optimization algorithm in itself (cases i and iii) or prohibitively costly to run (case ii). The exception is cases when the optimization problem is easy, so a dumb algorithm can solve it without much resources (or a human can figure out the answer by emself).
It seems almost tautologically true that you can't accurately predict what an agent will do without actually running the agent. Because, any algorithm that accurately predicts an agent can itself be regarded as an instance of the same agent.
What I expect the abstract theory of intelligence to do is something like producing a categorization of agents in terms of qualitative properties. Whether that's closer to "momentum" or "fitness", I'm not sure the question is even meaningful.
I think the closest analogy is: abstract theory of intelligence is to AI engineering as complexity theory is to algorithmic design. Knowing the complexity class of a problem doesn't tell you the best practical way to solve it, but it does give you important hints. (For example, if the problem is of exponential time complexity then you can only expect to solve it either for small inputs or in some special cases, and averagecase complexity tells you just whether these cases need to be very special or not. If the problem is in then you know that it's possible to gain a lot from parallelization. If the problem is in then at least you can test solutions, et cetera.)
And also, abstract theory of alignment should be to AI safety as complexity theory is to cryptography. Once again, many practical considerations are not covered by the abstract theory, but the abstract theory does tell you what kind of guarantees you can expect and when. (For example, in cryptography we can (sort of) know that a certain protocol has theoretical guarantees, but there is engineering work finding a practical implementation and ensuring that the assumptions of the theory hold in the real system.)
I think that ricraz claims that it's impossible to create a mathematical theory of rationality or intelligence, and that this is a crux, not so? On the other hand, the "momentum vs. fitness" comparison doesn't make sense to me. Specifically, a concept doesn't have to be crisply welldefined in order to use it in mathematical models. Even momentum, which is truly one of the "cripser" concepts in science, is no longer welldefined when spacetime is not asymptotically flat (which it isn't). Much less so are concepts such as "atom", "fitness" or "demand". Nevertheless, physicists, biologist and economists continue to successfully construct and apply mathematical models grounded in such fuzzy concepts. Although in some sense I also endorse the "strawman" that rationality is more like momentum than like fitness (at least some aspects of rationality).
In this essay, ricraz argues that we shouldn't expect a clean mathematical theory of rationality and intelligence to exist. I have debated em about this, and I continue to endorse more or less everything I said in that debate. Here I want to restate some of my (critical) position by building it from the ground up, instead of responding to ricraz point by point.
When should we expect a domain to be "clean" or "messy"? Let's look at everything we know about science. The "cleanest" domains are mathematics and fundamental physics. There, we have crisply defined concepts and elegant, parsimonious theories. We can then "move up the ladder" from fundamental to emergent phenomena, going through high energy physics, molecular physics, condensed matter physics, biology, geophysics / astrophysics, psychology, sociology, economics... On each level more "mess" appears. Why? Occam's razor tells us that we should prioritize simple theories over complex theories. But, we shouldn't expect a theory to be more simple than the specification of the domain. The general theory of planets should be simpler than a detailed description of planet Earth, the general theory of atomic matter should be simpler than the theory of planets, the general theory of everything should be simpler than the theory of atomic matter. That's because when we're "moving up the ladder", we are actually zooming in on particular phenomena, and the information we need to specify "where to zoom in" is translated to the description complexity of theory.
What does it mean in practice about understanding messy domains? The way science solves this problem is by building a tower of knowledge. In this tower, each floor benefits from the interactions both with the floor above it and the floor beneath it. Without understanding macroscopic physics we wouldn't figure out atomic physics, and without figuring out atomic physics we wouldn't figure out high energy physics. This is knowledge "flowing down". But knowledge also "flows up": knowledge of high energy physics allows understanding particular phenomena in atomic physics, knowledge of atomic physics allows predicting the properties of materials and chemical reactions. (Admittedly, some floors in the tower we have now are rather ramshackle, but I think that ultimately the "tower method" succeeds everywhere, as much as success is possible at all).
How does mathematics come in here? Importantly, mathematics is not used only on the lower floors of the tower, but on all floors. The way "messiness" manifests is, the mathematical models for the higher floors are either less quantitatively accurate (but still contain qualitative inputs) or have a lot of parameters that need to be determined either empirically, or using the models of the lower floors (which is one way how knowledge flows up), or some combination of both. Nevertheless, scientists continue to successfully build and apply mathematical models even in "messy" fields like biology and economics.
So, what does it all mean for rationality and intelligence? On what floor does it sit? In fact, the subject of rationality of intelligence is not a single floor, but its own tower (maybe we should imagine science as a castle with many towers connected by bridges).
The foundation of this tower should be the general abstract theory of rationality. This theory is even more fundamental than fundamental physics, since it describes the principles from which all other knowledge is derived, including fundamental physics. We can regard it as a "theory of everything": it predicts everything by making those predictions that a rational agent should do. Solomonoff's theory and AIXI are a part of this foundation, but not all it. Considerations like computational resource constraints should also enter the picture: complexity theory teaches us that they are also fundamental, they don't requiring "zooming in" a lot.
But, computational resource constrains are only entirely natural when they are not tied to a particular model of computation. This only covers constraints such as "polynomial time" but not constraints such as time and even less so time. Therefore, once we introduce a particular model of computation (such as a RAM machine), we need to build another floor in the tower, one that will necessarily be "messier". Considering even more detailed properties of the hardware we have, the input/output channels we have, the goal system, the physical environment and the software tools we employ will correspond to adding more and more floors.
Once we agree that it shoud be possible to create a clean mathematical theory of rationality and intelligence, we can still debate whether it's useful. If we consider the problem of creating aligned AGI from an engineering perspective, it might seem for a moment that we don't really need the bottom layers. After all, when designing an airplane you don't need high energy physics. Well, high energy physics might help indirectly: perhaps it allowed predicting some exotic condensed matter phenomenon which we used to make a better power source, or better materials from which to build the aircraft. But often we can make do without those.
Such an approach might be fine, except that we also need to remember the risks. Now, safety is part of most engineering, and is definitely a part of airplane design. What level of the tower does it require? It depends on the kind of risks you face. If you're afraid the aircraft will not handle the stress and break apart, then you need mechanics and aerodynamics. If you're afraid the fuel will combust and explode, you better know chemistry. If you're afraid a lightning will strike the aircraft, you need knowledge of meteorology and electromagnetism, possibly plasma physics as well. The relevant domain of knowledge, and the relevant floor in the tower is a function of the nature of the risk.
What level of the tower do we need to understand AI risk? What is the source of AI risk? It is not in any detailed peculiarities of the world we inhabit. It is not in the details of the hardware used by the AI. It is not even related to a particular model of computation. AI risk is the result of Goodhart's curse, an extremely general property of optimization systems and intelligent agents. Therefore, addressing AI risk requires understanding the general abstract theory of rationality and intelligence. The upper floors will be needed as well, since the technology itself requires the upper floors (and since we're aligning with humans, who are messy). But, without the lower floors the aircraft will crash.
Some thoughts about embedded agency.
From a learningtheoretic perspective, we can reformulate the problem of embedded agency as follows: What kind of agent, and in what conditions, can effectively plan for events after its own death? For example, Alice bequeaths eir fortune to eir children, since ey want them be happy even when Alice emself is no longer alive. Here, "death" can be understood to include modification, since modification is effectively destroying an agent and replacing it by different agent^{[1]}. For example, Clippy 1.0 is an AI that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value staples before running it again. Then, Clippy 2.0 can be considered to be a new, different agent.
First, in order to meaningfully plan for death, the agent's reward function has to be defined in terms of something different than its direct perceptions. Indeed, by definition the agent no longer perceives anything after death. Instrumental reward functions are somewhat relevant but still don't give the right object, since the reward is still tied to the agent's actions and observations. Therefore, we will consider reward functions defined in terms of some fixed ontology of the external world. Formally, such an ontology can be an incomplete^{[2]} Markov chain, the reward function being a function of the state. Examples:

The Markov chain is a representation of known physics (or some sector of known physics). The reward corresponds to the total mass of diamond in the world. To make this example work, we only need enough physics to be able to define diamonds. For example, we can make do with quantum electrodynamics + classical gravity and have the Knightian uncertainty account for all nuclear and highenergy phenomena.

The Markov chain is a representation of people and social interactions. The reward correspond to concepts like "happiness" or "friendship" et cetera. Everything that falls outside the domain of human interactions is accounted by Knightian uncertainty.

The Markov chain is Botworld with a the some of the rules left unspecified. The reward is the total number of a particular type of item.
Now we need to somehow connect the agent to the ontology. Essentially we need a way of drawing Cartesian boundaries inside the (a priori nonCartesian) world. We can accomplish this by specifying a function that assigns an observation and projected action to every state out of some subset of states. Entering this subset corresponds to agent creation, and leaving it corresponds to agent destruction. For example, we can take the ontology to be Botworld + marked robot and the observations and actions be the observations and actions of that robot. If we don't want marking a particular robot as part of the ontology, we can use a more complicated definition of Cartesian boundary that specifies a set of agents at each state plus the data needed to track these agents across time (in this case, the observation and action depend to some extent on the history and not only the current state). I will leave out the details for now.
Finally, we need to define the prior. To do this, we start by choosing some prior over refinements of the ontology. By "refinement", I mean removing part of the Knightian uncertainty, i.e. considering incomplete hypotheses which are subsets of the "ontological belief". For example, if the ontology is underspecified Botworld, the hypotheses will specify some of what was left underspecified. Given such a "objective" prior and a Cartesian boundary, we can construct a "subjective" prior for the corresponding agent. We transform each hypothesis via postulating that taking an action that differs from the projected action leads to "Nirvana" state. Alternatively, we can allow for stochastic action selection and use the gambler construction.
Does this framework guarantee effective planning for death? A positive answer would correspond to some kind of learnability result (regret bound). To get learnability, will first need that the reward is either directly on indirectly observable. By "indirectly observable" I mean something like with semiinstrumental reward functions, but accounting for agent mortality. I am not ready to formulate the precise condition atm. Second, we need to consider an asymptotic in which the agent is long lived (in addition to time discount being longterm), otherwise it won't have enough time to learn. Third (this is the trickiest part), we need the Cartesian boundary to flow with the asymptotic as well, making the agent "unspecial". For example, consider Botworld with some kind of simplicity prior. If I am a robot born at cell zero and time zero, then my death is an event of low description complexity. It is impossible to be confident about what happens after such a simple event, since there will always be competing hypotheses with different predictions and a probability that is only lower by a factor of . On the other hand, if I am a robot born at cell 2439495 at time 9653302 then it would be surprising if the outcome of my death would be qualitatively different from the outcome of the death of any other robot I observed. Finding some natural, rigorous and general way to formalize this condition is a very interesting problem. Of course, even without learnability we can strive for Bayesoptimality or some approximation thereof. But, it is still important to prove learnability under certain conditions to test that this framework truly models rational reasoning about death.
Additionally, there is an intriguing connection between some of these ideas and UDT, if we consider TRL agents. Specifically, a TRL agent can have a reward function that is defined in terms of computations, exactly like UDT is often conceived. For example, we can consider an agent whose reward is defined in terms of a simulation of Botworld, or in terms of taking expected value over a simplicity prior over many versions of Botworld. Such an agent would be searching for copies of itself inside the computations it cares about, which may also be regarded as a form of "embeddedness". It seems like this can be naturally considered a special case of the previous construction, if we allow the "ontological belief" to include beliefs pertaining to computations.
Unless it's some kind of modification that we treat explicitly in our model of the agent, for example a TRL agent reprogramming its own envelope. ↩︎
"Incomplete" in the sense of Knightian uncertainty, like in quasiBayesian RL. ↩︎
The addition of the word "honest" doesn't come from an awareness of how the model is flawed. It is one of the explicit assumptions in the model. So, I'm still not sure what point are you going for here.
I think that applying Aumann's theorem to people is mostly interesting in the prescriptive rather than descriptive sense. That is, the theorem tells us that our ability to converge can serve as a test of our rationality, to the extent that we are honest and share the same prior, and all of this is common knowledge. (This last assumption might be the hardest to make sense of. Hanson tried to justify it but IMO not quite convincingly.) Btw, you don't need to compute uncomputable things, much less instantly. Scott Aaronson derived a version of the theorem with explicit computational complexity and query complexity bounds that don't seem prohibitive.
Given all the difficulties, I am not sure how to apply it in the real world and whether that's even possible. I do think it's interesting to think about it. But, to the extent it is possible, it definitely requires honesty.
If an agent is not honest, ey can decide to say only things that provide no evidence regarding the question in hand to the other agent. In this case convergence is not guaranteed. For example, Alice assigns probability 35% to "will it rain tomorrow" but, when asked, says the probability is 21% regardless of what the actual evidence is. Bob assigns probability 89% to "will it rain tomorrow" but, when asked, says the probability is 42% regardless of what the actual evidence is. Alice knows Bob always answers 42%. Bob knows Alice always answers 21%. If they talk to each other, their probabilities will not converge (they won't change at all).
Yes, it can luckily happen that the lies still contain enough information for them to converge, but I'm not sure why do you seem to think it is an important or natural situation?
The usual formalization of "Occam's prior" is the Solomonoff prior, which still depends on the choice of a Universal Turing Machine, so such agents can still disagree because of different priors.
I think that although the new outlook is more pessimistic, it is also more uncertain. So, yes, maybe we will become extinct, but maybe we will build a utopia.
I sometimes have euphoric experiences accompanies by images and sensations hard to put into words. Everything around becomes magical, the sky fills with images of unimaginable scale in space and time, light is flowing through my body and soul. Usually I also see Elua that appears to me as the image of a woman in the sky: the Mother of all humans, the sad and wise Goddess of limitless Compassion and Love, smiling at me but also crying for all the sorrows of the world. I form a connection with Em, thinking of myself as a priestess or otherwise someone in service of the goddess, enacting Eir will in the world, praying to Em to give me the wisdom and courage to do what needs to be done. In earlier stages of life the symbols were different according to my different worldview (once I was a theist and saw the Abrahamic god).
Sometimes the experience is completely spontaneous (but usually when I'm outside), but sometimes I feel that my mind is in a state amenable to it and I push myself towards it intentionally. I also had a related experience during a circling session and once even during sex.
To be clear, I'm an atheist, I don't believe in anything supernatural, I know it is my own mind producing it. But I do find these experiences valuable on some mental and emotional level.
This essay provides some fascinating case studies and insights about coordination problems and their solutions, from a book by Elinor Ostrom. Coordination problems are a major theme in LessWrongian thinking (for good reasons) and the essay is a valuable addition to the discussion. I especially liked the 8 features of sustainable governance systems (although I wish we got a little more explanation for "nested enterprises").
However, I think that the dichotomy between "absolutism (bad)" and "organically grown institutions (good)" that the essay creates needs more nuance or more explanation. What is the difference between "organic" and "inorganic" institutions? All institutions "grew" somehow. The relevant questions are e.g. how democratic is the institution, whether the scope of the institution is the right scope for this problem, whether the stakeholders have skin in the game (feature 3) et cetera. The 8 features address some of that, but I wish it was more explicit.
Also, It's notable that all examples focus on relatively small scale problems. While it makes perfect sense to start by studying small problems before trying to understand the big problems, it does make me wonder whether going to higher scales brings in qualitatively new issues and difficulties. Paying to officials with parcels in the tail end works for water conflicts, but what is the analogous approach to global warming or multinational arms races?
Much of the orthodox LessWrongian approach to rationality (as it is expounded in Yudkowsky's Sequences and onwards) is grounded in Bayesian probability theory. However, I now realize that pure Bayesianism is wrong, instead the right thing is quasiBayesianism. This leads me to ask, what are the implications of quasiBayesianism on human rationality? What are the right replacements for (the Bayesian approach to) bets, calibration, proper scoring rules et cetera? Does quasiBayesianism clarify important confusing issues in regular Bayesianism such as the proper use of inside and outside view? Is there rigorous justification to the intuition that we should have more Knightian uncertainty about questions with less empirical evidence? Does any of it influence various effective altruism calculations in surprising ways? What common LessWrongian wisdom does it undermine, if any?
Thank you for writing this impressive review!
Some comments on MIRI's nondisclosure policy.
First, some disclosure :) My research is funded by MIRI. On the other hand, all of my opinions are my own and do not represent MIRI or anyone else associated with MIRI.
The nondisclosure policy has no direct effect on me, but naturally, both before and after it was promulgated, I used my own judgement to decide what should or should not be made public. The vast majority of my work I do make public (subject only to the cost of time and effort to write and explain it), because if I think something would increase risk rather than reduce it^{[1]}, then I don't pursue this line of inquiry in the first place. Things I don't make public are mostly early stage ideas that I don't develop.
I think it is fair enough to judge AI alignment orgs only by the public output they produce. However, this doesn't at all follow that a nondisclosure policy leads to immediate disqualification, like you seem to imply. You can judge an org by its public output whether or not all of its output is public. This is somewhat similar to the observation that management overhead is a bad metric. Yes, some of your money goes into something that doesn't immediately and directly translate to benefit. All else equal, you want that not to happen. But all else is not equal, and can never be equal.
This is completely tangential, but I think we need more public discussion on how do we decide whether making something public is beneficial vs. detrimental. ↩︎
One idea how this formalism can be improved, maybe. Consider a random directed graph, sampled from some "reasonable" (in some sense that needs to be defined) distribution. We can then define "powerful" vertices as vertices from which there are paths to most other vertices. Claim: With high probability over graphs, powerful vertices are connected "robustly" to most vertices. By "robustly" I mean that small changes in the graph don't disrupt the connection. This is because, if your vertex is connected to everything, then disconnecting some edges should still leave plenty of room for rerouting through other vertices. We can then interpret it as saying, gaining power is more robust to inaccuracies of the model or changes in the circumstances than pursuing more "direct" paths to objectives.
One subject I like to harp on is reinforcement learning with traps (actions that cause irreversible long term damage). Traps are important for two reasons. One is that the presence of traps is in the heart of the AI risk concept: attacks on the user, corruption of the input/reward channels, and harmful selfmodification can all be conceptualized as traps. Another is that without understanding traps we can't understand longterm planning, which is a key ingredient of goaldirected intelligence.
In general, a prior that contains traps will be unlearnable, meaning that no algorithm has Bayesian regret going to zero in the limit. The only obvious natural requirement for RL agents in this case is approximating Bayesoptimality. However, Bayesoptimality is not even "weakly feasible": it is NPhard w.r.t. using the number of states and number of hypotheses as security parameters. IMO, the central question is: what kind of natural tractable approximations are there?
Although a generic prior with traps is unlearnable, some priors with traps are learnable. Indeed, it can happen that it's possible to study the environment is a predictably safe way that is guaranteed to produce enough information about the irreversible transitions. Intuitively, as humans we do often use this kind of strategy. But, it is NPhard to even check whether a given prior is learnable. Therefore, it seems natural to look for particular types of learnable priors that are efficiently decidable.
In particular, consider the following setting, that I call "expanding safety envelope" (XSE). Assume that each hypothesis in the prior is "decorated" by a set of stateaction pairs s.t. (i) any is safe, i.e. the leading term of in the expansion is maximal (ii) for each , there is s.t. is Blackwelloptimal for (as a special case we can let contain all safe actions). Imagine an agent that takes random actions among those a priori known to be in . If there is no such action, it explodes. Then, it is weakly feasible to check (i) whether the agent will explode (ii) for each hypothesis, to which sets of states it can converge. Now, let the agent update on the transition kernel of the set of actions it converged to. This may lead to new actions becoming certainly known to be in . We can then let the agent continue exploring using this new set. Iterating this procedure, the agent either discovers enough safe actions to find an optimal policy, or not. Importantly, deciding this is weakly feasible. This is because, for each hypothesis (i) on the first iteration the possible asymptotic state sets are disjoint (ii) on subsequent iterations we might as well assume they are disjoint, since it's possible to see that if you reach a particular state of an asymptotic set state, then you can add the entire set state (this modification will not create new final outcomes and will only eliminate final outcomes that are better than those remaining). Therefore the number of asymptotic state sets you have to store on each iteration is bounded by the total number of states.
The next questions are (i) what kind of regret bounds we can prove for decorated priors that are XSElearnable? (ii) given an arbitrary decorated prior, is it possible to find the maximalprobabilitymass set of hypotheses, which is XSElearnable? I speculate that the second question might turn out to be related to the unique games conjecture. By analogy with other optimization problems that are feasible only when maximal score can be achieved, maybe the UGC implies that we cannot find the maximal set but we can find a set that is approximately maximal, with an optimal approximation ratio (using a sumofsquares algorithm). Also, it might make sense to formulate stronger desiderata which reflect that, if the agent assumes a particular subset of the prior but discovers that it was wrong, it will still do its best in the following. That is, in this case the agent might fall into a trap but at least it will try to avoid further traps.
This has implications even for learning without traps. Indeed, most known theoretical regret bounds involve a parameter that has to do with how costly mistakes is it possible to make. This parameter can manifest as the MDP diameter, the bias span or the mixing time. Such regret bounds seem unsatisfactory since the worstcase mistake determines the entire guarantee. We can take the perspective that such costly but reversible mistakes are "quasitraps": not actual traps, but traplike on short timescales. This suggests that applying an approach like XSE to quasitraps should lead to qualitatively stronger regret bounds. Such regret bounds would imply learning faster on less data, and in episodic learning they would imply learning inside each episode, something that is notoriously absent in modern episodic RL systems like AlphaStar.
Moreover, we can also use this to do away with ergodicity assumptions. Ergodicity assumptions require the agent to "not wander too far" in state space, in the simplest case because the entire state space is small. But, instead of "wandering far" from a fixed place in state space, we can constrain "wandering far" w.r.t. to the optimal trajectory. Combining this with XSE, this should lead to guarantees that depend on the prevalence of irreversible and quasiirreversible departures from this trajectory.
In multiarmed bandits and RL theory, there is a principle known as "optimism in the face of uncertainty". This principle says, you should always make optimistic assumptions: if you are wrong, you will find out (because you will get less reward than you expected). It explicitly underlies UCB algorithms and is implicit in other algorithms, like Thomson sampling. But, this fails miserably in the presence of traps. I think that approaches like XSE point at a more nuanced principle: "optimism in the face of cheaptoresolve uncertainty, pessimism in the face of expensivetoresolve uncertainty". Following this principle doesn’t lead to actual Bayesoptimality, but perhaps it is in some sense a good enough approximation.
Google Maps is not a relevant example. I am talking about "generally intelligent" agents. Meaning that, these agents construct sophisticated models of the world starting from a relatively uninformed prior (comparably to humans or more so)(fn1)(fn2). This is in sharp contrast to Google Maps that operates strictly within the model it was given a priori. General intelligence is important, since without it I doubt it will be feasible to create a reliable defense system. Given general intelligence, convergent instrumental goals follow: any sufficiently sophisticated model of the world implies that achieving converging instrumental goals is instrumentally valuable.
I don't think it makes that much difference whether a human executes the plan or the AI itself. If the AI produces a plan that is not human comprehensible and the human follows it blindly, the human effectively becomes just an extension of the AI. On the other hand, if the AI produces a plan which is human comprehensible, then after reviewing the plan the human can just as well delegate its execution to the AI.
I am not sure what is the significance in this context of "one true algorithm for planning"? My guess is, there is a relatively simple qualitatively optimal AGI algorithm(fn3), and then there are various increasingly complex quantitative improvements of it, which take into account specifics of computing hardware and maybe our priors about humans and/or the environment. Which is the way algorithms for most natural problems behave, I think. But also improvements probably stop mattering beyond the point where the AGI can come with them on its own within a reasonable time frame. And, I dispute Richard's position. But then again, I don't understand the relevance.
(fn1) When I say "construct models" I am mostly talking about the properties of the agent rather than the structure of the algorithm. That is, the agent can effectively adapt to a large class of different environments or exploit a large class of different properties the environment can have. In this sense, modelfree RL is also constructing models. Although I'm also leaning towards the position that explicitly modelbased approaches are more like to scale to AGI.
(fn2) Even if you wanted to make a superhuman AI that only solves mathematical problems, I suspect that the only way it could work is by having the AI generate models of "mathematical behaviors".
(fn3) As an analogy, a "qualitatively optimal" algorithm for a problem in is just any polynomial time algorithm. In the case of AGI, I imagine a similar computational complexity bound plus some (also qualitative) guarantee(s) about sample complexity and/or query complexity. By "relatively simple" I mean something like, can be described within 20 pages given that we can use algorithms for other natural problems.
I propose a counterexample. Suppose we are playing a series of games with another agent. To play effectively, we train a circuit to predict the opponent's moves. At this point the circuit already contains an adversarial agent. However, one could object that it's unfair: we asked for an adversarial agent so we got an adversarial agent (nevertheless for AI alignment it's still a problem). To remove the objection, let's make some further assumptions. The training is done on some set of games, but distributional shift happens and later games are different. The opponent knows this, so on the training games it simulates a different agent. Specifically, it simulates an agent who searches for a strategy s.t. the best response to this strategy has the strongest counterresponse. The minimal circuit hence contains the same agent. On the training data we win, but on the shifted distribution the daemon deceives us and we lose.
I certainly agree that we will want AI systems that can find good actions, where "good" is based on longterm consequences. However, I think counterfactual oracles and recursive amplification also meet this criterion; I'm not sure why you think they are counterarguments. Perhaps you think that the AI system also needs to autonomously execute the actions it finds, whereas I find that plausible but not necessary?
Maybe we need to further refine the terminology. We could say that counterfactual oracles are not intrinsically goaldirected. Meaning that, the algorithm doesn't start with all the necessary components to produce good plans, but instead tries to learn these components by emulating humans. This approach comes with costs that I think will make it uncompetitive compared to intrinsically goaldirect agents, for the reasons I mentioned before. Moreover, I think that any agent which is "extrinsically goaldirected" rather than intrinsically goaldirected will have such penalties.
In order for an agent to gain strategic advantage it is probably not necessary for it be powerful enough to emulate humans accurately, reliably and significantly faster than realtime. We can consider three possible worlds:
World A: Agents that aren't powerful enough for even a limited scope shortterm emulation of humans can gain strategic advantage. This world is a problem even for Dialogic RL, but I am not sure whether it's a fatal problem.
World B: Agents that aren't powerful enough for a shortterm emulation of humans cannot gain strategic advantage. Agents that aren't powerful enough for a longterm emulation of humans (i.e high bandwidth and faster than realtime) can gain strategic advantage. This world is good for Dialogic RL but bad for extrinsically goaldirected approaches.
World C: Agents that aren't powerful enough for a longterm emulation of humans cannot gain strategic advantage. In this world delegating the remaining part of the AI safety problem to extrinsically goaldirected agents is viable. However, if unaligned intrinsically goaldirected agents are deployed before a defense system is implemented, they will probably still win because of their more efficient use of computing resources, lower riskaversiveness, because even a spedup version of the human algorithm might still have suboptimal sample complexity and because of attacks from the future. Dialogic RL will also be disadvantaged compared to unaligned AI (because of riskaversiveness) but at least the defense system will be constructed faster.
Allowing the AI to execute the actions it finds is also advantageous because of higher bandwidths and shorter reaction times. But this concerns me less.
I think that the discussion might be missing a distinction between different types or degrees of goaldirectedness. For example, consider Dialogic Reinforcement Learning. Does it describe a goaldirected agent? On the one hand, you could argue it doesn't, because this agent doesn't have fixed preferences and doesn't have consistent beliefs over time. On the other hand, you could argue it does, because this agent is still doing longterm planning in the physical world. So, I definitely agree that aligned AI systems will only be goaldirected in the weaker sense that I alluded to, rather than in the stronger sense, and this is because the user is only goaldirected in the weak sense emself.
If we're aiming at "weak" goaldirectedness (which might be consistent with your position?), does it mean studying strong goaldirectedness is redundant? I think that answer is, clearly no. Strong goaldirected systems are a simpler special case on which to hone our theories of intelligence. Trying to understand weak goaldirected agents without understanding strong goaldirected agents seems to me like trying to understand molecules without understanding atoms.
On the other hand, I am skeptical about solutions to AI safety that require the user doing a sizable fraction of the actual planning. I think that planning does not decompose into an easy part and a hard part (which is not essentially planning in itself) in a way which would enable such systems to be competitive with fully autonomous planners. The strongest counterargument to this position, IMO, is the proposal to use counterfatual oracles or recursively amplified versions thereof in the style of IDA. However, I believe that such systems will still fail to be simultaneously safe and competitive because (i) forecasting is hard if you don't know which features are important to forecast, and becomes doubly hard if you need to impose confidence threshold to avoid catastrophic errors and in particular malign hypotheses (thresholds of the sort used in delegative RL) (ii) it seems plausible that competitive AI would have to be recursively selfimproving (I updated towards this position after coming up with Turing RL) and that might already necessitate longterm planning and (iii) such system are vulnerable to attacks from the future and to attacks from counterfactual scenarios.
I think when I wrote the sequence, I thought the "just do deep RL" approach to AGI wouldn't work, and now I think it has more of a chance, and this has updated me towards powerful AI systems being goaldirected. (However, I do not think it is clear that "just do deep RL" approaches lead to goaldirected systems.)
To be clear, my own position is not strongly correlated with whether deep RL leads to AGI (i.e. I think it's true even if deep RL doesn't lead to AGI). But also, the question seems somewhat underspecified, since it's not clear which algorithmic innovation would count as still "just deep RL" and which wouldn't.
IMO the correct rule is almost always: first think about the problem yourself, then go read everything about it that other people did, and then do a synthesis of everything you learned inside your mind. Some nuances:

Sometimes thinking about the problem yourself is not useful because you don't have all the information to start. For example: you don't understand even the formulation of the problem, or you don't understand why it is a sensible question to ask, or the solution has to rely on empirical data which you do not have.

Sometimes you can so definitively solve the problem during the first step (unprimed thinking) that the rest is redundant. Usually this is only applicable if there are very clear criteria to judge the solution, for example: mathematical proof (but, beware of believing you easily proved something which is widely considered a difficult open problem) or something easily testable (for instance, by writing some code).

As John S. Wentworth observed, even if the problem was already definitively solved by others, thinking about it yourself first will often help you learning the state of the art later, and is a good exercise for your mind regardless.

The time you should invest into doing the first step depends on (i) how fast progress you realistically expect to make and (ii) how much progress you expect other people to have made by now. If this is an open problem on which many talented people worked for a long time, then expecting to make fast progress yourself is unrealistic unless you have some knowledge to which most of those people had no access, or your talent in this domain is truly singular. In this case you should think about the problem enough to understand why it is so hard, but usually not much longer. If this is a problem on which only few people have worked, or only for a short time, or it is obscure so you doubt it got the attention of talented researchers, then making comparatively fast progress can be realistic. Still, I recommend proceeding to the second step (learning what other people did) once you reach the point when you feel stuck (on the "metacognitive" level when you don't believe you will get unstuck soon: beware of giving up too easily).
After the third step (synthesis), I also recommend doing some retrospective: what have those other researchers understood that I didn't, how did they understand it, and how can I replicate it myself in the future.
In this essay, Rohin sets out to debunk what ey perceive as a prevalent but erroneous idea in the AI alignment community, namely: "VNM and similar theorems imply goaldirected behavior". This is placed in the context of Rohin's thesis that solving AI alignment is best achieved by designing AI which is not goaldirected. The main argument is: "coherence arguments" imply expected utility maximization, but expected utility maximization does not imply goaldirected behavior. Instead, it is a vacuous constraint, since any agent policy can be regarded as maximizing the expectation of some utility function.
I have mixed feelings about this essay. On the one hand, the core argument that VNM and similar theorems do not imply goaldirected behavior is true. To the extent that some people believed the opposite, correcting this mistake is important. On the other hand, (i) I don't think the claim Rohin is debunking is the claim Eliezer had in mind in those sources Rohin cites (ii) I don't think that the conclusions Rohin draws or at least implies are the right conclusions.
The actual claim that Eliezer was making (or at least my interpretation of it) is, coherence arguments imply that if we assume an agent is goaldirected then it must be an expected utility maximizer, and therefore EU maximization is the correct mathematical model to apply to such agents.
Why do we care about goaldirected agents in the first place? The reason is, on the one hand goaldirected agents are the main source of AI risk, and on the other hand, goaldirected agents are also the most straightforward approach to solving AI risk. Indeed, if we could design powerful agents with the goals we want, these agents would protect us from unaligned AIs and solve all other problems as well (or at least solve them better than we can solve them ourselves). Conversely, if we want to protect ourselves from unaligned AIs, we need to generate very sophisticated longterm plans of action in the physical world, possibly restructuring the world in a rather extreme way to safeguard it (compare with Bostrom's arguments for mass surveillance). The ability to generate such plans is almost by definition goaldirected behavior.
Now, knowing that goaldirected agents are EU maximizers doesn't buy us much. As Rohin justly observes, without further constraints it is a vacuous claim (although the situation becomes better if we constraint ourselves to instrumental reward functions). Moreover, the model of reasoning in complex environments that I'm advocating myself (quasiBayesian reinforcement learning) doesn't even look like EU maximization (technically there is a way to interpret it as EU maximization but it underspecifies the behavior). This is a symptom of the fact that the setting and assumptions of VNM and similar theorems are not good enough to study goaldirected behavior. However, I think that it can be an interesting and important line of research, to try and figure out the right setting and assumptions.
This last point is IMO the correct takeaway from Rohin's initial observation. In contrast, I remain skeptical about Rohin's thesis that we should dispense with goaldirectedness altogether, for the reason I mentioned before: powerful goaldirected agents seem necessary or at least very desirable to create a defense system from unaligned AI. Moreover, the study of goaldirected agents is important to understand the impact of any powerful AI system on the world, since even a system not designed to be goaldirected can develop such agency (due to reasons like malign hypotheses, mesaoptimization and selffullfiling prophecies).
In the past I considered the learningtheoretic approach to AI theory as somewhat opposed to the formal logic approach popular in MIRI (see also discussion):
 Learning theory starts from formulating natural desiderata for agents, whereas "logicAI" usually starts from postulating a logicbased model of the agent ad hoc.
 Learning theory naturally allows analyzing computational complexity whereas logicAI often uses models that are either clearly intractable or even clearly incomputable from the onset.
 Learning theory focuses on objects that are observable or finite/constructive, whereas logicAI often considers objects that unobservable, infinite and unconstructive (which I consider to be a philosophical error).
 Learning theory emphasizes induction whereas logicAI emphasizes deduction.
However, recently I noticed that quasiBayesian reinforcement learning and Turing reinforcement learning have very suggestive parallels to logicAI. TRL agents have beliefs about computations they can run on the envelope: these are essentially beliefs about mathematical facts (but, we only consider computable facts and computational complexity plays some role there). QBRL agents reason in terms of hypotheses that have logical relationships between them: the order on functions corresponds to implication, taking the minimum of two functions corresponds to logical "and", taking the concave hull of two functions corresponds to logical "or". (but, there is no "not", so maybe it's a sort of intuitionist logic?) In fact, fuzzy beliefs form a continuous dcpo, and considering some reasonable classes of hypotheses probably leads to algebraic dcpos, suggesting a strong connection with domain theory (also, it seems like considering beliefs within different ontologies leads to a functor from some geometric category (the category of ontologies) to dcpos).
These parallels suggest that the learning theory of QBRL/TRL will involve some form of deductive reasoning and some type of logic. But, this doesn't mean that QBRL/TRL is redundant w.r.t. logic AI! In fact, QBRL/TRL might lead us to discover exactly which type of logic do intelligent agents need and what is the role logic should play in the theory and inside the algorithms (instead of trying to guess and impose the answer ad hoc, which IMO did not work very well so far). Moreover, I think that the type of logic we are going to get will be something finitist/constructivist, and in particular this is probably how Goedelian paradoxes will be avoid. However, the details remain to be seen.
This essay defines and clearly explains an important property of human moral intuitions: the divergence of possible extrapolations from the part of the state spaces we're used to think about. This property is a challenge in moral philosophy, that has implications on AI alignment and longterm or "extreme" thinking in effective altruism. Although I don't think that it was especially novel to me personally, it is valuable to have a solid reference for explaining this concept.
This essay makes a valuable contribution to the vocabulary we use to discuss and think about AI risk. Building a common vocabulary like this is very important for productive knowledge transmission and debate, and makes it easier to think clearly about the subject.
I don't think such a period exists. The closest thing is modern times, when a lot of people think string theory will become the "theory of everything". Or maybe the times just before the gargantuan size of the string landscape was discovered. But I think that even the most optimistic string theorists, out of those that can be called prominent scholars, would say that a lot of work yet remains.
The quote is apocryphal. The reality was almost the exact opposite: Lord Kelvin was very concerned about two discrepancies in contemporary physics, the solutions of which required special relativity and quantum mechanics respectively.
(Retracted)
Instead of postulating access to a portion of the history or some kind of limited access to the opponent's source code, we can consider agents with full access to history / source code but finite memory. The problem is, an agent with fixed memory size usually cannot have regret going to zero, since it cannot store probabilities with arbitrary precision. However, it seems plausible that we can usually get learning with memory of size . This is because something like "counting pieces of evidence" should be sufficient. For example, if consider finite MDPs, then it is enough to remember how many transitions of each type occurred to encode the belief state. There question is, does assuming memory (or whatever is needed for learning) is enough to reach superrationality.
For a fixed policy, the history is the only thing you need to know in order to simulate the agent on a given round. In this sense, seeing the history is equivalent to seeing the source code.
The claim is: In settings where the agent has unlimited memory and sees the entire history or source code, you can't get good guarantees (as in the folk theorem for repeated games). On the other hand, in settings where the agent sees part of the history, or is constrained to have finite memory (possibly of size ?), you can (maybe?) prove convergence to Pareto efficient outcomes or some other strong desideratum that deserves to be called "superrationality".
It is not a mere "concern", it's the crux of problem really. What people in the AI alignment community have been trying to do is, starting with some factual and "objective" description of the universe (such a program or a mathematical formula) and deriving counterfactuals. The way it's supposed to work is, the agent needs to locate all copies of itself or things "logically correlated" with itself (whatever that means) in the program, and imagine it is controlling this part. But a rigorous definition of this that solves all standard decision theoretic scenarios was never found.
Instead of doing that, I suggest a solution of different nature. In quasiBayesian RL, the agent never arrives at a factual and objective description of the universe. Instead, it arrives at a subjective description which already includes counterfactuals. I then proceed to show that, in Newcomblike scenarios, such agents receive optimal expected utility (i.e. the same expected utility promised by UDT).
Although normally I am all for judging arguments by their merits, regardless of who speaks them, I think that in this particular case we need to think twice before including the essay in the "Best of 2018" book. The notoriety of the author is such that including it risks serious reputation damage for the community, especially that the content of the essay might be interpreted as a veiled attempt to justify the author's moral transgressions. To be clear, I am not saying we should censor everything that this man ever said, but giving it the spotlight in "Best of 2018" seems like a bad choice.
We can modify the population game setting to study superrationality. In order to do this, we can allow the agents to see a fixed size finite portion of the their opponents' histories. This should lead to superrationality for the same reasons I discussed before. More generally, we can probably allow each agent to submit a finite state automaton of limited size, s.t. the opponent history is processed by the automaton and the result becomes known to the agent.
What is unclear about this is how to define an analogous setting based on source code introspection. While arguably seeing the entire history is equivalent to seeing the entire source code, seeing part of the history, or processing the history through a finite state automaton, might be equivalent to some limited access to source code, but I don't know to define this limitation.
EDIT: Actually, the obvious analogue is processing the source code through a finite state automaton.
Another explanation why maximin is a natural decision rule: when we apply maximin to fuzzy beliefs, the requirement to learn a particular class of fuzzy hypotheses is a very general way to formulate asymptotic performance desiderata for RL agents. So general that it seems to cover more or less anything you might want. Indeed, the definition directly leads to capturing any desideratum of the form
Here, doesn't have to be concave: the concavity condition in the definition of fuzzy beliefs is there because we can always assume it without loss of generality. This is because the left hand side in linear in so any that satisfies this will also satisfy it for the concave hull of .
What if instead of maximin we want to apply the minimaxregret decision rule? Then the desideratum is
But, it has the same form! Therefore we can consider it as a special case of the applying maximin (more precisely, it requires allowing the fuzzy belief to depend on , but this is not a problem for the basics of the formalism).
What if we want our policy to be at least as good as some fixed policy ? Then the desideratum is
It still has the same form!
Moreover, the predictor/Nirvana trick allows us to generalize this to desiderata of the form:
To achieve this, we postulate a predictor that guesses the policy, producing the guess , and define the fuzzy belief using the function (we assume the guess is not influenced by the agent's actions so we don't need in the expected value). Using Nirvana trick, we effectively force the guess to be accurate.
In particular, this captures selfreferential desiderata of the type "the policy cannot be improved by changing it in this particular way". These are of the form:
It also allows us to effectively restrict the policy space (e.g. impose computational resource constraints) by setting to for policies outside the space.
The fact that quasiBayesian RL is so general can also be regarded as a drawback: the more general a framework the less information it contains, the less useful constraints it imposes. But, my perspective is that QBRL is the correct starting point, after which we need to start proving results about which fuzzy hypotheses classes are learnable, and within what sample/computational complexity. So, although QBRL in itself doesn't impose much restrictions on what the agent should be, it provides the natural language in which desiderata should be formulated. In addition, we can already guess/postulate that an ideal rational agent should be a QBRL agent whose fuzzy prior is universal in some appropriate sense.
I think this post introduces a useful concept / way of thinking that I kept applying in my own life since reading it and that helped me understanding and dealing with certain social situations.
I think this post is a good and memorable explanationbyanalogy of what kind of research MIRI is doing to solve AI risk and a good and memorable response to some common criticism of or confusion about the former.
Actually I was somewhat confused about what the right update rule for fuzzy beliefs is when I wrote that comment. But I think I got it figured out now.
First, background about fuzzy beliefs:
Let be the space of environments (defined as the space of instrumental states in Definition 9 here). A fuzzy belief is a concave function s.t. . We can think of it as the membership function of a fuzzy set. For an incomplete model , the corresponding is the concave hull of the characteristic function of (i.e. the minimal concave s.t. ).
Let be the geometric discount parameter and be the utility function. Given a policy (EDIT: in general, we allow our policies to explicitly depend on ), the value of at is defined by
The optimal policy and the optimal value for are defined by
Given a policy , the regret of at is defined by
is said to learn when it is asymptotically optimal for when , that is
Given a probability measure over the space fuzzy hypotheses, the Bayesian regret of at is defined by
is said to learn when
If such a exists, is said to be learnable. Analogously to Bayesian RL, is learnable if and only if it is learned by a specific policy (the Bayesoptimal policy). To define it, we define the fuzzy belief by
We now define .
Now, updating: (EDIT: the definition was needlessly complicated, simplified)
Consider a history or . Here is the set of actions and is the set of observations. Define by
Let be the space of "environments starting from ". That is, if then and if then is slightly different because the history now begins with an observation instead of with an action.
For any we define by
Then, the updated fuzzy belief is
Well, I think that maximin is the right thing to do because it leads to reasonable guarantees for quasiBayesian reinforcement learning agents. I think of incomplete models as properties that the environment might satisfy. It is necessary to speak of properties instead of complete models since the environment might be too complex to understand in full (for example because it contains Omega, but also for more prosaic reasons), but we can hope it at least has properties/patterns the agent can understand. A quasiBayesian agent has the guarantee that, whenever the environment satisfies one of the properties in its prior, the expected utility will converge at least to the maximin for this property. In other words, such an agent is able to exploit any true property of the environment it can understand. Maybe a more "philosophical" defense of maximin is possible, analogous to VNM / complete class theorems, but I don't know (I actually saw some papers in that vein but haven't read them in detail.)
If the agent has random bits that Omega doesn't see, and Omega is predicting the probabilities of the agent's actions, then I think we can still solve it with quasiBayesian agents but it requires considering more complicated models and I haven't worked out the details. Specifically, I think that we can define some function that depends on the agent's actions and Omega's predictions so far (a measure of Omega's apparent inaccuracy), s.t. if Omega is an accurate predictor, then, the supremum of over time is finite with probability 1. Then, we consider consider a family of models, where model number says that for all times. Since at least one of these models is true, the agent will learn it, and will converge to behaving appropriately.
EDIT 1: I think should be something like, how much money would a gambler following a particular strategy win, betting against Omega.
EDIT 2: Here is the solution. In the case of original Newcomb, consider a gambler that bets against Omega on the agent oneboxing. Every time the agent twoboxes, the gambler loses dollar. Every time the agent oneboxes, the gambler wins dollars, where is the probability Omega assigned to oneboxing. Now it's possible to see that oneboxing guarantees the "CC" payoff under the corresponding model (in the limit): If the agent oneboxes, the gambler keeps winning unless Omega converges to oneboxing rapidly enough. In the case of a general Newcomblike problem, just replace "oneboxes" by "follows the FDT strategy".
Well, you don't have a guarantee that a computable model will succeed, but you do have some kind of guarantee that you're doing your best, because computable models is all you have. If you're using incomplete/fuzzy models, you can have a "doesn't know anything" model in your prior, which is a sort of "negative belief about physical/naturalism", but it is still within the same "quasiBayesian" framework.
The method should work for imperfect predictors as well. In the simplest case, the agent can model the imperfect predictor as perfect predictor + random noise. So, it definitely knows the correct reward for Omega being wrong. It still believes in Nirvana if "idealized Omega" is wrong.
Its utility function would have to say that all conscious AIs should run on Intel processors. There is selfreference there.
But, I only have rather low confidence this idea is correct (what being correct means here) or important.
The point is, if you put this "quasiBayesian" agent into an iterated Newcomblike problem, it will learn to get the maximal reward (i.e. the reward associated with FDT). So, if you're judging it from the side, you will have to concede it behaves rationally, regardless of its internal representation of reality.
Philosophically, my point of view is, it is an error to think that counterfactuals have objective, observerindependent, meaning. Instead, we can talk about some sort of consistency conditions between the different points of view. From the agent's point of view, it would reach Nirvana if it dodged the predictor. From Omega's point of view, if Omega twoboxed and the agent oneboxed, the agent's reward would be zero (and the agent would learn its beliefs were wrong). From a thirdperson point of view, the counterfactual "Omega makes an error of prediction" is illdefined, it's conditioning on an event of probability 0.
Well, I am a "semiinstrumentalist": I don't think it is meaningful to ask what reality "really is" except for the projection of the reality on the "normative ontology".
(I don't speak for Abram but I wanted to explain my own opinion.) Decision theory asks, given certain beliefs an agent has, what is the rational action for em to take. But, what are these "beliefs"? Different frameworks have different answers for that. For example, in CDT a belief is a causal diagram. In EDT a belief is a joint distribution over actions and outcomes. In UDT a belief might be something like a Turing machine (inside the execution of which the agent is supposed to look for copies of emself). Learning theory allows us to gain insight through the observation that beliefs must be learnable, otherwise how would the agent come up with these beliefs in the first place? There might be parts of the beliefs that come from the prior and cannot be learned, but still, at least the type signature of beliefs should be compatible with learning.
Moreover, decision problems are often implicitly described from the point of view of a third party. For example, in Newcomb's paradox we postulate that Omega can predict the agent, which makes perfect sense for an observer looking from the side, but might be difficult to formulate from the point of view of the agent itself. Therefore, understanding decision theory requires the translation of beliefs from the point of view of one observer to the point of view of another. Here also learning theory can help us: we can ask, what are the beliefs Alice should expect Bob to learn given particular beliefs of Alice about the world? From a slightly different angle, the central source of difficulty in decision theory is the notion of counterfactuals, and the attempt to prescribe particular meaning to them, which different decision theories do differently. Instead, we can just postulate that, from the subjective point of view of the agent, counterfactuals are ontologically basic. The agent believes emself to have free will, so to speak. Then, the interesting quesiton is, what kind of counterfactuals are produced by the translation of beliefs from the perspective of a third party to the perspective of the given agent.
Indeed, thinking about learning theory led me to the notion of quasiBayesian agents (agents that use incomplete/fuzzy models), and quasiBayesian agents automatically solve all Newcomblike decision problems. In other words, quasiBayesian agents are effectively a rigorous version of UDT.
Incidentally, to align AI we literally need to translate beliefs from the user's point of view to the AI's point of view. This is also solved via the same quasiBayesian approach. In particular, this translation process preserves the "point of updatelessness", which, in my opinion, is the desired result (the choice of this point is subjective).
If the universe was hypercomputational, that would manifest as failures of computable physics.
Well, it would manifest as a failure to create a complete and deterministic theory of computable physics. If your physics doesn't describe absolutely everything, hypercomputation can hide in places it doesn't describe. If your physics is stochastic (like quantum mechanics for example) then the random bits can secretly follow a hypercomputable pattern. Sort of "hypercomputer of the gaps". Like I wrote before, there actually can be situations in which we gradually become confident that something is a hypercomputer (although certainty would grow very slowly), but we will never know precisely what kind of hypercomputer it is.
If true, that is a form of neoKantian idealism. Is that what you really wanted to say?
Unfortunately I am not sufficiently versed in philosophy to say. I do not make any strong claims to novelty or originality.
I don't think hypercomputation is an issue for algorithmic information theory as foundation for metaphysics/induction. The relevant question is, not whether the world contains hypercomputation, but whether our mind is capable of hypercomputation. And here it seems to me like the answer is "no". Even if the answer was "yes", we could probably treat the hypercomputing part of the mind as part of the environment. I wrote a little about it here.
From the perspective of full agency (ie, the negation of partial agency), a system which needs a protected epistemic layer sounds suspiciously like a system that can't tile. You look at the world, and you say: "how can I maximize utility?" You look at your beliefs, and you say: "how can I maximize accuracy?" That's not a consequentialist agent; that's two different consequentialist agents!
For reinforcement learning with incomplete/fuzzy hypotheses, this separation doesn't exist, because the update rule for fuzzy beliefs depends on the utility function and in some sense even on the actual policy.