Posts
Comments
..."If agents know a good amount about each other due to access to source code or other such means, then a singleshot game has the character of iterated game theory. How can we capture that in our rationality criteria?"
I would study this problem by considering a population of learning agents that are sequentially paired up for oneshot games where the source code of each is revealed to the other. This way they can learn how to reason about source codes. In contrast, the model where the agents try to formally prove theorems seems poorly motivated to me.
Actually, here's another thought about this. Consider the following game: each player submits a program, then the programs are given each other as inputs, and executed in anytime mode (i.e. at given moment each program has to have some answer ready). The execution has some probability to terminate on each time step, so that the total execution time follows the geometric distribution. Once execution is finished, the output of the programs is interpreted as strategies in a normalform game and the payoffs are accordingly.
This seems quite similar to an iterated game with geometric time discount! In particular, in this version of the Prisoner's Dilemma, there is the following analogue of the titfortat strategy: start by setting the answer to C, simulate the other agent, once the other agents starts producing answers, set your answer to equal to the other agent's answer. For sufficiently shallow time discount, this is a Nash equilibrium.
In line with the idea I explained in a previous comment, it seems tempting to look for proper/thermal equilibria in this game with some constraint on the programs. One constraint that seems appealing is: force the program to have space complexity modulo oracle access to the other program. It is easy to see that a pair of such programs can be executed using space complexity as well.
I'm not usually thinking: "It sure seems like real agents reason using logic. How can we capture that formally in our rationality criteria?" Rather, I'm usually thinking things like: "If agents know a good amount about each other due to access to source code or other such means, then a singleshot game has the character of iterated game theory. How can we capture that in our rationality criteria?"
I would study this problem by considering a population of learning agents that are sequentially paired up for oneshot games where the source code of each is revealed to the other. This way they can learn how to reason about source codes. In contrast, the model where the agents try to formally prove theorems seems poorly motivated to me.
Perhaps one difference is that I keep talking about modeling something with regret bounds, whereas you are talking about achieving regret bounds. IE, maybe you more know what notion of regret you want to minimize, and are going after algorithms which help you minimize it; and I am more uncertain about what notion of regret is relevant, and am looking for formulations of regret which model what I am interested in.
Hmm, no, I don't think this is it. I think that I do both of those. It is definitely important to keep refining our regret criterion to capture more subtle optimality properties.
So I'm saying that (if we think of both prediction and logic as "proxies" in the way I describe) it is plausible that formal connections to prediction have analogs in formal connections to logic, because both seem to play a similar role in connection to doing well in terms of action utility.
Do they though? I am not convinced the logic has a meaningful connection to doing well in terms of action utility. I think that for prediction we can provide natural models that show how prediction connects to doing well, whereas I can't say the same about logic. (But I would be very interested in being proved wrong.)
Are you referring to results which require realizability here?
I usually assume realizability, but for incomplete model "realizability" just means that the environment satisfies some incomplete models (i.e. has some regularities that the agent can learn).
...the idea was more that by virtue of testing out ways of reasoning on the programs which can be evaluated, the agent would learn to evaluate more difficult cases which can't simply be run, which would in turn sometimes prove useful in dealing with the external environment.
Yes, I think this idea has merit, and I explained how TRL can address it in my other comment. I don't think we need or should study this using formal logic.
...I was thinking in terms of Bayesians with full models before  so logical induction showed me how to think about partial models and such. Perhaps it didn't show you how to do anything you couldn't already, since you were already thinking about things in optimal predictor terms.
Hmm, actually I think logical induction influenced me also to move away from Bayesian purism and towards incomplete models. I just don't think logic is the important part there. I am also not sure about the significance of this forecasting method, since in RL it is more natural to just do maximin instead. But, maybe it is still important somehow.
Another remark about Turing reinforcement learning (I renamed it so because, it's a reinforcement learner "plugged" into a universal Turing machine, and it is also related to Neural Turing machines). Here is how we can realize Abram's idea that "we can learn a whole lot by reasoning before we even observe that situation".
Imagine that the environment satisfies some hypothesis that contains the evaluation of a program , and that is relatively computationally expensive but not prohibitively so. If is a simple hypothesis, and in particular is a short program, then we can expect the agent to learn from a relatively small number of samples. However, because of the computational cost of , exploiting might be difficult (because there is a large latency between the time the agent knows it needs to evaluate and the time it gets the answer). Now assume that is actually equivalent (as a function) to a different program which is long but relatively cheap. Then we can use to describe hypothesis which is also true, and which is easily exploitable (i.e. guarantees a higher payoff than ) but which is complex (because is long). The agent would need a very large number of (physical) samples to learn directly. But, if the agent has plenty of computational resources for use during its "free time", it can use them to learn the (correct but complex) hypothesis . Then, the conjunction of and guarantees the same high payoff as . Thus, TRL converges to exploiting the environment well by the virtue of "reasoning before observing".
Moreover, we can plausibly translate this story into a rigorous desideratum along the following lines: for any conjunction of a "purely mathematical" hypothesis and arbitrary ("physical") hypothesis , we can learn (=exploit) it given enough "mathematical" samples relatively to the complexity (prior probability) of and enough "physical" samples relatively to the complexity of .
I don't currently see why we shouldn't ask to converge to pareto optima. Obviously, we can't expect to do so with arbitrary other agents; but it doesn't seem unreasonable to use an algorithm which has the property of reaching paretooptima with other agents who use that same algorithm.
I had an interesting thought that made me update towards your position. There is my old post about "metathreat equilibria", an idea I developed with Scott's help during my trip to Los Angeles. So, I just realized the same principle can be realized in the setting of repeated games. In particular, I am rather confident that the following, if formulated a little more rigorously, is a theorem:
Consider the Iterated Prisoner's Dilemma in which strategies are constrained to depend only on the action of the opponent in the previous round. Then, at the limit (shallow time discount), the only thermodynamic equilibrium is mutual cooperation.
Then, there is the question of how to generalize it. Obviously we want to consider more general games and more general strategies, but allowing fully general strategies probably won't work (just like allowing arbitrary programs doesn't work in the "selfmodification" setting). One obvious generalization is allowing the strategy to depend on some finite suffix of the history. But, this isn't natural in the context of agent theory: why would agents forget everything that happened before a certain time? Instead, we can constraint the strategy to be finitestate, and maybe require it to be communicating (i.e. forbid "grim triggers" where players change their behavior forever). On the game side, we can consider arbitrary repeated games, or communicating stochastic games (they have to be communicating because otherwise we can represent oneshot games), or even communicating partially observable stochastic games. This leads me to the following bold conjecture:
Consider any suitable (as above) game in which strategies are constrained to be (communicating?) finitestate. Then, at the limit , all thermodynamic equilibria are Pareto efficient.
This would be the sort of result that seems like it should be applicable to learning agents. That is, if the conjecture is true, there is good hope that appropriate learning agents are guaranteed to converge to Pareto efficient outcomes. Even if it the conjecture requires some moderately strong additional assumptions, it seems worth studying.
Suppose the interval between encounters with the predictor is long enough that, due to the agent's temporal discounting, the immediate reward of twoboxing outweighs the later gains which oneboxing provides. In any specific encounter with the predictor, the agent may prefer to twobox, but prefer to have been the sort of agent who predictably oneboxes, and also preferring to precommit to onebox on the next example if a commitment mechanism exists. (This scenario also requires a carefully tuned strength for the predictor, of course.)
Yes, but in this case the agent should twobox. This agent prefers sacrificing the future for short term gain, so that's what it does. Ofc if there a way to precommit that is visible to the predictor it will take it: this way it enjoys the best of both worlds, getting two boxes and causing the predictor to cooperate on the next round.
Btw, I am always interested in the asymptotic when the time discount parameter goes to , rather than the asymptotic when time discount is fixed and time goes to . That is, when I say the agent "converges" to something, I mean it in the former sense. Because, if the agent mostly cares only about the next thousands years, then it matters little how successful it is a million years from now. That is, the former asymptotic tracks the actual expected utility rather than some arbitrary portion thereof. (Also, I think it is interesting to consider what I dubbed "semianytime" algorithms: algorithms that receive a lower bound for the time discount parameter as input and guarantee a level of performance (usually regret) that improves as this lower bound grows. Even better are anytime algorithms: algorithms that don't depend on the time discount at all, but come with a performance guarantee that depends on the time discount. However, in settings such as Delegative RL it is impossible to have an anytime algorithm.)
What happens in ASP? (Say you're in an iterated Newcomb's problem with a predictor much slower than you, but which meets the LIC or similar.)
I am not sure what you mean by "meets the LIC or similar" in this context. If we consider a predictor which is a learning algorithm in itself (i.e., it predicts by learning from the agent's past choices), then the agent will converge to oneboxing. This is because a weak predictor will be fully inside the agent's prior, so the agent will know that oneboxing for long enough will cause the predictor to fill the box. If we consider a predictor that just analyzes the agent's source code and ignores the agent's choices, the agent will converge to twoboxing.
I was never convinced that "logical ASP" is a "fair" problem. I once joked with Scott that we can consider a "predictor" that is just the single line of code "return DEFECT" but in the comments it says "I am defecting only because I know you will defect." It was a joke, but it was halfserious. The notion of "weak predictor" taken to the limit leads to absurdity, and if you don't take it to the limit it might still lead to absurdity. Logical inductors in one way to try specifying a "weak predictor", but I am not convinced that settings in which logic is inserted ad hoc should be made into desiderata.
I still have a little hope that there will be a nice version, which doesn't involve a commitmentraces problem and which doesn't make use of an arbitrary commitment cutoff.
I am not sure we need an arbitrary cutoff. There might be a good solution where the agent can dynamically choose any finite cutoff.
...it doesn't seem unreasonable to use an algorithm which has the property of reaching paretooptima with other agents who use that same algorithm.
Maybe? The questions are, how robust is this cooperation (i.e. what counts as "same algorithm"), and whether there is a significant cost in other situations. And, on the philosophicalish level, the question is whether such desiderata should be considered essential for rationality/intelligence. But, I agree that this is worth studying.
Thank you for writing this, Abram.
The main difference between our ways of thinking seems to be: You are thinking in terms of, what happens inside the algorithm. I am thinking in terms of, what desiderata does the algorithm satisfy. I trust the latter approach more, because, while we can only speculate how intelligent agents work, we should be able to explain what an intelligent agent is: after all, something made us study intelligent agents in the first place. (Of course we do have some clues about how intelligent agents work from cognitive science and applied ML, but we are far from having the full picture.) In particular, I don't directly care about the distinction between modelbased and modelfree.
Specifically, the desiderata I often consider is some type of regret bound. This desiderata doesn't directly refer to prediction: it only cares about expected utility! Your reasoning would have us conclude it corresponds to modelfree RL. However, the actual algorithms I use to prove regret bounds use posterior sampling: a modelbased approach! So, not only that predictions enter the picture, but they enter the picture in a way that is clearly justified rather than ad hoc. (That said, the definition of Bayesian regret already refers to a prior, and the coefficients in the regret bounds depend on this prior, so I do introduce a notion of "model" a priori.)
(It should also be noted that a rewardlearning framework presumes we get feedback about utility at all. If we get no feedback about reward, then we're forced to only judge hypotheses by predictions, and make what inferences about utility we will. A dire situation for learning theory, but a situation where we can still talk about rational agency more generally.)
I do not agree that this is a "dire situation for learning theory". Quite the contrary, this is exactly the situation I formalized using instrumental reward functions. Unless you think instrumental reward functions are not good enough for some reason? (Of course, in my setting, the agent can get feedback about utility, it just doesn't get it all the time for free. But, the only alternative seems to be "flying spaghetti monster" utility functions that are completely unobservable, and I am not sure that's an interesting case to study.)
Agents that "make up math puzzles" is not necessarily a bad notion, but we should try to be more precise about what it means in terms of desiderata. We need to specify precisely what advantages "making up math puzzles" is supposed to confer.
However, there's no special dividing line between decidable and undecidable problems  any particular restriction to a decidable class might rule out some interesting (decidable but nonobviously so) stuff which we could learn from. So we might end up just going with any computations (halting or no).
It might be useful to consider arbitrary programs, but the agent will still not execute any of them for more than some small finite time. You might argue that this is why it needs logic, to prove theorems about computations that it cannot execute. But, I don't believe this line of thinking leads to anything tractable that is not redundant w.r.t. a simpler logicless approach.
[EDIT: Renamed "CSRL" to "TRL" 20190919]
One related idea that I came up with recently is what I call "Turing reinforcement learning" (TRL). (It is rather halfbaked, but I feel that the relevance is such that I should still bring it up. Also, it was inspired by our discussion, so thank you.) Imagine an RL agent connected to a computer. The agent can ask the computer to execute any program (for some finite amount of time ofc), and its prior already "knows" that such computations produce consistent answers (something about the semantics of programs can also be built into the prior, bringing us closer to logic, but I don't believe it can be anything nearly as strong as "proving theorems about programs using PA" if we want the algorithm to be even weakly feasible). This allows it to consider incomplete hypotheses that describe some parts of the environment in terms of programs (i.e. some physical phenomenon is hypotheses to behave equivalently to the execution of a particular program). The advantage of this approach: better management of computational resources. In classical RL, the resources are allocated "statically": any hypothesis is either in the prior or not. In contrast, in TRL the agent can decide itself which programs are worth running and when. Starting from generic regret bounds for incomplete models, this should lead to TRL satisfying some desiderata superior to classical RL with the same amount of resources, although I haven't worked out the details yet.
This gets back to my earlier intuition about agents having a reasonable chance of getting certain things right on the first try. Learningtheoretic agents don't get things right on the first try. However, agents who learn from logic have "lots of tries" before their first real try in physical time. If you can successfully determine which logical scenarios are relevantly analogous to your own, you can learn what to do just by thinking.
This seems to me the wrong way to look at it. There is no such thing as "learningtheoretic agents". There are agents with good learningtheoretic properties. If there is a property that you can prove in learningtheory to be impossible then no agent can have this property. Once again, I am not ruling out this entire line of thought, but I am arguing for looking for desiderata that can formalize it, instead of using it to justify ad hoc algorithms (e.g. algorithms that use logic.)
So, getting back to Vanessa's point in the comment I quoted: can we solve MIRIstyle decision problems by considering the iterated problem, rather than the singleshot version? To a large extent, I think so: in logical time, all games are iterated games. However, I don't want to have to set an agent up with a training sequence in which it encounters those specific problems many times. For example, finding good strategies in chess via selfplay should come naturally from the way the agent thinks about the world, rather than being an explicit training regime which the designer has to implement. Once the rules for chess are understood, the bottleneck should be thinking time rather than (physical) training instances.
But, you have to set an agent with a training sequence, otherwise how does it ever figure out the rules of chess? I definitely agree that in some situations computational complexity is the bottleneck rather the sample complexity. My point was different: if you don't add the context of a training sequences, you might end up with philosophical questions that are either poor starting points or don't make sense at all.
I wonder whether there is a reason this is not on the front page?
I believe that this indeed solves both INP and IXB.
I'd like to understand this part.
Every round of IXB has the following structure:
 Blackmail either arrives or not (observation)
 Agent either pays or not pays the blackmail (action)
 Infestation either happens or not (observation)
Suppose that on every round, the termite infestation happens with probability and its cost is . Then, this fact corresponds on an incomplete model (i.e., says that regardless of whether blackmailed arrived and regardless of whether blackmails was paid, the probability is ). is incomplete because it doesn't say anything about whether blackmail arrives or not. If is true, the agent can guarantee a payoff of per round (by rejecting the blackmail). Therefore, if the agent has a learnable prior that includes , it will converge to a payoff which is no less than . Of course achieving this payoff requires actually rejecting the blackmail.
This might seem surprising at first, because there is also a different incomplete model that says "if you pays the blackmail, infestation will not happen". is false if you use physical causal counterfactuals, but from the agent's perspective is consistent with all observations. However, only guarantees the payoff (because it is unknown whether the blackmail will arrive). Therefore, will have no effect on the ultimate behavior of the agent.
I'm not sure I should find the precommitment solution satisfying. Won't it make some stupid precommitments early (before it has learned enough about the world to make reasonable precommitments) and screw itself up forever? Is there a generally applicable version of precommitments which ensures learning good behavior?
The precommitments have to expire after some finite time. Incidentally, the agent can still screw itself up forever if the environment contains trap, which is a separate problem.
Notice that I am not making the claim that any sophisticated agent has to be capable of precommitments. This is because, I am not convinced that counterfactual mugging belongs to some class of problems that any sophisticated agent should be able to solve (the "fair" problems). Of course a sophisticated agent which suspects its environment contains this type of situation might want to create a descendant that is capable of precommitments.
The only class of problems that I'm genuinely unsure how to deal with is gametheoretic superrationality.
Let me add that I am not even sure what are the correct desiderata. In particular, I don't think that we should expect any group of good agents to converge to a Pareto optimal outcome. IMO in general it is more reasonable to expect that they converge to a Nash equilibrium (or some refinement, like a proper equilibrium). If the setting is iterated, or if they see each other's source code, then some equilibria are Pareto optimal, and perhaps under some reasonable assumptions convergence to Pareto optimal outcomes should be overwhelmingly likely (because of something like, these outcomes have the largest basin of attraction).
Not sure what do you mean by "computational complexity behaves kind of weirdly at low complexities"? In this case, I would be tempted to try the complexity class (logarithmic space complexity).
The most natural encoding is your "qualia", your raw sense data. This still leaves some freedom for how do you represent it, but this freedom has only a very minor effect.
Suppose in 2000 you use “superhuman Othello from selfplay” as a benchmark of a certain kind of impressive AI progress, and forecast it to be possible by 2020. It seems you were correct  very plausibly the AlphaZero architecture should work for this. However, in a strict sense your forecast was wrong  because no one has actually bothered to build a powerful Othello agent.
This might be a bad example? Quoting the Wikipedia: "There are many Othello programs... that can be downloaded from the Internet for free. These programs, when run on any uptodate computer, can play games in which the best human players are easily defeated." Well, arguably they are not "from selfplay" because they use handcraft evaluation functions. But, "no one has actually bothered to build a powerful Othello agent" seems just plain wrong.
Alright, so if you already know any fact which connects cereal/granola to high severity consequences (however slim the connection is) then you should choose based on those facts only.
My intuition is, you will never have a coherent theory of agency with nonArchimedean utility functions. I will change my mind when I see something that looks like such the beginning of such a theory (e.g. some reasonable analysis of reinforcement learning for nonArchimedean utility).
If we had infinite compute, that would not eliminate empirical uncertainty. There are many things you cannot compute because you just don't have enough information. This is why in learning theory sample complexity is distinct from computational complexity, and applies to algorithms with unlimited computational resources. So, you would definitely still need to take expectations.
If you're choosing between granola and cereal, then it might be that the granola manufacturer is owned by a multinational conglomerate that also has vested interests in the oil industry, and therefore, buying granola contributes to climate change that has consequences of much larger severity than your usual considerations for granola vs. cereal. Of course there might also be arguments in favor of cereal. But, if you take your nonArchimedean preferences seriously, it is absolutely worth the time to write down all of the arguments in favor of both options, and then choose the side which has the advantage in the largest severity class, however microscopic that advantage is.
If I understand what you do correctly, the severity classes are just the set differences , where is the filtration. I think that you also assume that the quotient is onedimensional and equipped with a choice of "positive" direction.
It doesn't really have much to do with particles vs. fields. We talk about measurements because measurements are the thing we actually observe. It seems strange to say you can model the world as a causal network if the causal network doesn't include your actual observations. If you want to choose a particular frame of reference and write down the wavefunction time evolution in that frame (while ignoring space) then you can say it's a causal network (which is just a linear chain, and deterministic at that) but IMO that's not very informative. It also loses the property of having things made of parts, which AFAIU was one of your objectives here.
Hmm, no, not really. The quantum fields follow "causality" in some quantum sense (roughly speaking, operators in spacelike separations commute, and any local operator can be expressed in terms of operators localized near the intersection of its past lightcone with any spacelike hypersurface), which is different from the sense used in causal DAGsa (in fact you can define "quantum causal DAGs" which is a different mathematical object). Violation of Bell's inequality precisely means that you can't describe the system by a causal DAG. If you want to do the MWI, then the wavefunction doesn't even decompose into data that can be localized.
A few comments:

Because of quantum physics, the universe is actually not described by a causal DAG very well (if we require microscopic precision). But I agree that in practice there are useful approximate descriptions that do look like a causal DAG.

One direction towards studying the learning of symmetric causal DAGs (of a sort) is my ideas about cellular decision processes and the more general "graph decision processes", see this comment.

As a side note, I'm not convinced that "breaking open black boxes" to solve "embedded agency" is a useful way of thinking. IMO the problems associated with "embedded agency" should be solved using a combination of incomplete models and correctly dealing with traps.
This seems interesting, but I think there are some assumptions that have been left out? What prevents us from taking and to be constants, which would give solutions for any ?
I think that the concept of "agency" (although maybe "intelligence" would be a better word?), in the context of AI alignment, implies the ability to learn the environment and exploit this knowledge towards a certain goal. The only way to pursue a goal effectively without learning is having hardcoded knowledge of the environment. But, where would this knowledge come from? For complex environments, it is only likely to come from learning algorithms upstream.
So, a rock is definitely not an agent since there is nothing it learns about its environment (I am not even sure what the input/output channels of a rock are supposed to be). Qlearning is an agent, but the resulting policy is not an agent in itself. Similarly, AlphaGo is a sort of agent when regarded together with the training loop (it can in principle learn to play different games), but not when disconnected from it. Evolution is an agent, even if not a very powerful one. An ant colony is probably a little agentic because it can learn something, although I'm not sure how much.
It seems like collecting dollars requires a hardcoded notion of how to draw the boundary around the agents, which runs contrary to the intention. It seems more natural for require the agents to strive to change the world in a particular way (e.g. maximize the number of rubes in the world).
Two comments:

The thing you called "pseudograding" is normally called "filtration".

In practice, because of the complexity of the world, and especially because of the presence of probabilistic uncertainty, an agent following a nonArchimedean utility function will always consider only the component corresponding to the absolute maximum of , since there will never be a choice between A and B such that these components just happen to be exactly equal. So it will be equivalent to an Archimedean agent whose utility is this worst component. (You can have an without an absolute maximum but I don't think it's possible to define any reasonable utility function like that, where by "reasonable" I roughly mean that, it's possible to build some good theory of reinforcement learning out of it.)
A few comments:

Regarding algorithmic similarity. This is an idea I just thought of this moment, so I'm not sure how solid it is, but. Given Turing machines and that compute the same functions, we want to say whether in some sense they do it "in the same way". Let's consider, for any input , the entire histories of intermediate states of the computations and . Call them and . We then say that and are "algorithmically equivalent" when there is a low complexity algorithm that, given access to , can produce any given part of , and vice versa. In particular, the complexity of must be much lower than the complexity of running from the beginning. Here, it seems useful to play with different types of complexity bounds (including time/space for example).

Regarding waterfalls and human beings. I think that a waterfall is not simulating a human being, because there is no algorithm of simultaneously low description complexity and low computational complexity that can decode a human being from a waterfall. Ofc it is not a binary distinction but a fuzzy distinction (the simpler the decoding algorithm is, the more reasonable it is to say a human being is there).

Regarding diamond optimizers. I think that the right way to design such an optimizer would be using an instrumental reward function. We then remain with the problem of how to specify this function. We could start with some ontology or class of ontologies that can be reasonably said to contain diamonds, and for which we can define the reward function unambiguously. These ontologies are then mapped into the space of instrumental states, giving us a partial specification of the instrumental reward function (it is specified on the affine span of the images of the ontologies). Then, there is the quesiton of how to extend the reward function to the entire instrumental state space. I wrote a few thoughts about that in the linked essay, but another approach we can take is, considering all extensions that have same range of values. These form a convex set, that can be interpreted as Knightian uncertainty regarding the reward function. We can then consider maximin policies for this set to be "diamond maximizers". In other words, we want the maximizer to be cautious/conservative about judging the number of diamonds on states that lie outside the ontologies.
I think that there is an interesting subproblem here, which is, given a policy (a RL agent), determine the belief state of this policy. I haven't thought much about this, but it seems reasonable to look at belief states (distributions over environments) w.r.t. which the policy has a sufficiently strong regret bound. This might still leave a lot of ambiguity that has to be regularized somehow.
The problem can be ameliorated by constraining to instrumental reward functions. This gives us agents that are, in some sense, optimizing the state of the environment rather than an arbitrary function of their own behavior. I think this is a better model of what it means to be "goaldirected" than classical reward functions.
Another thing we can do is just applying Occam's razor, i.e requiring the utility function (and prior) to have low description complexity. This can be interpreted as, taking the intentional stance towards a system is only useful if it results in compression.
I think this is a useful perspective. A way to reformulate (part of) this is: an explanation is good when it looks like it was sampled from the distribution of explanations you would come up with yourself, if you thought about the problem long enough. This resonates a lot with some of my (and other people's) thought about AI alignment, where we try to constrain the AI to "humanlike" behavior or make it imitate human behavior (this happens in Delegative Reinforcement Learning, Quantilization and other approaches).
The idea of explicit world models reminds me of my research direction concerning cellular decision processes. Note that the "grid" of such a process can be replaced by an arbitrary graph, which can also evolve dynamically, and that makes it quite close to the notion of, representing the world as a collection of objects and relationships/interactions. I did not know about the sheaf theoretic angle (added to reading list, thank you!), it might be interesting to see whether these two combine in a natural way.
RL is CDT in the sense that, your model of the world consists of actions and observations, and some causal link from past actions and observations to current observations, but there is no causal origin to the actions. The actions are just set by the agent to whatever it wants.
And, yes, I got CDT and EDT flipped there, good catch!
Heterodox opinion: I think the entire MIRIesque (and academic philosophy) approach to decision theory is confused. The basic assumption seems to be, that we can decouple the problem of learning a model of the world from the problem of taking a decision given such a model. We then ignore the first problem, and assume a particular shape for the model (for example, causal network) which allows us to consider decision theories such as CDT, EDT etc. However, in reality the two problems cannot be decoupled. This is because the type signature of a world model is only meaningful if it comes with an algorithm for how to learn a model of this type.
For example, consider Newcomb's paradox. The agent makes a decision under the assumption that Omega behaves in a certain way. But, where did the assumption come from? Realistic agents have to learn everything they know. Learning normally requires a time sequence. For example, we can consider the iterated Newcomb's paradox (INP). In INP, any reinforcement learning (RL) algorithm will converge to oneboxing, simply because oneboxing gives it the money. This is despite RL naively looking like CDT. Why does it happen? Because in the learned model, the "causal" relationships are not physical causality. The agent comes to believe that taking the one box causes the money to appear there.
In Newcomb's paradox EDT succeeds but CDT fails. Let's consider an example where CDT succeeds and EDT fails: the XOR blackmail. The iterated version would be IXB. In IXB, classical RL doesn't guarantee much because the environment is more complex than the agent (it contains Omega). To overcome this, we can use RL with incomplete models. I believe that this indeed solves both INP and IXB.
Then we can consider e.g. counterfactual mugging. In counterfactual mugging, RL with incomplete models doesn't work. That's because the assumption that Omega responds in a way that depends on a counterfactual world is not in the space of models at all. Indeed, it's unclear how can any agent learn such a fact from empirical observations. One way to fix it is by allowing the agent to precommit. Then the assumption about Omega becomes empirically verifiable. But, if we do this, then RL with incomplete models can solve the problem again.
The only class of problems that I'm genuinely unsure how to deal with is gametheoretic superrationality. However, I also don't see much evidence the MIRIesque approach has succeeded on that front. We probably need to start with just solving the grain of truth problem in the sense of converging to ordinary Nash (or similar) equilibria (which might be possible using incomplete models). Later we can consider agents that observe each other's source code, and maybe something along the lines of this can apply.
The idea of agent clusters seems closely related to my idea about modeling dynamically inconsistent agents. In my model, each subagent controls particular states. More general gametheoretic models can be considered, by they seem to have worse properties.
Regarding the broader question, about how to bridge the gap between agents that are "ideal, perfectly rational consequentialists" and "realistic" agents (e.g. humans), more factors that can be relevant are:

Realistic agents have computational resource bounds.

Realistic agents might be learning algorithms with suboptimal sample complexity (i.e. it takes them longer to learn that a perfect agent).

Realistic agents might only remain coherent within some subspace of the state space (although that might be possible to model using dynamic inconsistency).

We can also consider agents that have some "Knightian uncertainty" about their own utility function. For example, we can consider a convex set in the space of utility functions, and have the agent follow the maximin policy w.r.t. this convex set. As a more specific example, we can consider an instrumental reward function that is only defined on some affine subspace of the instrumental state space, and consider all extensions of it to the entire instrumental state space that don't increase its range.
 I think this is a useful taxonomy.
 The disparity between L2 and L3 is closely related to mesaoptimization.
 The problem that "every agent trivially optimises for being the agent that they are" can be somewhat ameliorated by constraining ourselves to instrumental reward functions.
It might be difficult to tell the difference between a periodic pattern and a random walk. A random walk also produces periods of "bad" and periods of "good" s.t. after every bad period there is a good period (because what else can there be?) and vice versa. The durations of different "cycles" would also usually be of similar magnitude. And, a priori, a random walk seems more likely and require less explanation.
Tangentially, I think it would be really great if there were standard benchmarks on which quantitative theories of society/history/economics could compete by measuring to which extent they can compress the data. That would make it much easier to understand what is real and what is overfitting.
Well, any system that satisfies the Minimal Requirement is doing long term planning on some level. For example, if your AI is approval directed, it still needs to learn how to make good plans that will be approved. Once your system has a superhuman capability of producing plans somewhere inside, you should worry about that capability being applied in the wrong direction (in particular due to mesaoptimization / daemons). Also, even without long term planning, extreme optimization is dangerous (for example an approval directed AI might create some kind of memetic supervirus).
But, I agree that these arguments are not enough to be confident of the strong empirical claim.
a) I believe a weaker version of the empirical claim, namely that the catastrophe is not nearly inevitable but not unlikely. That is, I can imagine different worlds in which the probability of the catastrophe is different, and I have uncertainty over in which world we actually are, s.t. in average the probability is sizable.
b) I think that the argument you gave is sort of correct. We need to augment it by: the minimal requirement from the AI is, it needs to effectively block all competing dangerous AI projects, without also doing bad things (which is why you can't just give it the zero utility function). Your counterargument seems weak to me because, moving from utility maximizes to other types of AIs is just replacing something that is relatively easy to reason about with something that it is harder to reason about, thereby obscuring the problems (that are still there). I think that whatever your AI is, given that is satisfies the minimal requirement, some kind of utilitymaximizationlike behavior is likely to arise.
Coming at it from a different angle, complicated systems often fail in unexpected ways. The way people solve this problem in practice is by a combination of mathematical analysis and empirical research. I don't think we have many examples of complicated systems where all failures were avoided by informal reasoning without either empirical or mathematical backing. In the case of superintelligent AI, empirical research alone is insufficient because, without mathematical models, we don't know how to extrapolate empirical results from current AIs to superintelligent AIs, and when superintelligent algorithms are already here it will probably be too late.
c) I think what we can (and should) realistically aim for is, having a mathematical theory of AI, and having a mathematical model of our particular AI, such that in this model we can prove the AI is safe. This model will have some assumptions and parameters that will need to be verified/measured in other ways, through some combination of (i) experiments with AI/algorithms (ii) learning from neuroscience (iii) learning from biological evolution and (iv) leveraging our knowledge of physics. Then, there is also the question of, how precise is the correspondence between the model and the actual code (and hardware). Ideally, we want to do formal verification in which we can test that a certain theorem holds for the actual code we are running. Weaker levels of correspondence might still be sufficient, but that would be Plan B.
Also, the proof can rely on mathematical conjectures in which we have high confidence, such as . Of course, the evidence for such conjectures is (some sort of) empirical, but it is important that the conjecture is at least a rigorous, well defined mathematical statement.
This idea was inspired by a discussion with Discord user @jbeshir
Model dynamically inconsistent agents (in particular humans) as having a different reward function at every state of the environment MDP (i.e. at every state we have a reward function that assigns values both to this state and to all other states: we have a reward matrix ). This should be regarded as a game where a different player controls the action at every state. We can now look for value learning protocols that converge to Nash* (or other kind of) equilibrium in this game.
The simplest setting would be, every time you visit a state, you learn the reward of all previous states w.r.t. the reward function of the current state. Alternatively, every time you visit a state, you can ask about the reward of one previously visited state w.r.t. the reward function of the current state. This is the analogue of classical reinforcement learning with an explicit reward channel. We can now try to prove a regret bound, which takes the form of an Nash equilibrium condition, with being the regret. More complicated settings would be analogues of Delegative RL (where the advisor also follows the reward function of the current state) and other value learning protocols.
This seems like a more elegant way to model "corruption" than as a binary or continuous one dimensional variable like I did before.
*Note that although for general games, even if they are purely coorperative, Nash equilibria can be suboptimal due to coordination problems, for this type of games it doesn't happen: in the purely cooperative case, the Nash equilibrium condition becomes the Bellman equation that implies global optimality.
It is an interesting problem to write explicit regret bounds for reinforcement learning with a prior that is the Solomonoff prior or something similar. Of course, any regret bound requires dealing with traps. The simplest approach is, leaving only environments without traps in the prior (there are technical details involved that I won't go into right now). However, after that we are still left with a different major problem. The regret bound we get is very weak. This happens because the prior contains sets of hypotheses of the form "program template augmented by a hardcoded bit string of length ". Such a set contains hypotheses, and its total probability mass is approximately , which is significant for short (even when is large). However, the definition of regret requires out agent to compete against a hypothetical agent that knows the true environment, which in this case means knowing both and . Such a contest is very hard since learning bits can take much time for large .
Note that the definition of regret depends on how we decompose the prior into a convex combination of individual hypotheses. To solve this problem, I propose redefining regret in this setting by grouping the hypotheses in a particular way. Specifically, in algorithmic statistics there is the concept of sophistication. The sophistication of a bit string is defined as follows. First, we consider the Kolmogorov complexity of . Then we consider pairs where is a program, is a bit string, and . Finally, we minimize over . The minimal is called the sophistication of . For our problem, we are interested in the minimal itself: I call it the "sophisticated core" of . We now group the hypotheses in our prior by sophisticated cores, and define (Bayesian) regret w.r.t. this grouping.
Coming back to our problematic set of hypotheses, most of it is now grouped into a single hypothesis, corresponding to the sophisticated core of . Therefore, the reference agent in the definition of regret now knows but doesn't know , making it feasible to compete with it.
A variant of Christiano's IDA amenable to learningtheoretic analysis. We consider reinforcement learning with a set of observations and a set of actions, where the semantics of the actions is making predictions about future observations. (But, as opposed to vanilla sequence forecasting, making predictions affects the environment.) The reward function is unknown and unobservable, but it is known to satisfy two assumptions:
(i) If we make the null prediction always, the expected utility will be lower bounded by some constant.
(ii) If our predictions sample the step future for a given policy , then our expected utility will be lower bounded by some function of the the expected utility of and . is s.t. for sufficiently low , but for sufficiently high , (in particular the constant in (i) should be high enough to lead to an increasing sequence). Also, it can be argued that it's natural to assume for .
The goal is proving regret bounds for this setting. Note that this setting automatically deals with malign hypotheses in the prior, bad selffulfilling prophecies and "corrupting" predictions that cause damage just by seeing them.
However, I expect that without additional assumptions the regret bound will be fairly weak, since the penalty for making wrong predictions grows with the total variation distance between the prediction distribution and the true distribution, which is quite harsh. I think this reflects a true weakness of IDA (or some versions of it, at least): without an explicit model of the utility function, we need very high fidelity to guarantee robustness against e.g. malign hypotheses. On the other hand, it might be possible to ameliorate this problem if we introduce an additional assumption of the form: the utility function is e.g. Lipschitz w.r.t some metric . Then, total variation distance is replaced by KantorovichRubinstein distance defined w.r.t. . The question is, where do we get the metric from. Maybe we can consider something like the process of humans rewriting texts into equivalent texts.
In last year's essay about my research agenda I wrote about an approach I call "learning by teaching" (LBT). In LBT, an AI is learning human values by trying to give advice to a human and seeing how the human changes eir behavior (without an explicit reward signal). Roughly speaking, if the human permanently changes eir behavior as a result of the advice, then one can assume the advice was useful. Partial protection against manipulative advice is provided by a delegation mechanism, which ensures the AI only produces advice that is in the space of "possible pieces of advice a human could give" in some sense. However, this protection seems insufficient since it allows for giving all arguments in favor of a position without giving any arguments against a position.
To add more defense against manipulation, I propose to build on the "AI debate" idea. However, in this scheme, we don't need more than one AI. In fact, this is a general fact: for any protocol involving multiple AIs, there is a protocol involving just one AI that works (at least roughly, qualitatively) just as well. Proof sketch: If we can prove that under assumptions , the protocol is safe/effective, then we can design a single AI which has assumptions baked into its prior. Such an AI would be able to understand that simulating protocol would lead to a safe/effective outcome, and would only choose a different strategy if it leads to an even better outcome under the same assumptions.
The way we use "AI debate" is not by implementing an actual AI debate. Instead, we use it to formalize our assumptions about human behavior. In ordinary IRL, the usual assumption is "a human is a (nearly) optimal agent for eir utility function". In the original version of LBT, the assumption was of the form "a human is (nearly) optimal when receiving optimal advice". In debateLBT the assumption becomes "a human is (nearly) optimal* when exposed to a debate between two agents at least one of which is giving optimal advice". Here, the human observes this hypothetical debate through the same "cheap talk" channel through which it receives advice from the single AI.
Notice that debate can be considered to be a form of interactive proof system (with two or more provers). However, the requirements are different from classical proof systems. In classical theory, the requirement is "When the prover is honestly arguing for a correct proposition, the verifier is convinced. For any prover the verifier cannot be convinced of a false proposition." In "debate proof systems" the requirement is "If at least one prover is honest, the verifier comes to the correct conclusion". That is, we don't guarantee anything when both provers are dishonest. It is easy to see that these debate proof systems admit any problem in PSPACE: given a game, both provers can state their assertions as to which side wins the game, and if they disagree they have to play the game for the corresponding sides.
*Fiddling with the assumptions a little, instead of "optimal" we can probably just say that the AI is guaranteed to achieve this level of performance, what it is.
I'm posting a few research directions in my research agenda about which I haven't written much elsewhere (except maybe in the MIRIx Discord server), and for which I so far haven't got the time to make a full length essay with mathematical details. Each direction is in a separate child comment.
I focus mostly on formal properties algorithms can or cannot have, rather than the algorithms themselves. So, from my point of view, it doesn't matter whether the prior is "explicit" and I doubt it's even a welldefined question. What I mean by "prior" is, more or less, whatever probability measure has the best Bayesian regret bound for the given RL algorithm.
I think the prior will have to look somewhat like the universal prior. Occam's razor is a foundational principle of rationality, and any reasonable algorithm should have inductive bias towards simpler hypotheses. I think there's even some work trying to prove that deep learning already has such inductive bias. At the same time, the space of hypotheses has to be very rich (although still constrained by computational resources and some additional structural assumptions needed to make learning feasible).
I think that DRL doesn't require a prior (or, more generally, algorithmic building blocks) substantially different from what is needed for capabilities, since if your algorithm is superintelligent (in the sense that, it's relevant to either causing or mitigating Xrisk) then it has to create sophisticated models of the world that include people, among other things, and therefore forcing it to model the advisor as well doesn't make the task substantially harder (well, it is harder in the sense that the regret bound is weaker, but that is not because of the prior).
For states that cause existential catastrophes this seems obviously desirable. Maybe my objection is more that with this sort of algorithm you need to have the right set of hypotheses in the first place, and that seems like the main difficulty?
Ultimately, the set of hypotheses should be something like the universal prior. More precisely, it should be whatever we need to use instead of the universal prior to get "general" intelligence that is computationally efficient. In other words, I don't think the choice of prior here is substantially different or more difficult from the choice of prior for AGI from a pure capability POV.
Maybe I'm also saying that this feels vulnerable to nearest unblocked strategies. Suppose the AI has learned that its reward function is to maximize paperclips, and the advisor doesn't realize that a complicated gadget the AI has built is a selfreplicating nanorobot that will autonomously convert atoms into paperclips. It doesn't seem like DRL saves us here.
The way I imagine it will work, the advisor will not do something weird and complicated that ey don't understand emself. So the AI also cannot do something weird and complicated that the advisor doesn't understand. In order for this not to be too constraining, I imagine the advisor having some kind of "diary" where ey write down eir thoughts and reasoning, which is considered a part of normal interaction with the environment. The advisor will only act on complicated plans after laying down the reasoning. The AI will then have to write down in this diary an explanation for its plans that will be understandable by and convincing for the advisor. This is partially protected from manipulations, because the explanation has to come from the space of explanations that the advisor could produce. That said, I think this defense from manipulation is insufficient in itself (because the AI can write down all arguments for a given position that the advisor could come up with, without writing down any arguments against it), and I have a research direction based on the "debate" approach about how to strengthen it.
Maybe another way of putting it  is there additional safety conferred by this approach that you couldn't get by having a human review all of the AI's actions? If so, should I think of this as "we want a human to review actions, but that's expensive, DRL is a way to make it more sample efficient"?
The current version of the formalism is more or less the latter, but you should imagine the review to be rather conservative (like in the nonorobot example). In the "soft" version it will become a limit on how much the AI policy deviates from the advisor policy, so it's not quite a review in the usual sense: there is no binary division between "legal" and "illegal" actions. I think of it more like, the AI should emulate an "improved" version of the advisor: do all the things the advisor would do on eir "best day".
Hi David, if you want to discuss this more, I think we can do it in person? AFAIK you live in Israel? For example, you can come to my talk in the LessWrong meetup on July 2.
Dealing with corrupt states requires a "different" algorithm, but the modification is rather trivial: for each hypothesis that includes dynamics and corruption, you need to replace the corrupt states by an inescapable state with reward zero and run the usual PSRL algorithm on this new prior. Indeed, the algorithm deals with corruption by never letting the agent go there. I am not sure I understand why you think this is not a good approach. Consider a corrupt state in which the human's brain has been somehow scrambled to make em give high rewards. Do you think such a state should be explored? Maybe your complaint is that in the real world corruption is continuous rather than binary, and the advisor avoids most of corruption but not all of it and not with 100% success probability. In this case, I agree, the current model is extremely simplified, but it still feels like progress. You can see this for a model of continuous corruption in DIRL, a simpler setting. More generally, I think that a better version of the formalism would build on ideas from quantilization and catastrophe mitigation to arrive at a setting where, you have a low rate of falling into traps or accumulating corruption as long as your policy remains "close enough" to the advisor policy w.r.t. some metric similar to infinityRenyi divergence (and, as long as your corruption remains low).
Hi Rohin, thank you for writing about my work! I want to address some issues you brought up regarding Delegative RL.
I worry about there being a Cartesian boundary between the agent and the environment, though perhaps even here as long as the advisor is aware of problems caused by such a boundary, they can be modeled as traps and thus avoided.
Yes. I think that the Cartesian boundary is part of the definition of the agent, and events that violate the Cartesian boundary should be thought of as destroying the agent. Destruction of the agent is a certainly a trap since from the POV of the agent it is irreversible. See also the "death of the agent" subsection of the imperceptible rewards essay.
One thing I wonder about is whether the focus on traps is necessary. With the presence of traps in the theoretical model, one of the main challenges is in preventing the agent from falling into a trap due to ignorance. However, it seems extremely unlikely that an AI system manages to take some irreversible catastrophic action by accident  I'm much more worried about the case where the AI system is adversarially optimizing against us and intentionally takes an irreversible catastrophic action.
I think that most of the "intentional" catastrophic actions can be regarded as "due to ignorance" from an appropriate perspective (the main exception is probably nonCartesian daemons). Consider two examples:
Example 1 is corrupt states, that I discussed here. These are states in which the specified reward function doesn't match the intended reward function (and also possibly the advisor becomes unreliable). We can equip the agent with a prior that accounts for the existence of such states. However, without further help, the agent doesn't have enough information to know when it could enter one. So, if the agent decides to e.g. hack its own reward channel, one perspective is that it is an intentional action against us, but another perspective is that it is due to the agent's ignorance of the true model of corruption. This problem is indeed fixed by Delegative RL (assuming that, in uncorrupt states, the advisor's actions don't lead to corruption).
Example 2 is malign hypotheses. The agent's prior may contain hypotheses that are agentic in themselves. Such a hypothesis can intentionally produce correct predictions up to a "traitorous turn" point, at which it produces predictions that manipulate the agent into an irreversible catastrophic action. From the perspective of the "outer" agent this is "ignorance", but from the perspective of the "inner" agent, this is intentional. Once again delegative RL fixes it: at the traitorous turn, a DRL agent detects the ambiguity in predictions and the critical need to take the right action, leading it delegate. Observing the advisor's action leads it to update away from the malign hypothesis.
You can create an empty profile just for the LessWrong group, but it's your call of course. It's just convenient for tracking meetups, and there are some interesting discussions. There is also a Google group but it is mostly dormant nowadays.
How about posting it on the Facebook group and discussing more there? It might be worthwhile to put in on rationality.co.il for example.
Update: In total, 16 people contacted me. Offer of mentorship is closed, since I have sufficiently many candidates for now. Offer of collaboration remains open for experienced researchers (i.e. researchers that (i) have some track record of original math / theoretical compsci research, and (ii) are able to take on concrete open problems without much guidance).
Hey, it's an AI safety unconference, we should be able to coordinate acausally ;)
Is there a specific deadline for signing up?
AFAIU discussing charged political issues is not allowed, or at least very frowned upon on LW, and for good reasons. So, I can't discuss the object level. On the other hand, the meta level is too vague. That is, the error is in the way the abstract reasoning is applied to case X (it's just not the right model), rather than in the abstract reasoning itself.
I didn't know about this feature. It has advantages and disadvantages, but I will at least consider it. Thank you!