## Posts

Slate Star Codex Tel Aviv 2019 2019-09-05T18:29:53.039Z · score: 6 (1 votes)
Offer of collaboration and/or mentorship 2019-05-16T14:16:20.684Z · score: 109 (36 votes)
Reinforcement learning with imperceptible rewards 2019-04-07T10:27:34.127Z · score: 18 (9 votes)
Dimensional regret without resets 2018-11-16T19:22:32.551Z · score: 9 (4 votes)
Computational complexity of RL with traps 2018-08-29T09:17:08.655Z · score: 14 (5 votes)
Entropic Regret I: Deterministic MDPs 2018-08-16T13:08:15.570Z · score: 12 (7 votes)
Algo trading is a central example of AI risk 2018-07-28T20:31:55.422Z · score: 27 (16 votes)
The Learning-Theoretic AI Alignment Research Agenda 2018-07-04T09:53:31.000Z · score: 27 (10 votes)
Meta: IAFF vs LessWrong 2018-06-30T21:15:56.000Z · score: 1 (1 votes)
Computing an exact quantilal policy 2018-04-12T09:23:27.000Z · score: 10 (2 votes)
Quantilal control for finite MDPs 2018-04-12T09:21:10.000Z · score: 3 (3 votes)
Improved regret bound for DRL 2018-03-02T12:49:27.000Z · score: 0 (0 votes)
More precise regret bound for DRL 2018-02-14T11:58:31.000Z · score: 1 (1 votes)
Catastrophe Mitigation Using DRL (Appendices) 2018-02-14T11:57:47.000Z · score: 0 (0 votes)
Bugs? 2018-01-21T21:32:10.492Z · score: 4 (1 votes)
The Behavioral Economics of Welfare 2017-12-22T11:35:09.617Z · score: 28 (12 votes)
Improved formalism for corruption in DIRL 2017-11-30T16:52:42.000Z · score: 0 (0 votes)
Why DRL doesn't work for arbitrary environments 2017-11-30T12:22:37.000Z · score: 0 (0 votes)
Catastrophe Mitigation Using DRL 2017-11-22T05:54:42.000Z · score: 2 (1 votes)
Catastrophe Mitigation Using DRL 2017-11-17T15:38:18.000Z · score: 0 (0 votes)
Delegative Reinforcement Learning with a Merely Sane Advisor 2017-10-05T14:15:45.000Z · score: 1 (1 votes)
On the computational feasibility of forecasting using gamblers 2017-07-18T14:00:00.000Z · score: 0 (0 votes)
Delegative Inverse Reinforcement Learning 2017-07-12T12:18:22.000Z · score: 11 (3 votes)
Learning incomplete models using dominant markets 2017-04-28T09:57:16.000Z · score: 1 (1 votes)
Dominant stochastic markets 2017-03-17T12:16:55.000Z · score: 0 (0 votes)
A measure-theoretic generalization of logical induction 2017-01-18T13:56:20.000Z · score: 3 (3 votes)
Towards learning incomplete models using inner prediction markets 2017-01-08T13:37:53.000Z · score: 2 (2 votes)
Subagent perfect minimax 2017-01-06T13:47:12.000Z · score: 0 (0 votes)
Minimax forecasting 2016-12-14T08:22:13.000Z · score: 0 (0 votes)
Minimax and dynamic (in)consistency 2016-12-11T10:42:08.000Z · score: 0 (0 votes)
Attacking the grain of truth problem using Bayes-Savage agents 2016-10-20T14:41:56.000Z · score: 1 (1 votes)
IRL is hard 2016-09-13T14:55:26.000Z · score: 0 (0 votes)
Stabilizing logical counterfactuals by pseudorandomization 2016-05-25T12:05:07.000Z · score: 1 (1 votes)
Stability of optimal predictor schemes under a broader class of reductions 2016-04-30T14:17:35.000Z · score: 0 (0 votes)
Reflection with optimal predictors 2016-03-22T17:20:37.000Z · score: 1 (1 votes)
Logical counterfactuals for random algorithms 2016-01-06T13:29:52.000Z · score: 3 (3 votes)
Quasi-optimal predictors 2015-12-25T14:17:05.000Z · score: 2 (2 votes)
Implementing CDT with optimal predictor systems 2015-12-20T12:58:44.000Z · score: 1 (1 votes)
Bounded Solomonoff induction using optimal predictor schemes 2015-11-10T13:59:29.000Z · score: 1 (1 votes)
Superrationality in arbitrary games 2015-11-04T18:20:41.000Z · score: 7 (6 votes)
Optimal predictor schemes 2015-11-01T17:28:46.000Z · score: 2 (2 votes)
Optimal predictors for global probability measures 2015-10-06T17:40:19.000Z · score: 0 (0 votes)
Logical counterfactuals using optimal predictor schemes 2015-10-04T19:48:23.000Z · score: 0 (0 votes)
Towards reflection with relative optimal predictor schemes 2015-09-30T15:44:21.000Z · score: 1 (1 votes)
Improved error space for universal optimal predictor schemes 2015-09-30T15:08:53.000Z · score: 0 (0 votes)
Optimal predictor schemes pass a Benford test 2015-08-30T13:25:59.000Z · score: 3 (3 votes)
Optimal predictors and propositional calculus 2015-07-04T09:51:38.000Z · score: 0 (0 votes)
Optimal predictors and conditional probability 2015-06-30T18:01:31.000Z · score: 2 (2 votes)
A complexity theoretic approach to logical uncertainty (Draft) 2015-05-11T20:04:28.000Z · score: 5 (5 votes)

Comment by vanessa-kosoy on Do Sufficiently Advanced Agents Use Logic? · 2019-09-21T13:26:50.643Z · score: 2 (1 votes) · LW · GW

..."If agents know a good amount about each other due to access to source code or other such means, then a single-shot game has the character of iterated game theory. How can we capture that in our rationality criteria?"

I would study this problem by considering a population of learning agents that are sequentially paired up for one-shot games where the source code of each is revealed to the other. This way they can learn how to reason about source codes. In contrast, the model where the agents try to formally prove theorems seems poorly motivated to me.

Actually, here's another thought about this. Consider the following game: each player submits a program, then the programs are given each other as inputs, and executed in anytime mode (i.e. at given moment each program has to have some answer ready). The execution has some probability to terminate on each time step, so that the total execution time follows the geometric distribution. Once execution is finished, the output of the programs is interpreted as strategies in a normal-form game and the payoffs are accordingly.

This seems quite similar to an iterated game with geometric time discount! In particular, in this version of the Prisoner's Dilemma, there is the following analogue of the tit-for-tat strategy: start by setting the answer to C, simulate the other agent, once the other agents starts producing answers, set your answer to equal to the other agent's answer. For sufficiently shallow time discount, this is a Nash equilibrium.

In line with the idea I explained in a previous comment, it seems tempting to look for proper/thermal equilibria in this game with some constraint on the programs. One constraint that seems appealing is: force the program to have space complexity modulo oracle access to the other program. It is easy to see that a pair of such programs can be executed using space complexity as well.

Comment by vanessa-kosoy on Do Sufficiently Advanced Agents Use Logic? · 2019-09-21T11:49:57.794Z · score: 2 (1 votes) · LW · GW

I'm not usually thinking: "It sure seems like real agents reason using logic. How can we capture that formally in our rationality criteria?" Rather, I'm usually thinking things like: "If agents know a good amount about each other due to access to source code or other such means, then a single-shot game has the character of iterated game theory. How can we capture that in our rationality criteria?"

I would study this problem by considering a population of learning agents that are sequentially paired up for one-shot games where the source code of each is revealed to the other. This way they can learn how to reason about source codes. In contrast, the model where the agents try to formally prove theorems seems poorly motivated to me.

Perhaps one difference is that I keep talking about modeling something with regret bounds, whereas you are talking about achieving regret bounds. IE, maybe you more know what notion of regret you want to minimize, and are going after algorithms which help you minimize it; and I am more uncertain about what notion of regret is relevant, and am looking for formulations of regret which model what I am interested in.

Hmm, no, I don't think this is it. I think that I do both of those. It is definitely important to keep refining our regret criterion to capture more subtle optimality properties.

So I'm saying that (if we think of both prediction and logic as "proxies" in the way I describe) it is plausible that formal connections to prediction have analogs in formal connections to logic, because both seem to play a similar role in connection to doing well in terms of action utility.

Do they though? I am not convinced the logic has a meaningful connection to doing well in terms of action utility. I think that for prediction we can provide natural models that show how prediction connects to doing well, whereas I can't say the same about logic. (But I would be very interested in being proved wrong.)

Are you referring to results which require realizability here?

I usually assume realizability, but for incomplete model "realizability" just means that the environment satisfies some incomplete models (i.e. has some regularities that the agent can learn).

...the idea was more that by virtue of testing out ways of reasoning on the programs which can be evaluated, the agent would learn to evaluate more difficult cases which can't simply be run, which would in turn sometimes prove useful in dealing with the external environment.

Yes, I think this idea has merit, and I explained how TRL can address it in my other comment. I don't think we need or should study this using formal logic.

...I was thinking in terms of Bayesians with full models before -- so logical induction showed me how to think about partial models and such. Perhaps it didn't show you how to do anything you couldn't already, since you were already thinking about things in optimal predictor terms.

Hmm, actually I think logical induction influenced me also to move away from Bayesian purism and towards incomplete models. I just don't think logic is the important part there. I am also not sure about the significance of this forecasting method, since in RL it is more natural to just do maximin instead. But, maybe it is still important somehow.

Comment by vanessa-kosoy on Do Sufficiently Advanced Agents Use Logic? · 2019-09-19T09:54:02.796Z · score: 2 (1 votes) · LW · GW

Another remark about Turing reinforcement learning (I renamed it so because, it's a reinforcement learner "plugged" into a universal Turing machine, and it is also related to Neural Turing machines). Here is how we can realize Abram's idea that "we can learn a whole lot by reasoning before we even observe that situation".

Imagine that the environment satisfies some hypothesis that contains the evaluation of a program , and that is relatively computationally expensive but not prohibitively so. If is a simple hypothesis, and in particular is a short program, then we can expect the agent to learn from a relatively small number of samples. However, because of the computational cost of , exploiting might be difficult (because there is a large latency between the time the agent knows it needs to evaluate and the time it gets the answer). Now assume that is actually equivalent (as a function) to a different program which is long but relatively cheap. Then we can use to describe hypothesis which is also true, and which is easily exploitable (i.e. guarantees a higher payoff than ) but which is complex (because is long). The agent would need a very large number of (physical) samples to learn directly. But, if the agent has plenty of computational resources for use during its "free time", it can use them to learn the (correct but complex) hypothesis . Then, the conjunction of and guarantees the same high payoff as . Thus, TRL converges to exploiting the environment well by the virtue of "reasoning before observing".

Moreover, we can plausibly translate this story into a rigorous desideratum along the following lines: for any conjunction of a "purely mathematical" hypothesis and arbitrary ("physical") hypothesis , we can learn (=exploit) it given enough "mathematical" samples relatively to the complexity (prior probability) of and enough "physical" samples relatively to the complexity of .

Comment by vanessa-kosoy on Formalising decision theory is hard · 2019-09-16T08:23:42.151Z · score: 2 (1 votes) · LW · GW

I don't currently see why we shouldn't ask to converge to pareto optima. Obviously, we can't expect to do so with arbitrary other agents; but it doesn't seem unreasonable to use an algorithm which has the property of reaching pareto-optima with other agents who use that same algorithm.

I had an interesting thought that made me update towards your position. There is my old post about "metathreat equilibria", an idea I developed with Scott's help during my trip to Los Angeles. So, I just realized the same principle can be realized in the setting of repeated games. In particular, I am rather confident that the following, if formulated a little more rigorously, is a theorem:

Consider the Iterated Prisoner's Dilemma in which strategies are constrained to depend only on the action of the opponent in the previous round. Then, at the limit (shallow time discount), the only thermodynamic equilibrium is mutual cooperation.

Then, there is the question of how to generalize it. Obviously we want to consider more general games and more general strategies, but allowing fully general strategies probably won't work (just like allowing arbitrary programs doesn't work in the "self-modification" setting). One obvious generalization is allowing the strategy to depend on some finite suffix of the history. But, this isn't natural in the context of agent theory: why would agents forget everything that happened before a certain time? Instead, we can constraint the strategy to be finite-state, and maybe require it to be communicating (i.e. forbid "grim triggers" where players change their behavior forever). On the game side, we can consider arbitrary repeated games, or communicating stochastic games (they have to be communicating because otherwise we can represent one-shot games), or even communicating partially observable stochastic games. This leads me to the following bold conjecture:

Consider any suitable (as above) game in which strategies are constrained to be (communicating?) finite-state. Then, at the limit , all thermodynamic equilibria are Pareto efficient.

This would be the sort of result that seems like it should be applicable to learning agents. That is, if the conjecture is true, there is good hope that appropriate learning agents are guaranteed to converge to Pareto efficient outcomes. Even if it the conjecture requires some moderately strong additional assumptions, it seems worth studying.

Comment by vanessa-kosoy on Formalising decision theory is hard · 2019-09-15T16:26:01.831Z · score: 2 (1 votes) · LW · GW

Suppose the interval between encounters with the predictor is long enough that, due to the agent's temporal discounting, the immediate reward of two-boxing outweighs the later gains which one-boxing provides. In any specific encounter with the predictor, the agent may prefer to two-box, but prefer to have been the sort of agent who predictably one-boxes, and also preferring to pre-commit to one-box on the next example if a commitment mechanism exists. (This scenario also requires a carefully tuned strength for the predictor, of course.)

Yes, but in this case the agent should two-box. This agent prefers sacrificing the future for short term gain, so that's what it does. Ofc if there a way to precommit that is visible to the predictor it will take it: this way it enjoys the best of both worlds, getting two boxes and causing the predictor to cooperate on the next round.

Btw, I am always interested in the asymptotic when the time discount parameter goes to , rather than the asymptotic when time discount is fixed and time goes to . That is, when I say the agent "converges" to something, I mean it in the former sense. Because, if the agent mostly cares only about the next thousands years, then it matters little how successful it is a million years from now. That is, the former asymptotic tracks the actual expected utility rather than some arbitrary portion thereof. (Also, I think it is interesting to consider what I dubbed "semi-anytime" algorithms: algorithms that receive a lower bound for the time discount parameter as input and guarantee a level of performance (usually regret) that improves as this lower bound grows. Even better are anytime algorithms: algorithms that don't depend on the time discount at all, but come with a performance guarantee that depends on the time discount. However, in settings such as Delegative RL it is impossible to have an anytime algorithm.)

Comment by vanessa-kosoy on Formalising decision theory is hard · 2019-09-14T10:27:22.687Z · score: 2 (1 votes) · LW · GW

What happens in ASP? (Say you're in an iterated Newcomb's problem with a predictor much slower than you, but which meets the LIC or similar.)

I am not sure what you mean by "meets the LIC or similar" in this context. If we consider a predictor which is a learning algorithm in itself (i.e., it predicts by learning from the agent's past choices), then the agent will converge to one-boxing. This is because a weak predictor will be fully inside the agent's prior, so the agent will know that one-boxing for long enough will cause the predictor to fill the box. If we consider a predictor that just analyzes the agent's source code and ignores the agent's choices, the agent will converge to two-boxing.

I was never convinced that "logical ASP" is a "fair" problem. I once joked with Scott that we can consider a "predictor" that is just the single line of code "return DEFECT" but in the comments it says "I am defecting only because I know you will defect." It was a joke, but it was half-serious. The notion of "weak predictor" taken to the limit leads to absurdity, and if you don't take it to the limit it might still lead to absurdity. Logical inductors in one way to try specifying a "weak predictor", but I am not convinced that settings in which logic is inserted ad hoc should be made into desiderata.

I still have a little hope that there will be a nice version, which doesn't involve a commitment-races problem and which doesn't make use of an arbitrary commitment cutoff.

I am not sure we need an arbitrary cutoff. There might be a good solution where the agent can dynamically choose any finite cutoff.

...it doesn't seem unreasonable to use an algorithm which has the property of reaching pareto-optima with other agents who use that same algorithm.

Maybe? The questions are, how robust is this cooperation (i.e. what counts as "same algorithm"), and whether there is a significant cost in other situations. And, on the philosophical-ish level, the question is whether such desiderata should be considered essential for rationality/intelligence. But, I agree that this is worth studying.

Comment by vanessa-kosoy on Do Sufficiently Advanced Agents Use Logic? · 2019-09-14T09:37:42.547Z · score: 15 (5 votes) · LW · GW

Thank you for writing this, Abram.

The main difference between our ways of thinking seems to be: You are thinking in terms of, what happens inside the algorithm. I am thinking in terms of, what desiderata does the algorithm satisfy. I trust the latter approach more, because, while we can only speculate how intelligent agents work, we should be able to explain what an intelligent agent is: after all, something made us study intelligent agents in the first place. (Of course we do have some clues about how intelligent agents work from cognitive science and applied ML, but we are far from having the full picture.) In particular, I don't directly care about the distinction between model-based and model-free.

Specifically, the desiderata I often consider is some type of regret bound. This desiderata doesn't directly refer to prediction: it only cares about expected utility! Your reasoning would have us conclude it corresponds to model-free RL. However, the actual algorithms I use to prove regret bounds use posterior sampling: a model-based approach! So, not only that predictions enter the picture, but they enter the picture in a way that is clearly justified rather than ad hoc. (That said, the definition of Bayesian regret already refers to a prior, and the coefficients in the regret bounds depend on this prior, so I do introduce a notion of "model" a priori.)

(It should also be noted that a reward-learning framework presumes we get feedback about utility at all. If we get no feedback about reward, then we're forced to only judge hypotheses by predictions, and make what inferences about utility we will. A dire situation for learning theory, but a situation where we can still talk about rational agency more generally.)

I do not agree that this is a "dire situation for learning theory". Quite the contrary, this is exactly the situation I formalized using instrumental reward functions. Unless you think instrumental reward functions are not good enough for some reason? (Of course, in my setting, the agent can get feedback about utility, it just doesn't get it all the time for free. But, the only alternative seems to be "flying spaghetti monster" utility functions that are completely unobservable, and I am not sure that's an interesting case to study.)

Agents that "make up math puzzles" is not necessarily a bad notion, but we should try to be more precise about what it means in terms of desiderata. We need to specify precisely what advantages "making up math puzzles" is supposed to confer.

However, there's no special dividing line between decidable and undecidable problems -- any particular restriction to a decidable class might rule out some interesting (decidable but non-obviously so) stuff which we could learn from. So we might end up just going with any computations (halting or no).

It might be useful to consider arbitrary programs, but the agent will still not execute any of them for more than some small finite time. You might argue that this is why it needs logic, to prove theorems about computations that it cannot execute. But, I don't believe this line of thinking leads to anything tractable that is not redundant w.r.t. a simpler logic-less approach.

[EDIT: Renamed "CSRL" to "TRL" 2019-09-19]

One related idea that I came up with recently is what I call "Turing reinforcement learning" (TRL). (It is rather half-baked, but I feel that the relevance is such that I should still bring it up. Also, it was inspired by our discussion, so thank you.) Imagine an RL agent connected to a computer. The agent can ask the computer to execute any program (for some finite amount of time ofc), and its prior already "knows" that such computations produce consistent answers (something about the semantics of programs can also be built into the prior, bringing us closer to logic, but I don't believe it can be anything nearly as strong as "proving theorems about programs using PA" if we want the algorithm to be even weakly feasible). This allows it to consider incomplete hypotheses that describe some parts of the environment in terms of programs (i.e. some physical phenomenon is hypotheses to behave equivalently to the execution of a particular program). The advantage of this approach: better management of computational resources. In classical RL, the resources are allocated "statically": any hypothesis is either in the prior or not. In contrast, in TRL the agent can decide itself which programs are worth running and when. Starting from generic regret bounds for incomplete models, this should lead to TRL satisfying some desiderata superior to classical RL with the same amount of resources, although I haven't worked out the details yet.

This gets back to my earlier intuition about agents having a reasonable chance of getting certain things right on the first try. Learning-theoretic agents don't get things right on the first try. However, agents who learn from logic have "lots of tries" before their first real try in physical time. If you can successfully determine which logical scenarios are relevantly analogous to your own, you can learn what to do just by thinking.

This seems to me the wrong way to look at it. There is no such thing as "learning-theoretic agents". There are agents with good learning-theoretic properties. If there is a property that you can prove in learning-theory to be impossible then no agent can have this property. Once again, I am not ruling out this entire line of thought, but I am arguing for looking for desiderata that can formalize it, instead of using it to justify ad hoc algorithms (e.g. algorithms that use logic.)

So, getting back to Vanessa's point in the comment I quoted: can we solve MIRI-style decision problems by considering the iterated problem, rather than the single-shot version? To a large extent, I think so: in logical time, all games are iterated games. However, I don't want to have to set an agent up with a training sequence in which it encounters those specific problems many times. For example, finding good strategies in chess via self-play should come naturally from the way the agent thinks about the world, rather than being an explicit training regime which the designer has to implement. Once the rules for chess are understood, the bottleneck should be thinking time rather than (physical) training instances.

But, you have to set an agent with a training sequence, otherwise how does it ever figure out the rules of chess? I definitely agree that in some situations computational complexity is the bottleneck rather the sample complexity. My point was different: if you don't add the context of a training sequences, you might end up with philosophical questions that are either poor starting points or don't make sense at all.

Comment by vanessa-kosoy on Russian x-risks newsletter, summer 2019 · 2019-09-07T19:39:25.904Z · score: 6 (3 votes) · LW · GW

I wonder whether there is a reason this is not on the front page?

Comment by vanessa-kosoy on Formalising decision theory is hard · 2019-09-06T16:24:15.768Z · score: 4 (2 votes) · LW · GW

I believe that this indeed solves both INP and IXB.

I'd like to understand this part.

Every round of IXB has the following structure:

1. Blackmail either arrives or not (observation)
2. Agent either pays or not pays the blackmail (action)
3. Infestation either happens or not (observation)

Suppose that on every round, the termite infestation happens with probability and its cost is . Then, this fact corresponds on an incomplete model (i.e., says that regardless of whether blackmailed arrived and regardless of whether blackmails was paid, the probability is ). is incomplete because it doesn't say anything about whether blackmail arrives or not. If is true, the agent can guarantee a payoff of per round (by rejecting the blackmail). Therefore, if the agent has a learnable prior that includes , it will converge to a payoff which is no less than . Of course achieving this payoff requires actually rejecting the blackmail.

This might seem surprising at first, because there is also a different incomplete model that says "if you pays the blackmail, infestation will not happen". is false if you use physical causal counterfactuals, but from the agent's perspective is consistent with all observations. However, only guarantees the payoff (because it is unknown whether the blackmail will arrive). Therefore, will have no effect on the ultimate behavior of the agent.

I'm not sure I should find the precommitment solution satisfying. Won't it make some stupid precommitments early (before it has learned enough about the world to make reasonable precommitments) and screw itself up forever? Is there a generally applicable version of precommitments which ensures learning good behavior?

The precommitments have to expire after some finite time. Incidentally, the agent can still screw itself up forever if the environment contains trap, which is a separate problem.

Notice that I am not making the claim that any sophisticated agent has to be capable of precommitments. This is because, I am not convinced that counterfactual mugging belongs to some class of problems that any sophisticated agent should be able to solve (the "fair" problems). Of course a sophisticated agent which suspects its environment contains this type of situation might want to create a descendant that is capable of precommitments.

The only class of problems that I'm genuinely unsure how to deal with is game-theoretic superrationality.

Let me add that I am not even sure what are the correct desiderata. In particular, I don't think that we should expect any group of good agents to converge to a Pareto optimal outcome. IMO in general it is more reasonable to expect that they converge to a Nash equilibrium (or some refinement, like a proper equilibrium). If the setting is iterated, or if they see each other's source code, then some equilibria are Pareto optimal, and perhaps under some reasonable assumptions convergence to Pareto optimal outcomes should be overwhelmingly likely (because of something like, these outcomes have the largest basin of attraction).

Comment by vanessa-kosoy on Algorithmic Similarity · 2019-09-06T15:35:43.394Z · score: 2 (1 votes) · LW · GW

Not sure what do you mean by "computational complexity behaves kind of weirdly at low complexities"? In this case, I would be tempted to try the complexity class (logarithmic space complexity).

The most natural encoding is your "qualia", your raw sense data. This still leaves some freedom for how do you represent it, but this freedom has only a very minor effect.

Comment by vanessa-kosoy on AI Forecasting Resolution Council (Forecasting infrastructure, part 2) · 2019-08-30T18:46:55.009Z · score: 7 (3 votes) · LW · GW

Suppose in 2000 you use “superhuman Othello from self-play” as a benchmark of a certain kind of impressive AI progress, and forecast it to be possible by 2020. It seems you were correct -- very plausibly the AlphaZero architecture should work for this. However, in a strict sense your forecast was wrong -- because no one has actually bothered to build a powerful Othello agent.

This might be a bad example? Quoting the Wikipedia: "There are many Othello programs... that can be downloaded from the Internet for free. These programs, when run on any up-to-date computer, can play games in which the best human players are easily defeated." Well, arguably they are not "from self-play" because they use hand-craft evaluation functions. But, "no one has actually bothered to build a powerful Othello agent" seems just plain wrong.

Comment by vanessa-kosoy on Torture and Dust Specks and Joy--Oh my! or: Non-Archimedean Utility Functions as Pseudograded Vector Spaces · 2019-08-29T16:39:03.355Z · score: 4 (2 votes) · LW · GW

Alright, so if you already know any fact which connects cereal/granola to high severity consequences (however slim the connection is) then you should choose based on those facts only.

My intuition is, you will never have a coherent theory of agency with non-Archimedean utility functions. I will change my mind when I see something that looks like such the beginning of such a theory (e.g. some reasonable analysis of reinforcement learning for non-Archimedean utility).

If we had infinite compute, that would not eliminate empirical uncertainty. There are many things you cannot compute because you just don't have enough information. This is why in learning theory sample complexity is distinct from computational complexity, and applies to algorithms with unlimited computational resources. So, you would definitely still need to take expectations.

Comment by vanessa-kosoy on Torture and Dust Specks and Joy--Oh my! or: Non-Archimedean Utility Functions as Pseudograded Vector Spaces · 2019-08-29T15:16:05.990Z · score: 2 (1 votes) · LW · GW

If you're choosing between granola and cereal, then it might be that the granola manufacturer is owned by a multinational conglomerate that also has vested interests in the oil industry, and therefore, buying granola contributes to climate change that has consequences of much larger severity than your usual considerations for granola vs. cereal. Of course there might also be arguments in favor of cereal. But, if you take your non-Archimedean preferences seriously, it is absolutely worth the time to write down all of the arguments in favor of both options, and then choose the side which has the advantage in the largest severity class, however microscopic that advantage is.

Comment by vanessa-kosoy on Torture and Dust Specks and Joy--Oh my! or: Non-Archimedean Utility Functions as Pseudograded Vector Spaces · 2019-08-29T15:09:50.245Z · score: 3 (2 votes) · LW · GW

If I understand what you do correctly, the severity classes are just the set differences , where is the filtration. I think that you also assume that the quotient is one-dimensional and equipped with a choice of "positive" direction.

Comment by vanessa-kosoy on Computational Model: Causal Diagrams with Symmetry · 2019-08-24T22:47:08.438Z · score: 2 (1 votes) · LW · GW

It doesn't really have much to do with particles vs. fields. We talk about measurements because measurements are the thing we actually observe. It seems strange to say you can model the world as a causal network if the causal network doesn't include your actual observations. If you want to choose a particular frame of reference and write down the wavefunction time evolution in that frame (while ignoring space) then you can say it's a causal network (which is just a linear chain, and deterministic at that) but IMO that's not very informative. It also loses the property of having things made of parts, which AFAIU was one of your objectives here.

Comment by vanessa-kosoy on Computational Model: Causal Diagrams with Symmetry · 2019-08-24T21:16:24.780Z · score: 2 (1 votes) · LW · GW

Hmm, no, not really. The quantum fields follow "causality" in some quantum sense (roughly speaking, operators in spacelike separations commute, and any local operator can be expressed in terms of operators localized near the intersection of its past light-cone with any spacelike hypersurface), which is different from the sense used in causal DAGsa (in fact you can define "quantum causal DAGs" which is a different mathematical object). Violation of Bell's inequality precisely means that you can't describe the system by a causal DAG. If you want to do the MWI, then the wavefunction doesn't even decompose into data that can be localized.

Comment by vanessa-kosoy on Computational Model: Causal Diagrams with Symmetry · 2019-08-24T16:40:02.424Z · score: 6 (3 votes) · LW · GW

• Because of quantum physics, the universe is actually not described by a causal DAG very well (if we require microscopic precision). But I agree that in practice there are useful approximate descriptions that do look like a causal DAG.

• One direction towards studying the learning of symmetric causal DAGs (of a sort) is my ideas about cellular decision processes and the more general "graph decision processes", see this comment.

• As a side note, I'm not convinced that "breaking open black boxes" to solve "embedded agency" is a useful way of thinking. IMO the problems associated with "embedded agency" should be solved using a combination of incomplete models and correctly dealing with traps.

Comment by vanessa-kosoy on Embedded Naive Bayes · 2019-08-24T16:25:42.160Z · score: 6 (3 votes) · LW · GW

This seems interesting, but I think there are some assumptions that have been left out? What prevents us from taking and to be constants, which would give solutions for any ?

Comment by vanessa-kosoy on Deconfuse Yourself about Agency · 2019-08-24T15:11:38.426Z · score: 8 (4 votes) · LW · GW

I think that the concept of "agency" (although maybe "intelligence" would be a better word?), in the context of AI alignment, implies the ability to learn the environment and exploit this knowledge towards a certain goal. The only way to pursue a goal effectively without learning is having hard-coded knowledge of the environment. But, where would this knowledge come from? For complex environments, it is only likely to come from learning algorithms upstream.

So, a rock is definitely not an agent since there is nothing it learns about its environment (I am not even sure what the input/output channels of a rock are supposed to be). Q-learning is an agent, but the resulting policy is not an agent in itself. Similarly, AlphaGo is a sort of agent when regarded together with the training loop (it can in principle learn to play different games), but not when disconnected from it. Evolution is an agent, even if not a very powerful one. An ant colony is probably a little agentic because it can learn something, although I'm not sure how much.

Comment by vanessa-kosoy on Creating Environments to Design and Test Embedded Agents · 2019-08-24T13:50:17.559Z · score: 14 (5 votes) · LW · GW

It seems like collecting dollars requires a hard-coded notion of how to draw the boundary around the agents, which runs contrary to the intention. It seems more natural for require the agents to strive to change the world in a particular way (e.g. maximize the number of rubes in the world).

Comment by vanessa-kosoy on Torture and Dust Specks and Joy--Oh my! or: Non-Archimedean Utility Functions as Pseudograded Vector Spaces · 2019-08-24T13:20:22.409Z · score: 5 (3 votes) · LW · GW

• The thing you called "pseudograding" is normally called "filtration".

• In practice, because of the complexity of the world, and especially because of the presence of probabilistic uncertainty, an agent following a non-Archimedean utility function will always consider only the component corresponding to the absolute maximum of , since there will never be a choice between A and B such that these components just happen to be exactly equal. So it will be equivalent to an Archimedean agent whose utility is this worst component. (You can have an without an absolute maximum but I don't think it's possible to define any reasonable utility function like that, where by "reasonable" I roughly mean that, it's possible to build some good theory of reinforcement learning out of it.)

Comment by vanessa-kosoy on Algorithmic Similarity · 2019-08-24T12:29:33.147Z · score: 9 (5 votes) · LW · GW

• Regarding algorithmic similarity. This is an idea I just thought of this moment, so I'm not sure how solid it is, but. Given Turing machines and that compute the same functions, we want to say whether in some sense they do it "in the same way". Let's consider, for any input , the entire histories of intermediate states of the computations and . Call them and . We then say that and are "algorithmically equivalent" when there is a low complexity algorithm that, given access to , can produce any given part of , and vice versa. In particular, the complexity of must be much lower than the complexity of running from the beginning. Here, it seems useful to play with different types of complexity bounds (including time/space for example).

• Regarding waterfalls and human beings. I think that a waterfall is not simulating a human being, because there is no algorithm of simultaneously low description complexity and low computational complexity that can decode a human being from a waterfall. Ofc it is not a binary distinction but a fuzzy distinction (the simpler the decoding algorithm is, the more reasonable it is to say a human being is there).

• Regarding diamond optimizers. I think that the right way to design such an optimizer would be using an instrumental reward function. We then remain with the problem of how to specify this function. We could start with some ontology or class of ontologies that can be reasonably said to contain diamonds, and for which we can define the reward function unambiguously. These ontologies are then mapped into the space of instrumental states, giving us a partial specification of the instrumental reward function (it is specified on the affine span of the images of the ontologies). Then, there is the quesiton of how to extend the reward function to the entire instrumental state space. I wrote a few thoughts about that in the linked essay, but another approach we can take is, considering all extensions that have same range of values. These form a convex set, that can be interpreted as Knightian uncertainty regarding the reward function. We can then consider maximin policies for this set to be "diamond maximizers". In other words, we want the maximizer to be cautious/conservative about judging the number of diamonds on states that lie outside the ontologies.

Comment by vanessa-kosoy on Thoughts on Retrieving Knowledge from Neural Networks · 2019-08-24T11:53:04.141Z · score: 6 (3 votes) · LW · GW

I think that there is an interesting subproblem here, which is, given a policy (a RL agent), determine the belief state of this policy. I haven't thought much about this, but it seems reasonable to look at belief states (distributions over environments) w.r.t. which the policy has a sufficiently strong regret bound. This might still leave a lot of ambiguity that has to be regularized somehow.

Comment by vanessa-kosoy on When do utility functions constrain? · 2019-08-24T11:10:52.705Z · score: 7 (5 votes) · LW · GW

The problem can be ameliorated by constraining to instrumental reward functions. This gives us agents that are, in some sense, optimizing the state of the environment rather than an arbitrary function of their own behavior. I think this is a better model of what it means to be "goal-directed" than classical reward functions.

Another thing we can do is just applying Occam's razor, i.e requiring the utility function (and prior) to have low description complexity. This can be interpreted as, taking the intentional stance towards a system is only useful if it results in compression.

Comment by vanessa-kosoy on Understanding understanding · 2019-08-24T10:56:23.558Z · score: 4 (2 votes) · LW · GW

I think this is a useful perspective. A way to reformulate (part of) this is: an explanation is good when it looks like it was sampled from the distribution of explanations you would come up with yourself, if you thought about the problem long enough. This resonates a lot with some of my (and other people's) thought about AI alignment, where we try to constrain the AI to "human-like" behavior or make it imitate human behavior (this happens in Delegative Reinforcement Learning, Quantilization and other approaches).

Comment by vanessa-kosoy on Optimization Provenance · 2019-08-24T10:36:08.919Z · score: 5 (4 votes) · LW · GW

The idea of explicit world models reminds me of my research direction concerning cellular decision processes. Note that the "grid" of such a process can be replaced by an arbitrary graph, which can also evolve dynamically, and that makes it quite close to the notion of, representing the world as a collection of objects and relationships/interactions. I did not know about the sheaf theoretic angle (added to reading list, thank you!), it might be interesting to see whether these two combine in a natural way.

Comment by vanessa-kosoy on Formalising decision theory is hard · 2019-08-23T19:13:48.192Z · score: 2 (1 votes) · LW · GW

RL is CDT in the sense that, your model of the world consists of actions and observations, and some causal link from past actions and observations to current observations, but there is no causal origin to the actions. The actions are just set by the agent to whatever it wants.

And, yes, I got CDT and EDT flipped there, good catch!

Comment by vanessa-kosoy on Formalising decision theory is hard · 2019-08-23T17:17:08.413Z · score: 17 (10 votes) · LW · GW

Heterodox opinion: I think the entire MIRIesque (and academic philosophy) approach to decision theory is confused. The basic assumption seems to be, that we can decouple the problem of learning a model of the world from the problem of taking a decision given such a model. We then ignore the first problem, and assume a particular shape for the model (for example, causal network) which allows us to consider decision theories such as CDT, EDT etc. However, in reality the two problems cannot be decoupled. This is because the type signature of a world model is only meaningful if it comes with an algorithm for how to learn a model of this type.

For example, consider Newcomb's paradox. The agent makes a decision under the assumption that Omega behaves in a certain way. But, where did the assumption come from? Realistic agents have to learn everything they know. Learning normally requires a time sequence. For example, we can consider the iterated Newcomb's paradox (INP). In INP, any reinforcement learning (RL) algorithm will converge to one-boxing, simply because one-boxing gives it the money. This is despite RL naively looking like CDT. Why does it happen? Because in the learned model, the "causal" relationships are not physical causality. The agent comes to believe that taking the one box causes the money to appear there.

In Newcomb's paradox EDT succeeds but CDT fails. Let's consider an example where CDT succeeds and EDT fails: the XOR blackmail. The iterated version would be IXB. In IXB, classical RL doesn't guarantee much because the environment is more complex than the agent (it contains Omega). To overcome this, we can use RL with incomplete models. I believe that this indeed solves both INP and IXB.

Then we can consider e.g. counterfactual mugging. In counterfactual mugging, RL with incomplete models doesn't work. That's because the assumption that Omega responds in a way that depends on a counterfactual world is not in the space of models at all. Indeed, it's unclear how can any agent learn such a fact from empirical observations. One way to fix it is by allowing the agent to precommit. Then the assumption about Omega becomes empirically verifiable. But, if we do this, then RL with incomplete models can solve the problem again.

The only class of problems that I'm genuinely unsure how to deal with is game-theoretic superrationality. However, I also don't see much evidence the MIRIesque approach has succeeded on that front. We probably need to start with just solving the grain of truth problem in the sense of converging to ordinary Nash (or similar) equilibria (which might be possible using incomplete models). Later we can consider agents that observe each other's source code, and maybe something along the lines of this can apply.

Comment by vanessa-kosoy on Vague Thoughts and Questions about Agent Structures · 2019-08-23T16:39:01.502Z · score: 6 (3 votes) · LW · GW

The idea of agent clusters seems closely related to my idea about modeling dynamically inconsistent agents. In my model, each subagent controls particular states. More general game-theoretic models can be considered, by they seem to have worse properties.

Regarding the broader question, about how to bridge the gap between agents that are "ideal, perfectly rational consequentialists" and "realistic" agents (e.g. humans), more factors that can be relevant are:

• Realistic agents have computational resource bounds.

• Realistic agents might be learning algorithms with suboptimal sample complexity (i.e. it takes them longer to learn that a perfect agent).

• Realistic agents might only remain coherent within some subspace of the state space (although that might be possible to model using dynamic inconsistency).

• We can also consider agents that have some "Knightian uncertainty" about their own utility function. For example, we can consider a convex set in the space of utility functions, and have the agent follow the maximin policy w.r.t. this convex set. As a more specific example, we can consider an instrumental reward function that is only defined on some affine subspace of the instrumental state space, and consider all extensions of it to the entire instrumental state space that don't increase its range.

Comment by vanessa-kosoy on Three Levels of Motivation · 2019-08-23T16:21:15.758Z · score: 3 (2 votes) · LW · GW
• I think this is a useful taxonomy.
• The disparity between L2 and L3 is closely related to mesa-optimization.
• The problem that "every agent trivially optimises for being the agent that they are" can be somewhat ameliorated by constraining ourselves to instrumental reward functions.
Comment by vanessa-kosoy on Book Review: Secular Cycles · 2019-08-15T20:27:24.136Z · score: 17 (11 votes) · LW · GW

It might be difficult to tell the difference between a periodic pattern and a random walk. A random walk also produces periods of "bad" and periods of "good" s.t. after every bad period there is a good period (because what else can there be?) and vice versa. The durations of different "cycles" would also usually be of similar magnitude. And, a priori, a random walk seems more likely and require less explanation.

Tangentially, I think it would be really great if there were standard benchmarks on which quantitative theories of society/history/economics could compete by measuring to which extent they can compress the data. That would make it much easier to understand what is real and what is overfitting.

Comment by vanessa-kosoy on AI Alignment Open Thread August 2019 · 2019-08-10T10:52:17.795Z · score: 4 (2 votes) · LW · GW

Well, any system that satisfies the Minimal Requirement is doing long term planning on some level. For example, if your AI is approval directed, it still needs to learn how to make good plans that will be approved. Once your system has a superhuman capability of producing plans somewhere inside, you should worry about that capability being applied in the wrong direction (in particular due to mesa-optimization / daemons). Also, even without long term planning, extreme optimization is dangerous (for example an approval directed AI might create some kind of memetic supervirus).

But, I agree that these arguments are not enough to be confident of the strong empirical claim.

Comment by vanessa-kosoy on AI Alignment Open Thread August 2019 · 2019-08-09T17:02:34.977Z · score: 6 (3 votes) · LW · GW

a) I believe a weaker version of the empirical claim, namely that the catastrophe is not nearly inevitable but not unlikely. That is, I can imagine different worlds in which the probability of the catastrophe is different, and I have uncertainty over in which world we actually are, s.t. in average the probability is sizable.

b) I think that the argument you gave is sort of correct. We need to augment it by: the minimal requirement from the AI is, it needs to effectively block all competing dangerous AI projects, without also doing bad things (which is why you can't just give it the zero utility function). Your counterargument seems weak to me because, moving from utility maximizes to other types of AIs is just replacing something that is relatively easy to reason about with something that it is harder to reason about, thereby obscuring the problems (that are still there). I think that whatever your AI is, given that is satisfies the minimal requirement, some kind of utility-maximization-like behavior is likely to arise.

Coming at it from a different angle, complicated systems often fail in unexpected ways. The way people solve this problem in practice is by a combination of mathematical analysis and empirical research. I don't think we have many examples of complicated systems where all failures were avoided by informal reasoning without either empirical or mathematical backing. In the case of superintelligent AI, empirical research alone is insufficient because, without mathematical models, we don't know how to extrapolate empirical results from current AIs to superintelligent AIs, and when superintelligent algorithms are already here it will probably be too late.

c) I think what we can (and should) realistically aim for is, having a mathematical theory of AI, and having a mathematical model of our particular AI, such that in this model we can prove the AI is safe. This model will have some assumptions and parameters that will need to be verified/measured in other ways, through some combination of (i) experiments with AI/algorithms (ii) learning from neuroscience (iii) learning from biological evolution and (iv) leveraging our knowledge of physics. Then, there is also the question of, how precise is the correspondence between the model and the actual code (and hardware). Ideally, we want to do formal verification in which we can test that a certain theorem holds for the actual code we are running. Weaker levels of correspondence might still be sufficient, but that would be Plan B.

Also, the proof can rely on mathematical conjectures in which we have high confidence, such as . Of course, the evidence for such conjectures is (some sort of) empirical, but it is important that the conjecture is at least a rigorous, well defined mathematical statement.

Comment by vanessa-kosoy on AI Alignment Open Thread August 2019 · 2019-08-09T16:19:44.298Z · score: 4 (2 votes) · LW · GW

This idea was inspired by a discussion with Discord user @jbeshir

Model dynamically inconsistent agents (in particular humans) as having a different reward function at every state of the environment MDP (i.e. at every state we have a reward function that assigns values both to this state and to all other states: we have a reward matrix ). This should be regarded as a game where a different player controls the action at every state. We can now look for value learning protocols that converge to Nash* (or other kind of) equilibrium in this game.

The simplest setting would be, every time you visit a state, you learn the reward of all previous states w.r.t. the reward function of the current state. Alternatively, every time you visit a state, you can ask about the reward of one previously visited state w.r.t. the reward function of the current state. This is the analogue of classical reinforcement learning with an explicit reward channel. We can now try to prove a regret bound, which takes the form of an -Nash equilibrium condition, with being the regret. More complicated settings would be analogues of Delegative RL (where the advisor also follows the reward function of the current state) and other value learning protocols.

This seems like a more elegant way to model "corruption" than as a binary or continuous one dimensional variable like I did before.

*Note that although for general games, even if they are purely coorperative, Nash equilibria can be suboptimal due to coordination problems, for this type of games it doesn't happen: in the purely cooperative case, the Nash equilibrium condition becomes the Bellman equation that implies global optimality.

Comment by vanessa-kosoy on AI Alignment Open Thread August 2019 · 2019-08-09T15:52:37.446Z · score: 4 (2 votes) · LW · GW

It is an interesting problem to write explicit regret bounds for reinforcement learning with a prior that is the Solomonoff prior or something similar. Of course, any regret bound requires dealing with traps. The simplest approach is, leaving only environments without traps in the prior (there are technical details involved that I won't go into right now). However, after that we are still left with a different major problem. The regret bound we get is very weak. This happens because the prior contains sets of hypotheses of the form "program template augmented by a hard-coded bit string of length ". Such a set contains hypotheses, and its total probability mass is approximately , which is significant for short (even when is large). However, the definition of regret requires out agent to compete against a hypothetical agent that knows the true environment, which in this case means knowing both and . Such a contest is very hard since learning bits can take much time for large .

Note that the definition of regret depends on how we decompose the prior into a convex combination of individual hypotheses. To solve this problem, I propose redefining regret in this setting by grouping the hypotheses in a particular way. Specifically, in algorithmic statistics there is the concept of sophistication. The sophistication of a bit string is defined as follows. First, we consider the Kolmogorov complexity of . Then we consider pairs where is a program, is a bit string, and . Finally, we minimize over . The minimal is called the sophistication of . For our problem, we are interested in the minimal itself: I call it the "sophisticated core" of . We now group the hypotheses in our prior by sophisticated cores, and define (Bayesian) regret w.r.t. this grouping.

Coming back to our problematic set of hypotheses, most of it is now grouped into a single hypothesis, corresponding to the sophisticated core of . Therefore, the reference agent in the definition of regret now knows but doesn't know , making it feasible to compete with it.

Comment by vanessa-kosoy on AI Alignment Open Thread August 2019 · 2019-08-09T15:32:19.195Z · score: 4 (2 votes) · LW · GW

A variant of Christiano's IDA amenable to learning-theoretic analysis. We consider reinforcement learning with a set of observations and a set of actions, where the semantics of the actions is making predictions about future observations. (But, as opposed to vanilla sequence forecasting, making predictions affects the environment.) The reward function is unknown and unobservable, but it is known to satisfy two assumptions:

(i) If we make the null prediction always, the expected utility will be lower bounded by some constant.

(ii) If our predictions sample the -step future for a given policy , then our expected utility will be lower bounded by some function of the the expected utility of and . is s.t. for sufficiently low , but for sufficiently high , (in particular the constant in (i) should be high enough to lead to an increasing sequence). Also, it can be argued that it's natural to assume for .

The goal is proving regret bounds for this setting. Note that this setting automatically deals with malign hypotheses in the prior, bad self-fulfilling prophecies and "corrupting" predictions that cause damage just by seeing them.

However, I expect that without additional assumptions the regret bound will be fairly weak, since the penalty for making wrong predictions grows with the total variation distance between the prediction distribution and the true distribution, which is quite harsh. I think this reflects a true weakness of IDA (or some versions of it, at least): without an explicit model of the utility function, we need very high fidelity to guarantee robustness against e.g. malign hypotheses. On the other hand, it might be possible to ameliorate this problem if we introduce an additional assumption of the form: the utility function is e.g. Lipschitz w.r.t some metric . Then, total variation distance is replaced by Kantorovich-Rubinstein distance defined w.r.t. . The question is, where do we get the metric from. Maybe we can consider something like the process of humans rewriting texts into equivalent texts.

Comment by vanessa-kosoy on AI Alignment Open Thread August 2019 · 2019-08-09T15:10:24.484Z · score: 6 (4 votes) · LW · GW

In last year's essay about my research agenda I wrote about an approach I call "learning by teaching" (LBT). In LBT, an AI is learning human values by trying to give advice to a human and seeing how the human changes eir behavior (without an explicit reward signal). Roughly speaking, if the human permanently changes eir behavior as a result of the advice, then one can assume the advice was useful. Partial protection against manipulative advice is provided by a delegation mechanism, which ensures the AI only produces advice that is in the space of "possible pieces of advice a human could give" in some sense. However, this protection seems insufficient since it allows for giving all arguments in favor of a position without giving any arguments against a position.

To add more defense against manipulation, I propose to build on the "AI debate" idea. However, in this scheme, we don't need more than one AI. In fact, this is a general fact: for any protocol involving multiple AIs, there is a protocol involving just one AI that works (at least roughly, qualitatively) just as well. Proof sketch: If we can prove that under assumptions , the protocol is safe/effective, then we can design a single AI which has assumptions baked into its prior. Such an AI would be able to understand that simulating protocol would lead to a safe/effective outcome, and would only choose a different strategy if it leads to an even better outcome under the same assumptions.

The way we use "AI debate" is not by implementing an actual AI debate. Instead, we use it to formalize our assumptions about human behavior. In ordinary IRL, the usual assumption is "a human is a (nearly) optimal agent for eir utility function". In the original version of LBT, the assumption was of the form "a human is (nearly) optimal when receiving optimal advice". In debate-LBT the assumption becomes "a human is (nearly) optimal* when exposed to a debate between two agents at least one of which is giving optimal advice". Here, the human observes this hypothetical debate through the same "cheap talk" channel through which it receives advice from the single AI.

Notice that debate can be considered to be a form of interactive proof system (with two or more provers). However, the requirements are different from classical proof systems. In classical theory, the requirement is "When the prover is honestly arguing for a correct proposition, the verifier is convinced. For any prover the verifier cannot be convinced of a false proposition." In "debate proof systems" the requirement is "If at least one prover is honest, the verifier comes to the correct conclusion". That is, we don't guarantee anything when both provers are dishonest. It is easy to see that these debate proof systems admit any problem in PSPACE: given a game, both provers can state their assertions as to which side wins the game, and if they disagree they have to play the game for the corresponding sides.

*Fiddling with the assumptions a little, instead of "optimal" we can probably just say that the AI is guaranteed to achieve this level of performance, what it is.

Comment by vanessa-kosoy on AI Alignment Open Thread August 2019 · 2019-08-09T13:09:59.068Z · score: 10 (4 votes) · LW · GW

I'm posting a few research directions in my research agenda about which I haven't written much elsewhere (except maybe in the MIRIx Discord server), and for which I so far haven't got the time to make a full length essay with mathematical details. Each direction is in a separate child comment.

Comment by vanessa-kosoy on [AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming · 2019-06-26T11:18:33.520Z · score: 10 (3 votes) · LW · GW

I focus mostly on formal properties algorithms can or cannot have, rather than the algorithms themselves. So, from my point of view, it doesn't matter whether the prior is "explicit" and I doubt it's even a well-defined question. What I mean by "prior" is, more or less, whatever probability measure has the best Bayesian regret bound for the given RL algorithm.

I think the prior will have to look somewhat like the universal prior. Occam's razor is a foundational principle of rationality, and any reasonable algorithm should have inductive bias towards simpler hypotheses. I think there's even some work trying to prove that deep learning already has such inductive bias. At the same time, the space of hypotheses has to be very rich (although still constrained by computational resources and some additional structural assumptions needed to make learning feasible).

I think that DRL doesn't require a prior (or, more generally, algorithmic building blocks) substantially different from what is needed for capabilities, since if your algorithm is superintelligent (in the sense that, it's relevant to either causing or mitigating X-risk) then it has to create sophisticated models of the world that include people, among other things, and therefore forcing it to model the advisor as well doesn't make the task substantially harder (well, it is harder in the sense that the regret bound is weaker, but that is not because of the prior).

Comment by vanessa-kosoy on [AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming · 2019-06-23T14:02:39.740Z · score: 2 (1 votes) · LW · GW

For states that cause existential catastrophes this seems obviously desirable. Maybe my objection is more that with this sort of algorithm you need to have the right set of hypotheses in the first place, and that seems like the main difficulty?

Ultimately, the set of hypotheses should be something like the universal prior. More precisely, it should be whatever we need to use instead of the universal prior to get "general" intelligence that is computationally efficient. In other words, I don't think the choice of prior here is substantially different or more difficult from the choice of prior for AGI from a pure capability POV.

Maybe I'm also saying that this feels vulnerable to nearest unblocked strategies. Suppose the AI has learned that its reward function is to maximize paperclips, and the advisor doesn't realize that a complicated gadget the AI has built is a self-replicating nanorobot that will autonomously convert atoms into paperclips. It doesn't seem like DRL saves us here.

The way I imagine it will work, the advisor will not do something weird and complicated that ey don't understand emself. So the AI also cannot do something weird and complicated that the advisor doesn't understand. In order for this not to be too constraining, I imagine the advisor having some kind of "diary" where ey write down eir thoughts and reasoning, which is considered a part of normal interaction with the environment. The advisor will only act on complicated plans after laying down the reasoning. The AI will then have to write down in this diary an explanation for its plans that will be understandable by and convincing for the advisor. This is partially protected from manipulations, because the explanation has to come from the space of explanations that the advisor could produce. That said, I think this defense from manipulation is insufficient in itself (because the AI can write down all arguments for a given position that the advisor could come up with, without writing down any arguments against it), and I have a research direction based on the "debate" approach about how to strengthen it.

Maybe another way of putting it -- is there additional safety conferred by this approach that you couldn't get by having a human review all of the AI's actions? If so, should I think of this as "we want a human to review actions, but that's expensive, DRL is a way to make it more sample efficient"?

The current version of the formalism is more or less the latter, but you should imagine the review to be rather conservative (like in the nonorobot example). In the "soft" version it will become a limit on how much the AI policy deviates from the advisor policy, so it's not quite a review in the usual sense: there is no binary division between "legal" and "illegal" actions. I think of it more like, the AI should emulate an "improved" version of the advisor: do all the things the advisor would do on eir "best day".

Comment by vanessa-kosoy on How does Gradient Descent Interact with Goodhart? · 2019-06-19T14:18:10.365Z · score: 4 (2 votes) · LW · GW

Hi David, if you want to discuss this more, I think we can do it in person? AFAIK you live in Israel? For example, you can come to my talk in the LessWrong meetup on July 2.

Comment by vanessa-kosoy on [AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming · 2019-06-13T12:41:57.268Z · score: 5 (2 votes) · LW · GW

Dealing with corrupt states requires a "different" algorithm, but the modification is rather trivial: for each hypothesis that includes dynamics and corruption, you need to replace the corrupt states by an inescapable state with reward zero and run the usual PSRL algorithm on this new prior. Indeed, the algorithm deals with corruption by never letting the agent go there. I am not sure I understand why you think this is not a good approach. Consider a corrupt state in which the human's brain has been somehow scrambled to make em give high rewards. Do you think such a state should be explored? Maybe your complaint is that in the real world corruption is continuous rather than binary, and the advisor avoids most of corruption but not all of it and not with 100% success probability. In this case, I agree, the current model is extremely simplified, but it still feels like progress. You can see this for a model of continuous corruption in DIRL, a simpler setting. More generally, I think that a better version of the formalism would build on ideas from quantilization and catastrophe mitigation to arrive at a setting where, you have a low rate of falling into traps or accumulating corruption as long as your policy remains "close enough" to the advisor policy w.r.t. some metric similar to infinity-Renyi divergence (and, as long as your corruption remains low).

Comment by vanessa-kosoy on [AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming · 2019-06-08T08:43:57.952Z · score: 6 (3 votes) · LW · GW

Hi Rohin, thank you for writing about my work! I want to address some issues you brought up regarding Delegative RL.

I worry about there being a Cartesian boundary between the agent and the environment, though perhaps even here as long as the advisor is aware of problems caused by such a boundary, they can be modeled as traps and thus avoided.

Yes. I think that the Cartesian boundary is part of the definition of the agent, and events that violate the Cartesian boundary should be thought of as destroying the agent. Destruction of the agent is a certainly a trap since from the POV of the agent it is irreversible. See also the "death of the agent" subsection of the imperceptible rewards essay.

One thing I wonder about is whether the focus on traps is necessary. With the presence of traps in the theoretical model, one of the main challenges is in preventing the agent from falling into a trap due to ignorance. However, it seems extremely unlikely that an AI system manages to take some irreversible catastrophic action by accident -- I'm much more worried about the case where the AI system is adversarially optimizing against us and intentionally takes an irreversible catastrophic action.

I think that most of the "intentional" catastrophic actions can be regarded as "due to ignorance" from an appropriate perspective (the main exception is probably non-Cartesian daemons). Consider two examples:

Example 1 is corrupt states, that I discussed here. These are states in which the specified reward function doesn't match the intended reward function (and also possibly the advisor becomes unreliable). We can equip the agent with a prior that accounts for the existence of such states. However, without further help, the agent doesn't have enough information to know when it could enter one. So, if the agent decides to e.g. hack its own reward channel, one perspective is that it is an intentional action against us, but another perspective is that it is due to the agent's ignorance of the true model of corruption. This problem is indeed fixed by Delegative RL (assuming that, in uncorrupt states, the advisor's actions don't lead to corruption).

Example 2 is malign hypotheses. The agent's prior may contain hypotheses that are agentic in themselves. Such a hypothesis can intentionally produce correct predictions up to a "traitorous turn" point, at which it produces predictions that manipulate the agent into an irreversible catastrophic action. From the perspective of the "outer" agent this is "ignorance", but from the perspective of the "inner" agent, this is intentional. Once again delegative RL fixes it: at the traitorous turn, a DRL agent detects the ambiguity in predictions and the critical need to take the right action, leading it delegate. Observing the advisor's action leads it to update away from the malign hypothesis.

Comment by vanessa-kosoy on I translated 'Twelve Virtues of Rationality' into Hebrew. · 2019-06-02T07:24:02.595Z · score: 2 (1 votes) · LW · GW

You can create an empty profile just for the LessWrong group, but it's your call of course. It's just convenient for tracking meetups, and there are some interesting discussions. There is also a Google group but it is mostly dormant nowadays.

Comment by vanessa-kosoy on I translated 'Twelve Virtues of Rationality' into Hebrew. · 2019-06-01T20:38:26.213Z · score: 3 (2 votes) · LW · GW

How about posting it on the Facebook group and discussing more there? It might be worthwhile to put in on rationality.co.il for example.

Comment by vanessa-kosoy on Offer of collaboration and/or mentorship · 2019-05-26T18:00:14.443Z · score: 20 (8 votes) · LW · GW

Update: In total, 16 people contacted me. Offer of mentorship is closed, since I have sufficiently many candidates for now. Offer of collaboration remains open for experienced researchers (i.e. researchers that (i) have some track record of original math / theoretical compsci research, and (ii) are able to take on concrete open problems without much guidance).

Comment by vanessa-kosoy on TAISU - Technical AI Safety Unconference · 2019-05-24T21:59:50.766Z · score: 8 (4 votes) · LW · GW

Hey, it's an AI safety unconference, we should be able to coordinate acausally ;)

Comment by vanessa-kosoy on TAISU - Technical AI Safety Unconference · 2019-05-24T21:50:13.965Z · score: 2 (1 votes) · LW · GW

Is there a specific deadline for signing up?

Comment by vanessa-kosoy on Comment section from 05/19/2019 · 2019-05-21T21:37:48.717Z · score: 8 (4 votes) · LW · GW

AFAIU discussing charged political issues is not allowed, or at least very frowned upon on LW, and for good reasons. So, I can't discuss the object level. On the other hand, the meta level is too vague. That is, the error is in the way the abstract reasoning is applied to case X (it's just not the right model), rather than in the abstract reasoning itself.

Comment by vanessa-kosoy on Comment section from 05/19/2019 · 2019-05-21T21:31:21.392Z · score: 2 (1 votes) · LW · GW