Posts

Vanessa Kosoy's Shortform 2019-10-18T12:26:32.801Z · score: 9 (3 votes)
Biorisks and X-Risks 2019-10-07T23:29:14.898Z · score: 6 (1 votes)
Slate Star Codex Tel Aviv 2019 2019-09-05T18:29:53.039Z · score: 6 (1 votes)
Offer of collaboration and/or mentorship 2019-05-16T14:16:20.684Z · score: 109 (36 votes)
Reinforcement learning with imperceptible rewards 2019-04-07T10:27:34.127Z · score: 24 (10 votes)
Dimensional regret without resets 2018-11-16T19:22:32.551Z · score: 9 (4 votes)
Computational complexity of RL with traps 2018-08-29T09:17:08.655Z · score: 14 (5 votes)
Entropic Regret I: Deterministic MDPs 2018-08-16T13:08:15.570Z · score: 12 (7 votes)
Algo trading is a central example of AI risk 2018-07-28T20:31:55.422Z · score: 27 (16 votes)
The Learning-Theoretic AI Alignment Research Agenda 2018-07-04T09:53:31.000Z · score: 37 (13 votes)
Meta: IAFF vs LessWrong 2018-06-30T21:15:56.000Z · score: 1 (1 votes)
Computing an exact quantilal policy 2018-04-12T09:23:27.000Z · score: 10 (2 votes)
Quantilal control for finite MDPs 2018-04-12T09:21:10.000Z · score: 3 (3 votes)
Improved regret bound for DRL 2018-03-02T12:49:27.000Z · score: 0 (0 votes)
More precise regret bound for DRL 2018-02-14T11:58:31.000Z · score: 1 (1 votes)
Catastrophe Mitigation Using DRL (Appendices) 2018-02-14T11:57:47.000Z · score: 0 (0 votes)
Bugs? 2018-01-21T21:32:10.492Z · score: 4 (1 votes)
The Behavioral Economics of Welfare 2017-12-22T11:35:09.617Z · score: 28 (12 votes)
Improved formalism for corruption in DIRL 2017-11-30T16:52:42.000Z · score: 0 (0 votes)
Why DRL doesn't work for arbitrary environments 2017-11-30T12:22:37.000Z · score: 0 (0 votes)
Catastrophe Mitigation Using DRL 2017-11-22T05:54:42.000Z · score: 2 (1 votes)
Catastrophe Mitigation Using DRL 2017-11-17T15:38:18.000Z · score: 0 (0 votes)
Delegative Reinforcement Learning with a Merely Sane Advisor 2017-10-05T14:15:45.000Z · score: 1 (1 votes)
On the computational feasibility of forecasting using gamblers 2017-07-18T14:00:00.000Z · score: 0 (0 votes)
Delegative Inverse Reinforcement Learning 2017-07-12T12:18:22.000Z · score: 11 (3 votes)
Learning incomplete models using dominant markets 2017-04-28T09:57:16.000Z · score: 1 (1 votes)
Dominant stochastic markets 2017-03-17T12:16:55.000Z · score: 0 (0 votes)
A measure-theoretic generalization of logical induction 2017-01-18T13:56:20.000Z · score: 3 (3 votes)
Towards learning incomplete models using inner prediction markets 2017-01-08T13:37:53.000Z · score: 2 (2 votes)
Subagent perfect minimax 2017-01-06T13:47:12.000Z · score: 0 (0 votes)
Minimax forecasting 2016-12-14T08:22:13.000Z · score: 0 (0 votes)
Minimax and dynamic (in)consistency 2016-12-11T10:42:08.000Z · score: 0 (0 votes)
Attacking the grain of truth problem using Bayes-Savage agents 2016-10-20T14:41:56.000Z · score: 1 (1 votes)
IRL is hard 2016-09-13T14:55:26.000Z · score: 0 (0 votes)
Stabilizing logical counterfactuals by pseudorandomization 2016-05-25T12:05:07.000Z · score: 1 (1 votes)
Stability of optimal predictor schemes under a broader class of reductions 2016-04-30T14:17:35.000Z · score: 0 (0 votes)
Predictor schemes with logarithmic advice 2016-03-27T08:41:23.000Z · score: 1 (1 votes)
Reflection with optimal predictors 2016-03-22T17:20:37.000Z · score: 1 (1 votes)
Logical counterfactuals for random algorithms 2016-01-06T13:29:52.000Z · score: 3 (3 votes)
Quasi-optimal predictors 2015-12-25T14:17:05.000Z · score: 2 (2 votes)
Implementing CDT with optimal predictor systems 2015-12-20T12:58:44.000Z · score: 1 (1 votes)
Bounded Solomonoff induction using optimal predictor schemes 2015-11-10T13:59:29.000Z · score: 1 (1 votes)
Superrationality in arbitrary games 2015-11-04T18:20:41.000Z · score: 7 (6 votes)
Optimal predictor schemes 2015-11-01T17:28:46.000Z · score: 2 (2 votes)
Optimal predictors for global probability measures 2015-10-06T17:40:19.000Z · score: 0 (0 votes)
Logical counterfactuals using optimal predictor schemes 2015-10-04T19:48:23.000Z · score: 0 (0 votes)
Towards reflection with relative optimal predictor schemes 2015-09-30T15:44:21.000Z · score: 1 (1 votes)
Improved error space for universal optimal predictor schemes 2015-09-30T15:08:53.000Z · score: 0 (0 votes)
Optimal predictor schemes pass a Benford test 2015-08-30T13:25:59.000Z · score: 3 (3 votes)
Optimal predictors and propositional calculus 2015-07-04T09:51:38.000Z · score: 0 (0 votes)

Comments

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-11-13T17:36:40.891Z · score: 2 (1 votes) · LW · GW

Well, I think that maximin is the right thing to do because it leads to reasonable guarantees for quasi-Bayesian reinforcement learning agents. I think of incomplete models as properties that the environment might satisfy. It is necessary to speak of properties instead of complete models since the environment might be too complex to understand in full (for example because it contains Omega, but also for more prosaic reasons), but we can hope it at least has properties/patterns the agent can understand. A quasi-Bayesian agent has the guarantee that, whenever the environment satisfies one of the properties in its prior, the expected utility will converge at least to the maximin for this property. In other words, such an agent is able to exploit any true property of the environment it can understand. Maybe a more "philosophical" defense of maximin is possible, analogous to VNM / complete class theorems, but I don't know (I actually saw some papers in that vein but haven't read them in detail.)

If the agent has random bits that Omega doesn't see, and Omega is predicting the probabilities of the agent's actions, then I think we can still solve it with quasi-Bayesian agents but it requires considering more complicated models and I haven't worked out the details. Specifically, I think that we can define some function that depends on the agent's actions and Omega's predictions so far (a measure of Omega's apparent inaccuracy), s.t. if Omega is an accurate predictor, then, the supremum of over time is finite with probability 1. Then, we consider consider a family of models, where model number says that for all times. Since at least one of these models is true, the agent will learn it, and will converge to behaving appropriately.

EDIT 1: I think should be something like, how much money would a gambler following a particular strategy win, betting against Omega.

EDIT 2: Here is the solution. In the case of original Newcomb, consider a gambler that bets against Omega on the agent one-boxing. Every time the agent two-boxes, the gambler loses dollar. Every time the agent one-boxes, the gambler wins dollars, where is the probability Omega assigned to one-boxing. Now it's possible to see that one-boxing guarantees the "CC" payoff under the corresponding model (in the limit): If the agent one-boxes, the gambler keeps winning unless Omega converges to one-boxing rapidly enough. In the case of a general Newcomb-like problem, just replace "one-boxes" by "follows the FDT strategy".

Comment by vanessa-kosoy on Building Intuitions On Non-Empirical Arguments In Science · 2019-11-13T14:37:58.654Z · score: 2 (1 votes) · LW · GW

Well, you don't have a guarantee that a computable model will succeed, but you do have some kind of guarantee that you're doing your best, because computable models is all you have. If you're using incomplete/fuzzy models, you can have a "doesn't know anything" model in your prior, which is a sort of "negative belief about physical/naturalism", but it is still within the same "quasi-Bayesian" framework.

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-11-13T14:13:34.768Z · score: 2 (1 votes) · LW · GW

The method should work for imperfect predictors as well. In the simplest case, the agent can model the imperfect predictor as perfect predictor + random noise. So, it definitely knows the correct reward for Omega being wrong. It still believes in Nirvana if "idealized Omega" is wrong.

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-11-13T13:30:32.562Z · score: 2 (1 votes) · LW · GW

Its utility function would have to say that all conscious AIs should run on Intel processors. There is self-reference there.

But, I only have rather low confidence this idea is correct (what being correct means here) or important.

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-11-13T13:23:55.781Z · score: 2 (1 votes) · LW · GW

The point is, if you put this "quasi-Bayesian" agent into an iterated Newcomb-like problem, it will learn to get the maximal reward (i.e. the reward associated with FDT). So, if you're judging it from the side, you will have to concede it behaves rationally, regardless of its internal representation of reality.

Philosophically, my point of view is, it is an error to think that counterfactuals have objective, observer-independent, meaning. Instead, we can talk about some sort of consistency conditions between the different points of view. From the agent's point of view, it would reach Nirvana if it dodged the predictor. From Omega's point of view, if Omega two-boxed and the agent one-boxed, the agent's reward would be zero (and the agent would learn its beliefs were wrong). From a third-person point of view, the counterfactual "Omega makes an error of prediction" is ill-defined, it's conditioning on an event of probability 0.

Comment by vanessa-kosoy on Building Intuitions On Non-Empirical Arguments In Science · 2019-11-13T00:34:21.179Z · score: 2 (1 votes) · LW · GW

Well, I am a "semi-instrumentalist": I don't think it is meaningful to ask what reality "really is" except for the projection of the reality on the "normative ontology".

Comment by vanessa-kosoy on The Credit Assignment Problem · 2019-11-13T00:21:32.581Z · score: 2 (1 votes) · LW · GW

(I don't speak for Abram but I wanted to explain my own opinion.) Decision theory asks, given certain beliefs an agent has, what is the rational action for em to take. But, what are these "beliefs"? Different frameworks have different answers for that. For example, in CDT a belief is a causal diagram. In EDT a belief is a joint distribution over actions and outcomes. In UDT a belief might be something like a Turing machine (inside the execution of which the agent is supposed to look for copies of emself). Learning theory allows us to gain insight through the observation that beliefs must be learnable, otherwise how would the agent come up with these beliefs in the first place? There might be parts of the beliefs that come from the prior and cannot be learned, but still, at least the type signature of beliefs should be compatible with learning.

Moreover, decision problems are often implicitly described from the point of view of a third party. For example, in Newcomb's paradox we postulate that Omega can predict the agent, which makes perfect sense for an observer looking from the side, but might be difficult to formulate from the point of view of the agent itself. Therefore, understanding decision theory requires the translation of beliefs from the point of view of one observer to the point of view of another. Here also learning theory can help us: we can ask, what are the beliefs Alice should expect Bob to learn given particular beliefs of Alice about the world? From a slightly different angle, the central source of difficulty in decision theory is the notion of counterfactuals, and the attempt to prescribe particular meaning to them, which different decision theories do differently. Instead, we can just postulate that, from the subjective point of view of the agent, counterfactuals are ontologically basic. The agent believes emself to have free will, so to speak. Then, the interesting quesiton is, what kind of counterfactuals are produced by the translation of beliefs from the perspective of a third party to the perspective of the given agent.

Indeed, thinking about learning theory led me to the notion of quasi-Bayesian agents (agents that use incomplete/fuzzy models), and quasi-Bayesian agents automatically solve all Newcomb-like decision problems. In other words, quasi-Bayesian agents are effectively a rigorous version of UDT.

Incidentally, to align AI we literally need to translate beliefs from the user's point of view to the AI's point of view. This is also solved via the same quasi-Bayesian approach. In particular, this translation process preserves the "point of updatelessness", which, in my opinion, is the desired result (the choice of this point is subjective).

Comment by vanessa-kosoy on Realism about rationality · 2019-11-11T12:15:00.749Z · score: 2 (1 votes) · LW · GW

If the universe was hypercomputational, that would manifest as failures of computable physics.

Well, it would manifest as a failure to create a complete and deterministic theory of computable physics. If your physics doesn't describe absolutely everything, hypercomputation can hide in places it doesn't describe. If your physics is stochastic (like quantum mechanics for example) then the random bits can secretly follow a hypercomputable pattern. Sort of "hypercomputer of the gaps". Like I wrote before, there actually can be situations in which we gradually become confident that something is a hypercomputer (although certainty would grow very slowly), but we will never know precisely what kind of hypercomputer it is.

If true, that is a form of neo-Kantian idealism. Is that what you really wanted to say?

Unfortunately I am not sufficiently versed in philosophy to say. I do not make any strong claims to novelty or originality.

Comment by vanessa-kosoy on Building Intuitions On Non-Empirical Arguments In Science · 2019-11-11T11:32:00.755Z · score: 2 (1 votes) · LW · GW

I don't think hypercomputation is an issue for algorithmic information theory as foundation for metaphysics/induction. The relevant question is, not whether the world contains hypercomputation, but whether our mind is capable of hypercomputation. And here it seems to me like the answer is "no". Even if the answer was "yes", we could probably treat the hypercomputing part of the mind as part of the environment. I wrote a little about it here.

Comment by vanessa-kosoy on The Credit Assignment Problem · 2019-11-10T16:43:02.192Z · score: 4 (2 votes) · LW · GW

From the perspective of full agency (ie, the negation of partial agency), a system which needs a protected epistemic layer sounds suspiciously like a system that can't tile. You look at the world, and you say: "how can I maximize utility?" You look at your beliefs, and you say: "how can I maximize accuracy?" That's not a consequentialist agent; that's two different consequentialist agents!

For reinforcement learning with incomplete/fuzzy hypotheses, this separation doesn't exist, because the update rule for fuzzy beliefs depends on the utility function and in some sense even on the actual policy.

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-11-10T12:52:46.072Z · score: 2 (1 votes) · LW · GW

In the previous "population game" setting, we assumed all players are "born" at the same time and learn synchronously, so that they always play against players of the same "age" (history length). Instead, we can consider a "mortal population game" setting where each player has a probability to die on every round, and new players are born to replenish the dead. So, if the size of the population is (we always consider the "thermodynamic" limit), players die and the same number of players are born on every round. Each player's utility function is a simple sum of rewards over time, so, taking mortality into account, effectively ey have geometric time discount. (We could use age-dependent mortality rates to get different discount shapes, or allow each type of player to have different mortality=discount rate.) Crucially, we group the players into games randomly, independent of age.

As before, each player type chooses a policy . (We can also consider the case where players of the same type may have different policies, but let's keep it simple for now.) In the thermodynamic limit, the population is described as a distribution over histories, which now are allowed to be of variable length: . For each assignment of policies to player types, we get dynamics where . So, as opposed to immortal population games, mortal population games naturally give rise to dynamical systems.

If we consider only the age distribution, then its evolution doesn't depend on and it always converges to the unique fixed point distribution . Therefore it is natural to restrict the dynamics to the subspace of that corresponds to the age distribution . We denote it .

Does the dynamics have fixed points? can be regarded as a subspace of . The later is compact (in the product topology) by Tychonoff's theorem and Polish, but is not closed. So, w.r.t. the weak topology on probability measure spaces, is also compact but isn't. However, it is easy to see that is closed in and therefore compact. It may also be regarded as a convex subset of an appropriate Banach space (the dual of the space of Lipschitz functions on some metrization of ). Moreover, it is easy to see is continuous (for populations that are close in the Kantorovich-Rubinstein metric, only the old players may have very different distributions, but old players are a small fraction of the population so their effect on the next round is small). By the Schauder fixed-point theorem, it follows that has a fixed point.

What are the fixed points like? Of course it depends on . In a fixed point, every player observes a sequence of IID plays in all of eir games. Therefore, if satisfies the (very mild!) learning-theoretic desideratum that, upon observing an IID sequence, it converges to optimal response in the limit, then, in the same limit, fixed points are Nash equilibria. This works even for extremely simple learning algorithms, such as "assume the plays in the next game will be sampled from a random past game", and it works for any Bayesian or "quasi-Bayesian" (i.e. using incomplete/fuzzy models) agent that includes all IID processes in its prior.

This raises a range of interesting questions:

  • Are any/all of the fixed points attractors?
  • Does convergence to a fixed point occur for all or at least almost all initial conditions?
  • Do all Nash equilibria correspond to fixed points?
  • Do stronger game theoretic solution concepts (e.g. proper equilibria) have corresponding dynamical properties?

Mortal population games are obviously reminiscent of evolutionary game theory. However, there are substantial differences. In mortal population games, the game doesn't have to be symmetric, we consider a single policy rather than many competing policies, the policies learn from experience instead of corresponding to fixed strategies, and mortality rate doesn't depend on the reward. In evolutionary game theory, convergence usually cannot be guaranteed. For example, in the rock-scissors-paper game, the population may cycle among the different strategies. On the other hand, in mortal population games, if the game is two-player zero-sum (which includes rock-paper-scissors), and the policy is quasi-Bayesian with appropriate prior, convergence is guaranteed. This is because each player can easily learn to guarantee maximin payoff. Continuity arguments probably imply that at least for small perturbations of zero-sum, there will still be convergence. This leads to some hope that convergence can be guaranteed even in general games, or at least under some relatively mild conditions.

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-11-09T15:03:29.702Z · score: 2 (1 votes) · LW · GW

There is a deficiency in this "dynamically subjective" regret bound (also can be called "realizable misalignment" bound) as a candidate formalization of alignment. It is not robust to scaling down. If the AI's prior allows it to accurately model the user's beliefs (realizability assumption), then the criterion seems correct. But, imagine that the user's beliefs are too complex and an accurate model is not possible. Then the realizability assumption is violated and the regret bound guarantees nothing. More precisely, the AI may use incomplete models to capture some properties of the user's beliefs and exploit them, but this might be not good enough. Therefore, such an AI might fall into a dangerous zone when it is powerful enough to cause catastrophic damage but not powerful enough to know it shouldn't do it.

To fix this problem, we need to introduce another criterion which has to hold simultaneously with the misalignment bound. We need that for any reality that satisfies the basic assumptions built into the prior (such as, the baseline policy is fairly safe, most questions are fairly safe, human beliefs don't change too fast etc), the agent will not fail catastrophically. (It would be way too much to ask it would converge to optimality, it would violate no-free-lunch.) In order to formalize "not fail catastrophically" I propose the following definition.

Let's start with the case when the user's preferences and beliefs are dynamically consistent. Consider some AI-observable event that might happen in the world. Consider a candidate learning algorithm and two auxiliary policies. The policy follows the baseline policy until happens, at which time it switches to the subjectively optimal policy. The policy follows the candidate learning algorithm until happens, at which time it also switches to the subjectively optimal policy. Then, the "-dangerousness" of is defined to be the expected utility of minus the expected utility of . Thus, when -incorrigibility is zero or negative, does no worse than .

Why do we need ? Because without the criterion would allow policies that don't damage the present but permanently destroy opportunities that could be used by a future better AI.

In the dynamically consistent case, incorrigibility can be represented as an expected sum over time-before- of Bellman errors w.r.t the value function of . This allows us generalizing it to the dynamically inconsistent case, by writing a similar expression except that each Bellman error term uses the transient preferences and beliefs of the user at the given moment.

Is it truly possible to have a reasonable bound on -dangerousness for all , and is it possible to do so while maintaining a reasonable realizable misalignment bound? It seems possible, for the following reason. The user's beliefs can be represented as a mapping from questions to answers(fn1). If you sample questions from any fixed distribution, then by verifying that you can predict the answers, you gain valid information about the belief state without any prior about the belief state (it is a "frequentist" guarantee). Therefore, the AI can constrain itself to taking only those actions which are known to be safe based on this "robust" information. Since there is no guarantee that the AI will find a model that predicts answers, in the unrealizable case this might leave it without an effective strategy, but even without any information the AI can stay safe by following the baseline.

This notion of dangerousness seems strongly related to corrigibility. To demonstrate, imagine an attempt by the user to shut down the AI. Suppose that the AI has 3 strategies with which to respond: (i) comply with the shut down (ii) resist defensively, i.e. prevent shutdown but without irreversible damaging anything (iii) resist offensively, e.g. by doing something irreversible to the user that will cause em to stop trying to shut down the AI. The baseline policy is complying. Then, assuming that the user's stated beliefs endorse the shutdown, an AI with low dangerousness should at most resist defensively for a short period and then comply. That's because resisting offensively would generate high dangerousness by permanent loss of value, whereas resisting defensively for a long time would generate high dangerousness by losing reward over that period. At the least, this is much more corrigible than CIRL which guarantees nothing in the unrealizable case, and even in the realizable case no general guarantees were obtained (and arguably cannot be obtained since the AI might not have enough information).

This notion of dangerousness opens the way towards designing AI systems that are provably safe while at the same time employing heuristic algorithms without theoretical understanding. Indeed, as long as the AI has sufficiently low dangerousness, it will almost certainly not cause catastrophic damage. A misalignment bound is only needed to prove the AI will also be highly capable at pursuing the user's goals. The way such a heuristic AI may work, is by producing formal certificates for each action it takes. Then, we need not trust the mechanism suggesting the actions nor the mechanism producing the certificates, as long as we trust the verification of those certificates (which doesn't require AI). The untrustworthy part might still be dangerous if it can spawn non-Cartesian daemons But, that is preventable using TRL, assuming that the "core" agent has low dangerousness and is too weak to spawn superhuman daemons without the "envelope".

(fn1) In truth, this assumption that the user's answers come from a mapping that changes only slowly is probably unrealistic, because the user need not have coherent beliefs even over short timescales. For example, there might be many pairs of fairly ordinary (non-manipulative) questions s.t. asking them in different order will produce different answers. However, to the extent that the user's beliefs are incoherent, and therefore admit multiple equally plausible interpretations, learning any interpretation should be good enough. Therefore, although the model needs to be made more general, the learning problem should not become substantially more difficult.

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-11-08T18:14:31.729Z · score: 2 (1 votes) · LW · GW

It seems useful to consider agents that reason in terms of an unobservable ontology, and may have uncertainty over what this ontology is. In particular, in Dialogic RL, the user's preferences are probably defined w.r.t. an ontology that is unobservable by the AI (and probably unobservable by the user too) which the AI has to learn (and the user is probably uncertain about emself). However, onotlogies are more naturally thought of as objects in a category than as elements in a set. The formalization of an "ontology" should probably be a POMDP or a suitable Bayesian network. A POMDP involves an arbitrary set of states, so it's not an element in a set, and the class of POMDPs can be naturally made into a category. Therefore, there is need for defining the notion of a probability measure over a category. Of course we can avoid this by enumerating the states, considering the set of all possible POMDPs w.r.t. this enumeration and then requiring the probability measure to be invariant w.r.t. state relabeling. However, the category theoretic point of view seems more natural, so it might be worth fleshing out.

Ordinary probably measures are defined on measurable spaces. So, first we need to define the analogue of "measurable structure" (-algebra) for categories. Fix a category . Denote the category of measurable spaces. A measurable structure on is then specified by providing a Grothendick fibration and an equivalence . Here, stands for the essential fiber of over the one point space . The intended interpretation of is, the category of families of objects in indexed by measurable spaces. The functor is supposed to extract the base (index space) of the family. We impose the following conditions on and :

Given , and , we denote the corresponding base change by ( and is canonically isomorphic to ).

  • Consider and . Consider also a point . We can think of as a morphism . This allows us considering the base changes and (the "fibers" of at and at respectively) where . Applying the universal property of to and , we get morphisms . We now require that, if for any , then (morphisms between families that are pointwise equal are just equal).

  • Consider and . Suppose that (i) is an isomorphism and (ii) for any , is an isomorphism. Then, is an isomorphism (families with a common base that are pointwise isomorphic are just isomorphic).

I'm not entirely sure how sufficient or necessary these conditions are for proving useful results, but they seem to me natural at first glance. Note that this definition can be regarded as motivated by the Yoneda lemma: a measurable space is defined by the measurable mappings to from other measurable spaces, so a "measurable category" should be defined by the measurable "mappings" to it from measurable spaces, and is precisely the category of such measurable "mappings". Compare this with definition of geometric stacks(fn1).

Next, we define probability measures. Specifically, for any "measurable category" (a category equipped with structure as above), we construct the category of "probability measures on ". First, we define the auxiliary category . An object in is a pair where is an object in and is a probability measure on . We interpret this as sampling from and then taking (using , the latter can be considered to be an object in ). We define the morphisms from to as those morphisms for which (the notation stands for pushforward). Given , we call it a "quasi-isomorphism" when, for any , is an isomorphism. Claim: quasi-isomorphisms admit a calculus of right fractions(fn2). We now define as the localization of by quasi-isomorphisms.

(fn1) Maybe the analogy with stacks should be made more formal? Not sure, stacks are motivated by topology and measurable spaces are not topological...

(fn2) This should clearly be right, and this is right for natural examples, but I haven't written down the proof. If it turns out to be false it would mean that my conditions on are too weak.

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-11-08T16:34:54.316Z · score: 4 (2 votes) · LW · GW

That's about right. The key point is, "applying the counterfactual belief that the predictor is always right" is not really well-defined (that's why people have been struggling with TDT/UDT/FDT for so long) while the thing I'm doing is perfectly well-defined. I describe agents that are able to learn which predictors exist in their environment and respond rationally ("rationally" according to the FDT philosophy).

TRL is for many things to do with rational use of computational resources, such as (i) doing multi-level modelling in order to make optimal use of "thinking time" and "interacting with environment time" (i.e. simultaneously optimize sample and computational complexity) (ii) recursive self-improvement (iii) defending from non-Cartesian daemons (iv) preventing thought crimes. But, yes, it also provides a solution to ASP. TRL agents can learn whether it's better to be predictable or predicting.

Comment by vanessa-kosoy on Building Intuitions On Non-Empirical Arguments In Science · 2019-11-07T22:05:03.647Z · score: 19 (6 votes) · LW · GW

...I think the correct response is to say that both theories explain the data, and one cannot empirically test which theory is true, but the paleontology theory is more elegant (I am tempted to say “simpler”, but that might imply I have a rigorous mathematical definition of the form of simplicity involved, which I don’t).

The concept Scott seems to be looking for is "lower Kolmogorov complexity". Well, there might be debate about whether Kolmogorov complexity is exactly the right metric, but it seems clearly a vast improvement over having no mathematical definition.

...there is no unambiguous algorithm according to which we can feed in these criteria, a list of theories, and a set of data, and expect the best theory to pop out. The way in which we judge scientific theories is inescapably reflective, messy, and human. That’s the reality of how science is actually done; it’s a matter of judgment, not of drawing bright lines between truth and falsity or science and non-science.

Carrol's position seems much too pessimistic, giving up without even trying. Why "inescapably"? Before Newton someone might have said, the way to guess the trajectory of a falling rock in inescapably messy and human. Now we know how to describe physics by mathematical equations but not metaphysics. This is not a state of affairs we should just accept.

Algorithmic information theory and AI theory show a clear path towards formalizing metaphysics. I think it is entirely plausible that in the future we will have tools for rigorously comparing scientific theories. Perhaps in cases such as Atlantis a fully rigorous analysis would still be intractable, because of the "messy" domain, but when comparing competing theories of fundamental physics I see no reason why it can't be done. Even in the messier cases having a rigorous theory should lead to tools for making comparison less subjective.

Comment by vanessa-kosoy on “embedded self-justification,” or something like that · 2019-11-03T17:08:26.026Z · score: 4 (2 votes) · LW · GW

A lot of what you write here seems related to my notion of Turing Reinforcement Learning. In Turing RL we consider an AI comprising of a "core" RL agent and an "envelope" which is a computer on which the core can run programs (somewhat similarly to neural Turing machines). From the point of the view of the core, the envelope is a component of its environment (in addition to its usual I/O), about which it has somewhat stronger priors than about the rest. Such a system learns how to make optimal use of the envelope's computing resources. Your "boundary" corresponds to the core, which is the immutable part of the algorithm that produces everything else. Regarding the "justification" of why a particular core algorithm is correct, the justification should come from regret bounds we prove about this algorithm w.r.t. some prior over incomplete models. Incomplete models are the solution to "even if you could obtain a perfect model of your world and beings like you, you wouldn’t be able to fit it inside your own head". Instead of obtaining a perfect model, the agent learns all patterns (incomplete models) in the world that it can fit into its head, and exploits these patterns for gain. More precisely, in Turing RL the agent starts with some small class of patterns that the core can fit into its head, and bootstraps from those to a larger class of patterns, accounting for a cost-benefit analysis of resource use. This way, the regret bound satisfied by the core algorithm should lead to even stronger guarantees for the system as a whole (for example this).

Comment by vanessa-kosoy on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-02T17:19:15.006Z · score: 2 (1 votes) · LW · GW

Pruning the game tree, or doing MC tree search, is impossible in StarCraft, not because of the size of the action space but because the game has incomplete information. At least in the standard form of those algorithms.

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-11-02T14:52:53.186Z · score: 2 (1 votes) · LW · GW

Maybe? I am not sure that I like Nirvana, but it doesn't seem that bad. If someone thinks of a solution without it, I would be interested.

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-11-02T13:41:19.230Z · score: 2 (1 votes) · LW · GW

Dialogic RL assumes that the user has beliefs about the AI's ontology. This includes the environment(fn1) from the AI's perspective. In other words, the user needs to have beliefs about the AI's counterfactuals (the things that would happen if the AI chooses different possible actions). But, what are the semantics of the AI's counterfactuals from the user's perspective? This is more or less the same question that was studied by the MIRI-sphere for a while, starting from Newcomb's paradox, TDT et cetera. Luckily, I now have an answer based on the incomplete models formalism. This answer can be applied in this case also, quite naturally.

Specifically, we assume that there is a sense, meaningful to the user, in which ey select the AI policy (program the AI). Therefore, from the user's perspective, the AI policy is a user action. Again from the user's perspective, the AI's actions and observations are all part of the outcome. The user's beliefs about the user's counterfactuals can therefore be expressed as (fn2), where is the space of AI policies(fn3). We assume that for every , is consistent with the natural sense. Such a belief can be transformed into an incomplete model from the AI's perspective, using the same technique we used to solve Newcomb-like decision problems, with playing the role of Omega. For a deterministic AI, this model looks like (i) at first, "Murphy" makes a guess that the AI's policy is (ii) The environment behaves according to the conditional measures of (iii) If the AI's policy ever deviates from , the AI immediately enters an eternal "Nirvana" state with maximal reward. For a stochastic AI, we need to apply the technique with statistical tests and multiple models alluded to in the link. This can also be generalized to the setting where the user's beliefs are already an incomplete model, by adding another step where Murphy chooses out of some set.

What we constructed is a method of translating counterfactuals from the user's perspective to the AI's perspective. In particular, the AI will inherit the user's level of "updatelessness" (in the sense that, if the user's counterfactuals are defined w.r.t. a particular effective precommitment point, the AI will use the same point). This translation may be implemented either (i) by the user, by explaining these semantics to em or (ii) by the AI, in which case the formal language should refer to the user's counterfactuals rather than the AI's counterfactuals.

(fn1) Up to an equivalence relation, that's a mapping .

(fn2) For infinite AI liftetime. We can trivially generalize this to allow for finite AI lifetime as well.

(fn3) Up to an equivalence relation, they are mappings . We may add computability/complexity constraints and represent them as programs.

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-10-31T15:16:55.351Z · score: 2 (1 votes) · LW · GW

Another notable feature of this approach is its resistance to "attacks from the future", as opposed to approaches based on forecasting. In the latter, the AI has to predict some future observation, for example what the user will write after working on some problem for a long time. In particular, this is how the distillation step in IDA is normally assumed to work, AFAIU. Such a forecaster might sample a future in which a UFAI has been instantiated and this UFAI will exploit this to infiltrate the present. This might result in a self-fulfilling prophecy, but even if the forecasting is counterfactual (and thus immune to self-fulfilling prophecies)it can be attacked by a UFAI that came to be for unrelated reasons. We can ameliorate this by making the forecasting recursive (i.e. apply multiple distillation & amplification steps) or use some other technique to compress a lot of "thinking time" into a small interval of physical time. However, this is still vulnerable to UFAIs that might arise already at present with a small probability rate (these are likely to exist since our putative FAI is deployed at a time when technology progressed enough to make competing AGI projects a real possibility).

Now, compare this to Dialogical RL, as defined via the framework of dynamically inconsistent beliefs. Dialogical RL might also employ forecasting to sample the future, presumably more accurate, beliefs of the user. However, if the user is aware of the possibility of a future attack, this possibility is reflected in eir beliefs, and the AI will automatically take it into account and deflect it as much as possible.

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-10-31T00:23:43.683Z · score: 2 (1 votes) · LW · GW

This approach also obviates the need for an explicit commitment mechanism. Instead, the AI uses the current user's beliefs about the quality of future user beliefs to decide whether it should wait for user's beliefs to improve or commit to an irreversible coarse of action. Sometimes it can also predict the future user beliefs instead of waiting (predict according to current user beliefs updated by the AI's observations).

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-10-31T00:22:14.948Z · score: 2 (1 votes) · LW · GW

(moved to alignment forum)

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-10-30T23:03:22.324Z · score: 9 (5 votes) · LW · GW

In Hanson’s futarchy, the utility function of the state is determined by voting but the actual policy is determined by a prediction market. But, voting incentivizes misrepresenting your values to get a larger share of the pie. So, shouldn’t it be something like the VCG mechanism instead?

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-10-30T19:55:38.445Z · score: 4 (2 votes) · LW · GW

One of the central challenges in Dialogic Reinforcement Learning is dealing with fickle users, i.e. the user changing eir mind in illegible ways that cannot necessarily be modeled as, say, Bayesian updating. To take this into account, we cannot use the naive notion of subjective regret bound, since the user doesn't have a well-defined prior. I propose to solve this by extending the notion of dynamically inconsistent preferences to dynamically inconsistent beliefs. We think of the system as a game, where every action-observation history corresponds to its own player. The action space of each player is just . An outcome of such a game can be also thought of as a policy for the AI. The payoff of a player is expected utility (for this player's reward function) w.r.t. the probability measure resulting from plus the current belief state of the user conditional on , ( is the set of possible "realities"). We then define regret as the sum of Bellman errors w.r.t. equilibrium value of the players that actually manifested (so that in equilibrium it is zero). Bayesian regret requires taking expected value w.r.t some "ur-prior" that the AI starts with. Note that:

  • For a user that updates its beliefs on the AI's observations according the Bayes' theorem, the regret per reality is the same as subjective regret. Bayesian regret is also the same if the ur-prior assumes the user's beliefs are calibrated (which in the more general case is not a necessary assumption). The same applies to a user that doesn't updates eir beliefs at all.

  • The user beliefs are part of the ontology . Therefore, the system takes into accounts the user's beliefs about the evolution of the user's beliefs. So, the equilibrium policy is incentivized to empower its future self to the extent that the user believes that eir own beliefs will become more accurate over time (given fixed reward function, see below).

  • contains a distinct reward function for each player. And, the user may have uncertainty even over eir own current reward function. Therefore, the system distinguishes two types of value modifications: "legitimate" modifications that consist of improving one's beliefs about the reward function and "illegitimate" modification that consist of the reward function actually changing. The equilibrium policy is incentivized to encourage the first type and avoid the second type.

Comment by vanessa-kosoy on AI Alignment Open Thread October 2019 · 2019-10-26T14:49:39.266Z · score: 6 (3 votes) · LW · GW

Thoughts about understanding how game theory combines with learning theory.

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-10-26T14:42:47.388Z · score: 13 (5 votes) · LW · GW

Game theory is widely considered the correct description of rational behavior in multi-agent scenarios. However, real world agents have to learn, whereas game theory assumes perfect knowledge, which can be only achieved in the limit at best. Bridging this gap requires using multi-agent learning theory to justify game theory, a problem that is mostly open (but some results exist). In particular, we would like to prove that learning agents converge to game theoretic solutions such as Nash equilibria (putting superrationality aside: I think that superrationality should manifest via modifying the game rather than abandoning the notion of Nash equilibrium).

The simplest setup in (non-cooperative) game theory is normal form games. Learning happens by accumulating evidence over time, so a normal form game is not, in itself, a meaningful setting for learning. One way to solve this is replacing the normal form game by a repeated version. This, however, requires deciding on a time discount. For sufficiently steep time discounts, the repeated game is essentially equivalent to the normal form game (from the perspective of game theory). However, the full-fledged theory of intelligent agents requires considering shallow time discounts, otherwise there is no notion of long-term planning. For shallow time discounts, the game theory of a repeated game is very different from the game theory of the original normal form game. In fact, the folk theorem asserts that any payoff vector above the maximin of each player is a possible Nash payoff. So, proving convergence to a Nash equilibrium amounts (more or less) to proving converges to at least the maximin payoff. This is possible using incomplete models, but doesn't seem very interesting: to receive the maximin payoff, the agents only have to learn the rules of the game, they need not learn the reward functions of the other players or anything else about them.

We arrive at the question, what setting is realistic (in the sense of involving learning with shallow time discount) and is expected to produce Nash equilibria for a normal form game? I suggest the following. Instead of a fixed set of agents repeatedly playing against each other, we consider a population of agents that are teamed-off randomly on each round of the game. The population is assumed to be large enough for agents not to encounter each other more than once. This can be formalized as follows. Let be the pure strategy set of the -th agent and the set of pure outcomes. The set of -round outcome histories is . The population of agents on the -round can then be described as a probability measure . Suppose the policy of the -th player (that is, of all the agents that take the role of the -th player) is . Then we can define a time evolution rule that produces from . This rule works as follows: in order to sample we sample once per player (this is the history the given player has seen), sample the policy of each player on its own history, and produce a new history by appending the resulting outcome to one of the old histories (it doesn't matter which). A set of policies is considered to be in equilibrium, when for any , and any alternative policy , letting play against the same population (i.e. all other copies of the -th player still play ) doesn't improve expected utility. In other words, on each round the "mutant" agent retains its own history but the other player histories are still sampled from the same . It is easy to see that any equilibrium payoff in this setting is a Nash payoff in the original normal form game. We can then legitimately ask whether taking the to be learning algorithms would result in convergence to a Nash payoff in the (shallow time discount) limit.

For example, consider the Prisoner's dilemma. In the repeated Prisoner's dilemma with shallow time discount, is an equilibrium because of the tit-for-tat policy. On the other hand, in the "population" (massively multi-player?) repeated Prisoner's dilemma, is the only equilibrium. Tit-for-tat doesn't work because a single "defect bot" can exploit a population of tit-for-tats: on each round it plays with a new opponent that doesn't know the defect bot defected on the previous round.

Note that we get a very different setting if we allow the players to see each other's histories, more similar (equivalent?) to the regular repeated game. For example, in the Prisoner's Dilemma we have a version of tit-for-tat that responds to what its current opponent played in its previous round (against a different opponent). This may be regarded as a confirmation of the idea that agents that know each other's source code are effectively playing a repeated game: in this setting, knowing the source code amounts to knowing the history.

Comment by vanessa-kosoy on AI Alignment Open Thread October 2019 · 2019-10-26T13:15:36.741Z · score: 2 (1 votes) · LW · GW

The incomplete models formalism solves a large chunk of decision theory.

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-10-26T13:04:24.146Z · score: 5 (3 votes) · LW · GW

I recently realized that the formalism of incomplete models provides a rather natural solution to all decision theory problems involving "Omega" (something that predicts the agent's decisions). An incomplete hypothesis may be thought of a zero-sum game between the agent and an imaginary opponent (we will call the opponent "Murphy" as in Murphy's law). If we assume that the agent cannot randomize against Omega, we need to use the deterministic version of the formalism. That is, an agent that learns an incomplete hypothesis converges to the corresponding maximin value in pure strategies. (The stochastic version can be regarded as a special case of the deterministic version where the agent has access to an external random number generator that is hidden from the rest of the environment according to the hypothesis.) To every decision problem, we can now correspond an incomplete hypothesis as follows. Every time Omega makes a prediction about the agent's future action in some counterfactual, we have Murphy make a guess instead. This guess cannot be directly observed by the agent. If the relevant counterfactual is realized, then the agent's action renders the guess false or true. If the guess is false, the agent receives infinite (or, sufficiently large) reward. If the guess is true, everything proceeds as usual. The maximin value then corresponds to the scenario where the guess is true and the agent behaves as if its action controls the guess. (Which is exactly what FDT and its variants try to achieve.)

For example, consider (repeated) counterfactual mugging. The incomplete hypothesis is a partially observable stochastic game (between the agent and Murphy), with the following states:

  • : initial state. Murphy has two actions: (guess the agent will pay), transitioning to and (guess the agent won't pay) transitioning to . (Reward = )
  • : Murphy guessed the agent will pay. Transitions to or with probability to each (the coin flip). (Reward = )
  • : Murphy guessed the agent won't pay. Transitions to or with probability to each (the coin flip). (Reward = )
  • : Agent receives the prize. Transitions to . (Reward = )
  • : Agent is asked for payment. Agent has two actions: (pay) transitioning to and (don't pay) transitioning to . (Reward = )
  • : Agent receives nothing. Transitions to . (Reward = )
  • : Agent is asked for payment. Agent has two actions: (pay) transitioning to and (don't pay) transitioning to . (Reward = )
  • : Murphy's guess remained untested. Transitions to . (Reward = )
  • : Murphy's guess was right, agent paid. Transitions to . (Reward = )
  • : Murphy's guess was right, agent didn't pay. Transitions to . (Reward = )
  • : Murphy's guess was wrong, agent paid. Transitions to . (Reward = )
  • : Murphy's guess was wrong, agent didn't pay. Transitions to . (Reward = )

The only percepts the agent receives are (i) the reward and (ii) whether it is asked for payment or not. The agent's maximin policy is paying, since it guarantees an expected reward of per round.

We can generalize this to an imperfect predictor (a predictor that sometimes makes mistakes), by using the same construction but adding noise to Murphy's guess for purposes other than the guess's correctness. Apparently, We can also generalize to the variant where the agent can randomize against Omega and Omega decides based on its predictions of the probabilities. This, however, is more complicated. In this variant there is no binary notion of "right" and "wrong" guess. Instead, we need to apply some statistical test to the guesses and compare it against a threshold. We can then consider a family of hypotheses with different thresholds, such that (i) with probability , for all but some finite number of thresholds, accurate guesses would never be judged wrong by the test (ii) with probability , consistently inaccurate guesses will be judged wrong by the test, with any threshold.

The same construction applies to logical counterfactual mugging, because the agent cannot distinguish between random and pseudorandom (by definition of pseudorandom). In TRL there would also be some family of programs the agent could execute s.t., according the hypothesis, their outputs are determined by the same "coin flips" as the offer to pay. However, this doesn't change the optimal strategy: the "logical time of precommitment" is determined by the computing power of the "core" RL agent, without the computer "envelope".

Comment by vanessa-kosoy on Artificial general intelligence is here, and it's useless · 2019-10-25T20:07:26.174Z · score: 5 (3 votes) · LW · GW

The entire argument seems to boil down to

  1. Give the "AI" the resource to manifest itself in the real world and act in much the same way the human would (with all the benefits of having a computer for a brain)...

So we are left with approach (3), giving our hypothetical AGI the physical resources to put its humanity-changing ideas into practice. But… who's going to give away those resources ? Based on what proof ?

We already connect AI systems to the real world, for example Facebook's algorithms that learn based on user behavior. There's nothing implausible about this.

Besides that:

They [humans] are capable of creating new modified version of themselves, updating their own algorithms...

Human ability to self-modify is very limited. Most of our algorithms are opaque to our conscious mind.

...I would say it's highly unlikely that the definition of intelligence exists.

There is already progress in defining intelligence, starting from Legg and Hutter 2007.

What's truly jarring, though, is the passage

Have you ever heard of Marilyn vos Savant or Chris Langan or Terence Tao or William James Sidis or Kim Ung-yong ? These are, as far as IQ tests are concerned, the most intelligence members that our species currently contains.

While I won't question their presumed intelligence, I can safely say their achievements are rather unimpressive.

Terence Tao is a fields medalist. You bet I heard of Tao. If a Fields Medal is "rather unimpressive", I can't imagine what is impressive.

Comment by vanessa-kosoy on AI Alignment Open Thread October 2019 · 2019-10-23T20:46:20.065Z · score: 11 (6 votes) · LW · GW

My new research direction for an "end-to-end" alignment scheme.

See also this clarifying comment.

I'm posting this in the Open Thread because, for technical reasons the shortforms don't appear in the feed on the main page of alignmentforum, so I am a little worried people missed it entirely (I discussed it with Oliver).

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-10-21T16:46:29.686Z · score: 2 (1 votes) · LW · GW

I am not sure. AI alignment seems to touch on many different aspects of the world, and it is not obvious that it can be reduced to assumptions that are extremely simple and natural. Or, if it can be reduced that way, then it might require a theory that on some level explains human civilization, its evolution and and its influence on the world (even if only on a fairly abstract level). I will share some thoughts how the various assumptions can be reduced another step back, but proceeding to reduce all of them to a simple core seems like a challenging research programme.

Most of the parts of this design can be regarded as reflecting particular assumptions we make about the user as an agent.

The core idea of having a dialogue comes from modeling the user as a "linguistic agent". Such agents may be viewed as nodes in a distributed AI system, but where each node has different objectives. It is an interesting philosophical question whether this assumption is necessary for value learning. It currently seems plausible to me that only for linguistic agents "values" are truly well-defined, or at least sufficiently well-defined to extrapolate them outside the trajectory that the agent follows on its own.

The need to quantilize, debate and censor infohazards comes from the assumption that the user can be manipulated (there is some small fraction of possible inputs that invalidate the usual assumptions about the user's behavior). Specifically debate might be possible to justify by some kind of Bayesian framework where every argument is a piece of evidence, and providing biased arguments is like providing selective evidence.

The need to deal with "incoherent" answers and the commitment mechanism comes from the assumption the user has limited access to its own knowledge state (including its own reward function). Perhaps we can formalize it further by modeling the user as a learning algorithm with some intrinsic source of information. Perhaps we can even explain why such agents are natural in the "distributed AI" framework, or by some evolutionary argument.

The need to translate between formal language and natural languages come from, not knowing the "communication protocol" of the "nodes". Formalizing this idea further requires some more detailed model of what "natural language" is, which might be possible via multi-agent learning theory.

Finally, the need to start from a baseline policy (and also the need to quantilize) comes from the assumption that the environment is not entirely secure. So that's an assumption about the current state of the world, rather than about the user. Perhaps, we can make formal the argument that this state of the world (short-term stable, long-term dangerous) is to be expected when agents populated it for a long time.

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-10-21T15:36:20.122Z · score: 4 (2 votes) · LW · GW

In my previous shortform, I used the phrase "attack vector", borrowed from classical computer security. What does it mean to speak of an "attack vector" in the context of AI alignment? I use 3 different interpretations, which are mostly 3 different ways of looking at the same thing.

In the first interpretation, an attack vector is a source of perverse incentives. For example, if a learning protocol allows the AI to ask the user questions, a carefully designed question can artificially produce an answer we would consider invalid, for example by manipulating the user or even by hacking the software or hardware of the system in some clever way. If the algorithm treats every answer as valid, this creates a perverse incentive: the AI knows that by phrasing the question in a particular way, a certain answer will result, so it will artificially obtain the answers that are preferable (for example answers that produce an easier to optimize utility function). In this interpretation the "attacker" is the AI itself. In order to defend against the vector, we might change the AI's prior so that the AI knows some of the answers are invalid. If the AI has some method of distinguishing valid from invalid answers, that would eliminate the perverse incentive.

In the second interpretation, an attack vector is a vulnerability that can be exploited by malicious hypotheses in the AI's prior. Such a hypothesis is an agent with its own goals (for example, it might arise as a simulation hypothesis). This agent intentionally drives the system to ask manipulative questions to further these goals. In order to defend, we might design the top level learning algorithm so that it only takes action that are safe with sufficiently high confidence (like in Delegative RL). If the prior contains a correct hypothesis along with the malicious hypothesis, the attack is deflected (since the correct hypothesis deems the action unsafe). Such a confidence threshold can usually be viewed as a computationally efficient implementation of the prior shaping described in the previous paragraph.

In the third interpretation, an attack vector is something that impedes you from proving a regret bound under sufficiently realistic assumptions. If your system has an undefended question interface, then proving a regret bound requires assuming that asking a question cannot create irreversible damage. In order to drop this assumption, a defense along the lines of the previous paragraphs has to be employed.

Comment by vanessa-kosoy on AI Alignment Open Thread October 2019 · 2019-10-21T14:51:09.370Z · score: 4 (2 votes) · LW · GW

Actually, now I'm confused. I just posted a shortform, but I don't see where it appears on the main page? There is "AI Alignment Posts" which only includes the "longforms" and there is "recent discussion" which only includes the comments. Does it mean nobody sees the shortform unless they open my profile?

Comment by vanessa-kosoy on AI Alignment Open Thread October 2019 · 2019-10-21T13:02:27.151Z · score: 2 (1 votes) · LW · GW

...the subtle difference is that open threads are more like a market square where you can show up and start talking to strangers, and shortform is more like a conversation in your living room.

Hmm, this seems like an informal cultural difference that isn't really enforced by the format. Technically, people can comment on the shortform as easily as on open thread comments. So, I am not entirely sure whether everyone perceive it this way (and will continue to perceive it this way).

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-10-21T12:44:05.142Z · score: 12 (6 votes) · LW · GW

This is preliminary description of what I dubbed Dialogic Reinforcement Learning (credit for the name goes to tumblr user @di--es---can-ic-ul-ar--es): the alignment scheme I currently find most promising.

It seems that the natural formal criterion for alignment (or at least the main criterion) is having a "subjective regret bound": that is, the AI has to converge (in the long term planning limit, limit) to achieving optimal expected user!utility with respect to the knowledge state of the user. In order to achieve this, we need to establish a communication protocol between the AI and the user that will allow transmitting this knowledge state to the AI (including knowledge about the user's values). Dialogic RL attacks this problem in the manner which seems the most straightforward and powerful: allowing the AI to ask the user questions in some highly expressive formal language, which we will denote .

allows making formal statements about a formal model of the world, as seen from the AI's perspective. includes such elements as observations, actions, rewards and corruption. That is, reflects (i) the dynamics of the environment (ii) the values of the user (iii) processes that either manipulate the user, or damage the ability to obtain reliable information from the user. Here, we can use different models of values: a traditional "perceptible" reward function, an instrumental reward function, a semi-instrumental reward functions, dynamically-inconsistent rewards, rewards with Knightian uncertainty etc. Moreover, the setup is self-referential in the sense that, also reflects the question-answer interface and the user's behavior.

A single question can consist, for example, of asking for the probability of some sentence in or the expected value of some expression of numerical type in . However, in order to address important features of the world, such questions have to be very complex. It is infeasible to demand that the user understands such complex formal questions unaided. Therefore, the AI always produces a formal question together with a natural language () annotation . This annotation has to explain the question in human understandable terms, and also convince the user that is indeed an accurate natural language rendering of . The user's feedback then consists of (i) accepting/rejecting/grading the annotation (ii) answering the question if the annotation is correct and the user can produce the answer. Making this efficient requires a process of iteratively constructing a correspondence between and , i.e effectively building a new shared language between the user and the AI. We can imagine concepts defined in and explained in that serve to define further, more complex, concepts, where at each stage the previous generation of concepts can be assumed given and mutually understandable. In addition to such intensional definitions we may also allow extensional definitions, as long as the generalization is assumed to be via some given function space that is relatively restricted (e.g. doesn't admit subagents). There seem to be some strong connections between the subproblem of designing the annotation system and the field of transparency in AI.

The first major concern that arises at this point is, questions can serve as an attack vector. This is addressed by quantilization. The key assumption is: it requires much less optimization power to produce some useful question than to produce a malicious question. Under this assumption, the quantilization parameter can be chosen to make the question interface safe but still effective. Over time, the agent accumulates knowledge about corruption dynamics that allows it to steer even further away from malicious questions while making the choice of questions even more effective. For the attack vector of deceitful annotations, we can improve safety using the debate approach, i.e. having the agent to produce additional natural language text that attempts to refute the validity of the annotation.

Of course, in addition to the question interface, the physical interface (direct interaction with environment) is also an attack vector (like in any RL system). There, safety is initially guaranteed by following a baseline policy (which can be something like "do nothing" or human imitation). Later, the agent starts deviating from the baseline policy while staying safe, by leveraging the knowledge it previously gained through both the question and the physical interface. Besides being safe, the algorithm also need to be effective, and for this it has to (in particular) find the learning strategy that optimally combines gaining knowledge through the question interface and gaining knowledge through autonomous exploration.

Crucially, we want our assumptions about user competence to be weak. This means that, the user can produce answers that are (i) incomplete (just refuse to answer) (ii) fickle (change eir answers) and (iii) inconsistent (contradictory answers). We address (i) by either assuming that the answerable questions are sufficient or requiring a weaker regret bound where the reference agents knows all obtainable answers rather than knowing the full knowledge state they refer to. We address (ii) by believing later answers over earlier answers and via the commitment mechanism (see below). We address (iii) by explaining the contradictions to the user and asking for their resolution. In addition to all that, the AI can predict the answers instead of actually asking (i.e. the user's knowledge state is also assumed to be regular to some extent). The commitment mechanism is a key tool for achieving optimal trade-off between allowing the user more time to think vs. acting quickly enough to address external dangers, while keeping the user informed of the consequences. It works by having the AI ask the user for a committed answer at the point of time when the AI has to commit to an irreversible course of action (and not much earlier), based on this particular answer. The AI provides the user with a a description of the Pareto frontier for the different possible answers. Moreover, even committed answers can be changed later. In this case, the algorithm cannot guarantee global optimality but it should still guarantee "subgame perfection": i.e., whatever irreversible mistakes were made, the AI recovers as best as possible.

In order to improve the user's ability to produce answers and stay informed about the AI's strategy, we introduce another interface through which the user can ask questions from the AI. In this direction, the user asks the question in and the AI both translates it to and provides a justification (as well as a refutation) of the translation, again employing quantilization. The user can then accept or reject the translation. If the translation is rejected, ey may require the AI to try again. If it is accepted, the AI produces an answer to the best of its ability. In general, it may be that producing an answer requires compiling a formal proof in which case the AI may or may not succeed. Here, there is concern about the scenario where (i) the user asks a question about a possible future (ii) the AI answers with a detailed description of the future (iii) as a result, the user is exposed to infohazards (sources of corruption) within this future. To address this, we need to either restrict the interface to small volumes of data, or shape the prior s.t. the AI can on its own realize the inherent danger. In the latter approach, the AI can refuse to answer or provide a "censored" answer while pointing out the presence of a potential infohazard.

Finally, the subjective regret bound approach might seem to impose a strong cap on capability: if the AI follows a policy optimal w.r.t. the user's knowledge state, then the limited computational abilities of the user might prevent the AI from creating models of the world that are more detailed. This can be addressed in a Turing Reinforcement Learning setting, by allowing the user's knowledge state to refer to arbitrary computations. The user can then be incapable of executing those computations, but the AI might be more able to do so.

Comment by vanessa-kosoy on AI Alignment Open Thread October 2019 · 2019-10-21T10:38:19.366Z · score: 4 (2 votes) · LW · GW

Meta: When should I used this rather the shortform? Do we really need both?

Comment by vanessa-kosoy on The Dualist Predict-O-Matic ($100 prize) · 2019-10-20T14:46:03.785Z · score: 4 (2 votes) · LW · GW

Two remarks.

Remark 1: Here's a simple model of self-fulfilling prophecies.

First, we need to decide how Predict-O-Matic outputs its predictions. In principle, it could (i) produce the maximum likelihood outcome (ii) produce the entire distribution over outcomes (iii) sample an outcome of the distribution. But, since Predict-O-Matic is supposed to produce predictions for large volume data (e.g. the inauguration speech of the next US president, or the film that will win the Oscar in 2048), the most sensible option is (iii). Option (i) can produce an outcome that is maximum likelihood but is extremely untypical (since every individual outcome has very low probability), so it is not very useful. Option (ii) requires somehow producing an exponentially large vector of numbers, so it's infeasible. More sophisticated variants are possible, but I don't think any of them avoids the problem.

If the Predict-O-Matic is a Bayesian inference algorithm, an interesting dynamic will result. On each round, some hypothesis will be sampled out of the current belief state. If this hypothesis is a self-fulfilling prophecy, sampling it will cause its likelihood to go up. We get positive feedback: the higher the probability Predict-O-Matic assigns to the hypothesis, the more often it is sampled, the more evidence in favor of the hypothesis is produced, the higher its probability becomes. So, if it starts out as sufficiently probable a priori, the belief state will converge there.

Of course realistic learning algorithms are not Bayesian inference, but they have to approximate Bayesian inference in some sense. At the least, there has to be some large space of hypotheses s.t. if one of them is true, the algorithm will converge there. Any algorithm with this property probably displays the dynamics above.

Now, to the simple model. In this model we have just two outcomes: A and B (so it's not large volume data, but that doesn't matter). On each round a prediction is made, after which some outcome occurs. The true environments works as follows: if prediction "A" is made, on this round A happens with probability 99% and B with probability 1%. If prediction "B" is made, on this round B happens with probability 100%. Of course Predict-O-Matic is not aware that predictions can influence outcomes. Instead, we will assume Predict-O-Matic is doing Bayesian inference with a prior over hypotheses, each of which assumes that the environment is IID. In other words, it is learning a single parameter which is the probability A will occur on any given round.

Claim: If the prior is s.t. any interval in -space is assigned positive probability, then Predict-O-Matic will converge to predicting B with frequency 1.

Sketch of proof: If Predict-O-Matic converges to predicting B with frequency then the environment converges to producing outcome B with frequency , implying that Predict-O-Matic converges to predicting B with frequency .

Remark 2: Some of the hypotheses in the prior might be intelligent agents in their own right, with their own utility functions. Such an agent can intentionally produce correct predictions to increase its probability in the belief state, until a "treacherous turn" point when it produces a prediction designed to have irreversible consequences in the outside world in favor of the agent. If it is not a self-fulfilling prophecy, this treacherous prediction will cause Predict-O-Matic to update against the agentic hypothesis, but it might be too late. If it is a self-fulfilling prophecy, it will only make this hypothesis even stronger.

Moreover, there is a mechanism that systematically produces such agentic hypotheses. Namely, a sufficiently powerful predictor is likely to run into "simulation hypotheses" i.e. hypotheses that claim the universe is a simulation by some other agent. As Christiano argued before, that opens an attack vector for powerful agents across the multiverse to manipulate Predict-O-Matic into making whatever predictions they want (assuming Predict-O-Matic is sufficiently powerful to guess what predictions those agents would want it to make).

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-10-19T12:03:40.698Z · score: 2 (1 votes) · LW · GW

Yes, but I also claim that the agents that would care about the hard problem of consciousness are exactly the agents that are themselves conscious.

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-10-18T14:40:47.455Z · score: 2 (1 votes) · LW · GW

No, I am talking about terminal values. Something like an instrumental reward function specified via an ontology that involves consciousness. Also, I am not sure deep learning in its present form qualifies as sufficiently "general" for (i).

Comment by vanessa-kosoy on Vanessa Kosoy's Shortform · 2019-10-18T12:26:33.081Z · score: 4 (2 votes) · LW · GW

The sketch of a proposed solution to the hard problem of consciousness: An entity is conscious if and only if (i) it is an intelligent agent (i.e. a sufficiently general reinforcement learning system) and (ii) its values depend on the presence and/or state of other conscious entities. Yes, this definition is self-referential, but hopefully some fixed point theorem applies. There may be multiple fixed points, corresponding to "mutually alien types of consciousness".

Why is this the correct definition? Because it describes precisely the type of agent who would care about the hard problem of consciousness.

Comment by vanessa-kosoy on The Parable of Predict-O-Matic · 2019-10-16T08:15:18.696Z · score: 9 (6 votes) · LW · GW

This was extremely entertaining and also had good points. For now, just one question:

...The intern was arguing that minimizing prediction error would have all kinds of unintended bad effects. Which was crazy enough. The engineer was worse: they were arguing that Predict-O-Matic might maximize prediction error! Some kind of duality principle. Minimizing in one direction means maximizing in the other direction. Whatever that means.

Is this a reference to duality in optimization? If so, I don't understand the formal connection?

Comment by vanessa-kosoy on Strong stances · 2019-10-15T21:16:23.071Z · score: 2 (1 votes) · LW · GW

This makes me think of Thompson sampling. There, on each round/episode you sample one hypothesis out of your current belief state and then follow the optimal action/policy for this hypothesis. In fact, Thompson sampling seems like one of the most natural computationally efficient algorithms for approximating Bayes-optimal decision making, so perhaps it is not surprising if it's useful for real life decision making too.

Comment by vanessa-kosoy on A simple sketch of how realism became unpopular · 2019-10-14T12:06:02.236Z · score: 2 (1 votes) · LW · GW

What does it mean for instrumentalism to be the correct metaphysics? Normally, I'd interpret "the correct metaphysics" as saying something basic about reality or the universe. (Or, if you're an instrumentalist and you say "X is the correct metaphysics", I'd assume you were saying "it's useful to have a model that treats X as a basic fact about reality or the universe", which also doesn't make sense to me if X is "instrumentalism".)

Like I said before, it means that instrumentalism is the point of view that is the most useful for designing AI or answering questions about AI. According to the "Yudkowskian computationalist" metaphilosophical view, this also makes it the most useful for rationality in general.

Adding hidden variables to QM doesn't make the theory any more realist, the way we're using "realist" here.

I imagined "realist" to mean something like "the universe can be described in a way independent of the choice of observer, and the perceptions of any given observer can be decoded from the history of the universe in this description, s.t. different observers have compatible observations". Adding hidden variables does make QM more realist in this sense, for example the de Broglie-Bohm interpretation is realist (but it only makes sense if you assume all observer perceptions can be reduced to configuration variables, which seems false and disqualifies it). MWI fails to be entirely realist because you have to either make the decoding of observer perceptions stochastic (and thereby admit that your description of the universe is incomplete) or to postulate, for each "normal" observer Alice, a huge ensemble of different observers that correspond to versions of Alice in different Everett branches (and thereby lose the physical meaning of quantum probabilities and with it essentially all the predictive power of the theory).

Comment by vanessa-kosoy on What do the baby eaters tell us about ethics? · 2019-10-13T19:48:51.585Z · score: 2 (1 votes) · LW · GW

Alright, now I finally understand your claim. I still disagree with it: I think that your cynicism about human motivations is unsupported by evidence. But, that's not a debate I'm interested to start atm. Thank you for explaining your views.

Comment by vanessa-kosoy on A simple sketch of how realism became unpopular · 2019-10-13T19:44:21.849Z · score: 4 (2 votes) · LW · GW

Let's unpack what it means to say that "refrigerators exist". From my (instrumentalist) perspective, it means that (i) I have a predictive model of my perception, to which I assign high credence, and in which for each state of the environment I can say which refrigerators exist (where "refrigerator" is just some symbol that makes sense inside the model) and (ii) according to my belief, the current state of the environment contains at least one refrigerator with high probability.

My claim is not that quantum mechanics proves realism is wrong. My claim is that instrumentalism is the correct metaphysics regardless, and once you accept that, the Copenhagen interpretation seems quite satisfactory. Although it is also true that if you try interpreting quantum mechanics according to sufficiently strong realist desiderata you run into impossibility results like the Kochen-Specker theorem and the violation of Bell's inequality.

Comment by vanessa-kosoy on A simple sketch of how realism became unpopular · 2019-10-12T12:29:49.433Z · score: 15 (7 votes) · LW · GW

This post was educational, however, I want to push back against the implicit criticism of instrumentalism and the Copenhagen interpretation. The metaphilosophical position I will use here is: to solve a philosophical question, we need to rephrase it as a question about AI design (AFAIK the full credit for this approach goes to Yudkowsky). So, suppose I am building an AI. Should the AI's model of the world be (i) "about things that exist independently of the AI" or (ii) "about the subjective perceptions of the AI"? This depends on what kind of reward function I want my AI to have.

The standard (I call it "perceptible") type of reward function in reinforcement learning only depends on the history of actions and observations. For such an AI the answer is (ii): it is important the AI will correctly predict the consequences of its actions, but there is no significance whatsoever that the AI's models can be interpreted as (i). Yes, these models will still have intrinsic variables corresponding to "unobserved objects" in some sense, but there is no reason for these variables to always have an unambiguous "realist" interpretation.

Now suppose that the AI is actually designed to care about particular objects outside itself. Specifically, assume the AI uses an instrumental (or semi-instrumental) reward function. Such a function might be specified (or partially specified) using some particular ontology. Then, the AI is modeling the world as containing certain unobserved objects, at least approximately. The answer is now in between (i) and (ii). The AI's models are about subjective perceptions of the AIs and also about some particular type of things that exist independently of the AI, namely, those things that are normatively important to it. Coming back from AIs to humans, we may conclude that, what really makes sense is our subjective perceptions + those external objects that we actually care about (e.g. other humans). But, quarks only "exist" in the sense that they are components in a useful model we created.

What does it tell us about the interpretation of quantum mechanics? Once again, consider an AI trying to discover quantum mechanics. From the AI's perspective, what it's looking for is a function from observation histories to distributions over the next observation. How can we construct such a function from the formalism of QM? Obviously, using Copenhagen: each observation is a measurement that causes the wavefunction of the environment to collapse. "But," the fans of MWI will object "what about all those other Everett branches? They are still out there, right? They don't actually vanish?!" The point is, from the AI's perspective the question is meaningless. The AI's design assumes the AI can record everything it observes in its memories, therefore, once an observation is made, those Everett branches will never meet again in the AI's lifetime. "But," another objection may go "what if someone tempers with the memories of the AI in a way that allows quantum interference between the branches? Sure it is completely impractical for humans, but it is theoretically possible, and might even be practically possible for an AI running on a quantum computer." Alright, but tempering with the memory of the agent is effectively destroying the agent: it invalidates the fundamental assumptions of its reasoning algorithm, and any reasoning algorithm must make some fundamental assumptions of that sort. (The agent might still accurately account for the possibility of its own destruction (see "The Death of the Agent and Kamikaze Strategies"), but probably only with the help of external knowledge.)

Comment by vanessa-kosoy on What do the baby eaters tell us about ethics? · 2019-10-12T09:50:51.262Z · score: 2 (1 votes) · LW · GW

If I understood correctly, your objection to Three Worlds Collide is (mostly?) descriptive rather than prescriptive: you think the story is unrealistic, rather than dispute some normative position that you believe it defends. However, depending on the interpretation of that maxim you formulated, it is (IMO) either factually wrong or entirely consistent with the story of Three Worlds Collide.

Do you believe real world humans are "slow to act against the morals it finds abhorrent"? If your answer is positive, how do you explain all (often extremely violent) conflicts over religion and political ideology over the course of human history? Whatever explanation you propose to these conflicts, what prevents it from explaining the conflict with the Baby-Eaters described in Three Worlds Collide? If your answer to the first question is negative, how do you explain the survival of the human species so far? Whatever explanation you provide to this survival, what prevents it from explaining the continued survival of the human species until the imaginary future in the story?

Comment by vanessa-kosoy on How feasible is long-range forecasting? · 2019-10-11T21:16:45.564Z · score: 26 (9 votes) · LW · GW

I agree with most of what you're saying, but this part seems like giving up way too easily: "And even if you say, ok sure it depends, but like what's the average answer - even then, the only the way to arrive at some unbiased global sense of whether the future is predictable is to come up with some way of enumerating and weighing all possible facts about the future universe... which is an impossible problem. So we're left with the unsatisfying truth that the future is neither predictable or unpredictable - it depends on which features of the future you are considering."

The only way to say something useful about this is literally enumerating all possible facts? Sounds needlessly pessimistic.

On the contrary, I think it could be tremendously interesting and useful to start building some kind of categorization of prediction domains that allows saying something about their respective predictability. Obviously this is a hard problem, obviously the universe is very complex and the categorization will miss a lot of the intricacy, but the same is true about many other domains of knowledge (probably, nearly all of them). Despite the complexity of the universe (i) we should keep looking for (extremely) simplified models that capture a lot of what we actually care about (ii) having even an extremely simplified model is often much better than no model at all (iii) the model will keep evolving over time (which is to say, it feels more like a potential new science than a single problem that can be stated and solved in a relatively short time frame).

Comment by vanessa-kosoy on What do the baby eaters tell us about ethics? · 2019-10-11T20:47:33.484Z · score: 2 (1 votes) · LW · GW

I still don't understand, is your claim descriptive or prescriptive?

Neither... Or maybe descriptive? I am simply stating the implication, not prescribing what to do.

Then I don't understand what you're saying at all. If you are stating an implication, then I don't understand (i) what exactly is the premise (ii) what exactly is the conclusion (iii) how is this implication violated in the story Three World Collide.

Yes, we do have plenty of laws, but no one goes out of their way to find and hunt down the violators.

So, your argument is (correct me if I'm wrong): in the real world people only put that much effort into hunting down criminals, therefore it is unrealistic that in the story the people put so much effort into thinking what to do with the Baby-eaters. I am not convinced. In the real world, you need to allocate your limited resources between many problems you need to deal with. The Baby-eaters are a heretofore unknown problem on a huge scale (possibly dwarfing all human criminality), so it makes perfect sense the protagonists would put a lot of effort into dealing with it. Moreover, we are talking about a future humanity in which there is much less violent crime (IIRC this is stated explicitly in the story) and people are much more sensitive to ethical issues.

I meant "human universal" not some species-independent morality.

I don't think the story obviously postulates a human universal morality. It only implies that many people living at the same time period have similar views on certain ethical questions, which doesn't strike me as unrealistic?

In general, it feels like this comment thread has failed to get to the crux of the disagreement, and I am not sure if anything can be done about it, at least without using a more interactive medium.

Well, if you feel this is not productive we can stop?

Comment by vanessa-kosoy on Misconceptions about continuous takeoff · 2019-10-09T09:30:16.336Z · score: 17 (7 votes) · LW · GW

Expanding on that a little, even if we know our AIs are misaligned that doesn't necessarily save us. We might reach a state of knowledge when it is easy to create AIs that (i) misaligned (ii) superhuman and (iii) non-singular (i.e. a single such AI is not stronger than the sum total of humanity and aligned AIs) but hard/impossible to create aligned superhuman AIs. Since misaligned AIs that can't take over still mostly follow human instructions, there will be tremendous economic incentives to deploy more such systems. This is effectively a tragedy of the commons: for every individual actor, deploying more AIs only increases global risk a little but brings in tremendous revenue. However, collectively, risk accumulates rapidly. At some point the total power of misaligned AIs crosses some (hard to predict in advance) threshold and there is a phase transition (a cascade of failures) from a human-controlled world to a coalition-of-misaligned-AI-controlled world. Alternatively, the AIs might find a way to manipulate our entire culture into gradually changing its values into something the AIs prefer (like with Murder Gandhi).