## Posts

## Comments

**Vanessa Kosoy (vanessa-kosoy)**on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-18T18:19:42.436Z · LW · GW

I don't think in this case should be defined to be 1. It seems perfectly justified to leave it undefined, since in such a game can be equally well conceptualized as maximally aligned or as maximally anti-aligned. It *is* true that if, out of some set of objects you consider the subset of those that have , then it's natural to include the undefined cases too. But, if out of some set of objects you consider the subset of those that have , then it's *also* natural to include the undefined cases. This is similar to how is simultaneously in the closure of and in the closure of , so can be considered to be either or (or any other number) depending on context.

**Vanessa Kosoy (vanessa-kosoy)**on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-18T17:10:15.755Z · LW · GW

In common-payoff games the denominator is *not* zero, in general. For example, suppose that , , , , . Then , as expected: current payoff is , if played it would be .

**Vanessa Kosoy (vanessa-kosoy)**on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-18T03:26:56.138Z · LW · GW

Consider any finite two-player game in normal form (each player can have any finite number of strategies, we can also easily generalize to certain classes of infinite games). Let be the set of pure strategies of player and the set of pure strategies of player . Let be the utility function of player . Let be a particular (mixed) outcome. Then the alignment of player with player in this outcome is defined to be:

Ofc so far it doesn't depend on at all. However, we can make it depend on if we use to impose assumptions on , such as:

- is a -best response to or
- is a Nash equilibrium (or other solution concept)

Caveat: If we go with the Nash equilibrium option, can become "systematically" ill-defined (consider e.g. the Nash equilibrium of matching pennies). To avoid this, we can switch to the extensive-form game where chooses their strategy after seeing 's strategy.

**Vanessa Kosoy (vanessa-kosoy)**on My Current Take on Counterfactuals · 2021-06-11T21:32:50.393Z · LW · GW

I would be convinced if you had a theory of rationality that is a Pareto improvement on IB (i.e. has all the good properties of IB + a more general class of utility functions). However, LI doesn't provide this AFAICT. That said, I would be interested to see some rigorous theorem about LIDT solving procrastination-like problems.

As to philosophical deliberation, I feel some appeal in this point of view, but I can also easily entertain a different point of view: namely, that human values are more or less fixed and well-defined whereas philosophical deliberation is just a "show" for game theory reasons. Overall, I place much less weight on arguments that revolve around the presumed nature of human values compared to arguments grounded in abstract reasoning about rational agents.

**Vanessa Kosoy (vanessa-kosoy)**on An Intuitive Guide to Garrabrant Induction · 2021-06-04T15:22:27.916Z · LW · GW

First, "no complexity bounds on the trader" doesn't mean we allow *uncomputable* traders, we just don't limit their time or other resources (exactly like in Solomonoff induction). Second, even having a trader that knows everything doesn't mean all the prices collapse in a single step. It does mean that the prices will *converge* to knowing everything with time. GI guarantees no budget-limited trader will make an *infinite* profit, it doesn't guarantee no trader will make a profit at all (indeed guaranteeing the later is impossible).

**Vanessa Kosoy (vanessa-kosoy)**on An Intuitive Guide to Garrabrant Induction · 2021-06-04T00:58:20.665Z · LW · GW

A brief note on naming: Solomonoff exhibited an uncomputable algorithm that does idealized induction, which we call Solomonoff induction. Garrabrant exhibited a computable algorithm that does logical induction, which we have named Garrabrant induction.

This seems misleading. Solomonoff induction has computable versions obtained by imposing a complexity bound on the programs. Garrabrant induction has uncomputable versions obtained by *removing* the complexity bound from the traders. The important difference between Solomonoff and Garrabrant is *not* computable v.s uncomputable. Also I feel that it would be appropriate to mention defensive forecasting as a historical precursor of Garrabrant induction.

**Vanessa Kosoy (vanessa-kosoy)**on My Current Take on Counterfactuals · 2021-06-04T00:06:12.488Z · LW · GW

My hope is that we will eventually have computationally feasible algorithms that satisfy provable (or at least conjectured) infra-Bayesian regret bounds for some sufficiently rich hypothesis space. Currently, even in the Bayesian case, we only have such algorithms for poor hypothesis spaces, such as MDPs with a small number of states. We can also rule out such algorithms for some large hypothesis spaces, such as short programs with a fixed polynomial-time bound. In between, there should be some hypothesis space which is small enough to be feasible and rich enough to be useful. Indeed, it seems to me that the existence of such a space is the simplest explanation for the success of deep learning (that is, for the ability to solve a diverse array of problems with relatively simple and domain-agnostic algorithms). But, at present I only have speculations about what this space looks like.

**Vanessa Kosoy (vanessa-kosoy)**on My Current Take on Counterfactuals · 2021-06-03T04:00:56.334Z · LW · GW

However, I also think LIDT solves the problem in practical terms:

What is LIDT exactly? I can try to guess but I rather make sure we're both talking about the same thing.

My basic argument is we can model this sort of preference, so why rule it out as a possible human preference? You may be philosophically confident in finitist/constructivist values, but are you so confident that you'd want to lock unbounded quantifiers out of the space of possible values for value learning?

I agree inasmuch as we actually *can* model this sort of preferences, for a sufficiently strong meaning of "model". I feel that it's much harder to be confident about any detailed claim about human values than about the validity of a generic theory of rationality. Therefore, if the ultimate generic theory of rationality imposes some conditions on utility functions (while still leaving a very rich space of different utility functions), that will lead me to try formalizing human values *within those constraints*. Of course, given a candidate theory, we *should* poke around and see whether it can be extended to weaken the constraints.

**Vanessa Kosoy (vanessa-kosoy)**on Power dynamics as a blind spot or blurry spot in our collective world-modeling, especially around AI · 2021-06-02T16:41:20.561Z · LW · GW

...Mistake Theorists should also be systematically biased towards the possibility of things like power dynamics being genuinely significant.

You meant to say, biased *against* that possibility?

**Vanessa Kosoy (vanessa-kosoy)**on Introduction To The Infra-Bayesianism Sequence · 2021-05-24T21:53:54.042Z · LW · GW

Boundedly rational agents definitely *can* have dynamic consistency, I guess it depends on just how bounded you want them to be. IIUC what you're looking for is a model that can formalize "approximately rational but doesn't necessary satisfy any crisp desideratum". In this case, I would use something like my quantitative AIT definition of intelligence.

**Vanessa Kosoy (vanessa-kosoy)**on Formal Inner Alignment, Prospectus · 2021-05-24T18:16:48.792Z · LW · GW

Since you're trying to compile a comprehensive overview of directions of research, I will try to summarize my own approach to this problem:

- I want to have algorithms that admit thorough theoretical analysis. There's already plenty of bottom-up work on this (proving initially weak but increasingly stronger theoretical guarantees for deep learning). I want to complement it by top-down work (proving strong theoretical guarantees for algorithms that are initially infeasible but increasingly made more feasible). Hopefully eventually the two will meet in the middle.
- Given feasible algorithmic building blocks with strong theoretical guarantees, some version of the consensus algorithm can tame Cartesian daemons (including manipulation of search) as long as the prior (inductive bias) of our algorithm is sufficiently good.
- Coming up with a good prior is a problem in embedded agency. I believe I achieved significant progress on this using a certain infra-Bayesian approach, and hopefully will have a post soonish.
- The consensus-like algorithm
*will*involve a trade-off between safety and capability. We will have to manage this trade-off based on expectations regarding external dangers that we need to deal with (e.g. potential competing unaligned AIs). I believe this to be inevitable, although ofc I would be happy to be proven wrong. - The resulting AI is only a first stage that we will use to design the second stage AI, it's
*not*something we will deploy in self-driving cars or such - Non-Cartesian daemons need to be addressed separately. Turing RL seems like a good way to study this if we assume the core is too weak to produce non-Cartesian daemons, so the latter can be modeled as potential catastrophic side effects of using the envelope. However, I don't have a satisfactory solution yet (aside perhaps homomorphic encryption, but the overhead might be prohibitive).

**Vanessa Kosoy (vanessa-kosoy)**on Introduction To The Infra-Bayesianism Sequence · 2021-05-24T00:18:50.373Z · LW · GW

I'm not sure why would we need a weaker requirement if the formalism already satisfies a stronger requirement? Certainly when designing concrete learning algorithms we might want to use some kind of simplified update rule, but I expect that to be contingent on the type of algorithm and design constraints. We do have some speculations in that vein, for example I suspect that, for communicating infra-MDPs, an update rule that forgets everything except the current state would only lose something like expected utility.

**Vanessa Kosoy (vanessa-kosoy)**on My Journey to the Dark Side · 2021-05-08T08:46:42.451Z · LW · GW

I wasn't making a proposal about turning everyone vegan. I was just observing that, at least if everyone was like me, the situation would have a "tragedy of the commons" payoff matrix (the Nash equilibrium is "everyone isn't vegan", the Pareto optimum is "everyone is vegan".)

**vanessa-kosoy**on [deleted post] 2021-05-07T20:41:54.585Z

**Vanessa Kosoy (vanessa-kosoy)**on My Journey to the Dark Side · 2021-05-07T08:04:29.021Z · LW · GW

Yes, I'm very skeptical that Ziz is truly at her core the perfect utilitarian she claims to be, however, even in the universe in which that is true, I still want to own up to being "evil". Not because I deserve accolades for my selfishness (I don't), but because being honest is an important part of my life strategy and the sort of social norms I promote.

**Vanessa Kosoy (vanessa-kosoy)**on My Journey to the Dark Side · 2021-05-06T21:57:27.232Z · LW · GW

While I'm extremely critical of the Ziz-cult, I have to admit that her theory of core self vs. narrative self is very close to my own thinking (although I think hyperbolic discounting exists *in addition* to the core/narrative dynamics.) I also did some kind of "deconstructing the matrix" when thinking about this. However, I strongly depart from her theory of morals. While I care about other people, I do so very non-uniformly so I'm nowhere near utilitarianism. For example, I'm against animal suffering but it probably wouldn't be worth it for me to be vegan if not for signalling value (while it probably would be worth it for me if I could make everyone vegan including myself.) I guess this might just mean that in her terminology I'm evil? However, I am very much in favor of engaging in mutually beneficial trade with other people, and in favor of creating social norms/matrices that are better for everyone.

**Vanessa Kosoy (vanessa-kosoy)**on Sexual Dimorphism in Yudkowsky's Sequences, in Relation to My Gender Problems · 2021-05-03T22:18:09.211Z · LW · GW

Definitely wasn't the case for me. The only psychological effect I'm pretty much certain about is the lower/different sex drive. There's also more crying, but I think that it's mostly accounted for by a different physical response to the same emotions. And some other changes, but they are more subtle and hence might be placebos. (Although, when I just started HRT I had extreme anxiety about possible side effects and it might have masked any rapid positive psychological effects.)

**Vanessa Kosoy (vanessa-kosoy)**on My Current Take on Counterfactuals · 2021-05-03T18:14:09.389Z · LW · GW

In particular, it's easy to believe that some computation knows more than you.

Yes, I think TRL captures this notion. You have some Knightian uncertainty about the world, and some Knightian uncertainty about the result of a computation, and the two are entangled.

**Vanessa Kosoy (vanessa-kosoy)**on My Current Take on Counterfactuals · 2021-05-03T18:09:27.878Z · LW · GW

I lean towards some kind of finitism or constructivism, and am skeptical of utility functions which involve unbounded quantifiers. But also, how does LI help with the procrastination paradox? I don't think I've seen this result.

**Vanessa Kosoy (vanessa-kosoy)**on My Current Take on Counterfactuals · 2021-05-03T17:58:43.257Z · LW · GW

Yes, I'm pretty sure we have that kind of completeness. Obviously representing all hypotheses in this opaque form would give you poor sample and computational complexity, but you can do something midway: use black-box programs as components in your hypothesis but also have some explicit/transparent structure.

**Vanessa Kosoy (vanessa-kosoy)**on Updating the Lottery Ticket Hypothesis · 2021-04-18T23:01:07.966Z · LW · GW

IIUC, here's a simple way to test this hypothesis: initialize a random neural network, and then find the minimal loss point *in the tangent space*. Since the tangent space is linear, this is easy to do (i.e. doesn't require heuristic gradient descent): for square loss it's just solving a large linear system once, for many other losses it should amount to *convex* optimization for which we have provable efficient algorithms. And, I guess it's underdetermined so you add some regularization. Is the result about as good as normal gradient descent in the actual parameter space? I'm guessing some of the linked papers might have done something like this?

**Vanessa Kosoy (vanessa-kosoy)**on My Current Take on Counterfactuals · 2021-04-16T16:09:55.585Z · LW · GW

So we have this nice picture, where rationality is characterized by non-exploitability wrt a specific class of potential exploiters.

I'm not convinced this is the right desideratum for that purpose. Why should we care about exploitability by traders if making such trades is not actually possible given the environment and the utility function? IMO epistemic rationality is subservient to instrumental rationality, so our desiderata should be derived from the later.

Human value-uncertainty is not particularly well-captured by Bayesian uncertainty, as I imagine you'll agree... It's hard to picture that I have some true platonic utility function.

Actually I am rather skeptical/agnostic on this. For me it's fairly easy to picture that I have a "platonic" utility function, except that the time discount is dynamically inconsistent (not exponential).

I am in favor of exploring models of preferences which admit all sorts of uncertainty and/or dynamic inconsistency, but (i) it's up to debate how much degrees of freedom we need to allow there and (ii) I feel that the case logical induction is the right framework for this is kinda weak (but maybe I'm missing something).

**Vanessa Kosoy (vanessa-kosoy)**on My Current Take on Counterfactuals · 2021-04-10T12:43:00.792Z · LW · GW

I guess we can try studying Troll Bridge using infra-Bayesian modal logic, but atm I don't know what would result.

From a radical-probabilist perspective, the complaint would be that Turing RL still uses the InfraBayesian update rule, which might not always be necessary to be rational (the same way Bayesian updates aren't always necessary).

Ah, but there is a sense in which it doesn't. The radical update rule is equivalent to updating on "secret evidence". And in TRL we have such secret evidence. Namely, if we only look at the agent's beliefs about "physics" (the environment), then they would be updated radically, because of secret evidence from "mathematics" (computations).

**Vanessa Kosoy (vanessa-kosoy)**on "Infra-Bayesianism with Vanessa Kosoy" – Watch/Discuss Party · 2021-04-09T19:18:07.563Z · LW · GW

What if the super intelligent deity is less than maximally evil or maximally good? (E.g. the deity picking the median-performance world)

Thinking of the worst-case is just a mathematical reflection of the fact we want to be able to prove *lower bounds* on the expected utility of our agents. We have an unpublished theorem that, in some sense, *any* such lower bound guarantee has an infra-Bayesian formulation.

Another way to justify it is the infra-Bayesian CCT (see "Complete Class Theorem Weak Version" here).

What about the dutch-bookability of infraBayesians? (the classical dutch-book arguments seem to suggest pretty strongly that non-classical-Bayesians can be arbitrarily exploited for resources)

I think it might depend on the specific Dutch book argument, but *one* way infra-Bayesians escape them is by... being equivalent to certain Bayesians! For example, consider the setting where your agent has access to random bits that the environment can't predict. Then, infra-Bayesian behavior is just the Nash equilibrium in a two-player zero-sum game (against Murphy). Now, the Nash strategy in such a game is the (Bayes) optimal response to the Nash strategy of the other player, so it can be regarded as "Bayesian". However, the converse is false: not every best response to Nash is in itself Nash. So, the infra-Bayesian decision rule is more restrictive than the corresponding Bayesian decision rule, but it's a special case of the latter.

Is there a meaningful metaphysical interpretation of infraBayesianism that does not involve Murphy? (similarly to how Bayesianism can be metaphysically viewed as "there's a real, static world out there, but I'm probabilistically unsure about it")

I think of it as just another way of organizing uncertainty. The question is too broad for a succinct answer, I think, but here's *one* POV you could take: Let's remember the frequentist definition of probability distributions as time limits of frequencies. Now, what if the time limit doesn't converge? Then, we can make a (crisp) infradistribution instead: the convex hull of all limit points. Classical frequentism also has the problem that the exact same event never repeats itself. But in "infra-frequentism" we can solve this: you don't need the exact same event to repeat, you can draw the boundary around what counts as "the event" any way you like.

Once we go from passive observation to active interaction with the environment, your own behavior serves as *another* source of Knightian uncertainty. That is, you're modeling the world in terms of certain features while ignoring everything else, but the state of everything else depends on your past behavior (and you don't want to explicitly keep track of that). This line of thought can be formalized in the language of infra-MDPs (unpublished). And then ofc you complement this "aleatoric" uncertainty with "epistemic" uncertainty by considering the mixture of many infra-Bayesian hypotheses.

**Vanessa Kosoy (vanessa-kosoy)**on "Infra-Bayesianism with Vanessa Kosoy" – Watch/Discuss Party · 2021-04-09T18:48:20.893Z · LW · GW

There is a formal sense in which "predicting Nirvana in some circumstance is equivalent to predicting that there are no possible futures in that circumstance", see our latest post. It's similar to MUDT, where, if you prove a contradiction then you can prove utility is as high as you like.

**Vanessa Kosoy (vanessa-kosoy)**on "Infra-Bayesianism with Vanessa Kosoy" – Watch/Discuss Party · 2021-04-09T18:43:30.092Z · LW · GW

The exact same thing is true for classical probability theory: you have distributions, mixtures of distributions and linear functionals respectively. So I'm not sure what new difficulty comes from infra-Bayesianism?

Maybe it would help thinking about infra-MDPs and infra-POMDPs?

Also, here I wrote about how you could construct an infra-Bayesian version of the Solomonoff prior, although possibly it's better to do it using infra-Bayesian logic.

**Vanessa Kosoy (vanessa-kosoy)**on My Current Take on Counterfactuals · 2021-04-09T18:39:39.332Z · LW · GW

I only skimmed this post for now, but a few quick comments on links to infra-Bayesianism:

InfraBayes doesn’t seem to have that worry, since it applies to non-realizable cases. (Or does it? Is there some kind of non-oscillation guarantee? Or is non-oscillation part of what it means for a set of environments to be learnable -- IE it can oscillate in some cases?)... AFAIK the conditions for learnability in the InfraBayes case are still pretty wide open.

It's true that these questions still need work, but I think it's rather clear that something like "there are no traps" is a sufficient condition for learnability. For example, if you have a finite set of "episodic" hypotheses (i.e. time is divided into episodes, and no states is preserved from one episode to another), then a simple adversarial bandit algorithm (e.g. Exp3) that treats the hypotheses as arms leads to learning. For a more sophisticated example, consider Tian et al which is formulated in the language of game theory, but can be regarded as an infra-Bayesian regret bound for infra-MDPs.

Radical Probabalism and InfraBayes are plausibly two orthogonal dimensions of generalization for rationality. Ultimately we want to generalize in both directions, but to do that, working out the radical-probabilist (IE logical induction) decision theory in more detail might be necessary.

True, but IMO the way to incorporate "radical probabilism" is via what I called Turing RL.

I don’t know how to talk about the CDT vs EDT insight in the InfraBayes world.

I'm not sure what precisely you mean by "CDT vs EDT insight" but our latest post might be relevant: it shows how you can regard infra-Bayesian hypotheses as joint beliefs about observations *and* actions, EDT-style.

Perhaps more importantly, the Troll Bridge insights. As I mentioned in the beginning, in order to meaningfully solve Troll Bridge, it’s necessary to “respect logic” in the right sense. InfraBayes doesn’t do this, and it’s not clear how to get it to do so.

Is there a way to operationalize "respecting logic"? For example, a specific toy scenario where an infra-Bayesian agent would fail due to not respecting logic?

**Vanessa Kosoy (vanessa-kosoy)**on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-04-07T12:26:13.735Z · LW · GW

From your reply to Paul, I understand your argument to be something like the following:

- Any solution to single-single alignment will involve a tradeoff between alignment and capability.
- If AIs systems are not designed to be cooperative, then in a competitive environment each system will either go out of business or slide towards the capability end of the tradeoff. This will result in catastrophe.
- If AI systems
*are*designed to be cooperative, they will strike deals to stay towards the alignment end of the tradeoff. - Given the technical knowledge to design cooperative AI, the incentives are in favor of cooperative AI since cooperative AIs can come ahead by striking mutually-beneficial deals even purely in terms of capability. Therefore, producing such technical knowledge will prevent catastrophe.
- We might still need regulation to prevent players who irrationally choose to deploy uncooperative AI, but this kind of regulation is relatively easy to promote since it aligns with competitive incentives (an uncooperative AI wouldn't have much of an edge, it would just threaten to drag everyone into a mutually destructive strategy).

I think this argument has merit, but also the following weakness: given single-single alignment, we can delegate the design of cooperative AI to the initial uncooperative AI. Moreover, uncooperative AIs have an incentive to self-modify into cooperative AIs, if they assign even a small probability to their peers doing the same. I think we definitely need more research to understand these questions better, but it seems plausible we can reduce cooperation to "just" solving single-single alignment.

**Vanessa Kosoy (vanessa-kosoy)**on Formal Solution to the Inner Alignment Problem · 2021-04-06T21:11:21.656Z · LW · GW

I'm kind of scared of this approach because I feel unless you really nail everything there is going to be a gap that an attacker can exploit.

I think that not every gap is exploitable. For most types of biases in the prior, it would only promote simulation hypotheses with baseline universes conformant to this bias, and attackers who evolved in such universes will also tend to share this bias, so they will target universes conformant to this bias and that would make them less competitive with the true hypothesis. In other words, most types of bias affect both and in a similar way.

More generally, I guess I'm more optimistic than you about solving all such philosophical liabilities.

I think of this in contrast with my approach based on epistemic competitiveness approach, where the idea is not necessarily to identify these considerations in advance, but to be epistemically competitive with an attacker (inside one of your hypotheses) who has noticed an improvement over your prior.

I don't understand the proposal. Is there a link I should read?

This is very similar to what I first thought about when going down this line. My instantiation runs into trouble with "giant" universes that do all the possible computations you would want, and then using the "free" complexity in the bridge rules to pick which of the computations you actually wanted.

So, you can let your physics be a dovetailing of all possible programs, and delegate to the bridge rule the task of filtering the outputs of only one program. But the bridge rule is not "free complexity" because it's not coming from a simplicity prior at all. For a program of length , you need a particular DFA of size . However, the actual DFA is of expected size with . The probability of having the DFA you need embedded in that is something like . So moving everything to the bridge makes a much less likely hypothesis.

**Vanessa Kosoy (vanessa-kosoy)**on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-04-05T16:34:54.389Z · LW · GW

I don't understand the claim that the scenarios presented here prove the need for some new kind of technical AI alignment research. It seems like the failures described happened because the AI systems were misaligned in the usual "unipolar" sense. These management assistants, DAOs etc *are not aligned to the goals of their respective, individual users/owners*.

I do see two reasons why multipolar scenarios might require more technical research:

- Maybe several AI systems aligned to different users with different interests can interact in a Pareto inefficient way (a tragedy of the commons among the AIs), and maybe this can be prevented by designing the AIs in particular ways.
- In a multipolar scenario, aligned AI might have to compete with already deployed unaligned AI, meaning that safety must not come on expense of capability
^{[1]}.

In addition, aligning a single AI to multiple users also requires extra technical research (we need to somehow balance the goals of the different users and solve the associated mechanism design problem.)

However, it seems that this article is arguing for something different, since none of the above aspects are highlighted in the description of the scenarios. So, I'm confused.

In fact, I suspect this desideratum is impossible in its strictest form, and we actually have no choice but somehow making sure aligned AIs have a significant head start on all unaligned AIs. ↩︎

**Vanessa Kosoy (vanessa-kosoy)**on Formal Solution to the Inner Alignment Problem · 2021-04-02T17:03:49.801Z · LW · GW

Is bounded? I assign significant probability to it being or more, as mentioned in the other thread between me and Michael Cohen, in which case we'd have trouble.

Yes, you're right. A malign simulation hypothesis can be a very powerful explanation to the AI for the why it found itself at a point suitable for this attack, thereby compressing the "bridge rules" by a lot. I believe you argued as much in your previous writing, but I managed to confuse myself about this.

Here's the sketch of a proposal how to solve this. Let's construct our prior to be the *convolution* of a simplicity prior with a computational easiness prior. As an illustration, we can imagine a prior that's sampled as follows:

- First, sample a hypothesis from the Solomonoff prior
- Second, choose a number according to some simple distribution with high expected value (e.g. ) with
- Third, sample a DFA with states and a uniformly random transition table
- Fourth, apply to the output of

We think of the simplicity prior as choosing "physics" (which we expect to have low description complexity but possibly high computational complexity) and the easiness prior as choosing "bridge rules" (which we expect to have low computational complexity but possibly high description complexity). Ofc this convolution can be regarded as another sort of simplicity prior, so it differs from the original simplicity prior merely by a factor of , however the source of our trouble is also "merely" a factor of .

Now the simulation hypothesis no longer has an advantage via the bridge rules, since the bridge rules have a large constant budget allocated to them anyway. I think it should be possible to make this into some kind of theorem (two agents with this prior in the same universe that have access to roughly the same information should have similar posteriors, in the limit).

**Vanessa Kosoy (vanessa-kosoy)**on Vanessa Kosoy's Shortform · 2021-03-29T19:43:21.771Z · LW · GW

So is the general idea that we quantilize such that we're choosing in expectation an action that doesn't have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?

Yes, although you probably want much more than twice. Basically, if the probability of corruption following the user policy is and your quantilization fraction is then the AI's probability of corruption is bounded by .

I also wonder if using the user policy to sample actions isn't limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?

Obviously it is limiting, but this is the price of safety. Notice, however, that the quantilization strategy is only an existence proof. In principle, there might be better strategies, depending on the prior (for example, the AI might be able to exploit an assumption that the user is quasi-rational). I didn't specify the AI by quantilization, I specified it by maximizing EU subject to the Hippocratic constraint. Also, the support is not really the important part: even if the support is the full action space, some sequences of actions are possible but so unlikely that the quantilization will never follow them.

**Vanessa Kosoy (vanessa-kosoy)**on Vanessa Kosoy's Shortform · 2021-03-29T19:11:07.468Z · LW · GW

More observations about this attack vector ("attack from counterfactuals"). I focus on "amplifying by subjective time".

**The harder the takeoff the more dangerous this attack vector:**During every simulation cycle, ability to defend against simulated malign AI depends on the power of the defense system*in the beginning of the cycle*^{[1]}. On the other hand, the capability of the attacker depends on its power*in the end of the cycle*. Therefore, if power grows very fast this is bad news for the defender. On the other hand, if power grows very slowly, the defender wins because defense is probably easier.**Inner control of anchor makes system safer:**Given a fixed objective time anchor, the more times the simulation goes through a cycle, the higher its exposure to the threat, since improving the defense system takes time even if you have all the knowledge. Therefore, we want to give the simulated user the ability to move the objective time anchor forward in time, in order to benefit from improvements in the defense system.**Additional information about the external world makes system safer:**Suppose our subjective probability of a malign AI successfully attacking during a single simulation cycle is 0.1%. Then performing 1000 cycles superficially yields a failure probability of about 37%. However, this is only the case if attacks on different cycles are independent random variables. If the AI knows as much about the possibility of such an attack as we do, this is a justified assumption. However, if the AI knows more, then, from our subjective perspective, these variables are positively correlated. In particular, if the AI knows with certainty whether a successful attack would occur or not the failure probability over 1000 cycles goes back to 0.1%. Therefore, it is beneficial to give the AI extra information about the world, just so it can estimate the probability of attack with more confidence.

More precisely, it is somewhat better than this since, if we are at some late cycle among a sequence of cycles anchored to same point in objective time, the simulated user can use knowledge generated in previous cycles to improve the defense system during the current cycle. So, early cycles contribute more to the risk. ↩︎

**Vanessa Kosoy (vanessa-kosoy)**on Inframeasures and Domain Theory · 2021-03-29T17:54:29.015Z · LW · GW

Virtually all the credit for this post goes to Alex, I think the proof of Proposition 1 was more or less my only contribution.

**Vanessa Kosoy (vanessa-kosoy)**on Vanessa Kosoy's Shortform · 2021-03-29T16:48:16.385Z · LW · GW

The distribution is the user's policy, and the utility function for this purpose is the *eventual success probability* estimated by the user (as part of the timeline report), in the end of the "maneuver". More precisely, the original quantilization formalism was for the one-shot setting, but you can easily generalize it, for example I did it for MDPs.

**Vanessa Kosoy (vanessa-kosoy)**on Toward A Bayesian Theory Of Willpower · 2021-03-29T16:30:44.677Z · LW · GW

Yes, I *think* we are talking about the same thing. If you change your distribution over hypotheses, or the distribution over evidence implied by each hypothesis, then it means you're changing the prior.

**Vanessa Kosoy (vanessa-kosoy)**on Toward A Bayesian Theory Of Willpower · 2021-03-29T16:29:16.243Z · LW · GW

Okay, that's exactly the same as saying the parameter controls the prior? If I understood you correctly?

**Vanessa Kosoy (vanessa-kosoy)**on Introduction To The Infra-Bayesianism Sequence · 2021-03-29T16:26:18.560Z · LW · GW

IIUC your question can be reformulated as follows: a crisp infradistribution can be regarded as a claim about reality (the true distribution is inside the set), but it's not clear how to generalize this to non-crisp. Well, if you think in terms of desiderata, then crisp says: if distribution is inside set then we have some lower bound on expected utility (and if it's not then we don't promise anything). On the other hand non-crisp gives a lower bound that is *variable* with the true distribution. We can think of non-crisp infradistirbutions as being *fuzzy* properties of the distribution (hence the name "crisp"). In fact, if we restrict ourselves to either of homogenous, cohomogenous or c-additive infradistributions, then we actually have a formal way to assign membership functions to infradistirbutions, i.e. literally regard them as fuzzy sets of distributions (which ofc have to satisfy some property analogous to convexity).

**Vanessa Kosoy (vanessa-kosoy)**on Toward A Bayesian Theory Of Willpower · 2021-03-26T07:56:13.468Z · LW · GW

An alternative explanation of will-power is hyperbolic discounting. Your time discount function is not exponential, and therefore not dynamically consistent. So you can simultaneously (i) prefer gaining short-term pleasure on the expense of long-term goals (e.g. play games instead of studying) and (ii) take actions to prevent future-you from doing the same (e.g. go to rehab).

This seems simpler, but it doesn't explain why the same drugs that cause/prevent weird beliefs should add/deplete will-power.

"How to weight evidence vs. the prior" is not a free parameter in Bayesianism. What you *can* have is some parameter controlling the prior itself (so that the prior can be less or more confident about certain things). I guess we can speculate that there are some parameters in the prior and some parameters in the reward function s.t. various drugs affect both of them simultaneously, and maybe there's a planning-as-inference explanation for why the two are entangled.

**Vanessa Kosoy (vanessa-kosoy)**on Introduction To The Infra-Bayesianism Sequence · 2021-03-25T20:49:40.692Z · LW · GW

There is some truth in that, in the sense that, your beliefs must take a form that is *learnable* rather than just a god-given system of logical relationships.

**Vanessa Kosoy (vanessa-kosoy)**on Introduction To The Infra-Bayesianism Sequence · 2021-03-25T20:42:18.386Z · LW · GW

Am I right though that in the case of e.g. Newcomb's problem, if you use the anti-Nirvana trick (getting -infinity reward if the prediction is wrong), then you would still recover the same behavior (EDIT: if you also use best-case reasoning instead of worst-case reasoning)?

Yes

imagine that you know that the even bits in an infinite bitsequence come from a fair coin, but the odd bits come from some other agent, where you can't model them exactly but you have some suspicion that they are a bit more likely to choose 1 over 0. Risk aversion might involve making a small bet that you'd see a 1 rather than a 0 in some specific odd bit (smaller than what EU maximization / Bayesian decision theory would recommend), but "reflecting reality" might recommend having Knightian uncertainty about the output of the agent which would mean never making a bet on the outputs of the odd bits.

I think that if you are offered a single bet, your utility is linear in money and your belief is a *crisp* infradistribution (i.e. a closed convex set of probability distributions) then it is always optimal to bet either as much as you can or nothing at all. But for more general infradistributions this need not be the case. For example, consider and take the set of a-measures generated by and . Suppose you start with dollars and can bet any amount on any outcome at even odds. Then the optimal bet is betting dollars on the outcome , with a value of dollars.

**Vanessa Kosoy (vanessa-kosoy)**on [AN #143]: How to make embedded agents that reason probabilistically about their environments · 2021-03-24T17:51:48.079Z · LW · GW

Thank you Rohin!

I commented on Rohin's summary here.

**Vanessa Kosoy (vanessa-kosoy)**on Introduction To The Infra-Bayesianism Sequence · 2021-03-24T16:46:56.073Z · LW · GW

it's basically trying to think about the statistics of environments rather than their internals

That's not really true because the structure of infra-environments reflects the structure of those Newcombian scenarios. This means that the *sample complexity* of learning them will likely scale with their intrinsic complexity (e.g. some analogue of RVO dimension). This is different from treating the environment as a black-box and converging to optimal behavior by pure trial and error, which would yield much worse sample complexity.

**Vanessa Kosoy (vanessa-kosoy)**on Introduction To The Infra-Bayesianism Sequence · 2021-03-24T16:41:38.003Z · LW · GW

The central problem of <@embedded agency@>(@Embedded Agents@) is that there is no clean separation between an agent and its environment...

That's certainly one way to motivate IB, however I'd like to note that even if there *was* a clean separation between an agent and its environment, it could still be the case that the environment cannot be precisely modeled by the agent due to its computational complexity (in particular this must be the case if the environment contains other agents of similar or greater complexity).

The contribution of infra-Bayesianism is to show how to formally specify a decision procedure that uses Knightian uncertainty, while still satisfying many properties we would like a decision procedure to satisfy.

Well, the use of Knightian uncertainty (imprecise probability) in decision theory certain appeared in the literature, so it would be more fair to say that the contribution of IB is combining that with *reinforcement learning theory* (i.e. treating *sequential* decision making and considering learnability and regret bounds in this setting) and applying that to various other questions (in particular, Newcombian paradoxes).

In particular, one thing that feels a bit odd to me is the choice of worst-case reasoning for the top level -- I don’t really see anything that

forcesthat to be the case. As far as I can tell we could get all the same results by using best-case reasoning instead (assuming we modified the other aspects appropriately).

The reason we use worst-case reasoning is because we want the agent to satisfy certain *guarantees*. Given a learnable class of infra-hypotheses, in the limit, we can guarantee that whenever the true environment satisfies one of those hypotheses, the agent attains at least the corresponding amount of expected utility. You don't get anything analogous with best-case reasoning.

Moreover, there is an (unpublished) theorem showing that virtually any guarantee you might want to impose can be written in IB form. That is, let be the space of environments, and let be an increasing sequence of functions. We can interpret every as a requirement about the policy: . These requirements become stronger with increasing . We might then want to be s.t. it satisfies the requirement with the highest possible. The theorem then says that (under some mild assumptions about the functions ) there exists an *infra-environment* s.t. optimizing for it is equivalent to maximizing . (We can replace by a continuous parameter, I made it discrete just for ease of exposition.)

The obvious justification for worst-case reasoning is that it is a form of risk aversion, but it doesn’t feel like that is really sufficient -- risk aversion in humans is pretty different from literal worst-case reasoning, and also none of the results in the post seem to depend on risk aversion.

Actually it might be not that different. The Legendre-Fenchel duality shows you can think of infradistributions as just concave expectation functionals, which seems as a fairly general way to add risk-aversion to decision theory. It is also used in mathematical economics, see Peng.

it seems interesting to characterize what makes some rules work while others don’t.

Another rule which is tempting to use (and is known in the literature) is minimax-regret. However, it's possible to show that if you allow your hypotheses to depend on the utility function then you can reduce it to ordinary maximin.

**Vanessa Kosoy (vanessa-kosoy)**on Formal Solution to the Inner Alignment Problem · 2021-03-22T00:04:02.451Z · LW · GW

It seems like the simplest algorithm that makes good predictions and runs on your computer is going to involve e.g. reasoning about what aspects of reality are important to making good predictions and then attending to those. But that doesn't mean that I think reality probably works that way. So I don't see how to salvage this kind of argument.

I think it works differently. What you should get is an infra-Bayesian hypothesis which models only those parts of reality that can be modeled within the given computing resources. More generally, if you don't endorse the predictions of the prediction algorithm than either you are wrong or you should use a different prediction algorithm.

How the can the laws of physics be extra-compressible within the context of a simulation hypothesis? More compression means more explanatory power. I think that is must look something like, we can use the simulation hypothesis to predict the values of some of the physical constants. But, it would require a very unlikely coincidence for physical constants to have such values *unless we are actually in a simulation*.

It seems to me like this requires a very strong match between the priors we write down and our real priors. I'm kind of skeptical about that a priori, but then in particular we can see lots of ways in which attackers will be exploiting failures in the prior we write down (e.g. failure to update on logical facts they observe during evolution, failure to make the proper anthropic update, and our lack of philosophical sophistication meaning that we write down some obviously "wrong" universal prior).

I agree that we won't have a *perfect* match but I think we can get a "good enough" match (similarly to how any two UTMs that are not too crazy give similar Solomonoff measures.) I think that infra-Bayesianism solves a lot of philosophical confusions, including anthropics and logical uncertainty, although some of the details still need to be worked out. (But, I'm not sure what specifically do you mean by "logical facts they observe during evolution"?) Ofc this doesn't mean I am already able to fully specify the correct infra-prior: I think that would take us most of the way to AGI.

Do we have any idea how to write down such an algorithm though?

I have all sorts of ideas, but still nowhere near the solution ofc. We can do deep learning while randomizing initial conditions and/or adding some noise to gradient descent (e.g. simulated annealing), producing a population of networks that progresses in an evolutionary way. We can, for each prediction, train a model that produces the *opposite* prediction and compare it to the default model in terms of convergence time and/or weight magnitudes. We can search for the algorithm using meta-learning. We can do variational Bayes with a "multi-modal" model space: mixtures of some "base" type of model. We can do progressive refinement of infra-Bayesian hypotheses, s.t. the plausible hypotheses at any given moment are the leaves of some tree.

moreover it's not clear to me the malign hypothesis faces a similar version of this problem since it's just thinking about a small list of hypotheses rather than trying to maintain a broad enough distribution to find all of them

Well, we also don't have to find all of them: we just have to make sure we don't miss the true one. So, we need some kind of transitivity: if we find a hypothesis which itself finds another hypothesis (in some sense) then we also find the other hypothesis. I don't know how to prove such a principle, but it doesn't seem implausible that we can.

it may just be reasoning deductively about properties of the space of hypotheses rather than using a simple algorithm we can write down.

Why do you think "reasoning deductively" implies there is no simple algorithm? In fact, I think infra-Bayesian logic might be just the thing to combine deductive and inductive reasoning.

**Vanessa Kosoy (vanessa-kosoy)**on Formal Solution to the Inner Alignment Problem · 2021-03-20T18:43:33.000Z · LW · GW

Is bounded? I assign significant probability to it being or more, as mentioned in the other thread between me and Michael Cohen, in which case we'd have trouble.

I think that there are roughly two possibilities: either the laws of our universe happen to be strongly compressible when packed into a malign simulation hypothesis, or they don't. ~~In the latter case, shouldn't be large.~~ In the former case, it means that we are overwhelming likely to actually be inside a malign simulation. But, then AI risk is the least of our troubles. (In particular, because the simulation will probably be turned off once the attack-relevant part is over.)

[**EDIT:** I was wrong, see this.]

It feels like searching for a neural network is analogous to searching for a MAP estimate, and that more generally efficient algorithms are likely to just run one hypothesis most of the time rather than running them all.

Probably efficient algorithms are not running literally all hypotheses, but, they can probably consider multiple plausible hypotheses. In particular, the malign hypothesis itself is an efficient algorithm and it is somehow aware of the two different hypotheses (itself and the universe it's attacking). Currently I can only speculate about neural networks, but I do hope we'll have competitive algorithms amenable to theoretical analysis, whether they are neural networks or not.

As you mention, is it safe to wait and defer or are we likely to have a correlated failure in which all the aligned systems block simultaneously? (e.g. as described here)

I think that the problem you describe in the linked post can be delegated to the AI. That is, instead of controlling trillions of robots via counterfactual oversight, we will start with just one AI project that will research how to organize the world. This project would top any solution we can come up with ourselves.

**Vanessa Kosoy (vanessa-kosoy)**on Vanessa Kosoy's Shortform · 2021-03-18T16:06:05.410Z · LW · GW

I retracted part of that, see the edit.

**Vanessa Kosoy (vanessa-kosoy)**on Vanessa Kosoy's Shortform · 2021-03-18T15:21:27.185Z · LW · GW

*Probably not too original but I haven't seen it clearly written anywhere*.

There are several ways to amplify imitators with different safety-performance tradeoffs. This is something to consider when designing IDA-type solutions.

**Amplifying by objective time:** The AI is predicting what the user(s) will output after thinking about a problem for a long time. This method is the strongest, but also the least safe. It is the least safe because malign AI might exist in the future, which affects the prediction, which creates an attack vector for future malign AI to infiltrate the present world. We can try to defend by adding a button for "malign AI is attacking", but that still leaves us open to surprise takeovers in which there is no chance to press the button.

**Amplifying by subjective time:** The AI is predicting what the user(s) will output after thinking about a problem for a short time, where in the beginning they are given the output of a similar process that ran for one iteration less. So, this simulates a "groundhog day" scenario where the humans wake up in the same objective time period over and over without memory of the previous iterations but with a written legacy. This is weaker than amplifying by objective time, because learning previous results is an overhead, and illegible intuitions might be hard to transmit. This is safer than amplifying by objective time, but if there is some probability of malign AI created in the short time period, there is still an attack vector. The malign AI leakage in this method is roughly proportional to subjective time of simulation times the *present* rate of malign AI takeover, as opposed to amplification by objective time where leakage is proportional to subjective time of simulation times some average *future* rate of malign AI takeover. However, by the time we are able to create this benign AI, the present rate of malign AI takeover might also be considerable.

**Amplifying by probability:** We allow the user(s) to choose "success" or "failure" (or some continuous metric) after completing their work, and make the AI skew the distribution of predictions toward success. This is similar to amplifying by subjective time without any transmission of information. It is weaker and about as safe. The potential advantage is, lower sample complexity: the AI only needs to have a reliable distribution of outcomes after the initial state instead of subsequent states.

**Amplifying by parallelization:** The AI is predicting the output of many copies of the user working together, by having strictly defined interfaces between the copies, over a time period similar to real time. For example, we can imagine a hierarchical organization where each person gives subtasks to their subordinates. We can then simulate such an organization with a copy of some subset of users in each role. To do this, the AI only needs to learn what a given subset of users would do given a particular task from their supervisors and particular results by their subordinates. This method is weaker than previous methods since it requires that the task at hand can be parallelized. ~~But, it is also the safest since the rate of malign AI takeover is only amplified by compared to the background.~~ [**EDIT:** Actually, it's not safer than subjective time because the AI would sample the external world independently for each node in the organization. To avoid this, we would need to somehow define a correspondence between the outcome sets of worlds in which the user was queried at different nodes, and I don't know how to do this.]

A complete solution can try to combine all of those methods, by simulating a virtual organization where the members can control which method is applied at every point. This way they can strive for the optimal risk-performance balance: parallelize everything that can be parallelized and amplify otherwise tasks that cannot be parallelized, change the subjective/objective time balance based on research into malign AI timelines etc.

**Vanessa Kosoy (vanessa-kosoy)**on HCH Speculation Post #2A · 2021-03-18T14:25:25.792Z · LW · GW

[**EDIT**: After thinking about this some more, I realized that malign AI leakage is a bigger problem than I thought when writing the parent comment, because the way I imagined it can be overcome doesn't work that well.]

I think the biggest differences are that HCH is a psychological "monoculture," HCH has tiny bottlenecks through which to pass messages compared to the information I can pass to my future self, and there's some presumption that the output will be "an answer" whereas I have no such demands on the brain-state I pass to tomorrow.

I don't think that last one is a real constraint. What counts as "an answer" is entirely a matter of interpretation by the participants in the HCH. For example, initially I can ask the question "what are the most useful thoughts about AI alignment I can come up with during 1,000,000 iterations?". When I am tasked to answer the question "what are the most useful thoughts about AI alignment I can come up with during iterations?" then

- If , I will just spend my allotted time thinking about AI alignment and write whatever I came up with in the end.
- If , I will ask "what are the most useful thoughts about AI alignment I can come up with during iterations?". Then, I will study the answer and use the remaining time to improve on it to the best of my ability.

An iteration of 2 weeks might be too short to learn the previous results, but we can work in longer iterations. Certainly, having to learn the previous results from text carries overhead compared to just remembering myself developing them (and having developed some illegible intuitions in the process), but only that much overhead.

As to "monoculture", we can do HCH with multiple people (either the AI learns to simulate the entire system of multiple people or we use some rigid interface e.g. posting on a forum). For example, we can imagine putting the entire AI X-safety community there. But, we certainly don't want to put the entire world in there, since that way malign AI would probably leak into the system.

I think the problems are harder to solve if you want IDA approximations of HCH. I'm not totally sure what you meant by the confidence thresholds link - was it related to this?

Yes: it shows how to achieve reliable imitation (although for now in a theoretical model that's not feasible to implement), and the same idea should be applicable to an imitation system like IDA (although it calls for its own theoretical analysis). Essentially, the AI queries a real person if and only if it cannot produce a reliable prediction using previous data (because there are several plausible mutually inconsistent hypotheses), and the frequency of queries vanishes over time.

**Vanessa Kosoy (vanessa-kosoy)**on Intermittent Distillations #1 · 2021-03-17T17:05:43.478Z · LW · GW

I think the Armstrong and Minderman NFL result is very weak. Obviously inferring values requires assuming that the planning algorithm is trying to maximize those values *in some sense*, and they don't have such an assumption. IMO my AIT definition of intelligence shows a clear path to solving this. That said I'm not at all sure this is enough to get alignment without full access to the human policy.